SlideShare a Scribd company logo
1 of 34
Download to read offline
CAPSTONE - NLP
Automatic Ticket Assignment
1. Amit Mentor
2. Pranov Student
3. Hari Student
4. Bhargav Student
5. Nikhil Student
6. Vamsi Student
Abstract
Automating the task of Ticket Assignment by using advanced Machine Learning algorithms.
Table of Contents
Summary..........................................................................................................................2
Problem Statement..................................................................................................................2
Primary Objective ....................................................................................................................2
Significant Findings ..................................................................................................................2
Implications...................................................................................Error! Bookmark not defined.
Overview................................................................................Error! Bookmark not defined.
Machine Learning methodology ...............................................................................................4
EDA and data pre-processing....................................................................................................6
Basic EDA.......................................................................................................................................................6
Data Pre-processing ......................................................................................................................................7
Walkthrough of solution.........................................................Error! Bookmark not defined.
Model evaluation ...................................................................Error! Bookmark not defined.
Comparison to Benchmark..............................................................................................31
Visualisations................................................................................................................. 31
Implications....................................................................................................................31
Limitations .....................................................................................................................32
Closing Reflections..........................................................................................................32
Summary
Problem Statement
An incident is something that is unplanned interruption to an IT service or reduction in the
quality of an IT service that affects the Users and the Business. The main goal of Incident
Management process is to provide a quick fix /workarounds or solutions that resolves the
interruption and restores the service to its full capacity to ensure no business impact.
In the support process, incoming incidents are analysed and assessed by organization’s
support teams to fulfil the request. In many organizations, better allocation and effective
usage of the valuable support resources will directly result in substantial cost savings.
Guided by powerful AI techniques that can classify incidents to right functional groups can
help organizations to reduce the resolving time of the issue and can focus on more productive
tasks.
Primary Objective
Automate incidents assignment to right functional groups, based on the provided data, to a
certain level of accuracy which helps in saving time and effort of L1/L2 groups.
The effort and time saved by automation should result in quick and better issue resolutions.
Significant Findings
Based on problem description and information
1. L1/L2 spend a lot of time (almost 1 FTE) in assigning issues to L3/Functional groups.
Apart from assigning issues, L1/L2 also resolved issues
2. There were multiple instances of incidents getting assigned to wrong functional
groups.
3. Additional effort was needed for Functional teams to re-assign to right functional
groups.
4. And, as a consequence, some of the incidents are in queue and not addressed timely
resulting in poor customer service.
Benchmark
In real world scenario,
1. Even if 50% percent of the assignment work is automated, it will save a lot of time
and effort for L1/L2.
2. This will also improve assignment accuracy and faster issue resolution.
3. In other words, this time and effort saved translates to cost savings and customer
satisfaction.
Overview and Walkthrough of solution
Our objective in this project is to automate ticket assignment to the respective groups.
Machine Learning methodology for automatic assignment solution
Below are the key steps followed to implement the solution.
1. Understand the problem and identify the goal.
a. We have identified that the major issues with existing solution, manual
assignment of incidents to functional groups, is L1/L2 groups have to put a lot
of time and effort on incident assignments and there are instances of wrong
assignments also.
b. The objective is to automate incident assignments to right functional groups,
based on the provided data, to a certain level of accuracy which helps in
saving time and effort of L1/L2 groups.
2. EDA (Exploratory Data Analysis) and data pre-processing of given dataset
a. Understanding the characteristics of the data in the given dataset.
i. In the given dataset, we have 8500 rows and 4 columns: Short
description, Description, Caller and Assignment group.
ii. There are 74 unique teams/assignment groups to which the received
tickets are assigned.
iii. GRP_0 has nearly half of the total number of tickets received.
iv. A lot of ticket assignments with a frequency of less than 100.
v. A lot of groups with only ticket assigned which do not add much value
to any machine learning framework.
So, based on the analysis, all tickets with frequency less than 100 or 200 have
been into aggregated into one group to make the data a little less imbalanced.
b. Data pre-processing done in this project to convert the raw data to model
consumable data.
i. Tokenisation of the descriptions to prepare it for data processing.
ii. Removing stop words and other commonly occurring words
iii. N-gram models to improve classification.
iv. TF-IDF embeddings for traditional models
v. GLoVe embeddings for Neural Network based models.
Few other techniques like Polarity analysis and Text Readability analysis have
been done to verify the correlation between these metrics and classification.
Finally, the given dataset, after aggregation of groups and consolidation of
text, was divided into Train/Validation and Test datasets.
The final accuracy is measured on Test dataset.
3. Model evaluation based on performance metrics done in this project.
For this project, several models as described below were evaluated
a. Traditional models like Naïve Bayes, SVM, Random Forest and Xgboost.
b. Neural Networks like Bi-Directional LSTM, CNN with LSTM and GRU,
Transformer-Attention model and an ensemble Bi-Directional LSTM model.
By using a classification report with class-wise metrics like Precision, Recall, F1 score
and Support. Accuracy was also measured.
4. Finalize and tune the model
After trying different traditional models and neural networks,
the ensemble model constructed by combining similar Bi-directional LSTM model
gave best results in this project.
It was able to predict Group 0 , which has the majority of assignments, with a
precision of 0.83 and an overall accuracy of 0.73.
5. Predict using the model, with a certain confidence level, in real life.
a. The model developed in the project predicts Group 0 assignments with a
precision of 0.83.
b. This has already exceeded the benchmark based on the current manual
process.
c. By using this automation, time and effort on assignments will be reduced.
In future, with more advanced techniques like SMOTE for data balancing, the
model can predict at a better confidence level.
EDA and data pre-processing
Basic EDA
EDA(Exploratory Data Analysis) is a method of analysing a dataset guided by what we look
for, how we look, and how we interpret.
The tools used in this project for EDA are statistical and graphical techniques which are often
simple graphs and plots:
1. Plotting the provided(raw) data into graphs like histograms, probability plots etc.
A countplot used in this project to see frequency distribution of all ticket
assignments.
Similar plot after creation of “Grouped Assignment” class which comprises of all
classes which have less than 100/200 tickets assigned to them.
2. Verifying presence of NULLs and Identifying rows with missing values
After identifying the missing values, they were handled by replacing missing
Descriptions by Short Descriptions and vice-versa.
Data Pre-processing
Data Pre-processing refers to analysis and activities involved in transforming the data to such
a state that a machine can easily parse it.
Raw text is unstructured consisting of text , dates , numbers and facts which cannot be
interpreted by a machine.
The text data is first cleaned and prepared so that it can be consumed by a ML Algorithm.
Tokenization is the process of chopping the raw text into pieces, called tokens.
Tokens are usually equated to words but they can comprise of miniature groups of words or
sequences.
In this project, Tokenization has been used in several contexts:
1. Tokenize to remove stop words using NLTK.
2. Tokenize at word level, N-gram level and character level for TF-IDF
3. Tokenize the text for neural network model building.
Stop words are usually the most common words in a language.
But, depending on the problem and context, few other commonly used words within the
corpus can be considered as stop words.
Stop words and punctuations are usually removed, to improve performance, before giving
the text to ML algorithms.
In this project, removal of common English stop words has been done.
There was an attempt to identify commonly used words used, within corpus, and remove
them. But, this did not yield better results.
Word clouds are graphical representations of word frequency that give greater prominence
to most frequent words in the text.
They can be used to identify stop words.
A word cloud highlighting the stops words in our project data.
After removal of stop words.
N-gram is a sequence of N words. For example, “The Final Report” has three words and so
is a tri-gram.
if we assign a probability to the occurrence of an N-gram, It can help make next word
predictions. We can also see if there is a relation between certain N-grams and the
classification.
Top 20 trigrams in headlines.
Polarity means emotion expressed in the sentence. It can be positive, negative or neutral.
In this project, an attempt has been made to see if there can be a relation between Polarity and
classification to a certain group.
TextBlob package in Python can be used to determine polarity.
Representing assignments by Polarity.
Readability (index) is a metric to quantify how hard a text is to read. There are several
methods to determine readability indices like Flesch Reading Ease, Dale Chall Readability
Score, and Gunning Fog Index.
TextStat library in Python can be used to determine these readability indices.
In this project, an attempt has been made to see if there is a relation between the readability
index and classification (that is, group assignment)
It has been observed that the readability of descriptions assigned to Group_0 is higher if we
go by Flesh Reading score.
Aggregations can be done to put the data in a better perspective.
Behaviour of this grouped data is more stable than individual data objects.
In the dataset provided for this project, there are lot of ticket assignments with a frequency of
less than 100. All such tickets with frequency less than 100/200 have been aggregated into
one group to make the data a little less imbalanced.
Feature Encoding or Feature extraction is the process of converting text to numbers so that
the Machine Learning algorithm can efficiently process it.
TF-IDF, which stands for term frequency — inverse document frequency, is a scoring
measure intended to reflect how relevant a term is in a given document.
The idea is if a word occurs frequently in a document, we should boost its relevance as it
should be more meaningful than other words. (TF)
But, if it occurs frequently in many other documents also, maybe it is because this word is
just a frequent word and not because it was relevant or meaningful.(IDF)
How TF-IDF works?
The product of TF and IDF gives the TF-IDF.
Consider term ‘t’ , assign weight in the document d such that
 Weight is highest when t occurs many times within a document.
 Lower when the term occurs in many documents
 Lowest when the term occurs almost in all documents
In this project, TF-IDF has been used to vectorise N-gram datasets (derived from raw data).
This vectorised data is then used as input for traditional ML models.
Word embeddings are a type of word representation that allows words with similar meaning
to have a similar representation. They are basically lower dimensional, dense vector
representations for words.
While TF-IDF can also be called as embedding, it consists for sparse vectors which does not
add much meaning to the representation.
GloVe from Stanford and Word2vec from Google are popular word embeddings used for
NLP.
GloVe stands for Global Vectors for Word Representation.
 It is an unsupervised learning algorithm for obtaining vector representations for
words.
 It captures word-word co-occurrences in the entire corpus better.
 It suppresses rare co-occurrences and prevents frequent co-occurrences from taking
over and
 GloVe is more accurate than Word2vec.
GLoVe vs other techniques.
Finally, GLoVe pre-trained weights can be used as embedding for Neural Network models.
In this project, GLoVe pre-trained weights are used as embedding layer for the different
Neural Network based models.
Model Evaluation
Let’s try couple of classification Algorithms for our model building.
Classification Algorithms
The classification step usually involves a statistical model like Naïve Bayes, Logistic
Regression, Support Vector Machines, or Neural Networks:
 Naïve Bayes: a family of probabilistic algorithms that uses Bayes Theorem to predict
the category of a text.
 Linear Regression: a very well-known algorithm in statistics used to predict some
value (Y) given a set of features (X).
 Support Vector Machines: a non-probabilistic model which uses a representation of
text examples as points in a multidimensional space. Examples of different categories
(sentiments) are mapped to distinct regions within that space. Then, new texts are
assigned a category based on similarities with existing texts and the regions they’re
mapped to.
 Deep Learning: a diverse set of algorithms that attempt to mimic the human brain, by
employing artificial neural networks to process data.
Traditional ML Models
Machine learning is an algorithm or model that learns patterns in data and then predicts
similar patterns in new data. For example, if you want to classify children’s books, it would
mean that instead of setting up precise rules for what constitutes a children’s book,
developers can feed the computer hundreds of examples of children’s books. The computer
finds the patterns in these books and uses that pattern to identify future books in that
category.
Essentially, ML is a subset of artificial intelligence that enables computers to learn without
being explicitly programmed with predefined rules. It focuses on the development of
computer programs that can teach themselves to grow and change when exposed to new data.
This predictive ability, in addition to the computer’s ability to process massive amounts of
data, enables ML to handle complex business situations with efficiency and accuracy.
Let’s look at the different machine learning models used for this particular problem
statement.
Naïve Bayes
It is a classification technique based on Bayes’ Theorem with an assumption of independence among
predictors. In simple terms, a Naive Bayes classifier assumes that the presence of a particular feature
in a class is unrelated to the presence of any other feature.
For example, a fruit may be considered to be an apple if it is red, round, and about 3 inches in
diameter. Even if these features depend on each other or upon the existence of the other features, all
of these properties independently contribute to the probability that this fruit is an apple and that is
why it is known as ‘Naive’.
Naive Bayes model is easy to build and particularly useful for very large data sets. Along with simplicity,
Naive Bayes is known to outperform even highly sophisticated classification methods.
Bayes theorem provides a way of calculating posterior probability P(c|x) from P(c), P(x) and P(x|c).
Look at the equation below:
 P(c|x) is the posterior probability of class (c, target) given predictor (x, attributes).
 P(c) is the prior probability of class.
 P(x|c) is the likelihood which is the probability of predictor given class.
 P(x) is the prior probability of predictor.
See code implementation of Naive Bayes for this particular problem -
Results:
The accuracy and precision achieved are way low to be considered!!!
Support Vector Machine
Support Vector Machine” (SVM) is a supervised machine learning algorithm which can be used for
both classification or regression challenges. However, it is mostly used in classification problems. In
the SVM algorithm, we plot each data item as a point in n-dimensional space (where n is number of
features you have) with the value of each feature being the value of a particular coordinate. Then, we
perform classification by finding the hyper-plane that differentiates the two classes very well (look at
the below snapshot).
Support Vectors are simply the co-ordinates of individual observation. The SVM classifier is a frontier
which best segregates the two classes (hyper-plane/ line).
Please see code snippet used to implement SVM -
Again, the results are not good enough to be considered!!!
Random Forest
Random forest is like bootstrapping algorithm with Decision tree (CART) model. Say, we
have 1000 observation in the complete population with 10 variables. Random forest tries to
build multiple CART models with different samples and different initial variables. For
instance, it will take a random sample of 100 observation and 5 randomly chosen initial
variables to build a CART model. It will repeat the process (say) 10 times and then make a
final prediction on each observation. Final prediction is a function of each prediction. This
final prediction can simply be the mean of each prediction. See below illustrative diagram -
Code snippet for Random Forest Model below –
The results are not encouraging with Traditional ML models, let’s try Deep Learning models
to see performance.
Deep Learning
Deep learning text classification model architectures generally consist of the following
components connected in sequence.
Architecture:
Let’s go through one by one by with the code developed for our automated ticket
classification problem.
Embedding Layer
Word Embedding is a representation of text where words that have the same
meaning have a similar representation. In other words, it represents words in a
coordinate system where related words, based on a corpus of relationships, are placed
closer together. In the deep learning frameworks such as TensorFlow, Keras, this part
is usually handled by an embedding layer which stores a lookup table to map the
words represented by numeric indexes to their dense vector representations.
Two popular examples of methods of learning word embeddings from text include:
Word2Vec
Word2Vec is the first neural embedding model (or at least the first, which
gained its popularity in 2013) and still the one, which is used by the most of
researchers. Doc2Vec, its child, is also the most popular model for
paragraphs representation, which was inspired by Word2Vec.
Glove(Global Vectors for Word Representation)
The approach of global word representation is used to capture the meaning
of one word embedding with the structure of the whole observed corpus;
word frequency and co-occurrence counts are the main measures on which
the majority of unsupervised algorithms are based on. Glove model trains on
global co-occurrence counts of words and makes a sufficient use of statistics
by minimizing least-squares error and, as result, producing a word vector
space with meaningful substructure. Such an outline sufficiently preserves
words similarities with vector distance.
Please see below code used for glove word embedding.
From the above, we have picked “glove.6B.200d.txt” because it is having
glove embeddings with 6 Billion words with each word being a 200-
dimensional vector.
Since the embedding size has defined as 200, we require 200-dimensional file
of glove pre-defined embedding. Similarly for any problem , the correct
embedding file should be used based on the embedding size is defined.
Deep Network
Deep network takes the sequence of embedding vectors as input and converts them to
a compressed representation. The compressed representation effectively captures all
the information in the sequence of words in the text. The deep network part is usually
an RNN or some forms of it like LSTM/GRU. The dropout is added to overcome the
tendency to overfit, a very common problem with RNN based networks.
LSTM Networks
Long Short Term Memory networks — usually just called “LSTMs” — are a special
kind of RNN, capable of learning long-term dependencies. They were introduced
by Hochreiter & Schmidhuber (1997), and were refined and popularized by many
people in following work. They work tremendously well on a large variety of
problems, and are now widely used.
LSTMs are explicitly designed to avoid the long-term dependency problem.
Remembering information for long periods of time is practically their default
behaviour, not something they struggle to learn!
All recurrent neural networks have the form of a chain of repeating modules of neural
network. LSTMs also have this chain like structure, but the repeating module has a
different structure. Instead of having a single neural network layer, there are four,
interacting in a very special way.
In the above diagram1, each line carries an entire vector, from the output of one node
to the inputs of others. The pink circles represent pointwise operations, like vector
addition, while the yellow boxes are learned neural network layers. Lines merging
denote concatenation, while a line forking denotes its content being copied and the
copies going to different locations.
LSTM Model
Here we have imported the necessary libraries for LSTM model building
To use keras on text data, we first have to pre-process it. For this, we have used glove
word embedding to embed the text into 200-dimensional vector. In above code, we
have used Bidirectional LSTM which is an extension of traditional LSTMs that can
improve model performance on sequential classification problems.
Please see below graphical representation of Bidirectional LSTMs which works
internally above code.
Bidirectional recurrent neural networks(RNN) are really just putting two independent
RNNs together. The input sequence is fed in normal time order for one network, and
in reverse time order for another. The outputs of the two networks are usually
concatenated at each time step, though there are other options, e.g. summation.
This structure allows the networks to have both backward and forward information
about the sequence at every time step.
Now let’s understand about fully connected layer and why do we require?
Fully Connected Layer
The fully connected layer takes the deep representation from the RNN/LSTM/GRU
and transforms it into the final output classes or class scores. This component is
comprised of fully connected layers along with batch normalization and optionally
dropout layers for regularization.
Why we need Dense Layers for above Model building?
After introducing neural networks and linear layers, and after stating the limitations of
linear layers, we introduce here the dense (non-linear) layers.
In general, they have the same formulas as the linear layers wx+b, but the end result
is passed through a non-linear function called Activation function.
The “Deep” in deep-learning comes from the notion of increased complexity resulting
by stacking several consecutive (hidden) non-linear layers. Here are some graphs of
the most famous activation functions:
From the below code, we have used 64 layers in the first dense layer with “relu” as
an activation function as our output always resides on positive side and it is a multi-
layer neural networks.
Output Layer
Based on the problem at hand, this layer can have either Sigmoid for binary
classification or Softmax for both binary and multi classification output.
Finally, we have used 9 layers in the second dense layer i.e. output layer with
‘softmax’ as an activation function and in this we have to use the number of layers as
9 because our target classification groups count is 9 as defined in data pre-processing.
With above, we have completed the model building and now let’s understand the
performance of the model on train & test data.
Model Scoring & Selection
Our model scoring and selection is based on the standard evaluation metrics Accuracy,
Precision, and F1 score, which are defined as follows:
where:
· TP represents the number of true positive classifications. That is, the records with the actual
label A that have been correctly classified, or „predicted”, as label A.
· TN is the number of true negative classifications. That is, the records with an actual label
not equal to A that have been correctly classified as not belonging to label A.
· FP is the number of false positive classifications, i.e., records with an actual label other than
A that have been incorrectly classified as belonging to category A.
· FN is the number of false negatives, i.e., records with a label equal to A that have been
incorrectly classified as not belonging to category A.
Now Let’s understand what is the performance of our model.
Let’s have a look on the precision, recall etc of each individual group accuracy. Before that
let’s understand what is precision, recall etc.
The combination of predicted and actual classifications is known as „confusion
matrix”. Against the background of these definitions, the various evaluation metrics
provide the following insights:
· Accuracy: The proportion of the total number of model predictions that were
correct.
· Precision (also called positive predictive value): The proportion of correct
predictions relative to all predictions for that specific class.
· Recall (also called true positive rate, TPR, hit rate or sensitivity): The proportion of
true positives relative to all the actual positives.
· False positive rate (also called false alarm rate): The proportion of false positives
relative to all the actual negatives (FPR).
· F1: (The more robust) Harmonic mean of Precision and Recall.
Let’s have a look at Confusion matrix as well.
Bi-Directional LSTM attempted on the data set after aggregating all groups with less than 200
frequency has given an accuracy of 71% and precision of 71%.
Now, let’s try CNN model and understand the model performance.
CNN(Convolutional Neural Network)
A Convolutional Neural Network (ConvNet/CNN) is a Deep Learning algorithm
which can take in an input image, assign importance (learnable weights and biases) to various
aspects/objects in the image and be able to differentiate one from the other. The pre-
processing required in a ConvNet is much lower as compared to other classification
algorithms. While in primitive methods filters are hand-engineered, with enough training,
ConvNets have the ability to learn these filters/characteristics.
The architecture of a ConvNet is analogous to that of the connectivity pattern of Neurons in
the Human Brain and was inspired by the organization of the Visual Cortex. Individual
neurons respond to stimuli only in a restricted region of the visual field known as the
Receptive Field. A collection of such fields overlaps to cover the entire visual area.
From the above, we could see that Convolution, Max-Pooling layers are the major part of
CNN network. Let’s understand one by one
Convolutional Layer
Convolutional layers are the major building blocks used in convolutional neural networks.
A convolution is the simple application of a filter to an input that results in an activation.
Repeated application of the same filter to an input results in a map of activations called a
feature map, indicating the locations and strength of a detected feature in an input, such as an
image.
Max Pooling Layer
Maximum pooling, or max pooling, is a pooling operation that calculates the maximum, or
largest, value in each patch of each feature map. We can make the max pooling operation
concrete by again applying it to the output feature map of the line detector convolutional
operation and manually calculate the first row of the pooled feature map.
From the above, we could see that validation accuracy is 60% so we can understand that
CNN is not performing much in this classification problem.
Let’s try to understand what are our observations so far and the next course of actions.
Observations & Next Steps:
 Traditional ML models with tf-idf embeddings did not give encouraging results
 Multiple deep learning sequential models with Glove Embeddings were attempted
and results compared to arrive at the best model. The two best models are highlighted
below through their results.
 Bi-Directional LSTM attempted on the data set after aggregating all groups with less
than 200 frequency has given an accuracy of 71% and precision of 71%.
 The accuracy and precision was further improved to 73% to 76% respectively when
an ensemble of 7 Bi-LSTM was built.
 In both the above attempts, the precision for Group_0 was even higher i.e. 82% and
86% respectively
 Since Group_0 constitutes close to 50% of total population, with 82% precision, we
are likely to achieve an automation of close to 40% with Group_0 itself.
 With an average precision of 76%, we are expecting to save around 269 minutes of
reading time (for reading 8500 tickets raised manually), considering each document
takes 2.5 secs to read
 The team would continue to focus on experimenting to improve the precision further
and with it the level of automation
 Considering that this is a highly imbalanced dataset, it is always going to be a trade-
off between the automating routing of tickets to more groups or automating routing
more accurately to lesser groups
 The ensemble model gives us the best result and should be recommended for
installation in production environment
 Future work would include applying smote to make the data balanced and re-trying
the models to see if the results are improving
Comparison to Benchmark
With the models we have built we have been able to achieve more than 75% accuracy and
precision. This means that, 75% of our predictions are accurate which is a massive benefit to
the company in reducing manual routing of tickets. We started the project with an assumption
that if we can automate even 50% of the tasks, the company will be much better off. With the
application of AI we are being able to go well past 50% automation. We are achieving more
than 75% automation. Another important thing to note is that for Group_0 which constitutes
close to 50% of total tickets, we have achieved more than 85% precision. Putting all this
together, we can confidently say that we have exceeded the benchmark laid out at the outset.
Visualisations
In addition to quantifying your model and the solution,
please include all relevant visualizations that support the ideas/insights that you gleaned
from the data?
Implications
Classification of text is a very common problem in the corporate world. Traditionally,
assigning of tickets, customer queries etc. to appropriate team was majorly done manually,
which is cumbersome and non-value added. With the advancements in AI, all corporations
should look forward to application of NLP solutions to automate these tasks with fairly
good levels of accuracy. It helps with automation of redundant nonvalue-added activities
thereby releasing staff for more complex activities.
In this project, we were able to achieve automated classification of tickets received to
appropriate teams with an accuracy of 73% and precision of 75%. Considering all the deep
learning models built that can be used for classification, we have a confidence interval of
70% to 73% for accuracy. We are achieving close to 86% with group_0 which is helping us
achieve automation for more than 40% of the total tickets.
The bottom line for us in this project
is we may be sacrificing automation of ticket assignment which are rare but will automate the
ticket assignment of more frequent issues which will still be a great help to the company
Since Group_0 constitutes close to 50% of the
total population, with 86% precision, we are likely to achieve an automation of close to 40%
with Group_0 itself.
With an average precision of 76%, we are expecting to save around 269 minutes of reading ti
me (for reading 8500 tickets raised manually), considering each document takes 2.5 secs to re
ad
Limitations
The dataset had only 8500 observations and was highly imbalanced. We had 74 classes in the
target variable with only class constituting about 50% of the total observations.
There are 55 groups who have a number of observations equalling to less than 1% of
total observations. This essentially turns out to be a problem of high cardinality with the
majority of classes with very little observations for any machine learning algorithm to learn
the patterns and predict or classify accurately on unseen data.
We have a lot of ticket assignments with a frequency of less than 100.We have only 9 classes
when we aggregate all groups with less than 200 assignments
This was our best approach considering the above limitation.
This will give any machine learning algorithm a better chance
to learn the underlying patterns in the data.
The category with all these aggregated tickets will continue to be required to be manually rev
iewed. But the rest of the categories may be possible to be automated for ticket assignment.
However, all the tickets getting tagged to the common group should be monitored and
information collected till a point we have enough number of observations for all the
underlying classes.
Once we have a good number of observations for all classes, we can recommend the
company to revisit the solution for a more all-inclusive solution.
Closing Reflections
"Not everything that counts can be counted, and not everything that can be counted counts" -
Albert Einstein
Natural Language Processing (NLP) is a subfield of artificial intelligence that helps
computers understand human language. Using NLP, machines can make sense of
unstructured data so that we can gain valuable insights.
We use explicit and implicit rules to convey an idea during conversations. Most of the words
we use to do this are flowing out of our mouth just because of the probability that a given
word will come after or before another word we are using. This probability can be reverse-
engineered from a corpus of things we say or write.
Deep Learning and NLP algorithms help us achieve this and with it we can train machines to
understand text. This helps in automating many manual tasks like reading and classifying
texts to a theme. Companies use resources to read complaints received from customers and
route it to specific teams to address it. We may have to read issues reported and classify them
to root causes like accuracy, completeness or timeliness for further action. Companies may
have to read through customer reviews to identify the dissatisfied customers.
In this project, we learnt application of various NLP techniques to generate insights from the
text data, identify the problems associated with the dataset and then leverage multiple NLP
and Deep Learning techniques to solve a problem. We could calculate the readability
standard of texts, reading time per description and then the polarity of each description which
were useful insights.
We could then break the plain text to sequential numeric format by converting it to an
embedding (use pre-computed models) so that it can be fed to our deep learning models. This
project gave us the opportunity to experiment with multiple approaches and algorithms to
arrive at the final solution which gives us the best solution, given the limitations. Use of Bi-
Directional LSTM and Transformer layers seemed to have given us the best results and also
reiterated the common sense in knowing that words with the context convey more than words
themselves. It was seen that hybrid models like combination of CNN and lstm/gru/bi-lstm
can also be applied which can sometimes give good results, though not in this case.
Lastly we saw that the ensemble of Bi-LSTM gave us the best results indicating that synergy
amongst resources can always give better results than what can be achieved from one good
resource.

More Related Content

What's hot

Chatbot FAQs – The Most Common Chatbot Questions Answered!
Chatbot FAQs – The Most Common Chatbot Questions Answered!Chatbot FAQs – The Most Common Chatbot Questions Answered!
Chatbot FAQs – The Most Common Chatbot Questions Answered!Onlim GmbH
 
Nlp research presentation
Nlp research presentationNlp research presentation
Nlp research presentationSurya Sg
 
Developing highly scalable applications with Symfony and RabbitMQ
Developing highly scalable applications with  Symfony and RabbitMQDeveloping highly scalable applications with  Symfony and RabbitMQ
Developing highly scalable applications with Symfony and RabbitMQAlexey Petrov
 
Preprocessing of Web Log Data for Web Usage Mining
Preprocessing of Web Log Data for Web Usage MiningPreprocessing of Web Log Data for Web Usage Mining
Preprocessing of Web Log Data for Web Usage MiningAmir Masoud Sefidian
 
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...Simplilearn
 
Introduction to Artificial Intelligence
Introduction to Artificial IntelligenceIntroduction to Artificial Intelligence
Introduction to Artificial IntelligenceLuca Bianchi
 
Stock market prediction using Twitter sentiment analysis
Stock market prediction using Twitter sentiment analysisStock market prediction using Twitter sentiment analysis
Stock market prediction using Twitter sentiment analysisjournal ijrtem
 
TechM GIS Competency
TechM GIS CompetencyTechM GIS Competency
TechM GIS CompetencyTechM-GIS
 
Data Mining: Graph mining and social network analysis
Data Mining: Graph mining and social network analysisData Mining: Graph mining and social network analysis
Data Mining: Graph mining and social network analysisDataminingTools Inc
 
What Is Artificial Intelligence? | Artificial Intelligence For Beginners | Wh...
What Is Artificial Intelligence? | Artificial Intelligence For Beginners | Wh...What Is Artificial Intelligence? | Artificial Intelligence For Beginners | Wh...
What Is Artificial Intelligence? | Artificial Intelligence For Beginners | Wh...Simplilearn
 
Deep learning health care
Deep learning health care  Deep learning health care
Deep learning health care Meenakshi Sood
 
Natural language processing with python and amharic syntax parse tree by dani...
Natural language processing with python and amharic syntax parse tree by dani...Natural language processing with python and amharic syntax parse tree by dani...
Natural language processing with python and amharic syntax parse tree by dani...Daniel Adenew
 
Build MLOps System on AWS
Build MLOps System on AWS Build MLOps System on AWS
Build MLOps System on AWS Yunrui Li
 
Introduction to Named Entity Recognition
Introduction to Named Entity RecognitionIntroduction to Named Entity Recognition
Introduction to Named Entity RecognitionTomer Lieber
 
AI in Defence.pptx
AI in Defence.pptxAI in Defence.pptx
AI in Defence.pptxKavya990096
 
Introduction to natural language processing, history and origin
Introduction to natural language processing, history and originIntroduction to natural language processing, history and origin
Introduction to natural language processing, history and originShubhankar Mohan
 
Machine Learning in Cyber Security Domain
Machine Learning in Cyber Security Domain Machine Learning in Cyber Security Domain
Machine Learning in Cyber Security Domain BGA Cyber Security
 

What's hot (20)

Chatbot FAQs – The Most Common Chatbot Questions Answered!
Chatbot FAQs – The Most Common Chatbot Questions Answered!Chatbot FAQs – The Most Common Chatbot Questions Answered!
Chatbot FAQs – The Most Common Chatbot Questions Answered!
 
Nlp research presentation
Nlp research presentationNlp research presentation
Nlp research presentation
 
Developing highly scalable applications with Symfony and RabbitMQ
Developing highly scalable applications with  Symfony and RabbitMQDeveloping highly scalable applications with  Symfony and RabbitMQ
Developing highly scalable applications with Symfony and RabbitMQ
 
Preprocessing of Web Log Data for Web Usage Mining
Preprocessing of Web Log Data for Web Usage MiningPreprocessing of Web Log Data for Web Usage Mining
Preprocessing of Web Log Data for Web Usage Mining
 
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...
 
Introduction to Artificial Intelligence
Introduction to Artificial IntelligenceIntroduction to Artificial Intelligence
Introduction to Artificial Intelligence
 
Stock market prediction using Twitter sentiment analysis
Stock market prediction using Twitter sentiment analysisStock market prediction using Twitter sentiment analysis
Stock market prediction using Twitter sentiment analysis
 
Explore the Impact of AI on E-Commerce
Explore the Impact of AI on E-CommerceExplore the Impact of AI on E-Commerce
Explore the Impact of AI on E-Commerce
 
Basics of Soft Computing
Basics of Soft  Computing Basics of Soft  Computing
Basics of Soft Computing
 
TechM GIS Competency
TechM GIS CompetencyTechM GIS Competency
TechM GIS Competency
 
Data Mining: Graph mining and social network analysis
Data Mining: Graph mining and social network analysisData Mining: Graph mining and social network analysis
Data Mining: Graph mining and social network analysis
 
Deep learning
Deep learningDeep learning
Deep learning
 
What Is Artificial Intelligence? | Artificial Intelligence For Beginners | Wh...
What Is Artificial Intelligence? | Artificial Intelligence For Beginners | Wh...What Is Artificial Intelligence? | Artificial Intelligence For Beginners | Wh...
What Is Artificial Intelligence? | Artificial Intelligence For Beginners | Wh...
 
Deep learning health care
Deep learning health care  Deep learning health care
Deep learning health care
 
Natural language processing with python and amharic syntax parse tree by dani...
Natural language processing with python and amharic syntax parse tree by dani...Natural language processing with python and amharic syntax parse tree by dani...
Natural language processing with python and amharic syntax parse tree by dani...
 
Build MLOps System on AWS
Build MLOps System on AWS Build MLOps System on AWS
Build MLOps System on AWS
 
Introduction to Named Entity Recognition
Introduction to Named Entity RecognitionIntroduction to Named Entity Recognition
Introduction to Named Entity Recognition
 
AI in Defence.pptx
AI in Defence.pptxAI in Defence.pptx
AI in Defence.pptx
 
Introduction to natural language processing, history and origin
Introduction to natural language processing, history and originIntroduction to natural language processing, history and origin
Introduction to natural language processing, history and origin
 
Machine Learning in Cyber Security Domain
Machine Learning in Cyber Security Domain Machine Learning in Cyber Security Domain
Machine Learning in Cyber Security Domain
 

Similar to Automation of IT Ticket Automation using NLP and Deep Learning

Sentiment Analysis: A comparative study of Deep Learning and Machine Learning
Sentiment Analysis: A comparative study of Deep Learning and Machine LearningSentiment Analysis: A comparative study of Deep Learning and Machine Learning
Sentiment Analysis: A comparative study of Deep Learning and Machine LearningIRJET Journal
 
IRJET- Recruitment Chatbot
IRJET- Recruitment ChatbotIRJET- Recruitment Chatbot
IRJET- Recruitment ChatbotIRJET Journal
 
IRJET- Factoid Question and Answering System
IRJET-  	  Factoid Question and Answering SystemIRJET-  	  Factoid Question and Answering System
IRJET- Factoid Question and Answering SystemIRJET Journal
 
Data mining seminar report
Data mining seminar reportData mining seminar report
Data mining seminar reportmayurik19
 
STOCK MARKET PREDICTION USING NEURAL NETWORKS
STOCK MARKET PREDICTION USING NEURAL NETWORKSSTOCK MARKET PREDICTION USING NEURAL NETWORKS
STOCK MARKET PREDICTION USING NEURAL NETWORKSIRJET Journal
 
Performance Comparison between Pytorch and Mindspore
Performance Comparison between Pytorch and MindsporePerformance Comparison between Pytorch and Mindspore
Performance Comparison between Pytorch and Mindsporeijdms
 
IRJET- Visual Question Answering using Combination of LSTM and CNN: A Survey
IRJET- Visual Question Answering using Combination of LSTM and CNN: A SurveyIRJET- Visual Question Answering using Combination of LSTM and CNN: A Survey
IRJET- Visual Question Answering using Combination of LSTM and CNN: A SurveyIRJET Journal
 
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...kevig
 
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...ijnlc
 
Rachit Mishra_stock prediction_report
Rachit Mishra_stock prediction_reportRachit Mishra_stock prediction_report
Rachit Mishra_stock prediction_reportRachit Mishra
 
Matrix_Profile_Tutorial_Part1.pdf
Matrix_Profile_Tutorial_Part1.pdfMatrix_Profile_Tutorial_Part1.pdf
Matrix_Profile_Tutorial_Part1.pdfAndrea496281
 
IRJET - A Survey on Machine Learning Algorithms, Techniques and Applications
IRJET - A Survey on Machine Learning Algorithms, Techniques and ApplicationsIRJET - A Survey on Machine Learning Algorithms, Techniques and Applications
IRJET - A Survey on Machine Learning Algorithms, Techniques and ApplicationsIRJET Journal
 
Machine Learning and Deep Learning from Foundations to Applications Excel, R,...
Machine Learning and Deep Learning from Foundations to Applications Excel, R,...Machine Learning and Deep Learning from Foundations to Applications Excel, R,...
Machine Learning and Deep Learning from Foundations to Applications Excel, R,...Narendra Ashar
 
IRJET- Survey on Text Error Detection using Deep Learning
IRJET-  	  Survey on Text Error Detection using Deep LearningIRJET-  	  Survey on Text Error Detection using Deep Learning
IRJET- Survey on Text Error Detection using Deep LearningIRJET Journal
 
Discover How Scientific Data is Used for the Public Good with Natural Languag...
Discover How Scientific Data is Used for the Public Good with Natural Languag...Discover How Scientific Data is Used for the Public Good with Natural Languag...
Discover How Scientific Data is Used for the Public Good with Natural Languag...BaoTramDuong2
 
employee turnover prediction document.docx
employee turnover prediction document.docxemployee turnover prediction document.docx
employee turnover prediction document.docxrohithprabhas1
 
IRJET - Automated Essay Grading System using Deep Learning
IRJET -  	  Automated Essay Grading System using Deep LearningIRJET -  	  Automated Essay Grading System using Deep Learning
IRJET - Automated Essay Grading System using Deep LearningIRJET Journal
 

Similar to Automation of IT Ticket Automation using NLP and Deep Learning (20)

Sentiment Analysis: A comparative study of Deep Learning and Machine Learning
Sentiment Analysis: A comparative study of Deep Learning and Machine LearningSentiment Analysis: A comparative study of Deep Learning and Machine Learning
Sentiment Analysis: A comparative study of Deep Learning and Machine Learning
 
IRJET- Recruitment Chatbot
IRJET- Recruitment ChatbotIRJET- Recruitment Chatbot
IRJET- Recruitment Chatbot
 
IRJET- Factoid Question and Answering System
IRJET-  	  Factoid Question and Answering SystemIRJET-  	  Factoid Question and Answering System
IRJET- Factoid Question and Answering System
 
Text Analytics for Legal work
Text Analytics for Legal workText Analytics for Legal work
Text Analytics for Legal work
 
Deep Learning Demystified
Deep Learning DemystifiedDeep Learning Demystified
Deep Learning Demystified
 
Data mining seminar report
Data mining seminar reportData mining seminar report
Data mining seminar report
 
STOCK MARKET PREDICTION USING NEURAL NETWORKS
STOCK MARKET PREDICTION USING NEURAL NETWORKSSTOCK MARKET PREDICTION USING NEURAL NETWORKS
STOCK MARKET PREDICTION USING NEURAL NETWORKS
 
Performance Comparison between Pytorch and Mindspore
Performance Comparison between Pytorch and MindsporePerformance Comparison between Pytorch and Mindspore
Performance Comparison between Pytorch and Mindspore
 
IRJET- Visual Question Answering using Combination of LSTM and CNN: A Survey
IRJET- Visual Question Answering using Combination of LSTM and CNN: A SurveyIRJET- Visual Question Answering using Combination of LSTM and CNN: A Survey
IRJET- Visual Question Answering using Combination of LSTM and CNN: A Survey
 
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...
 
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...
 
Rachit Mishra_stock prediction_report
Rachit Mishra_stock prediction_reportRachit Mishra_stock prediction_report
Rachit Mishra_stock prediction_report
 
Matrix_Profile_Tutorial_Part1.pdf
Matrix_Profile_Tutorial_Part1.pdfMatrix_Profile_Tutorial_Part1.pdf
Matrix_Profile_Tutorial_Part1.pdf
 
IRJET - A Survey on Machine Learning Algorithms, Techniques and Applications
IRJET - A Survey on Machine Learning Algorithms, Techniques and ApplicationsIRJET - A Survey on Machine Learning Algorithms, Techniques and Applications
IRJET - A Survey on Machine Learning Algorithms, Techniques and Applications
 
Machine Learning and Deep Learning from Foundations to Applications Excel, R,...
Machine Learning and Deep Learning from Foundations to Applications Excel, R,...Machine Learning and Deep Learning from Foundations to Applications Excel, R,...
Machine Learning and Deep Learning from Foundations to Applications Excel, R,...
 
IRJET- Survey on Text Error Detection using Deep Learning
IRJET-  	  Survey on Text Error Detection using Deep LearningIRJET-  	  Survey on Text Error Detection using Deep Learning
IRJET- Survey on Text Error Detection using Deep Learning
 
Discover How Scientific Data is Used for the Public Good with Natural Languag...
Discover How Scientific Data is Used for the Public Good with Natural Languag...Discover How Scientific Data is Used for the Public Good with Natural Languag...
Discover How Scientific Data is Used for the Public Good with Natural Languag...
 
employee turnover prediction document.docx
employee turnover prediction document.docxemployee turnover prediction document.docx
employee turnover prediction document.docx
 
Presentation1
Presentation1Presentation1
Presentation1
 
IRJET - Automated Essay Grading System using Deep Learning
IRJET -  	  Automated Essay Grading System using Deep LearningIRJET -  	  Automated Essay Grading System using Deep Learning
IRJET - Automated Essay Grading System using Deep Learning
 

More from Pranov Mishra

Sales Performance Deep Dive and Forecast: A ML Driven Analytics Solution
Sales Performance Deep Dive and Forecast: A ML Driven Analytics SolutionSales Performance Deep Dive and Forecast: A ML Driven Analytics Solution
Sales Performance Deep Dive and Forecast: A ML Driven Analytics SolutionPranov Mishra
 
Prediction of potential customers for term deposit
Prediction of potential customers for term depositPrediction of potential customers for term deposit
Prediction of potential customers for term depositPranov Mishra
 
Reduction in customer complaints - Mortgage Industry
Reduction in customer complaints - Mortgage IndustryReduction in customer complaints - Mortgage Industry
Reduction in customer complaints - Mortgage IndustryPranov Mishra
 
Prediction of customer propensity to churn - Telecom Industry
Prediction of customer propensity to churn - Telecom IndustryPrediction of customer propensity to churn - Telecom Industry
Prediction of customer propensity to churn - Telecom IndustryPranov Mishra
 
Impact of Macro-Economic Factors on Customer Behaviour in the US Insurance In...
Impact of Macro-Economic Factors on Customer Behaviour in the US Insurance In...Impact of Macro-Economic Factors on Customer Behaviour in the US Insurance In...
Impact of Macro-Economic Factors on Customer Behaviour in the US Insurance In...Pranov Mishra
 
Recommendations for Preventive Maintenance - A Machine Learning Project
Recommendations for Preventive Maintenance - A Machine Learning ProjectRecommendations for Preventive Maintenance - A Machine Learning Project
Recommendations for Preventive Maintenance - A Machine Learning ProjectPranov Mishra
 

More from Pranov Mishra (6)

Sales Performance Deep Dive and Forecast: A ML Driven Analytics Solution
Sales Performance Deep Dive and Forecast: A ML Driven Analytics SolutionSales Performance Deep Dive and Forecast: A ML Driven Analytics Solution
Sales Performance Deep Dive and Forecast: A ML Driven Analytics Solution
 
Prediction of potential customers for term deposit
Prediction of potential customers for term depositPrediction of potential customers for term deposit
Prediction of potential customers for term deposit
 
Reduction in customer complaints - Mortgage Industry
Reduction in customer complaints - Mortgage IndustryReduction in customer complaints - Mortgage Industry
Reduction in customer complaints - Mortgage Industry
 
Prediction of customer propensity to churn - Telecom Industry
Prediction of customer propensity to churn - Telecom IndustryPrediction of customer propensity to churn - Telecom Industry
Prediction of customer propensity to churn - Telecom Industry
 
Impact of Macro-Economic Factors on Customer Behaviour in the US Insurance In...
Impact of Macro-Economic Factors on Customer Behaviour in the US Insurance In...Impact of Macro-Economic Factors on Customer Behaviour in the US Insurance In...
Impact of Macro-Economic Factors on Customer Behaviour in the US Insurance In...
 
Recommendations for Preventive Maintenance - A Machine Learning Project
Recommendations for Preventive Maintenance - A Machine Learning ProjectRecommendations for Preventive Maintenance - A Machine Learning Project
Recommendations for Preventive Maintenance - A Machine Learning Project
 

Recently uploaded

How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degreeyuu sss
 
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一F La
 
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxBoston Institute of Analytics
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 

Recently uploaded (20)

How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
 
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
 
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 

Automation of IT Ticket Automation using NLP and Deep Learning

  • 1. CAPSTONE - NLP Automatic Ticket Assignment 1. Amit Mentor 2. Pranov Student 3. Hari Student 4. Bhargav Student 5. Nikhil Student 6. Vamsi Student Abstract Automating the task of Ticket Assignment by using advanced Machine Learning algorithms.
  • 2. Table of Contents Summary..........................................................................................................................2 Problem Statement..................................................................................................................2 Primary Objective ....................................................................................................................2 Significant Findings ..................................................................................................................2 Implications...................................................................................Error! Bookmark not defined. Overview................................................................................Error! Bookmark not defined. Machine Learning methodology ...............................................................................................4 EDA and data pre-processing....................................................................................................6 Basic EDA.......................................................................................................................................................6 Data Pre-processing ......................................................................................................................................7 Walkthrough of solution.........................................................Error! Bookmark not defined. Model evaluation ...................................................................Error! Bookmark not defined. Comparison to Benchmark..............................................................................................31 Visualisations................................................................................................................. 31 Implications....................................................................................................................31 Limitations .....................................................................................................................32 Closing Reflections..........................................................................................................32
  • 3. Summary Problem Statement An incident is something that is unplanned interruption to an IT service or reduction in the quality of an IT service that affects the Users and the Business. The main goal of Incident Management process is to provide a quick fix /workarounds or solutions that resolves the interruption and restores the service to its full capacity to ensure no business impact. In the support process, incoming incidents are analysed and assessed by organization’s support teams to fulfil the request. In many organizations, better allocation and effective usage of the valuable support resources will directly result in substantial cost savings. Guided by powerful AI techniques that can classify incidents to right functional groups can help organizations to reduce the resolving time of the issue and can focus on more productive tasks. Primary Objective Automate incidents assignment to right functional groups, based on the provided data, to a certain level of accuracy which helps in saving time and effort of L1/L2 groups. The effort and time saved by automation should result in quick and better issue resolutions. Significant Findings Based on problem description and information 1. L1/L2 spend a lot of time (almost 1 FTE) in assigning issues to L3/Functional groups. Apart from assigning issues, L1/L2 also resolved issues
  • 4. 2. There were multiple instances of incidents getting assigned to wrong functional groups.
  • 5. 3. Additional effort was needed for Functional teams to re-assign to right functional groups. 4. And, as a consequence, some of the incidents are in queue and not addressed timely resulting in poor customer service. Benchmark In real world scenario, 1. Even if 50% percent of the assignment work is automated, it will save a lot of time and effort for L1/L2. 2. This will also improve assignment accuracy and faster issue resolution. 3. In other words, this time and effort saved translates to cost savings and customer satisfaction. Overview and Walkthrough of solution Our objective in this project is to automate ticket assignment to the respective groups. Machine Learning methodology for automatic assignment solution Below are the key steps followed to implement the solution. 1. Understand the problem and identify the goal. a. We have identified that the major issues with existing solution, manual assignment of incidents to functional groups, is L1/L2 groups have to put a lot of time and effort on incident assignments and there are instances of wrong assignments also. b. The objective is to automate incident assignments to right functional groups, based on the provided data, to a certain level of accuracy which helps in saving time and effort of L1/L2 groups. 2. EDA (Exploratory Data Analysis) and data pre-processing of given dataset a. Understanding the characteristics of the data in the given dataset. i. In the given dataset, we have 8500 rows and 4 columns: Short description, Description, Caller and Assignment group. ii. There are 74 unique teams/assignment groups to which the received tickets are assigned. iii. GRP_0 has nearly half of the total number of tickets received. iv. A lot of ticket assignments with a frequency of less than 100. v. A lot of groups with only ticket assigned which do not add much value to any machine learning framework.
  • 6. So, based on the analysis, all tickets with frequency less than 100 or 200 have been into aggregated into one group to make the data a little less imbalanced. b. Data pre-processing done in this project to convert the raw data to model consumable data. i. Tokenisation of the descriptions to prepare it for data processing. ii. Removing stop words and other commonly occurring words iii. N-gram models to improve classification. iv. TF-IDF embeddings for traditional models v. GLoVe embeddings for Neural Network based models. Few other techniques like Polarity analysis and Text Readability analysis have been done to verify the correlation between these metrics and classification. Finally, the given dataset, after aggregation of groups and consolidation of text, was divided into Train/Validation and Test datasets. The final accuracy is measured on Test dataset. 3. Model evaluation based on performance metrics done in this project. For this project, several models as described below were evaluated a. Traditional models like Naïve Bayes, SVM, Random Forest and Xgboost. b. Neural Networks like Bi-Directional LSTM, CNN with LSTM and GRU, Transformer-Attention model and an ensemble Bi-Directional LSTM model. By using a classification report with class-wise metrics like Precision, Recall, F1 score and Support. Accuracy was also measured. 4. Finalize and tune the model After trying different traditional models and neural networks, the ensemble model constructed by combining similar Bi-directional LSTM model gave best results in this project. It was able to predict Group 0 , which has the majority of assignments, with a precision of 0.83 and an overall accuracy of 0.73. 5. Predict using the model, with a certain confidence level, in real life. a. The model developed in the project predicts Group 0 assignments with a precision of 0.83. b. This has already exceeded the benchmark based on the current manual process. c. By using this automation, time and effort on assignments will be reduced. In future, with more advanced techniques like SMOTE for data balancing, the model can predict at a better confidence level.
  • 7. EDA and data pre-processing Basic EDA EDA(Exploratory Data Analysis) is a method of analysing a dataset guided by what we look for, how we look, and how we interpret. The tools used in this project for EDA are statistical and graphical techniques which are often simple graphs and plots: 1. Plotting the provided(raw) data into graphs like histograms, probability plots etc. A countplot used in this project to see frequency distribution of all ticket assignments. Similar plot after creation of “Grouped Assignment” class which comprises of all classes which have less than 100/200 tickets assigned to them. 2. Verifying presence of NULLs and Identifying rows with missing values
  • 8. After identifying the missing values, they were handled by replacing missing Descriptions by Short Descriptions and vice-versa. Data Pre-processing Data Pre-processing refers to analysis and activities involved in transforming the data to such a state that a machine can easily parse it. Raw text is unstructured consisting of text , dates , numbers and facts which cannot be interpreted by a machine. The text data is first cleaned and prepared so that it can be consumed by a ML Algorithm. Tokenization is the process of chopping the raw text into pieces, called tokens. Tokens are usually equated to words but they can comprise of miniature groups of words or sequences. In this project, Tokenization has been used in several contexts: 1. Tokenize to remove stop words using NLTK. 2. Tokenize at word level, N-gram level and character level for TF-IDF 3. Tokenize the text for neural network model building. Stop words are usually the most common words in a language. But, depending on the problem and context, few other commonly used words within the corpus can be considered as stop words. Stop words and punctuations are usually removed, to improve performance, before giving the text to ML algorithms. In this project, removal of common English stop words has been done. There was an attempt to identify commonly used words used, within corpus, and remove them. But, this did not yield better results.
  • 9. Word clouds are graphical representations of word frequency that give greater prominence to most frequent words in the text. They can be used to identify stop words. A word cloud highlighting the stops words in our project data.
  • 10. After removal of stop words. N-gram is a sequence of N words. For example, “The Final Report” has three words and so is a tri-gram. if we assign a probability to the occurrence of an N-gram, It can help make next word predictions. We can also see if there is a relation between certain N-grams and the classification.
  • 11. Top 20 trigrams in headlines. Polarity means emotion expressed in the sentence. It can be positive, negative or neutral. In this project, an attempt has been made to see if there can be a relation between Polarity and classification to a certain group. TextBlob package in Python can be used to determine polarity.
  • 12. Representing assignments by Polarity. Readability (index) is a metric to quantify how hard a text is to read. There are several methods to determine readability indices like Flesch Reading Ease, Dale Chall Readability Score, and Gunning Fog Index. TextStat library in Python can be used to determine these readability indices. In this project, an attempt has been made to see if there is a relation between the readability index and classification (that is, group assignment) It has been observed that the readability of descriptions assigned to Group_0 is higher if we go by Flesh Reading score. Aggregations can be done to put the data in a better perspective. Behaviour of this grouped data is more stable than individual data objects. In the dataset provided for this project, there are lot of ticket assignments with a frequency of less than 100. All such tickets with frequency less than 100/200 have been aggregated into one group to make the data a little less imbalanced. Feature Encoding or Feature extraction is the process of converting text to numbers so that the Machine Learning algorithm can efficiently process it. TF-IDF, which stands for term frequency — inverse document frequency, is a scoring measure intended to reflect how relevant a term is in a given document. The idea is if a word occurs frequently in a document, we should boost its relevance as it should be more meaningful than other words. (TF)
  • 13. But, if it occurs frequently in many other documents also, maybe it is because this word is just a frequent word and not because it was relevant or meaningful.(IDF) How TF-IDF works? The product of TF and IDF gives the TF-IDF. Consider term ‘t’ , assign weight in the document d such that  Weight is highest when t occurs many times within a document.  Lower when the term occurs in many documents  Lowest when the term occurs almost in all documents In this project, TF-IDF has been used to vectorise N-gram datasets (derived from raw data). This vectorised data is then used as input for traditional ML models. Word embeddings are a type of word representation that allows words with similar meaning to have a similar representation. They are basically lower dimensional, dense vector representations for words. While TF-IDF can also be called as embedding, it consists for sparse vectors which does not add much meaning to the representation. GloVe from Stanford and Word2vec from Google are popular word embeddings used for NLP. GloVe stands for Global Vectors for Word Representation.
  • 14.  It is an unsupervised learning algorithm for obtaining vector representations for words.  It captures word-word co-occurrences in the entire corpus better.  It suppresses rare co-occurrences and prevents frequent co-occurrences from taking over and  GloVe is more accurate than Word2vec. GLoVe vs other techniques. Finally, GLoVe pre-trained weights can be used as embedding for Neural Network models. In this project, GLoVe pre-trained weights are used as embedding layer for the different Neural Network based models. Model Evaluation Let’s try couple of classification Algorithms for our model building. Classification Algorithms The classification step usually involves a statistical model like Naïve Bayes, Logistic Regression, Support Vector Machines, or Neural Networks:  Naïve Bayes: a family of probabilistic algorithms that uses Bayes Theorem to predict the category of a text.  Linear Regression: a very well-known algorithm in statistics used to predict some value (Y) given a set of features (X).  Support Vector Machines: a non-probabilistic model which uses a representation of text examples as points in a multidimensional space. Examples of different categories (sentiments) are mapped to distinct regions within that space. Then, new texts are assigned a category based on similarities with existing texts and the regions they’re mapped to.
  • 15.  Deep Learning: a diverse set of algorithms that attempt to mimic the human brain, by employing artificial neural networks to process data. Traditional ML Models Machine learning is an algorithm or model that learns patterns in data and then predicts similar patterns in new data. For example, if you want to classify children’s books, it would mean that instead of setting up precise rules for what constitutes a children’s book, developers can feed the computer hundreds of examples of children’s books. The computer finds the patterns in these books and uses that pattern to identify future books in that category. Essentially, ML is a subset of artificial intelligence that enables computers to learn without being explicitly programmed with predefined rules. It focuses on the development of computer programs that can teach themselves to grow and change when exposed to new data. This predictive ability, in addition to the computer’s ability to process massive amounts of data, enables ML to handle complex business situations with efficiency and accuracy. Let’s look at the different machine learning models used for this particular problem statement. Naïve Bayes It is a classification technique based on Bayes’ Theorem with an assumption of independence among predictors. In simple terms, a Naive Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature. For example, a fruit may be considered to be an apple if it is red, round, and about 3 inches in diameter. Even if these features depend on each other or upon the existence of the other features, all of these properties independently contribute to the probability that this fruit is an apple and that is why it is known as ‘Naive’. Naive Bayes model is easy to build and particularly useful for very large data sets. Along with simplicity, Naive Bayes is known to outperform even highly sophisticated classification methods. Bayes theorem provides a way of calculating posterior probability P(c|x) from P(c), P(x) and P(x|c). Look at the equation below:
  • 16.  P(c|x) is the posterior probability of class (c, target) given predictor (x, attributes).  P(c) is the prior probability of class.  P(x|c) is the likelihood which is the probability of predictor given class.  P(x) is the prior probability of predictor. See code implementation of Naive Bayes for this particular problem - Results: The accuracy and precision achieved are way low to be considered!!! Support Vector Machine
  • 17. Support Vector Machine” (SVM) is a supervised machine learning algorithm which can be used for both classification or regression challenges. However, it is mostly used in classification problems. In the SVM algorithm, we plot each data item as a point in n-dimensional space (where n is number of features you have) with the value of each feature being the value of a particular coordinate. Then, we perform classification by finding the hyper-plane that differentiates the two classes very well (look at the below snapshot). Support Vectors are simply the co-ordinates of individual observation. The SVM classifier is a frontier which best segregates the two classes (hyper-plane/ line). Please see code snippet used to implement SVM - Again, the results are not good enough to be considered!!! Random Forest Random forest is like bootstrapping algorithm with Decision tree (CART) model. Say, we have 1000 observation in the complete population with 10 variables. Random forest tries to build multiple CART models with different samples and different initial variables. For instance, it will take a random sample of 100 observation and 5 randomly chosen initial variables to build a CART model. It will repeat the process (say) 10 times and then make a
  • 18. final prediction on each observation. Final prediction is a function of each prediction. This final prediction can simply be the mean of each prediction. See below illustrative diagram - Code snippet for Random Forest Model below –
  • 19. The results are not encouraging with Traditional ML models, let’s try Deep Learning models to see performance. Deep Learning Deep learning text classification model architectures generally consist of the following components connected in sequence. Architecture: Let’s go through one by one by with the code developed for our automated ticket classification problem. Embedding Layer Word Embedding is a representation of text where words that have the same meaning have a similar representation. In other words, it represents words in a coordinate system where related words, based on a corpus of relationships, are placed closer together. In the deep learning frameworks such as TensorFlow, Keras, this part is usually handled by an embedding layer which stores a lookup table to map the words represented by numeric indexes to their dense vector representations. Two popular examples of methods of learning word embeddings from text include: Word2Vec Word2Vec is the first neural embedding model (or at least the first, which gained its popularity in 2013) and still the one, which is used by the most of researchers. Doc2Vec, its child, is also the most popular model for paragraphs representation, which was inspired by Word2Vec. Glove(Global Vectors for Word Representation) The approach of global word representation is used to capture the meaning of one word embedding with the structure of the whole observed corpus; word frequency and co-occurrence counts are the main measures on which the majority of unsupervised algorithms are based on. Glove model trains on global co-occurrence counts of words and makes a sufficient use of statistics by minimizing least-squares error and, as result, producing a word vector space with meaningful substructure. Such an outline sufficiently preserves words similarities with vector distance.
  • 20. Please see below code used for glove word embedding. From the above, we have picked “glove.6B.200d.txt” because it is having glove embeddings with 6 Billion words with each word being a 200- dimensional vector. Since the embedding size has defined as 200, we require 200-dimensional file of glove pre-defined embedding. Similarly for any problem , the correct embedding file should be used based on the embedding size is defined. Deep Network Deep network takes the sequence of embedding vectors as input and converts them to a compressed representation. The compressed representation effectively captures all the information in the sequence of words in the text. The deep network part is usually an RNN or some forms of it like LSTM/GRU. The dropout is added to overcome the tendency to overfit, a very common problem with RNN based networks.
  • 21. LSTM Networks Long Short Term Memory networks — usually just called “LSTMs” — are a special kind of RNN, capable of learning long-term dependencies. They were introduced by Hochreiter & Schmidhuber (1997), and were refined and popularized by many people in following work. They work tremendously well on a large variety of problems, and are now widely used. LSTMs are explicitly designed to avoid the long-term dependency problem. Remembering information for long periods of time is practically their default behaviour, not something they struggle to learn! All recurrent neural networks have the form of a chain of repeating modules of neural network. LSTMs also have this chain like structure, but the repeating module has a different structure. Instead of having a single neural network layer, there are four, interacting in a very special way. In the above diagram1, each line carries an entire vector, from the output of one node to the inputs of others. The pink circles represent pointwise operations, like vector addition, while the yellow boxes are learned neural network layers. Lines merging denote concatenation, while a line forking denotes its content being copied and the copies going to different locations.
  • 22. LSTM Model Here we have imported the necessary libraries for LSTM model building To use keras on text data, we first have to pre-process it. For this, we have used glove word embedding to embed the text into 200-dimensional vector. In above code, we have used Bidirectional LSTM which is an extension of traditional LSTMs that can improve model performance on sequential classification problems. Please see below graphical representation of Bidirectional LSTMs which works internally above code.
  • 23. Bidirectional recurrent neural networks(RNN) are really just putting two independent RNNs together. The input sequence is fed in normal time order for one network, and in reverse time order for another. The outputs of the two networks are usually concatenated at each time step, though there are other options, e.g. summation. This structure allows the networks to have both backward and forward information about the sequence at every time step. Now let’s understand about fully connected layer and why do we require? Fully Connected Layer The fully connected layer takes the deep representation from the RNN/LSTM/GRU and transforms it into the final output classes or class scores. This component is comprised of fully connected layers along with batch normalization and optionally dropout layers for regularization. Why we need Dense Layers for above Model building? After introducing neural networks and linear layers, and after stating the limitations of linear layers, we introduce here the dense (non-linear) layers. In general, they have the same formulas as the linear layers wx+b, but the end result is passed through a non-linear function called Activation function. The “Deep” in deep-learning comes from the notion of increased complexity resulting by stacking several consecutive (hidden) non-linear layers. Here are some graphs of
  • 24. the most famous activation functions: From the below code, we have used 64 layers in the first dense layer with “relu” as an activation function as our output always resides on positive side and it is a multi- layer neural networks. Output Layer Based on the problem at hand, this layer can have either Sigmoid for binary classification or Softmax for both binary and multi classification output.
  • 25. Finally, we have used 9 layers in the second dense layer i.e. output layer with ‘softmax’ as an activation function and in this we have to use the number of layers as 9 because our target classification groups count is 9 as defined in data pre-processing. With above, we have completed the model building and now let’s understand the performance of the model on train & test data. Model Scoring & Selection Our model scoring and selection is based on the standard evaluation metrics Accuracy, Precision, and F1 score, which are defined as follows: where: · TP represents the number of true positive classifications. That is, the records with the actual label A that have been correctly classified, or „predicted”, as label A. · TN is the number of true negative classifications. That is, the records with an actual label not equal to A that have been correctly classified as not belonging to label A. · FP is the number of false positive classifications, i.e., records with an actual label other than A that have been incorrectly classified as belonging to category A.
  • 26. · FN is the number of false negatives, i.e., records with a label equal to A that have been incorrectly classified as not belonging to category A. Now Let’s understand what is the performance of our model. Let’s have a look on the precision, recall etc of each individual group accuracy. Before that let’s understand what is precision, recall etc.
  • 27. The combination of predicted and actual classifications is known as „confusion matrix”. Against the background of these definitions, the various evaluation metrics provide the following insights: · Accuracy: The proportion of the total number of model predictions that were correct. · Precision (also called positive predictive value): The proportion of correct predictions relative to all predictions for that specific class. · Recall (also called true positive rate, TPR, hit rate or sensitivity): The proportion of true positives relative to all the actual positives. · False positive rate (also called false alarm rate): The proportion of false positives relative to all the actual negatives (FPR). · F1: (The more robust) Harmonic mean of Precision and Recall. Let’s have a look at Confusion matrix as well.
  • 28. Bi-Directional LSTM attempted on the data set after aggregating all groups with less than 200 frequency has given an accuracy of 71% and precision of 71%. Now, let’s try CNN model and understand the model performance. CNN(Convolutional Neural Network) A Convolutional Neural Network (ConvNet/CNN) is a Deep Learning algorithm which can take in an input image, assign importance (learnable weights and biases) to various aspects/objects in the image and be able to differentiate one from the other. The pre- processing required in a ConvNet is much lower as compared to other classification algorithms. While in primitive methods filters are hand-engineered, with enough training, ConvNets have the ability to learn these filters/characteristics. The architecture of a ConvNet is analogous to that of the connectivity pattern of Neurons in the Human Brain and was inspired by the organization of the Visual Cortex. Individual neurons respond to stimuli only in a restricted region of the visual field known as the Receptive Field. A collection of such fields overlaps to cover the entire visual area. From the above, we could see that Convolution, Max-Pooling layers are the major part of CNN network. Let’s understand one by one
  • 29. Convolutional Layer Convolutional layers are the major building blocks used in convolutional neural networks. A convolution is the simple application of a filter to an input that results in an activation. Repeated application of the same filter to an input results in a map of activations called a feature map, indicating the locations and strength of a detected feature in an input, such as an image. Max Pooling Layer Maximum pooling, or max pooling, is a pooling operation that calculates the maximum, or largest, value in each patch of each feature map. We can make the max pooling operation concrete by again applying it to the output feature map of the line detector convolutional operation and manually calculate the first row of the pooled feature map.
  • 30.
  • 31. From the above, we could see that validation accuracy is 60% so we can understand that CNN is not performing much in this classification problem. Let’s try to understand what are our observations so far and the next course of actions. Observations & Next Steps:  Traditional ML models with tf-idf embeddings did not give encouraging results  Multiple deep learning sequential models with Glove Embeddings were attempted and results compared to arrive at the best model. The two best models are highlighted below through their results.  Bi-Directional LSTM attempted on the data set after aggregating all groups with less than 200 frequency has given an accuracy of 71% and precision of 71%.  The accuracy and precision was further improved to 73% to 76% respectively when an ensemble of 7 Bi-LSTM was built.
  • 32.  In both the above attempts, the precision for Group_0 was even higher i.e. 82% and 86% respectively  Since Group_0 constitutes close to 50% of total population, with 82% precision, we are likely to achieve an automation of close to 40% with Group_0 itself.  With an average precision of 76%, we are expecting to save around 269 minutes of reading time (for reading 8500 tickets raised manually), considering each document takes 2.5 secs to read  The team would continue to focus on experimenting to improve the precision further and with it the level of automation  Considering that this is a highly imbalanced dataset, it is always going to be a trade- off between the automating routing of tickets to more groups or automating routing more accurately to lesser groups  The ensemble model gives us the best result and should be recommended for installation in production environment  Future work would include applying smote to make the data balanced and re-trying the models to see if the results are improving Comparison to Benchmark With the models we have built we have been able to achieve more than 75% accuracy and precision. This means that, 75% of our predictions are accurate which is a massive benefit to the company in reducing manual routing of tickets. We started the project with an assumption that if we can automate even 50% of the tasks, the company will be much better off. With the application of AI we are being able to go well past 50% automation. We are achieving more than 75% automation. Another important thing to note is that for Group_0 which constitutes close to 50% of total tickets, we have achieved more than 85% precision. Putting all this together, we can confidently say that we have exceeded the benchmark laid out at the outset. Visualisations In addition to quantifying your model and the solution, please include all relevant visualizations that support the ideas/insights that you gleaned from the data? Implications Classification of text is a very common problem in the corporate world. Traditionally, assigning of tickets, customer queries etc. to appropriate team was majorly done manually, which is cumbersome and non-value added. With the advancements in AI, all corporations should look forward to application of NLP solutions to automate these tasks with fairly good levels of accuracy. It helps with automation of redundant nonvalue-added activities thereby releasing staff for more complex activities. In this project, we were able to achieve automated classification of tickets received to appropriate teams with an accuracy of 73% and precision of 75%. Considering all the deep learning models built that can be used for classification, we have a confidence interval of 70% to 73% for accuracy. We are achieving close to 86% with group_0 which is helping us achieve automation for more than 40% of the total tickets.
  • 33. The bottom line for us in this project is we may be sacrificing automation of ticket assignment which are rare but will automate the ticket assignment of more frequent issues which will still be a great help to the company Since Group_0 constitutes close to 50% of the total population, with 86% precision, we are likely to achieve an automation of close to 40% with Group_0 itself. With an average precision of 76%, we are expecting to save around 269 minutes of reading ti me (for reading 8500 tickets raised manually), considering each document takes 2.5 secs to re ad Limitations The dataset had only 8500 observations and was highly imbalanced. We had 74 classes in the target variable with only class constituting about 50% of the total observations. There are 55 groups who have a number of observations equalling to less than 1% of total observations. This essentially turns out to be a problem of high cardinality with the majority of classes with very little observations for any machine learning algorithm to learn the patterns and predict or classify accurately on unseen data. We have a lot of ticket assignments with a frequency of less than 100.We have only 9 classes when we aggregate all groups with less than 200 assignments This was our best approach considering the above limitation. This will give any machine learning algorithm a better chance to learn the underlying patterns in the data. The category with all these aggregated tickets will continue to be required to be manually rev iewed. But the rest of the categories may be possible to be automated for ticket assignment. However, all the tickets getting tagged to the common group should be monitored and information collected till a point we have enough number of observations for all the underlying classes. Once we have a good number of observations for all classes, we can recommend the company to revisit the solution for a more all-inclusive solution. Closing Reflections "Not everything that counts can be counted, and not everything that can be counted counts" - Albert Einstein Natural Language Processing (NLP) is a subfield of artificial intelligence that helps computers understand human language. Using NLP, machines can make sense of unstructured data so that we can gain valuable insights. We use explicit and implicit rules to convey an idea during conversations. Most of the words we use to do this are flowing out of our mouth just because of the probability that a given word will come after or before another word we are using. This probability can be reverse- engineered from a corpus of things we say or write. Deep Learning and NLP algorithms help us achieve this and with it we can train machines to understand text. This helps in automating many manual tasks like reading and classifying
  • 34. texts to a theme. Companies use resources to read complaints received from customers and route it to specific teams to address it. We may have to read issues reported and classify them to root causes like accuracy, completeness or timeliness for further action. Companies may have to read through customer reviews to identify the dissatisfied customers. In this project, we learnt application of various NLP techniques to generate insights from the text data, identify the problems associated with the dataset and then leverage multiple NLP and Deep Learning techniques to solve a problem. We could calculate the readability standard of texts, reading time per description and then the polarity of each description which were useful insights. We could then break the plain text to sequential numeric format by converting it to an embedding (use pre-computed models) so that it can be fed to our deep learning models. This project gave us the opportunity to experiment with multiple approaches and algorithms to arrive at the final solution which gives us the best solution, given the limitations. Use of Bi- Directional LSTM and Transformer layers seemed to have given us the best results and also reiterated the common sense in knowing that words with the context convey more than words themselves. It was seen that hybrid models like combination of CNN and lstm/gru/bi-lstm can also be applied which can sometimes give good results, though not in this case. Lastly we saw that the ensemble of Bi-LSTM gave us the best results indicating that synergy amongst resources can always give better results than what can be achieved from one good resource.