Deep Learning Automated Helpdesk

AUTOMATED HELPDESK
FINAL YEAR PROJECT (7TH SEM)
SUBMITTED BY
NIKHIL PATHANIA
PARTHA PRATIM KURMI
PRANAV SHARMA
RISHABH KUMAR
SOURAV KUMAR PAUL

PRESENTATION TIMELINE
Theoretical NLP
Knowledge Base
Design
By – Pranav Sharma
Practical NLP
Application
Forming of Tokens
By – Rishabh Kumar
Clustering
By – Sourav Kr Paul
Tensorflow
By – Nikhil Pathania
Query Model
By – Partha Pratim
Kurmi

PROJECT TIMELINE
Problem Formulation
Sep-2016
Literature Survey
Sept-Oct 2016
Design Methodology
Nov-2016
Synchronizing
Modules
Nov-2016
Basic Implementation
Jan- Feb 2017
Working Model
Mar-2017
Accuracy
Improvements
Mar- Apr 2017

PROBLEM STATEMENT
Automate the task of customer centers.
AIM - Build a system to answer questions like
"How to recharge my mobile?" - PayTM
"How to pay my bills?" - PayTM
"Why is my refund not credited?" - Book My Show

Training
Model
1.1
Raw Data
1.2
NLP
1.3
Preprocessing
1.4
Knowledge Base
1.5
Clustering
O/P

INFORMATION RETRIEVAL
• Data Sources
• FAQ's
• Past forum data
• Proper data extraction model
• Knowledge base

DATA EXTRACTION MODELS
WHY NLP?
• 3 steps process.
• Extends with clustering.
• Fast, accurate.
NLP
• 4 step process.
• No extension with clustering.
• Smaller domain.
PATTERN MATCHING
Example
Knowledge Base - “The CEO of IBM is Samuel Palmisano.”
Query - “Who is the CEO of IBM?”
Format - Q is A

NATURAL LANGUAGE PROCESSING
• Problem Domain – English.
• Aim.
• Origin - Turing Test.
• Annotating the sentence.
• Clouds exist on mars. => <cloud, exist, mars>
• Kernel sentences, T Expressions.

KERNEL SENTENCE, T-EXP
• Kernel Sentences.
• Ternary Expressions.
• <Subject, Relation, Object>

KNOWLEDGE BASE
• What is it?
• What to store? Proper data structure.
• Mapping to original set.
• NLP Annotations, parameterized variants.

Training
Model
1.1
Raw Data
1.2
NLP
1.3
Preprocessing
1.4
Knowledge Base
O/P
1.5
Clustering

PREPROCESSING:-
• Tokenization
• Stop words removal.
• Stemming.
• POS Tagging.

NLTK ( NATURAL LANGUAGE TOOLKIT )
• Suite of libraries.
• Python Support.
• Few libraries which we will be using are :-
• Lexical analysis.
• Parts of speech tagger

TOKENIZATION:-
Tokenization( Word Tokenize)
• Breaking stream into meaningful elements.
• Stream may or may not be a meaningful sentence.

EXAMPLE:-
"Recharge your mobile by visiting this link"
After tokenization:-
['Recharge', 'your', 'mobile', 'by', 'visiting', 'this', 'link']

STOP WORDS :-
E.g. “is, for, the, in, etc”
Target :- REMOVE THE STOP WORDS

EXAMPLE :-
FromTokenization
['Recharge', 'your', 'mobile', 'by', 'visiting', 'this', 'link']
After Stop Words removal
['Recharge', 'mobile', 'visiting', 'link']

STEMMING:-
Word = Stem + Affixes
Example:- playing = play(stem) + ing(affixes)
TARGET:- Removing affixes from word (called stemming)
E.g. plays, playing, playful all reduced to 'play'
Library in NLTK :- PorterStemmer

EXAMPLE :-
From Stop words removal :-
['Recharge', 'mobile', 'visiting', 'link']
After Stemming :-
['Recharge', 'mobile', 'visit', 'link'] // input for clustering is
generated

POS TAGGING:-
POS (part of speech) = Category of Tokens in linguistics, such as
verb noun etc.
Target :- Tag the tokens with the POS with a universal format.

EXAMPLE :-
From Stemming:-
['Recharge', 'mobile', 'visit', 'link']
After POS Tagging:-
[('Recharge', 'NN')]
[('mobile', 'NN')]
[('visit', 'VBG')]
[('link', 'NN')]

DOCUMENT CLUSTERING – WHAT AND
WHY?
• Unsupervised document organization
• Automatic topic organization
• Topic extraction
• Fast Information retrieval and filtering

EXAMPLES
• Web document clustering for search users.
• QA document clustering to solve common problems and questions.

WHY K-MEANS? WHY NOT ANY HIERARCHICAL ALGO?
• Time Complexity

CLUSTERING
• Algorithm
• Find k (most dissimilar) documents
• Assign them as k centroid
• Until no change
• For each document
• Find the most similar cluster
• Use cosine similarity fn
• Recalculate the centroid of each cluster
• Stop If no document was reassigned

K-MEANS USING JACCARD DISTANCE
MEASURE
• Problems in Simple K-Means Procedure.
• Greedy Algorithm
• Doesn't guarantee the best solution.
• JACCARD Distance Measure
• Find k most dissimilar document.

OUTPUT OF PREPROCESSING
• Possible text documents are :
• Recharge mobile visit link
• Recharge landline visit link
• Cancel ticket process
• Add money wallet

CALCULATING TF-IDF VECTORS
• Term Frequency – Inverse Document Frequency
• (Weight) Ranks the importance
• Terms frequent in Document and rare in Set
• Ex: College name NITS. - name is frequent but not rare.

TF-IDF VECTOR SPACE
Add Cance Recha
rge
landli
e
link mobil mone proce
s
ticket visit wallet
0.00 0.00 0.17 0.00 0.17 0.35 0.00 0.00 0.00 0.17 0.00
0.00 0.00 0.17 0.35 0.17 0.00 0.00 0.00 0.00 0.17 0.00
0.00 0.46 0.00 0.00 0.00 0.00 0.00 0.46 0.46 0.00 0.00
0.46 0.00 0.00 0.00 0.00 0.00 0.46 0.00 0.00 0.00 0.46

SELECT K-CLUSTER ( K =3)
• Use Jaccard Distance Measure - {{0},{2},{3}}
Document No (I) Document No (J) Similarity
0 1 0.6
0 2 0.00
0 3 0.00
1 2 0.00
1 3 0.00
2 3 0.00

AFTER FIRST ITERATION
• Assigning of documents to its most similar cluster. -
{{0,1},{2},{3}}
• Clusters After 1st iteration: (vecspace – centroid centers)
Add Cance Recha
rge
landli
e
link mobil mone proce
s
ticket visit wallet
0.00 0.00 0.17 0.17 0.17 0.17 0.00 0.00 0.00 0.17 0.00
0.00 0.46 0.00 0.00 0.00 0.00 0.00 0.46 0.46 0.00 0.00
0.46 0.0 0.0 0.0 0.0 0.0 0.46 0.0 0.0 0.0 0.46

CLUSTERING OUTPUT
• { { Recharge mobile visit link, Recharge landline visit link },
{ Cancel ticket process },
{ Add money wallet }
}

TENSOR FLOW
• What
• Why
• Where

PROGRAMMING MODEL AND BASIC
CONCEPTS
• Computation Graph
• Nodes
• Tensors
• Session
• Extend
• Run

IMPLEMENTATION
• Single Device Execution
• Multi Device Execution
• Cross Device Communication

PERFORMANCE
• Data Parallel Training
• Model Parallel Training
• Concurrent Step for Model Computation Pipelining

MODEL PARALLEL AND CONCURRENT
STEPS

CLUSTERING USING TENSOR FLOW
• Training Sets
• Nodes
• Data flow
• Feed as Input
• Output

Query Model
2.1
Query
2.2
NLP
2.3
Preprocessing
2.4
Recommendation
Engine
O/P

RECOMMENDATION ENGINE
• Recommendation Engine analyzes available data to answer the
questions
• The various steps are:
1. Data collection
2. Preprocessing and Transformations
3. Classifier Ensemble

PREPROCESSING AND TRANSFORMATIONS
• The training set is taken consisting of FAQs, past forums etc.
• Given a question, we want to deduce its genre from the texts
• Only the text of the question is extracted.
• Feature selection to evaluate the importance of a word using
TF-IDF

• Training set derived from the key parts of speech in each
sentence
Example How to recharge my mobile
Part of Speech Verb Noun Object
Decision label Task Electronics

• recharge mobile
• Find TF-IDF vector
• Compare it with distinct clusters using cosine similarity

CLASSIFIER ENSEMBLE
• Ensemble modelling is used for classification using three classifiers
• Naïve Bayesian using FAQ training set
• POS Naïve Bayesian
• Threshold Biasing classifier

ENSEMBLE STRUCTURE
• Learning algorithm that uses multiple classifiers
• Classify using a weighted vote for their decisions
• The classifier having better precision is considered

RESULTS
• Documents are hand-tagged with the genres
• In the Ensemble approach, we use a bag approach
• The count of genres is taken into account
• The top tallied genre is used to generate result
• Answer is "recharge mobile visit link"

INNOVATION
• Sections Removed
• User friendly
• Reduced Man-power
• Future plans to collaborate with college website.

CONCLUSION AND OUTCOMES
The outcomes of this project can be formulated (but not limited to) in
the following points :-
1. Complete Designed Architecture.
2. Proper modules and uses defined.
3. Model solution to the problem.
Hence we would like to conclude that the theoretical and survey aspect
of the problem is complete. We have selected the best tech solutions
after surveying for all existing alternatives. Thus, a working model is
soon to be expected from the team.

LITERATURE SURVEY
Seria
l No
Paper Title Authors
1 Natural Language Annotations for Question
Answering
Boris Katz, Gary Borchardt and Sue
Felshin
2 Using English for Indexing and Retrieving Katz, Boris
3 Recommendation engine: Matching
individual/group profiles for better shopping
experience
Sanjeev Kulkarni, Ashok M. Sanpal,
Ravindra R. Mudholkar, kiran Kumari
4 Recommendation engine for Reddit Hoang Nguyen, Rachel Richards,
C.C. Chan, Kathy J. Liszka
5 TensorFlow: Large-Scale Machine Learning on
Heterogeneous Distributed Systems
Mart´ın Abadi, Ashish Agarwal, Paul
Barham, Eugene Brevdo
6 Executing a program
on the MIT tagged-token dataflow architecture.
IEEE Trans. Comput., 1990.
Arvind and Rishiyur S. Nikhil

LITERATURE SURVEY
Serial
No
Paper Title Author
7 An efficient K-Means Algorithm integrated with Jaccard
Distance Measure for Document Clustering
Mushfeq-Us-Saleheen
Shameem, Raihana Ferdous
8 An Intelligent Similarity Measure for Effective Text
Document Clustering
M.L.AISHWARYA1
Department of Computer
Science , K.SELVI2
9 K Means Clustering with Tf-idf Weights Jonathan Zong
10 Comparison Between K-Mean and Hierarchical
Algorithm
Using Query Redirection
Manpreet kaur , Usvir Kaur
11 Question Answering System on Education Acts Using
NLP Techniques
Dr.M.M. Raghuwanshi
Professor , Department Of
Computer Science and
Technology

LITERATURE SURVEY
Serial
No
Paper Title Author
12 Affective – Hierarchical Classification of Text – An
Approach Using NLP Toolkit
Dr.R.Venkatesan Asst.Prof-
III/CSE
13 Building high-level features using large scale
unsupervised
learning. In ICML’2012, 2012.
Quoc Le, Marc’Aurelio
Ranzato, Rajat Monga, and
Andrew
Ng.
14 Preprocessing Techniques for Text Mining - An
Overview
Dr. S. Vijayarani1, Ms. J.
Ilamathi, Ms. Nithya, Assistant
Professor, M. Phil Research
Scholar,
Department of Computer
Science

Deep Learning Automated Helpdesk

More Related Content

Similar to Deep Learning Automated Helpdesk

Recently uploaded

Deep Learning Automated Helpdesk

Editor's Notes