Determining Customer Satisfaction Using Machine Learning

DETERMINING CUSTOMER SATISFACTION
IN-ECOMMERCE
Chaza Alkis, Abdurrahim Derric
Department of computer engineering
Yildiz Technical University, 34220 Istanbul, Türkiye
shaza.alqays@hotmail.com, abdelrahimdarrige@gmail.com
Abstract—Delivering the customer to a high degree of
confidence and the seller for more information about the
products and the desire of customers through the use of
modern technology and Machine Learning through comments
left on the product to see and evaluate the comments added
later and thus evaluate the product, whether good or bad.
Keywords—Machine Learning, User Review.
I. INTRODUCTION
The customer satisfaction determination system in
e-commerce keeps your purchases under control by giving
you the opportunity to predict whether if the chosen product
is marked as good or not. This program aims to make
the amateurish information to be converted into a more
professional structure and to be able to make prediction and
other related transactions faster, simpler and easier. Thus, it
is aimed to establish a more healthy communication with
producer or seller and to keep customer satisfaction as high
as possible.
II. NATURAL LANGUAGE PROCESSING
Natural Language Processing is the technology used to
aid computers to understand the human’s natural language.
Natural Language Processing [1], usually shortened as
NLP, is a branch of artificial intelligence that deals with
the interaction between computers and humans using the
natural language.
The ultimate objective of NLP is to read, decipher,
understand, and make sense of the human languages in a
manner that is valuable.
Most NLP techniques rely on machine learning to derive
meaning from human languages.
III. TEXT VECTORIZATION
Machine learning algorithms operate on a numeric
feature space, expecting input as a two-dimensional array
where rows are instances and columns are features. In order
to perform machine learning on text, we need to transform
our documents into vector representations such that we
can apply numeric machine learning. This process is called
feature extraction or more simply, vectorization [2], and is
an essential first step toward language-aware analysis.
IV. CLASSIFIERS
A. Naïve bayes
Naïve bayes classifiers are a collection of classification
algorithms based on Bayes’ Theorem [3]. It is not a single
algorithm but a family of algorithms where all of them
share a common principle, i.e. every pair of features being
classified is independent of each other.
P(A|B) =
P(B|A)P(A)
P(B)
where A and B are events and P(B)
B. Bernoulli naïve bayes
This is similar to the multinomial naïve bayes [3] but
the predictors are boolean variables. The parameters that
we use to predict the class variable take up only values yes
or no, for example if a word occurs in the text or not.
C. Support Vector Machines
A Support Vector Machine (SVM) [4] is a supervised
machine learning algorithm that can be employed for
both classification and regression purposes. SVMs are more
commonly used in classification problems and as such, this
is what we will focus on in this part.
SVMs are based on the idea of finding a hyperplane that
best divides a dataset into two classes, as shown in Figure
1.
Figure 1 SVM

D. Stochastic gradient descent
The gradient descent algorithm may be infeasible when
the training data size is huge. Thus, a stochastic version of
the algorithm is often used instead.
To motivate the use of stochastic optimization algorithms,
note that when training deep learning models, we often
consider the objective function as a sum of a finite number
of functions:
f (x) =
1
n
n
i=1
fi(x)
where fi(x) is a loss function based on the training data
instance indexed by i.
E. The random forest algorithm
Random Forest is a flexible, easy to use machine learning
algorithm that produces, even without hyper-parameter
tuning, a great result most of the time. It is also one of the
most used algorithms, because it’s simplicity and the fact
that it can be used for both classification and regression
tasks.
You can see how a random forest [5] would look like with
two trees in Figure 2.
Figure 2 Random forest tree
F. Logistic regression
Logistic regression is the appropriate regression analysis
to conduct when the dependent variable is dichotomous
(binary). Like all regression analyses, the logistic regression
is a predictive analysis. Logistic regression [6] is used
to describe data and to explain the relationship between
one dependent binary variable and one or more nominal,
ordinal, interval or ratio-level independent variables.
V. MODELS OF REPRESENTATION
A. Bag of words
The bag-of-words model is a simplifying representation
used in natural language processing and information
retrieval (IR). In this model, a text (such as a sentence or a
document) is represented as the bag (multi-set) of its words,
disregarding grammar and even word order but keeping
multiplicity. The approach is very simple and flexible, and
can be used in a myriad of ways for extracting features from
documents.
A bag-of-words is a representation of text that describes
the occurrence of words within a document. It involves two
things:
1. A vocabulary of known words.
2. A measure of the presence of known words.
It is called a “bag” of words, because any information about
the order or structure of words in the document is discarded.
The model is only concerned with whether known words
occur in the document, not where in the document.
"A very common feature extraction procedures for sentences
and documents is the bag-of-words approach (BOW). In this
approach, we look at the histogram of the words within the
text, i.e. considering each word count as a feature" [7].
B. Bag of ngrams
Bag-of-ngrams (BON) models are commonly used for
representing text. One of the main drawbacks of traditional
BON is the ignorance of n-gram’s semantics.
N-grams [8] are contiguous sequences of n-items in a
sentence. N can be 1, 2 or any other positive integers,
although usually we do not consider very large N because
those n-grams rarely appears in many different places.
When performing machine learning tasks related to natural
language processing, we usually need to generate n-grams
from input sentences. For example, in text classification
tasks, in addition to using each individual token found in
the corpus, we may want to add bi-grams or tri-grams as
features to represent our documents.
We have two types of ngrams, word based ngrams and
character based ngram. We will explain both of them but,
we will focus on character based ngram.
1)Word based ngrams: The word based text
representations require language-dependent tools, such
as a tokenizer (to split the message into tokens) and
usually a lemmatizer plus a list of stop words (to reduce
the dimensionality of the problem). Moreover, there is no
effective lemmatizers available for any natural language,
especially for morphologically rich languages. Word
n-grams [9], i.e., contiguous sequences of n words. Such
approach attempt to take advantage of contextual phrasal
information. However, word n-grams considerably increase
the dimensionality of the problem and the results so far are
not encouraging.
2)Character based ngrams: In training and testing,
we focus on a different but simple text representation.
In particular, each message is considered as a bag of
character n-grams, that is, strings of length n. Character
n-grams are able to capture information on various levels:
lexical (|the |, |free|), word-class (|ed |, |ing |),
punctuation mark usage (|!!!|, |f.r.|), etc. In addition, they
are robust to grammatical errors (e.g., the word-tokens
‘assignment’ and ‘assignment’ share the majority of character
n-grams) and strange usage of abbreviations, punctuation
marks etc.. The bag of character n-grams representation
is language-independent and does not require any text

preprocessing (tokenizer, lemmatizer, or other ‘deep’
NLP tools). It has already been used in several tasks
including language identification, authorship attribution,
and topic-based text categorization with remarkable results
in comparison to word-based representations.
An important characteristic of the character-level n-grams
[9] is that they avoid (at least to a great extent) the problem
of sparse data that arises when using word-level n-grams.
That is, there is much less character combinations than
word combinations, therefore, less n-grams will have zero
frequency. On the other hand, the proposed representation
still produces a considerably larger feature set in comparison
with traditional bag of words representations. Therefore,
learning algorithms able to deal with high dimensional
spaces should be used.
C. FastText
Fasttext module allows training word embeddings from
a training corpus with the additional ability to obtain word
vectors for out-of-vocabulary words.
fastText as a library for efficient learning of word
representations and sentence classification. It is
written in C++ and supports multiprocessing during
training. FastText allows you to train supervised and
unsupervised representations of words and sentences.
These representations (embeddings) can be used for
numerous applications from data compression, as features
into additional models, for candidate selection, or as
initializers for transfer learning.
FastText supports training continuous bag of words (CBOW)
or skip-gram models using negative sampling, softmax or
hierarchical softmax loss functions.
1)Representation: FastText is able to achieve really
good performance for word representations and sentence
classification, specially in the case of rare words by making
use of character level information.
Each word is represented as a bag of character n-grams in
addition to the word itself, so for example, for the word
matter, with n = 3, the fastText representations for the
character n-grams is <ma, mat, att, tte, ter, er>. < and >
are added as boundary symbols to distinguish the ngram of
a word from a word itself, so for example, if the word mat
is part of the vocabulary, it is represented as <mat>. This
helps preserve the meaning of shorter words that may show
up as ngrams of other words. Inherently, this also allows
you to capture meaning for suffixes/prefixes.
2)Reading data: While the training for fastText is
multi-threaded, reading the data in is done via a single
thread. The parsing and tokenization is done when the input
data is read.
FastText [10] initializes a couple of vectors to keep track
of the input information.Vector size can be limiting when
training on a large corpus, and can effectively be increased
while maintaining performance.
A word gets discarded only if, during training stage, a
random draw from a uniform distribution between 0 and
1 is greater than the probability of discard.
3)Training: Once the input and hidden vectors are
initialized, multiple training threads are kicked off. All the
training threads hold a shared pointer to the matrices for
input and hidden vectors. The threads all read from the
input file, updating the model with each input line that is
read.
The input vectors are the vector representation for the
original word, and all the n-grams of that word. The loss
is computed which sets the weights for the forward pass,
which propagate their way all the way back to the vectors
for the input layer in the back propagation pass. This
tuning of the input vector weights that happens during
the back propagation pass is what allows us to learn
representations that maximize co occurrence similarity. The
learning rate affects how much each particular instance
affects the weights.
VI. PRODUCT ANALYSING
The rate of negative and positive comments for a given
product generally is unbalanced, according to the size of our
comment dataset the positive comments are 6 more times
than the negative ones.
For the previous semantic analysis of text, a balanced dataset
of comments was used and a high rates were achieved.
In this section we will consider each product comments
both negative and positive to be a dataset by their own,
considering the unbalanced dataset.
Products datasets were extracted from a comment dataset
of 40.000 comment. Considering products with at least 100
comments, the study results are as follows.
A. Using BoW
The highest rates is achieved when SVM classifer was
used, except accuracy it was high when random forest is
used.
Accuracy: 0.8807
Precision: 0.7065
Recall: 0.6478
F-score: 0.6546
To be clearer see figure 3.
Figure 3 BoW - all products
B. Using BoN
The highest rates is achieved when SVM is used.
Accuracy: 0.8784
Precision: 0.7178
Recall: 0.6506
F-score: 0.6590

Figure 4 BoN - all products
C. Using fasttext
The highest rates is achieved when SVM, SVR, and
logistic regression are used.
Accuracy: 0.98
Precision: 0.98
Recall: 0.98
F-score: 0.98
Figure 5 Fasttext - all products
VII. CONCLUSION
Working with text files cannot be done directly on the
computer, we need to find a suitable way to represent this
text digitally to be able to handle in computer. For this
purpose, three representations were used; the bag of words,
bag of ngrams and fasttext.
We observed a high accuracy and good results with bag of
words, which is the simplest way to represent the text, this
representation is relatively slow. We also observed high and
good results in the bag of ngram characters, nevertheless, it
is a very time-consuming representation.
In fasttext, we noticed that choosing a big vector size
yields better results, but this representation showed a lower
accuracy than the others.
Finally, there is a lot of representations and algorithms to
deploy for any text, but there is no one best way in all
aspects, various methods should be tried and the appropriate
one to be chosen.
By comparing the three algorithms for products we find the
result in figure 6.
Figure 6 Algorithm comparing
REFERENCES
[1] D. M. J. Garbade. (2018) A simple introduction to natural language
processing. [Online]. Available: https://becominghuman.ai/a-
simple-introduction-to-natural-language-processing-ea66a1747b32
[2] O’Reilly. (2019) Text vectorization and transformation pipelines.
[Online]. Available: https://www.oreilly.com/library/view/applied-
text-analysis/9781491963036/ch04.html
[3] R. Gandhi. (2018) Naive bayes classifier. [Online].
Available: https://towardsdatascience.com/naive-bayes-classifier-
81d512f50a7c
[4] N. Bambrick and AYLIEN. (2016) Support vector machines.
[Online]. Available: https://www.kdnuggets.com/2016/07/support-
vector-machines-simple-explanation.html
[5] N. Donges. (2018) The random forest algorithm. [Online]. Avail-
able: https://towardsdatascience.com/the-random-forest-algorithm-
d457d499ffcd
[6] S. Solutions. (2019) Logistic regression. [Online]. Available:
https://www.statisticssolutions.com/what-is-logistic-regression/
[7] Y. Goldberg, “Synthesis lectures on human language technologies,”
in Neural Network Methods in Natural Language Processing. Morgan
and Claypool puplisher, 2017, p. 69.
[8] A. A. Yeung. (2018) Generating n-grams from sentences python. [On-
line]. Available: http://www.albertauyeung.com/post/generating-
ngrams-python/
[9] I. KANARIS, K. KANARIS, I. HOUVARDAS, and E. STAMATATOS,
“Words vs. character n-grams for anti-spam filtering,” International
Journal on Artificial Intelligence Tools, vol. XX, no. X, pp. 1–20, 2006.
[10] N. Subedi. (2018) Fasttext: Under the hood. [Online].
Available: https://towardsdatascience.com/fasttext-under-the-hood-
11efc57b2b3

Determining Customer Satisfaction Using Machine Learning

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Determining Customer Satisfaction Using Machine Learning

Similar to Determining Customer Satisfaction Using Machine Learning (20)

Recently uploaded

Recently uploaded (20)

Determining Customer Satisfaction Using Machine Learning