SlideShare a Scribd company logo
1 of 4
Download to read offline
DETERMINING CUSTOMER SATISFACTION
IN-ECOMMERCE
Chaza Alkis, Abdurrahim Derric
Department of computer engineering
Yildiz Technical University, 34220 Istanbul, Türkiye
shaza.alqays@hotmail.com, abdelrahimdarrige@gmail.com
Abstract—Delivering the customer to a high degree of
confidence and the seller for more information about the
products and the desire of customers through the use of
modern technology and Machine Learning through comments
left on the product to see and evaluate the comments added
later and thus evaluate the product, whether good or bad.
Keywords—Machine Learning, User Review.
I. INTRODUCTION
The customer satisfaction determination system in
e-commerce keeps your purchases under control by giving
you the opportunity to predict whether if the chosen product
is marked as good or not. This program aims to make
the amateurish information to be converted into a more
professional structure and to be able to make prediction and
other related transactions faster, simpler and easier. Thus, it
is aimed to establish a more healthy communication with
producer or seller and to keep customer satisfaction as high
as possible.
II. NATURAL LANGUAGE PROCESSING
Natural Language Processing is the technology used to
aid computers to understand the human’s natural language.
Natural Language Processing [1], usually shortened as
NLP, is a branch of artificial intelligence that deals with
the interaction between computers and humans using the
natural language.
The ultimate objective of NLP is to read, decipher,
understand, and make sense of the human languages in a
manner that is valuable.
Most NLP techniques rely on machine learning to derive
meaning from human languages.
III. TEXT VECTORIZATION
Machine learning algorithms operate on a numeric
feature space, expecting input as a two-dimensional array
where rows are instances and columns are features. In order
to perform machine learning on text, we need to transform
our documents into vector representations such that we
can apply numeric machine learning. This process is called
feature extraction or more simply, vectorization [2], and is
an essential first step toward language-aware analysis.
IV. CLASSIFIERS
A. Naïve bayes
Naïve bayes classifiers are a collection of classification
algorithms based on Bayes’ Theorem [3]. It is not a single
algorithm but a family of algorithms where all of them
share a common principle, i.e. every pair of features being
classified is independent of each other.
P(A|B) =
P(B|A)P(A)
P(B)
where A and B are events and P(B)
B. Bernoulli naïve bayes
This is similar to the multinomial naïve bayes [3] but
the predictors are boolean variables. The parameters that
we use to predict the class variable take up only values yes
or no, for example if a word occurs in the text or not.
C. Support Vector Machines
A Support Vector Machine (SVM) [4] is a supervised
machine learning algorithm that can be employed for
both classification and regression purposes. SVMs are more
commonly used in classification problems and as such, this
is what we will focus on in this part.
SVMs are based on the idea of finding a hyperplane that
best divides a dataset into two classes, as shown in Figure
1.
Figure 1 SVM
D. Stochastic gradient descent
The gradient descent algorithm may be infeasible when
the training data size is huge. Thus, a stochastic version of
the algorithm is often used instead.
To motivate the use of stochastic optimization algorithms,
note that when training deep learning models, we often
consider the objective function as a sum of a finite number
of functions:
f (x) =
1
n
n
i=1
fi(x)
where fi(x) is a loss function based on the training data
instance indexed by i.
E. The random forest algorithm
Random Forest is a flexible, easy to use machine learning
algorithm that produces, even without hyper-parameter
tuning, a great result most of the time. It is also one of the
most used algorithms, because it’s simplicity and the fact
that it can be used for both classification and regression
tasks.
You can see how a random forest [5] would look like with
two trees in Figure 2.
Figure 2 Random forest tree
F. Logistic regression
Logistic regression is the appropriate regression analysis
to conduct when the dependent variable is dichotomous
(binary). Like all regression analyses, the logistic regression
is a predictive analysis. Logistic regression [6] is used
to describe data and to explain the relationship between
one dependent binary variable and one or more nominal,
ordinal, interval or ratio-level independent variables.
V. MODELS OF REPRESENTATION
A. Bag of words
The bag-of-words model is a simplifying representation
used in natural language processing and information
retrieval (IR). In this model, a text (such as a sentence or a
document) is represented as the bag (multi-set) of its words,
disregarding grammar and even word order but keeping
multiplicity. The approach is very simple and flexible, and
can be used in a myriad of ways for extracting features from
documents.
A bag-of-words is a representation of text that describes
the occurrence of words within a document. It involves two
things:
1. A vocabulary of known words.
2. A measure of the presence of known words.
It is called a “bag” of words, because any information about
the order or structure of words in the document is discarded.
The model is only concerned with whether known words
occur in the document, not where in the document.
"A very common feature extraction procedures for sentences
and documents is the bag-of-words approach (BOW). In this
approach, we look at the histogram of the words within the
text, i.e. considering each word count as a feature" [7].
B. Bag of ngrams
Bag-of-ngrams (BON) models are commonly used for
representing text. One of the main drawbacks of traditional
BON is the ignorance of n-gram’s semantics.
N-grams [8] are contiguous sequences of n-items in a
sentence. N can be 1, 2 or any other positive integers,
although usually we do not consider very large N because
those n-grams rarely appears in many different places.
When performing machine learning tasks related to natural
language processing, we usually need to generate n-grams
from input sentences. For example, in text classification
tasks, in addition to using each individual token found in
the corpus, we may want to add bi-grams or tri-grams as
features to represent our documents.
We have two types of ngrams, word based ngrams and
character based ngram. We will explain both of them but,
we will focus on character based ngram.
1)Word based ngrams: The word based text
representations require language-dependent tools, such
as a tokenizer (to split the message into tokens) and
usually a lemmatizer plus a list of stop words (to reduce
the dimensionality of the problem). Moreover, there is no
effective lemmatizers available for any natural language,
especially for morphologically rich languages. Word
n-grams [9], i.e., contiguous sequences of n words. Such
approach attempt to take advantage of contextual phrasal
information. However, word n-grams considerably increase
the dimensionality of the problem and the results so far are
not encouraging.
2)Character based ngrams: In training and testing,
we focus on a different but simple text representation.
In particular, each message is considered as a bag of
character n-grams, that is, strings of length n. Character
n-grams are able to capture information on various levels:
lexical (|the |, |free|), word-class (|ed |, |ing |),
punctuation mark usage (|!!!|, |f.r.|), etc. In addition, they
are robust to grammatical errors (e.g., the word-tokens
‘assignment’ and ‘assignment’ share the majority of character
n-grams) and strange usage of abbreviations, punctuation
marks etc.. The bag of character n-grams representation
is language-independent and does not require any text
preprocessing (tokenizer, lemmatizer, or other ‘deep’
NLP tools). It has already been used in several tasks
including language identification, authorship attribution,
and topic-based text categorization with remarkable results
in comparison to word-based representations.
An important characteristic of the character-level n-grams
[9] is that they avoid (at least to a great extent) the problem
of sparse data that arises when using word-level n-grams.
That is, there is much less character combinations than
word combinations, therefore, less n-grams will have zero
frequency. On the other hand, the proposed representation
still produces a considerably larger feature set in comparison
with traditional bag of words representations. Therefore,
learning algorithms able to deal with high dimensional
spaces should be used.
C. FastText
Fasttext module allows training word embeddings from
a training corpus with the additional ability to obtain word
vectors for out-of-vocabulary words.
fastText as a library for efficient learning of word
representations and sentence classification. It is
written in C++ and supports multiprocessing during
training. FastText allows you to train supervised and
unsupervised representations of words and sentences.
These representations (embeddings) can be used for
numerous applications from data compression, as features
into additional models, for candidate selection, or as
initializers for transfer learning.
FastText supports training continuous bag of words (CBOW)
or skip-gram models using negative sampling, softmax or
hierarchical softmax loss functions.
1)Representation: FastText is able to achieve really
good performance for word representations and sentence
classification, specially in the case of rare words by making
use of character level information.
Each word is represented as a bag of character n-grams in
addition to the word itself, so for example, for the word
matter, with n = 3, the fastText representations for the
character n-grams is <ma, mat, att, tte, ter, er>. < and >
are added as boundary symbols to distinguish the ngram of
a word from a word itself, so for example, if the word mat
is part of the vocabulary, it is represented as <mat>. This
helps preserve the meaning of shorter words that may show
up as ngrams of other words. Inherently, this also allows
you to capture meaning for suffixes/prefixes.
2)Reading data: While the training for fastText is
multi-threaded, reading the data in is done via a single
thread. The parsing and tokenization is done when the input
data is read.
FastText [10] initializes a couple of vectors to keep track
of the input information.Vector size can be limiting when
training on a large corpus, and can effectively be increased
while maintaining performance.
A word gets discarded only if, during training stage, a
random draw from a uniform distribution between 0 and
1 is greater than the probability of discard.
3)Training: Once the input and hidden vectors are
initialized, multiple training threads are kicked off. All the
training threads hold a shared pointer to the matrices for
input and hidden vectors. The threads all read from the
input file, updating the model with each input line that is
read.
The input vectors are the vector representation for the
original word, and all the n-grams of that word. The loss
is computed which sets the weights for the forward pass,
which propagate their way all the way back to the vectors
for the input layer in the back propagation pass. This
tuning of the input vector weights that happens during
the back propagation pass is what allows us to learn
representations that maximize co occurrence similarity. The
learning rate affects how much each particular instance
affects the weights.
VI. PRODUCT ANALYSING
The rate of negative and positive comments for a given
product generally is unbalanced, according to the size of our
comment dataset the positive comments are 6 more times
than the negative ones.
For the previous semantic analysis of text, a balanced dataset
of comments was used and a high rates were achieved.
In this section we will consider each product comments
both negative and positive to be a dataset by their own,
considering the unbalanced dataset.
Products datasets were extracted from a comment dataset
of 40.000 comment. Considering products with at least 100
comments, the study results are as follows.
A. Using BoW
The highest rates is achieved when SVM classifer was
used, except accuracy it was high when random forest is
used.
Accuracy: 0.8807
Precision: 0.7065
Recall: 0.6478
F-score: 0.6546
To be clearer see figure 3.
Figure 3 BoW - all products
B. Using BoN
The highest rates is achieved when SVM is used.
Accuracy: 0.8784
Precision: 0.7178
Recall: 0.6506
F-score: 0.6590
To be clearer see figure 4.
Figure 4 BoN - all products
C. Using fasttext
The highest rates is achieved when SVM, SVR, and
logistic regression are used.
Accuracy: 0.98
Precision: 0.98
Recall: 0.98
F-score: 0.98
To be clearer see figure 5.
Figure 5 Fasttext - all products
VII. CONCLUSION
Working with text files cannot be done directly on the
computer, we need to find a suitable way to represent this
text digitally to be able to handle in computer. For this
purpose, three representations were used; the bag of words,
bag of ngrams and fasttext.
We observed a high accuracy and good results with bag of
words, which is the simplest way to represent the text, this
representation is relatively slow. We also observed high and
good results in the bag of ngram characters, nevertheless, it
is a very time-consuming representation.
In fasttext, we noticed that choosing a big vector size
yields better results, but this representation showed a lower
accuracy than the others.
Finally, there is a lot of representations and algorithms to
deploy for any text, but there is no one best way in all
aspects, various methods should be tried and the appropriate
one to be chosen.
By comparing the three algorithms for products we find the
result in figure 6.
Figure 6 Algorithm comparing
REFERENCES
[1] D. M. J. Garbade. (2018) A simple introduction to natural language
processing. [Online]. Available: https://becominghuman.ai/a-
simple-introduction-to-natural-language-processing-ea66a1747b32
[2] O’Reilly. (2019) Text vectorization and transformation pipelines.
[Online]. Available: https://www.oreilly.com/library/view/applied-
text-analysis/9781491963036/ch04.html
[3] R. Gandhi. (2018) Naive bayes classifier. [Online].
Available: https://towardsdatascience.com/naive-bayes-classifier-
81d512f50a7c
[4] N. Bambrick and AYLIEN. (2016) Support vector machines.
[Online]. Available: https://www.kdnuggets.com/2016/07/support-
vector-machines-simple-explanation.html
[5] N. Donges. (2018) The random forest algorithm. [Online]. Avail-
able: https://towardsdatascience.com/the-random-forest-algorithm-
d457d499ffcd
[6] S. Solutions. (2019) Logistic regression. [Online]. Available:
https://www.statisticssolutions.com/what-is-logistic-regression/
[7] Y. Goldberg, “Synthesis lectures on human language technologies,”
in Neural Network Methods in Natural Language Processing. Morgan
and Claypool puplisher, 2017, p. 69.
[8] A. A. Yeung. (2018) Generating n-grams from sentences python. [On-
line]. Available: http://www.albertauyeung.com/post/generating-
ngrams-python/
[9] I. KANARIS, K. KANARIS, I. HOUVARDAS, and E. STAMATATOS,
“Words vs. character n-grams for anti-spam filtering,” International
Journal on Artificial Intelligence Tools, vol. XX, no. X, pp. 1–20, 2006.
[10] N. Subedi. (2018) Fasttext: Under the hood. [Online].
Available: https://towardsdatascience.com/fasttext-under-the-hood-
11efc57b2b3

More Related Content

What's hot

T EXT M INING AND C LASSIFICATION OF P RODUCT R EVIEWS U SING S TRUCTURED S U...
T EXT M INING AND C LASSIFICATION OF P RODUCT R EVIEWS U SING S TRUCTURED S U...T EXT M INING AND C LASSIFICATION OF P RODUCT R EVIEWS U SING S TRUCTURED S U...
T EXT M INING AND C LASSIFICATION OF P RODUCT R EVIEWS U SING S TRUCTURED S U...csandit
 
IRJET- Short-Text Semantic Similarity using Glove Word Embedding
IRJET- Short-Text Semantic Similarity using Glove Word EmbeddingIRJET- Short-Text Semantic Similarity using Glove Word Embedding
IRJET- Short-Text Semantic Similarity using Glove Word EmbeddingIRJET Journal
 
text summarization using amr
text summarization using amrtext summarization using amr
text summarization using amramit nagarkoti
 
A Combined Approach to Part-of-Speech Tagging Using Features Extraction and H...
A Combined Approach to Part-of-Speech Tagging Using Features Extraction and H...A Combined Approach to Part-of-Speech Tagging Using Features Extraction and H...
A Combined Approach to Part-of-Speech Tagging Using Features Extraction and H...Editor IJARCET
 
AN ADVANCED APPROACH FOR RULE BASED ENGLISH TO BENGALI MACHINE TRANSLATION
AN ADVANCED APPROACH FOR RULE BASED ENGLISH TO BENGALI MACHINE TRANSLATIONAN ADVANCED APPROACH FOR RULE BASED ENGLISH TO BENGALI MACHINE TRANSLATION
AN ADVANCED APPROACH FOR RULE BASED ENGLISH TO BENGALI MACHINE TRANSLATIONcscpconf
 
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESTHE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESkevig
 
amta-decision-trees.doc Word document
amta-decision-trees.doc Word documentamta-decision-trees.doc Word document
amta-decision-trees.doc Word documentbutest
 
7 probability and statistics an introduction
7 probability and statistics an introduction7 probability and statistics an introduction
7 probability and statistics an introductionThennarasuSakkan
 
EXTRACTIVE SUMMARIZATION WITH VERY DEEP PRETRAINED LANGUAGE MODEL
EXTRACTIVE SUMMARIZATION WITH VERY DEEP PRETRAINED LANGUAGE MODELEXTRACTIVE SUMMARIZATION WITH VERY DEEP PRETRAINED LANGUAGE MODEL
EXTRACTIVE SUMMARIZATION WITH VERY DEEP PRETRAINED LANGUAGE MODELijaia
 
AN EMPIRICAL STUDY OF WORD SENSE DISAMBIGUATION
AN EMPIRICAL STUDY OF WORD SENSE DISAMBIGUATIONAN EMPIRICAL STUDY OF WORD SENSE DISAMBIGUATION
AN EMPIRICAL STUDY OF WORD SENSE DISAMBIGUATIONijnlc
 
[Paper Reading] Supervised Learning of Universal Sentence Representations fro...
[Paper Reading] Supervised Learning of Universal Sentence Representations fro...[Paper Reading] Supervised Learning of Universal Sentence Representations fro...
[Paper Reading] Supervised Learning of Universal Sentence Representations fro...Hiroki Shimanaka
 
A Survey of Various Methods for Text Summarization
A Survey of Various Methods for Text SummarizationA Survey of Various Methods for Text Summarization
A Survey of Various Methods for Text SummarizationIJERD Editor
 
Extractive Summarization with Very Deep Pretrained Language Model
Extractive Summarization with Very Deep Pretrained Language ModelExtractive Summarization with Very Deep Pretrained Language Model
Extractive Summarization with Very Deep Pretrained Language Modelgerogepatton
 
Introduction to Text Mining
Introduction to Text Mining Introduction to Text Mining
Introduction to Text Mining Rupak Roy
 
Extraction Based automatic summarization
Extraction Based automatic summarizationExtraction Based automatic summarization
Extraction Based automatic summarizationAbdelaziz Al-Rihawi
 
New Quantitative Methodology for Identification of Drug Abuse Based on Featur...
New Quantitative Methodology for Identification of Drug Abuse Based on Featur...New Quantitative Methodology for Identification of Drug Abuse Based on Featur...
New Quantitative Methodology for Identification of Drug Abuse Based on Featur...Carrie Wang
 
PUNJABI SPEECH SYNTHESIS SYSTEM USING HTK
PUNJABI SPEECH SYNTHESIS SYSTEM USING HTKPUNJABI SPEECH SYNTHESIS SYSTEM USING HTK
PUNJABI SPEECH SYNTHESIS SYSTEM USING HTKijistjournal
 
SEMI-SUPERVISED BOOTSTRAPPING APPROACH FOR NAMED ENTITY RECOGNITION
SEMI-SUPERVISED BOOTSTRAPPING APPROACH FOR NAMED ENTITY RECOGNITIONSEMI-SUPERVISED BOOTSTRAPPING APPROACH FOR NAMED ENTITY RECOGNITION
SEMI-SUPERVISED BOOTSTRAPPING APPROACH FOR NAMED ENTITY RECOGNITIONkevig
 
Sentiment Analysis In Myanmar Language Using Convolutional Lstm Neural Network
Sentiment Analysis In Myanmar Language Using Convolutional Lstm Neural NetworkSentiment Analysis In Myanmar Language Using Convolutional Lstm Neural Network
Sentiment Analysis In Myanmar Language Using Convolutional Lstm Neural Networkkevig
 

What's hot (20)

T EXT M INING AND C LASSIFICATION OF P RODUCT R EVIEWS U SING S TRUCTURED S U...
T EXT M INING AND C LASSIFICATION OF P RODUCT R EVIEWS U SING S TRUCTURED S U...T EXT M INING AND C LASSIFICATION OF P RODUCT R EVIEWS U SING S TRUCTURED S U...
T EXT M INING AND C LASSIFICATION OF P RODUCT R EVIEWS U SING S TRUCTURED S U...
 
IRJET- Short-Text Semantic Similarity using Glove Word Embedding
IRJET- Short-Text Semantic Similarity using Glove Word EmbeddingIRJET- Short-Text Semantic Similarity using Glove Word Embedding
IRJET- Short-Text Semantic Similarity using Glove Word Embedding
 
text summarization using amr
text summarization using amrtext summarization using amr
text summarization using amr
 
A Combined Approach to Part-of-Speech Tagging Using Features Extraction and H...
A Combined Approach to Part-of-Speech Tagging Using Features Extraction and H...A Combined Approach to Part-of-Speech Tagging Using Features Extraction and H...
A Combined Approach to Part-of-Speech Tagging Using Features Extraction and H...
 
AN ADVANCED APPROACH FOR RULE BASED ENGLISH TO BENGALI MACHINE TRANSLATION
AN ADVANCED APPROACH FOR RULE BASED ENGLISH TO BENGALI MACHINE TRANSLATIONAN ADVANCED APPROACH FOR RULE BASED ENGLISH TO BENGALI MACHINE TRANSLATION
AN ADVANCED APPROACH FOR RULE BASED ENGLISH TO BENGALI MACHINE TRANSLATION
 
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESTHE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
 
amta-decision-trees.doc Word document
amta-decision-trees.doc Word documentamta-decision-trees.doc Word document
amta-decision-trees.doc Word document
 
7 probability and statistics an introduction
7 probability and statistics an introduction7 probability and statistics an introduction
7 probability and statistics an introduction
 
EXTRACTIVE SUMMARIZATION WITH VERY DEEP PRETRAINED LANGUAGE MODEL
EXTRACTIVE SUMMARIZATION WITH VERY DEEP PRETRAINED LANGUAGE MODELEXTRACTIVE SUMMARIZATION WITH VERY DEEP PRETRAINED LANGUAGE MODEL
EXTRACTIVE SUMMARIZATION WITH VERY DEEP PRETRAINED LANGUAGE MODEL
 
AN EMPIRICAL STUDY OF WORD SENSE DISAMBIGUATION
AN EMPIRICAL STUDY OF WORD SENSE DISAMBIGUATIONAN EMPIRICAL STUDY OF WORD SENSE DISAMBIGUATION
AN EMPIRICAL STUDY OF WORD SENSE DISAMBIGUATION
 
[Paper Reading] Supervised Learning of Universal Sentence Representations fro...
[Paper Reading] Supervised Learning of Universal Sentence Representations fro...[Paper Reading] Supervised Learning of Universal Sentence Representations fro...
[Paper Reading] Supervised Learning of Universal Sentence Representations fro...
 
A Survey of Various Methods for Text Summarization
A Survey of Various Methods for Text SummarizationA Survey of Various Methods for Text Summarization
A Survey of Various Methods for Text Summarization
 
Extractive Summarization with Very Deep Pretrained Language Model
Extractive Summarization with Very Deep Pretrained Language ModelExtractive Summarization with Very Deep Pretrained Language Model
Extractive Summarization with Very Deep Pretrained Language Model
 
Introduction to Text Mining
Introduction to Text Mining Introduction to Text Mining
Introduction to Text Mining
 
Extraction Based automatic summarization
Extraction Based automatic summarizationExtraction Based automatic summarization
Extraction Based automatic summarization
 
New Quantitative Methodology for Identification of Drug Abuse Based on Featur...
New Quantitative Methodology for Identification of Drug Abuse Based on Featur...New Quantitative Methodology for Identification of Drug Abuse Based on Featur...
New Quantitative Methodology for Identification of Drug Abuse Based on Featur...
 
PUNJABI SPEECH SYNTHESIS SYSTEM USING HTK
PUNJABI SPEECH SYNTHESIS SYSTEM USING HTKPUNJABI SPEECH SYNTHESIS SYSTEM USING HTK
PUNJABI SPEECH SYNTHESIS SYSTEM USING HTK
 
SEMI-SUPERVISED BOOTSTRAPPING APPROACH FOR NAMED ENTITY RECOGNITION
SEMI-SUPERVISED BOOTSTRAPPING APPROACH FOR NAMED ENTITY RECOGNITIONSEMI-SUPERVISED BOOTSTRAPPING APPROACH FOR NAMED ENTITY RECOGNITION
SEMI-SUPERVISED BOOTSTRAPPING APPROACH FOR NAMED ENTITY RECOGNITION
 
228-SE3001_2
228-SE3001_2228-SE3001_2
228-SE3001_2
 
Sentiment Analysis In Myanmar Language Using Convolutional Lstm Neural Network
Sentiment Analysis In Myanmar Language Using Convolutional Lstm Neural NetworkSentiment Analysis In Myanmar Language Using Convolutional Lstm Neural Network
Sentiment Analysis In Myanmar Language Using Convolutional Lstm Neural Network
 

Similar to Determining Customer Satisfaction Using Machine Learning

Chunker Based Sentiment Analysis and Tense Classification for Nepali Text
Chunker Based Sentiment Analysis and Tense Classification for Nepali TextChunker Based Sentiment Analysis and Tense Classification for Nepali Text
Chunker Based Sentiment Analysis and Tense Classification for Nepali Textkevig
 
Chunker Based Sentiment Analysis and Tense Classification for Nepali Text
Chunker Based Sentiment Analysis and Tense Classification for Nepali TextChunker Based Sentiment Analysis and Tense Classification for Nepali Text
Chunker Based Sentiment Analysis and Tense Classification for Nepali Textkevig
 
EasyChair-Preprint-7375.pdf
EasyChair-Preprint-7375.pdfEasyChair-Preprint-7375.pdf
EasyChair-Preprint-7375.pdfNohaGhoweil
 
IRJET- Automatic Language Identification using Hybrid Approach and Classifica...
IRJET- Automatic Language Identification using Hybrid Approach and Classifica...IRJET- Automatic Language Identification using Hybrid Approach and Classifica...
IRJET- Automatic Language Identification using Hybrid Approach and Classifica...IRJET Journal
 
G04124041046
G04124041046G04124041046
G04124041046IOSR-JEN
 
Doc format.
Doc format.Doc format.
Doc format.butest
 
A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTE
A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTEA FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTE
A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTEijnlc
 
A neural probabilistic language model
A neural probabilistic language modelA neural probabilistic language model
A neural probabilistic language modelc sharada
 
A Survey Of Various Machine Learning Techniques For Text Classification
A Survey Of Various Machine Learning Techniques For Text ClassificationA Survey Of Various Machine Learning Techniques For Text Classification
A Survey Of Various Machine Learning Techniques For Text ClassificationJoshua Gorinson
 
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESTHE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESkevig
 
Genetic Approach For Arabic Part Of Speech Tagging
Genetic Approach For Arabic Part Of Speech TaggingGenetic Approach For Arabic Part Of Speech Tagging
Genetic Approach For Arabic Part Of Speech Taggingkevig
 
GENETIC APPROACH FOR ARABIC PART OF SPEECH TAGGING
GENETIC APPROACH FOR ARABIC PART OF SPEECH TAGGINGGENETIC APPROACH FOR ARABIC PART OF SPEECH TAGGING
GENETIC APPROACH FOR ARABIC PART OF SPEECH TAGGINGijnlc
 
A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTE
A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTEA FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTE
A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTEkevig
 
A hybrid composite features based sentence level sentiment analyzer
A hybrid composite features based sentence level sentiment analyzerA hybrid composite features based sentence level sentiment analyzer
A hybrid composite features based sentence level sentiment analyzerIAESIJAI
 
NLP Techniques for Text Classification.docx
NLP Techniques for Text Classification.docxNLP Techniques for Text Classification.docx
NLP Techniques for Text Classification.docxKevinSims18
 

Similar to Determining Customer Satisfaction Using Machine Learning (20)

Chunker Based Sentiment Analysis and Tense Classification for Nepali Text
Chunker Based Sentiment Analysis and Tense Classification for Nepali TextChunker Based Sentiment Analysis and Tense Classification for Nepali Text
Chunker Based Sentiment Analysis and Tense Classification for Nepali Text
 
Chunker Based Sentiment Analysis and Tense Classification for Nepali Text
Chunker Based Sentiment Analysis and Tense Classification for Nepali TextChunker Based Sentiment Analysis and Tense Classification for Nepali Text
Chunker Based Sentiment Analysis and Tense Classification for Nepali Text
 
sentiment analysis
sentiment analysis sentiment analysis
sentiment analysis
 
EasyChair-Preprint-7375.pdf
EasyChair-Preprint-7375.pdfEasyChair-Preprint-7375.pdf
EasyChair-Preprint-7375.pdf
 
IRJET- Automatic Language Identification using Hybrid Approach and Classifica...
IRJET- Automatic Language Identification using Hybrid Approach and Classifica...IRJET- Automatic Language Identification using Hybrid Approach and Classifica...
IRJET- Automatic Language Identification using Hybrid Approach and Classifica...
 
Wsd final paper
Wsd final paperWsd final paper
Wsd final paper
 
G04124041046
G04124041046G04124041046
G04124041046
 
Doc format.
Doc format.Doc format.
Doc format.
 
L1803058388
L1803058388L1803058388
L1803058388
 
A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTE
A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTEA FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTE
A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTE
 
A neural probabilistic language model
A neural probabilistic language modelA neural probabilistic language model
A neural probabilistic language model
 
A Survey Of Various Machine Learning Techniques For Text Classification
A Survey Of Various Machine Learning Techniques For Text ClassificationA Survey Of Various Machine Learning Techniques For Text Classification
A Survey Of Various Machine Learning Techniques For Text Classification
 
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESTHE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
 
Genetic Approach For Arabic Part Of Speech Tagging
Genetic Approach For Arabic Part Of Speech TaggingGenetic Approach For Arabic Part Of Speech Tagging
Genetic Approach For Arabic Part Of Speech Tagging
 
GENETIC APPROACH FOR ARABIC PART OF SPEECH TAGGING
GENETIC APPROACH FOR ARABIC PART OF SPEECH TAGGINGGENETIC APPROACH FOR ARABIC PART OF SPEECH TAGGING
GENETIC APPROACH FOR ARABIC PART OF SPEECH TAGGING
 
A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTE
A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTEA FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTE
A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTE
 
A hybrid composite features based sentence level sentiment analyzer
A hybrid composite features based sentence level sentiment analyzerA hybrid composite features based sentence level sentiment analyzer
A hybrid composite features based sentence level sentiment analyzer
 
NLP Techniques for Text Classification.docx
NLP Techniques for Text Classification.docxNLP Techniques for Text Classification.docx
NLP Techniques for Text Classification.docx
 
Aq35241246
Aq35241246Aq35241246
Aq35241246
 
LLM.pdf
LLM.pdfLLM.pdf
LLM.pdf
 

Recently uploaded

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 

Recently uploaded (20)

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 

Determining Customer Satisfaction Using Machine Learning

  • 1. DETERMINING CUSTOMER SATISFACTION IN-ECOMMERCE Chaza Alkis, Abdurrahim Derric Department of computer engineering Yildiz Technical University, 34220 Istanbul, Türkiye shaza.alqays@hotmail.com, abdelrahimdarrige@gmail.com Abstract—Delivering the customer to a high degree of confidence and the seller for more information about the products and the desire of customers through the use of modern technology and Machine Learning through comments left on the product to see and evaluate the comments added later and thus evaluate the product, whether good or bad. Keywords—Machine Learning, User Review. I. INTRODUCTION The customer satisfaction determination system in e-commerce keeps your purchases under control by giving you the opportunity to predict whether if the chosen product is marked as good or not. This program aims to make the amateurish information to be converted into a more professional structure and to be able to make prediction and other related transactions faster, simpler and easier. Thus, it is aimed to establish a more healthy communication with producer or seller and to keep customer satisfaction as high as possible. II. NATURAL LANGUAGE PROCESSING Natural Language Processing is the technology used to aid computers to understand the human’s natural language. Natural Language Processing [1], usually shortened as NLP, is a branch of artificial intelligence that deals with the interaction between computers and humans using the natural language. The ultimate objective of NLP is to read, decipher, understand, and make sense of the human languages in a manner that is valuable. Most NLP techniques rely on machine learning to derive meaning from human languages. III. TEXT VECTORIZATION Machine learning algorithms operate on a numeric feature space, expecting input as a two-dimensional array where rows are instances and columns are features. In order to perform machine learning on text, we need to transform our documents into vector representations such that we can apply numeric machine learning. This process is called feature extraction or more simply, vectorization [2], and is an essential first step toward language-aware analysis. IV. CLASSIFIERS A. Naïve bayes Naïve bayes classifiers are a collection of classification algorithms based on Bayes’ Theorem [3]. It is not a single algorithm but a family of algorithms where all of them share a common principle, i.e. every pair of features being classified is independent of each other. P(A|B) = P(B|A)P(A) P(B) where A and B are events and P(B) B. Bernoulli naïve bayes This is similar to the multinomial naïve bayes [3] but the predictors are boolean variables. The parameters that we use to predict the class variable take up only values yes or no, for example if a word occurs in the text or not. C. Support Vector Machines A Support Vector Machine (SVM) [4] is a supervised machine learning algorithm that can be employed for both classification and regression purposes. SVMs are more commonly used in classification problems and as such, this is what we will focus on in this part. SVMs are based on the idea of finding a hyperplane that best divides a dataset into two classes, as shown in Figure 1. Figure 1 SVM
  • 2. D. Stochastic gradient descent The gradient descent algorithm may be infeasible when the training data size is huge. Thus, a stochastic version of the algorithm is often used instead. To motivate the use of stochastic optimization algorithms, note that when training deep learning models, we often consider the objective function as a sum of a finite number of functions: f (x) = 1 n n i=1 fi(x) where fi(x) is a loss function based on the training data instance indexed by i. E. The random forest algorithm Random Forest is a flexible, easy to use machine learning algorithm that produces, even without hyper-parameter tuning, a great result most of the time. It is also one of the most used algorithms, because it’s simplicity and the fact that it can be used for both classification and regression tasks. You can see how a random forest [5] would look like with two trees in Figure 2. Figure 2 Random forest tree F. Logistic regression Logistic regression is the appropriate regression analysis to conduct when the dependent variable is dichotomous (binary). Like all regression analyses, the logistic regression is a predictive analysis. Logistic regression [6] is used to describe data and to explain the relationship between one dependent binary variable and one or more nominal, ordinal, interval or ratio-level independent variables. V. MODELS OF REPRESENTATION A. Bag of words The bag-of-words model is a simplifying representation used in natural language processing and information retrieval (IR). In this model, a text (such as a sentence or a document) is represented as the bag (multi-set) of its words, disregarding grammar and even word order but keeping multiplicity. The approach is very simple and flexible, and can be used in a myriad of ways for extracting features from documents. A bag-of-words is a representation of text that describes the occurrence of words within a document. It involves two things: 1. A vocabulary of known words. 2. A measure of the presence of known words. It is called a “bag” of words, because any information about the order or structure of words in the document is discarded. The model is only concerned with whether known words occur in the document, not where in the document. "A very common feature extraction procedures for sentences and documents is the bag-of-words approach (BOW). In this approach, we look at the histogram of the words within the text, i.e. considering each word count as a feature" [7]. B. Bag of ngrams Bag-of-ngrams (BON) models are commonly used for representing text. One of the main drawbacks of traditional BON is the ignorance of n-gram’s semantics. N-grams [8] are contiguous sequences of n-items in a sentence. N can be 1, 2 or any other positive integers, although usually we do not consider very large N because those n-grams rarely appears in many different places. When performing machine learning tasks related to natural language processing, we usually need to generate n-grams from input sentences. For example, in text classification tasks, in addition to using each individual token found in the corpus, we may want to add bi-grams or tri-grams as features to represent our documents. We have two types of ngrams, word based ngrams and character based ngram. We will explain both of them but, we will focus on character based ngram. 1)Word based ngrams: The word based text representations require language-dependent tools, such as a tokenizer (to split the message into tokens) and usually a lemmatizer plus a list of stop words (to reduce the dimensionality of the problem). Moreover, there is no effective lemmatizers available for any natural language, especially for morphologically rich languages. Word n-grams [9], i.e., contiguous sequences of n words. Such approach attempt to take advantage of contextual phrasal information. However, word n-grams considerably increase the dimensionality of the problem and the results so far are not encouraging. 2)Character based ngrams: In training and testing, we focus on a different but simple text representation. In particular, each message is considered as a bag of character n-grams, that is, strings of length n. Character n-grams are able to capture information on various levels: lexical (|the |, |free|), word-class (|ed |, |ing |), punctuation mark usage (|!!!|, |f.r.|), etc. In addition, they are robust to grammatical errors (e.g., the word-tokens ‘assignment’ and ‘assignment’ share the majority of character n-grams) and strange usage of abbreviations, punctuation marks etc.. The bag of character n-grams representation is language-independent and does not require any text
  • 3. preprocessing (tokenizer, lemmatizer, or other ‘deep’ NLP tools). It has already been used in several tasks including language identification, authorship attribution, and topic-based text categorization with remarkable results in comparison to word-based representations. An important characteristic of the character-level n-grams [9] is that they avoid (at least to a great extent) the problem of sparse data that arises when using word-level n-grams. That is, there is much less character combinations than word combinations, therefore, less n-grams will have zero frequency. On the other hand, the proposed representation still produces a considerably larger feature set in comparison with traditional bag of words representations. Therefore, learning algorithms able to deal with high dimensional spaces should be used. C. FastText Fasttext module allows training word embeddings from a training corpus with the additional ability to obtain word vectors for out-of-vocabulary words. fastText as a library for efficient learning of word representations and sentence classification. It is written in C++ and supports multiprocessing during training. FastText allows you to train supervised and unsupervised representations of words and sentences. These representations (embeddings) can be used for numerous applications from data compression, as features into additional models, for candidate selection, or as initializers for transfer learning. FastText supports training continuous bag of words (CBOW) or skip-gram models using negative sampling, softmax or hierarchical softmax loss functions. 1)Representation: FastText is able to achieve really good performance for word representations and sentence classification, specially in the case of rare words by making use of character level information. Each word is represented as a bag of character n-grams in addition to the word itself, so for example, for the word matter, with n = 3, the fastText representations for the character n-grams is <ma, mat, att, tte, ter, er>. < and > are added as boundary symbols to distinguish the ngram of a word from a word itself, so for example, if the word mat is part of the vocabulary, it is represented as <mat>. This helps preserve the meaning of shorter words that may show up as ngrams of other words. Inherently, this also allows you to capture meaning for suffixes/prefixes. 2)Reading data: While the training for fastText is multi-threaded, reading the data in is done via a single thread. The parsing and tokenization is done when the input data is read. FastText [10] initializes a couple of vectors to keep track of the input information.Vector size can be limiting when training on a large corpus, and can effectively be increased while maintaining performance. A word gets discarded only if, during training stage, a random draw from a uniform distribution between 0 and 1 is greater than the probability of discard. 3)Training: Once the input and hidden vectors are initialized, multiple training threads are kicked off. All the training threads hold a shared pointer to the matrices for input and hidden vectors. The threads all read from the input file, updating the model with each input line that is read. The input vectors are the vector representation for the original word, and all the n-grams of that word. The loss is computed which sets the weights for the forward pass, which propagate their way all the way back to the vectors for the input layer in the back propagation pass. This tuning of the input vector weights that happens during the back propagation pass is what allows us to learn representations that maximize co occurrence similarity. The learning rate affects how much each particular instance affects the weights. VI. PRODUCT ANALYSING The rate of negative and positive comments for a given product generally is unbalanced, according to the size of our comment dataset the positive comments are 6 more times than the negative ones. For the previous semantic analysis of text, a balanced dataset of comments was used and a high rates were achieved. In this section we will consider each product comments both negative and positive to be a dataset by their own, considering the unbalanced dataset. Products datasets were extracted from a comment dataset of 40.000 comment. Considering products with at least 100 comments, the study results are as follows. A. Using BoW The highest rates is achieved when SVM classifer was used, except accuracy it was high when random forest is used. Accuracy: 0.8807 Precision: 0.7065 Recall: 0.6478 F-score: 0.6546 To be clearer see figure 3. Figure 3 BoW - all products B. Using BoN The highest rates is achieved when SVM is used. Accuracy: 0.8784 Precision: 0.7178 Recall: 0.6506 F-score: 0.6590 To be clearer see figure 4.
  • 4. Figure 4 BoN - all products C. Using fasttext The highest rates is achieved when SVM, SVR, and logistic regression are used. Accuracy: 0.98 Precision: 0.98 Recall: 0.98 F-score: 0.98 To be clearer see figure 5. Figure 5 Fasttext - all products VII. CONCLUSION Working with text files cannot be done directly on the computer, we need to find a suitable way to represent this text digitally to be able to handle in computer. For this purpose, three representations were used; the bag of words, bag of ngrams and fasttext. We observed a high accuracy and good results with bag of words, which is the simplest way to represent the text, this representation is relatively slow. We also observed high and good results in the bag of ngram characters, nevertheless, it is a very time-consuming representation. In fasttext, we noticed that choosing a big vector size yields better results, but this representation showed a lower accuracy than the others. Finally, there is a lot of representations and algorithms to deploy for any text, but there is no one best way in all aspects, various methods should be tried and the appropriate one to be chosen. By comparing the three algorithms for products we find the result in figure 6. Figure 6 Algorithm comparing REFERENCES [1] D. M. J. Garbade. (2018) A simple introduction to natural language processing. [Online]. Available: https://becominghuman.ai/a- simple-introduction-to-natural-language-processing-ea66a1747b32 [2] O’Reilly. (2019) Text vectorization and transformation pipelines. [Online]. Available: https://www.oreilly.com/library/view/applied- text-analysis/9781491963036/ch04.html [3] R. Gandhi. (2018) Naive bayes classifier. [Online]. Available: https://towardsdatascience.com/naive-bayes-classifier- 81d512f50a7c [4] N. Bambrick and AYLIEN. (2016) Support vector machines. [Online]. Available: https://www.kdnuggets.com/2016/07/support- vector-machines-simple-explanation.html [5] N. Donges. (2018) The random forest algorithm. [Online]. Avail- able: https://towardsdatascience.com/the-random-forest-algorithm- d457d499ffcd [6] S. Solutions. (2019) Logistic regression. [Online]. Available: https://www.statisticssolutions.com/what-is-logistic-regression/ [7] Y. Goldberg, “Synthesis lectures on human language technologies,” in Neural Network Methods in Natural Language Processing. Morgan and Claypool puplisher, 2017, p. 69. [8] A. A. Yeung. (2018) Generating n-grams from sentences python. [On- line]. Available: http://www.albertauyeung.com/post/generating- ngrams-python/ [9] I. KANARIS, K. KANARIS, I. HOUVARDAS, and E. STAMATATOS, “Words vs. character n-grams for anti-spam filtering,” International Journal on Artificial Intelligence Tools, vol. XX, no. X, pp. 1–20, 2006. [10] N. Subedi. (2018) Fasttext: Under the hood. [Online]. Available: https://towardsdatascience.com/fasttext-under-the-hood- 11efc57b2b3