SlideShare a Scribd company logo
1 of 13
Download to read offline
E-PROCEEDING OF THE 6TH INTERNATIONAL CONFERENCE ON
ARTIFICIAL INTELLIGENCE AND COMPUTER SCIENCE 2019
E-PROCEEDING OF THE 6TH INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND
COMPUTER SCIENCE (AICS 2019). (e-ISBN 978-967-0792-34-7). 29 July 2019, Puri Pujangga Hotel, Bangi
Selangor, Malaysia. Organized by https://worldconferences.net Page 6
EVALUATING THE IMPACT OF REMOVING LESS IMPORTANT TERMS ON
SENTIMENT ANALYSIS
Salhana Amad Darwis, Duc Nghia Pham, Ang Jia Pheng, Ong Hong Hoe
Artificial Intelligence Laboratory,
MIMOS Berhad
{salhana.darwis, nghia.pham, jp.ang, hh.ong} @mimos.my
ABSTRACT
Sentiment analysis is an important task in Natural Language Processing (NLP) that analyses
and predicts people s opinion from te tual data. It is a complex process due to the interactions
with computer science, linguistics, psychology and social science disciplines. There is no
straight forward rule to analyse and predict sentiment. Supervised learning methods, which
adopt learning models from human, are being widely used by NLP researchers and experts to
predict sentiment. However, this approach is tricky due to the challenges in ensuring the
quality of the manually labelled training dataset. In this study, we investigated the use of
ling istic factors to impro e the model s accuracy. We gathered two datasets: (i) 125,000
annotated sentences from Amazon product reviews, and (ii) 11,250 annotated sentences from
financial news articles. We then pre-processed the data, identified the less important terms that
exist in the dataset, the linguistic features and their effect towards the correctness of predicted
sentiment. Our experimental results showed that punctuation separation and removal of
supporting POS words improves precision accuracy in larger-generic dataset rather than in
smaller-context sensitive dataset.
Field of Research: sentiment analysis, supervised learning, NLP, linguistics
----------------------------------------------------------------------------------------------------------------
1. Introduction
Modern technologies enable people to express their opinions, thoughts and feelings about what
they experience and things happening around them through various online channels such as
social media, online surveys, blogs, etc. Such massively growing social data is indeed
meaningful and useful for businesses and organisations to gather feedbacks on their products
and services, or to study and analyse social issues, psychological issues, political situations,
etc. However, due to the variety, velocity and volume of such data, it is difficult to digest and
summarise social data into a meaningful form [13]. Sentiment analysis, also known as opinion
mining, specifically analyses people opinions, feelings, and thoughts that are usually
expressed in textual data through the above channels [10].
Due to the practical value of sentiment analysis in several application areas, more and more
efforts have been spent to address this task and the results are very promising. However,
researchers working on it are facing tremendous challenges when dealing with textual data.
Sentiment analysis is a part of NLP tasks which is known as one of the complex problem areas
in Artificial Intelligence [6]. Solely using linguistic method for sentiment analysis is not
sufficient as it does not address uncertainties that require a tacit and cannot
learn from past experience when interpreting opinion; both are important elements in analysing
sentiment [4,6]. For example, the words: I , am , happ , not , to , the , this and mo ie
exist in both sentences below, whereby the negative sentiment is expressed sarcastically:
E-PROCEEDING OF THE 6TH INTERNATIONAL CONFERENCE ON
ARTIFICIAL INTELLIGENCE AND COMPUTER SCIENCE 2019
E-PROCEEDING OF THE 6TH INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND
COMPUTER SCIENCE (AICS 2019). (e-ISBN 978-967-0792-34-7). 29 July 2019, Puri Pujangga Hotel, Bangi
Selangor, Malaysia. Organized by https://worldconferences.net Page 7
Positive - I am happ to atch this mo ie beca se the ticket price is not e pensi e
Negative - I am happ to lea e the cinema because the movie is not good
In the above example, by using linguistic rules, it is difficult to distinguish which words or
phrases that determine positive or negative sentiment as the element of sarcasm was expressed
indirectly as a negative sentiment. Due to such linguistic constraints, many researchers from
both academic and industries prefer the machine learning approaches for sentiment analysis
[1,17,19,20] that enables the machine to learn from examples and past experience just like a
human does.
On the other hand, using machine learning on textual data exposes other challenges and
drawbacks as machine learning methods also have their own limitations:
a) Using simpler words representation model a a a
ensure the performance in sentiment analysis [19]. It looks at the word counts [7]
discarded sequence of words in a sentence and word to word relations. Some important
linguistic information from the words, phrases and sentences may not be identified during
processing.
b) In supervised learning, a good training dataset has to be set up as the input for the machine
learning on NLP context to ensure it learn correctly from sufficient examples. Enormous
work needs to be done in preparing the training dataset, it is tedious, high in
computational cost and yet it may suffer from inaccuracy if less optimal examples fed
into the training dataset.
c) Using mathematical model with word2vec representation approach for instance, causes
the word vector grow when more training data provided. The omission of pre-processing
and lack of linguistic information analysis such as punctuation, word variance, part of
speech, stop words, and noises may lead to the word vector grow even more with higher
dimensions [9,15,16,18]. This factor increases the complexity in text processing and may
impact the performance of sentiment analysis.
In this paper, we present our work on sentiment analysis using supervised learning for English
text. We employed fastText, an open sourced algorithm for text classification developed for
Facebook. Studying the dataset, we noticed that there was a large variance of words repeatedly
occurred in the dataset. In our experimental study, we identified less important words in the
dataset, and then evaluated the impact of dimensionality reduction (by removing these words
from the dataset) towards the performance of fastText. Finally, we provide our conclusion and
suggestion for future works.
2. Machine Learning for Sentiment Analysis
In order for a system to be categorised as intelligent, it must have the ability to learn. In
developing intelligent systems, researchers are grappling to find solutions for common
bottlenecks in AI which are to handle uncertainty, ambiguity and to applying human tacit
knowledge [6]. Machine learning addresses these obstacles by having the ability to learn which
could not be addressed by conventional computing methods or rules. Over the years, various
researches were carried out to compare the performance of methods for sentiment analysis.
Popular machine learning methods for sentiment analysis are Naïve Bayes, k-
NearestNeighbour (k-NN), Support Vector Machine (SVM), Long Short-Term Memory
(LSTM), and Neural Networks (NN) [1,6,17,19,20]. Most of these methods are supervised
learning.
E-PROCEEDING OF THE 6TH INTERNATIONAL CONFERENCE ON
ARTIFICIAL INTELLIGENCE AND COMPUTER SCIENCE 2019
E-PROCEEDING OF THE 6TH INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND
COMPUTER SCIENCE (AICS 2019). (e-ISBN 978-967-0792-34-7). 29 July 2019, Puri Pujangga Hotel, Bangi
Selangor, Malaysia. Organized by https://worldconferences.net Page 8
2.1 Supervised Learning
Supervised learning is a machine learning approach whereby the machine learns from labelled
or annotated data. The objective of supervised learning is to build intelligent system that can
learn from input-output training samples [14]. The most interesting aspect of machine learning
is its learning ability [6] which resemble a natural process of human learning. In sentiment
analysis using supervised learning, selected samples of sentences from different scenarios are
annotated with sentiment from human judgement a a a a a a
a a a .
Supervised learning, however, requires a good training dataset as the input to ensure the
effective learning process. Learning is one of the most important aspect in intelligence and it
is a complex human process. As such, training a machine to learn can lead to complexities and
challenges. Building an effective learning strategy is one of the hardest questions raised in
supervised learning and neural computing [6]. Hence, preparing a training dataset for
supervised learning can be tricky. As the system learns from more examples, the complexity
increases and the vector size grows with higher dimension of word representation. Training
time will also increase with the growing data. Figure 1 illustrates the distribution of words in a
multi-dimensional vector space upon completion of training in a supervised learning.
Figure 1: Illustration of word vector in multiple dimensions upon completion of training a dataset.
2.2 fastText
While many supervised learning methods take a long time to train on large training dataset,
fastText (an open sourced text classifier for Facebook) was introduced to simplify the
complexity in handling large dataset [2,8]. It improves upon traditional learning method for
text classification through various ways: (i) it uses hidden layers to benefit from reusable
variables, (ii) it uses hierarchical softmax to optimise the learning process and reduce running
time and computational complexity, and (iii) it uses n-Grams instead of 'bags of words' to
capture partial information about local word order, and yet remains comparable to other
methods that capture actual word order [2,8].
3. The Complexity in Understanding Natural Language
Interpreting human language is a complex cognitive process as it relies not only on the syntactic
aspect of the language used but also on how the language is perceived by human when
expressing and interpreting thoughts. While various levels of knowledge involved in natural
language understanding process such as syntactic knowledge, semantic knowledge and
pragmatic knowledge [6], there is no straight forwards rule when analysing natural language.
3.1 Part of Speech (POS)
In linguistics, a language is spoken, written and understood by following set of rules such as
the POS rule. The role of POS has been studied in relation with psycho-linguistic effect of
E-PROCEEDING OF THE 6TH INTERNATIONAL CONFERENCE ON
ARTIFICIAL INTELLIGENCE AND COMPUTER SCIENCE 2019
E-PROCEEDING OF THE 6TH INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND
COMPUTER SCIENCE (AICS 2019). (e-ISBN 978-967-0792-34-7). 29 July 2019, Puri Pujangga Hotel, Bangi
Selangor, Malaysia. Organized by https://worldconferences.net Page 9
human in constructing and understanding sentences. Due to its importance and vital role in
language, the identification of POS, also known as POS tagging, has become a pre-requisite
step of language analysis in many computational linguistic applications to provide more
linguistic information before other NLP methods are applied to the text. POS tagging is
important to assign a word in a sentence with morpho-syntactic category [21]. Table 1
summarises different POS types and their role adapted from an English grammar book [3].
Table 1: Summary of POS types and roles.
Word Class Role Example
Noun Identifies a person, a thing, an idea, quality or state girl, engineer, friend, horse, wall,
flower, country, anger, courage, life
Verb Describes what a person or thing does or what happens run, kick, eat
Adjective Describes a noun, giving more information about
people, animal or things represented by noun or
pronoun
big, tall, long, hungry, beautiful
Adverb Gives information about a verb, adjective, or other
adverb
quietly, loudly, badly, accurately
Pronoun Used in place of a noun that is already known or has
already been mentioned to avoid repeating the noun.
she, him, that, something, them
Preposition Used in front of nouns or pronouns to show the
relationship between them. They describe the position,
time, or the way in which something is done
after, in, to, on, with
Conjunction Used to connect phrases, clauses, and sentences. and, because, but, for, if,
or, and when.
Determiner Introduces a noun and usually used before noun a, an, the, every, this, those
Exclamation
/Interjection
Expresses strong emotion, such as surprise, pleasure,
or anger.
How wonderful! Hello! Well done!
Despite of the crucial role of POS, there are POS types that only play supporting roles to
provide description to other words with different POS. Referring to Table 1, POS classes that
play supporting role are Adverb, Determiner, Conjunction, Pronoun, and Preposition [17]. In
this paper, we refer them as supporting POS . Some of them are frequently used but carry less
information in computational linguistic perspective.
3.2 Stop Words
In NLP, stop words (such as a , an , the , e er , this , and , and beca se ) are words
with lesser information that commonly exist in textual document written in natural language.
Stop words are commonly identified from supporting POS such as determiner, pronoun, and
preposition [9]. It is a common pre-requisite steps to remove stop words in pre-processing stage
in NLP. Even though the exhaustive removal of stop words is a debatable practice [5], the
research on stop words removal shows encouraging trend amongst NLP experts from academic
and industry through various projects [5,11,12].
4. Methodology
In this research, we focus on text pre-processing step to identify less important words in the
training datasets that could impair the effectiveness of the learning. We study various POS
classes and stop words. In this section, we experiment the effect of dimensionality reduction of
dataset by removing stop words and the less important words such as words with supporting
E-PROCEEDING OF THE 6TH INTERNATIONAL CONFERENCE ON
ARTIFICIAL INTELLIGENCE AND COMPUTER SCIENCE 2019
E-PROCEEDING OF THE 6TH INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND
COMPUTER SCIENCE (AICS 2019). (e-ISBN 978-967-0792-34-7). 29 July 2019, Puri Pujangga Hotel, Bangi
Selangor, Malaysia. Organized by https://worldconferences.net Page 10
POS towards the precision and correctness of sentiment analysis trained using fastText
algorithm. The overall sentiment analysis process is illustrated in Figure 2.
Step 1 Step 2 Step 3 Step 4
------------------------------------------------------------------------------------ Step 5
Step 6
Figure 2: The sentiment analysis process.
4.1 The Datasets
As illustrated in Figure 2, step 1 involves collecting English dataset of sentiment annotated
sentences. For this research, we prepared two datasets as described below:
(a) Amazon Dataset: consists of 125,000 sentences from Amazon product reviews (that
were written in informal language). These sentences are annotated with either positive
or negative sentiment. This dataset was later split into a training set of 100,000
sentences and a test set of 25,000 sentences.
(b) Financial Dataset: consists of 11,250 sentences from newspaper articles on Financial
Market (that were written in formal language). These sentences are annotated with
either positive, negative or neutral sentiment. The dataset was then split into a training
set of 9,000 sentences and a test set of 2,250 sentences.
4.2 Pre-processing
In this work, we focused on the dimensionality reduction of the training and test dataset as
illustrated in step 2, Figure 2 through data pre-processing steps: tokenisation, punctuation
separation, POS identification, stop words removal and words with supporting POS removal.
Note that, during pre-processing, we ensured that the punctuation was detached from words to
allow fastText to identify words as an individual item. For example, the punctuation in the
string eat. was separated as eat and . . We refer this process as punctuation separation.
Next, we proceeded with removal of less important words in two main parts:
First Part:
Firstly, we performed stop words removal on the dataset, consists of few sections:
1. Removal of maximum list of stop words collection from various sources of researches
and python (consists of 340 unique words) [11,12]
Raw Data
Sentences
with
Manual
Sentiment
Annotation
(Input File)
Training
Dataset
Word
embedding
Model
(output)
Sentiment
Analysis Training
Supervise the
dataset using
fastText
Data Pre-
processing
Tokenization,
punctuation,
POS, stop words
Classification
Model
(output)
New Query
Sentiment
Prediction
Sentiment
Predictions
(output)
Model
testing
Precision
Recall
(output)
E-PROCEEDING OF THE 6TH INTERNATIONAL CONFERENCE ON
ARTIFICIAL INTELLIGENCE AND COMPUTER SCIENCE 2019
E-PROCEEDING OF THE 6TH INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND
COMPUTER SCIENCE (AICS 2019). (e-ISBN 978-967-0792-34-7). 29 July 2019, Puri Pujangga Hotel, Bangi
Selangor, Malaysia. Organized by https://worldconferences.net Page 11
2. Removal of minimal list of stop words collection from python (consist of 127 unique
words)
3. Removal of minimal stop words with supporting POS- Determiner, Conjunction,
Preposition and Pronoun, by their respective POS
Second Part:
Secondly, we performed removal of words with POS that only play supporting role to other
main POS, referred to as supporting POS, consist of POS types: Conjunction, Determiner,
Preposition and Pronoun (as mentioned in section 3.1). We gathered maximum list of words
with these supporting POS to be removed and validated their POS category with online
Cambridge Dictionary and an English grammar book [3]. These word lists are the extended
collection to contain maximum words with the respective POS regardless they are in stop words
list or not. This is to enable analysis when comparing of the effect of solely removing stop
words versus removing all words with supporting POS regardless of their existence in stop
words list. In this second part implementation, we only focused to perform below step:
1. Removal of words with supporting POS - Conjunction, Determiner, Preposition and
Pronoun from the dataset.
4.3 Training the Model
In step 4, we trained the model using the pre-processed dataset (which is shown in step 3) using
fastText. This step was repeated for both Amazon dataset and Financial dataset. The training
parameters were set with learning rate, lr = 1.0, epoch = 25, and word n-gram = 5. This step is
referred to as supervised , where all the samples in dataset is trained using fastText classifier
algorithm. Upon completion of training, two models were generated: (1) word embedding
model, and (2) classification model.
4.4 Evaluating the Model
Finally, in step 5 we evaluate the performance of the newly trained models (which were
generated in step 4) using the corresponding test set. We then calculate precision and recall
results of this sentiment analysis model. In step 6, we performed sentiment prediction query on
live cases to evaluate the correctness of sentiment prediction of the generated model.
5. Results & Discussion
The datasets described in section 4.1 were used to conduct experiments to find out the effect
of pre-processing on Amazon Dataset and Financial Dataset towards the precision, recall and
correctness of sentiment prediction. We also looked at the impact of pre-processing on the
vector size of the embedding models that were generated from these tests (by evaluating the
Vector file size ) a a a a -processing steps.
The results are discussed in the following sections (Section 5.1 and 5.2)
5.1 Removal of Stop Words and Supporting POS from Amazon Dataset
Experiment 1: Removal of Stop Words from Amazon dataset.
Table 2 shows the results of our experiment after removing stop words from the Amazon
dataset. Set A is the baseline where no change was made to the original Amazon dataset. We
then separated punctuations from words in the original dataset, resulted in set B. This set was
E-PROCEEDING OF THE 6TH INTERNATIONAL CONFERENCE ON
ARTIFICIAL INTELLIGENCE AND COMPUTER SCIENCE 2019
E-PROCEEDING OF THE 6TH INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND
COMPUTER SCIENCE (AICS 2019). (e-ISBN 978-967-0792-34-7). 29 July 2019, Puri Pujangga Hotel, Bangi
Selangor, Malaysia. Organized by https://worldconferences.net Page 12
then used to create the subsequent sets C H, by removing either all the stop words or only the
stop words with a certain POS role.
Table 2: Experiment 1 results Removal of stop words from Amazon dataset.
Amazon dataset Evaluation Total words Word embedding model
Stop word removal Precision Recall Count
Reduction
(%)
Size (KBs)
Reduction
(%)
Set A - Original
Training Dataset
0.907 0.907 9,179,234 - 410,166 -
Set B Punctuation
Separation
0.920 0.920 9,179,234 - 249,031 39.280
Set C - Stop words
(Maximum)
0.896 0.896 4,927,534 46.32% 217,425 46.990
Set D - Stop words
(Minimum)
0.903 0.903 5,401,919 41.15% 211,048 48.545
Set E - Stop words
(POS Conjunction)
0.919 0.919 8,884,591 3.21% 213,034 48.061
Set F - Stop words
(POS Determiner)
0.919 0.919 8,199,831 10.67% 212,778 48.123
Set G - Stop words
(POS Pronoun)
0.919 0.919 8,594,147 6.37% 212,721 48.137
Set H - Stop words
(POS Preposition)
0.918 0.918 8,314,093 9.42% 212,836 48.109
From this experiment, we observed that separation of punctuation from words (set B)
contributes to the highest precision and recall at 0.92, while reducing the vector size by 39.28%.
On the other hand, removal of stop words from the punctuation-separated dataset had a negative
impact on the performance of fastText (see results of sets C and D) compared to the original
set A. Meanwhile, removal of stop words by a specific POS role on top of punctuation
separation (sets E H) resulted in slight improvements on precision and recall compared to the
original set A. Out of the four support POS roles tested, removal of stop words with preposition
role produced the least performance boost. However, the difference in precision and recall to
the other three POS roles (which were on par on performance) is very negligible (0.001).
Table 3: Experiment 2 results Removal of words with supporting POS roles (Determiner, Conjunction, Preposition,
or Pronoun) from the Amazon dataset.
Amazon dataset Evaluation Total words Word embedding model
Supporting POS removal Precision Recall Count
Reduction
(%)
Size
(KBs)
Reduction
(%)
Set A - Original Training
Dataset
0.907 0.907 9,179,234 - 410,166 -
Set B Punctuation
Separation
0.920 0.920 9,179,234 - 249,031 39.280
Set C - All words
(POS Determiner)
0.917 0.917 7,908,000 13.84% 212,603 48.166
Set D - All words
(POS Conjunction)
0.919 0.919 8,593,953 6.37% 212,662 48.152
Set E - All words
(POS Preposition)
0.919 0.919 8,214,290 10.51% 212,520 48.186
Set F - All words
(POS Pronoun)
0.918 0.918 8,664,127 5.61% 212,792 48.121
E-PROCEEDING OF THE 6TH INTERNATIONAL CONFERENCE ON
ARTIFICIAL INTELLIGENCE AND COMPUTER SCIENCE 2019
E-PROCEEDING OF THE 6TH INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND
COMPUTER SCIENCE (AICS 2019). (e-ISBN 978-967-0792-34-7). 29 July 2019, Puri Pujangga Hotel, Bangi
Selangor, Malaysia. Organized by https://worldconferences.net Page 13
Experiment 2: Removal of All Words with Supporting POS from Amazon dataset.
Table 3 shows our experimental result on removal of words with supporting POS roles
(determiner, conjunction, preposition, or pronoun) from the Amazon dataset. Similar to
experiment 1, set B was created by separating punctuations from words; and sets C F were
created by removing words with respective support POS role from set B.
The previous observation in Experiment 1 also applicable here: removal of words with
supporting POS roles improved fastText a a a
on the original set A. However, the impact of these word removals remained slightly less than
just simply separating punctuations, even though these settings can significantly reduce the size
of the word embedding model. Among four supporting POS roles tested, removal of
determiners produced the highest reduction of 13.67% total words. However, its impact on
a a a a a a POS .
5.2 Removal of All Words with Supporting POS from Financial Dataset
Experiment 3: Removal of Stop Words from Financial Dataset
Table 4 shows the results of our experiment after removing stop words from the Financial
dataset. Set A is the baseline where no change was made to the original Financial dataset. We
then separated punctuations from words in the original dataset, resulted in set B. This set was
then used to create the subsequent sets C H, by removing either all the stop words or only the
stop words with a certain POS role.
Table 4: Experiment 3 results Removal of stop words from Financial dataset.
Financial dataset Evaluation Total words Word embedding model
Stop word removal Precision Recall Count
Reduction
(%)
Size (KBs)
Reduction
(%)
Set A - Original
Training Dataset
0.809 0.809 234,110 - 25,301 -
Set B Punctuation
Separation
0.809 0.809 234,110 - 16,928 33.00
Set C - Stop words
(Maximum)
0.787 0.787 166,853 28.72 16,334 35.44
Set D - Stop words
(Minimum)
0.793 0.793 173,139 26.04 16,498 34.79
Set E - Stop words
(POS Conjunction)
0.806 0.806 232,264 0.79 16,563 34.53
Set F - Stop words
(POS Determiner)
0.801 0.801 216,902 7.35 16,799 33.60
Set G -Stop words
(POS Pronoun)
0.799 0.799 232,821 0.55 16,813 33.54
Set H - Stop words
(POS Preposition)
0.794 0.794 212,009 9.44 16,552 34.59
Table 4 shows that separation of punctuations from words (set B) neither increased nor
decreased the performance of fastText in comparison to its performance on the original dataset
(set A). This is contrast to our finding in experiments 1 & 2 where separation of punctuations
from words gave the highest accuracy boost to fastText.
In addition, removal of stop words, regardless of whether POS role filter was applied or not,
decreased the accuracy of fastText on this Financial dataset, ranging from -0.022 (set C:
E-PROCEEDING OF THE 6TH INTERNATIONAL CONFERENCE ON
ARTIFICIAL INTELLIGENCE AND COMPUTER SCIENCE 2019
E-PROCEEDING OF THE 6TH INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND
COMPUTER SCIENCE (AICS 2019). (e-ISBN 978-967-0792-34-7). 29 July 2019, Puri Pujangga Hotel, Bangi
Selangor, Malaysia. Organized by https://worldconferences.net Page 14
maximum list of stop words) to -0.003 (set E: conjunction). This observation is again contrast
to our finding on the larger Amazon dataset where removal of only stop words with a specific
POS role helped improve fastText a .
Experiment 4: Removal of All Words with Supporting POS from Financial Dataset
Table 5 shows our experimental result on removal of words with supporting POS roles
(Determiner, Conjunction, Preposition, or Pronoun) from the Financial dataset. Similar to
previous experiments, set B was created by separating punctuations from words; and sets C
F were created by removing words with respective support POS role from set B.
Table 5: Experiment 4 results - Removal of words with supporting POS roles (Determiner, Conjunction, Preposition, or
Pronoun) from Financial dataset.
Financial dataset Evaluation Total words Word embedding model
Supporting POS removal Precision Recall Count
Reduction
(%)
Size
(KBs)
Reduction
(%)
Set A - Original Training
Dataset
0.809 0.809 234,110 - 25,301 -
Set B - Punctuation
Separation
0.809 0.809 234,110 - 16,928 33.00
Set C - All words
(POS Determiner)
0.797 0.797 214,040 8.57% 16,909 33.16
Set D -All words
(POS Conjunction)
0.802 0.802 225,743 3.57% 16,959 32.97
Set E - All words
(POS Preposition)
0.789 0.789 206,336 11.86% 16,869 33.33
Set F - All words
(POS Pronoun)
0.802 0.802 231,503 1.12% 16,974 32.91
In contrast to our finding in experiment 2, removal of words with any supporting POS roles
(determiner, conjunction, preposition, or pronoun) worsened the accuracy of fastText on the
Financial dataset, with a negative difference ranging from -0.020 (set E: preposition) to -0.007
(set D: conjunction and set F: pronoun). Here, removal of prepositions (not determiners as
found in experiment 2) produced the highest reduction of 11.86% total words. It also reduced
the word embedding model the most by 33.33%.
5.3 Discussion
Punctuation separation provided the best result in the Amazon dataset due to the reduction of
noise from the original dataset. In the original dataset, punctuations were attached with word
causing unnecessary formation of new tokens which were regarded as a new word variance by
fastText. For example, both tokens eat and eat. exist in the original dataset, causing the
formation a new word eat. . Due to the large data volume in Amazon dataset, these word
variances grow tremendously and increase the size of word embedding model. They were
eliminated by punctuation separation. On the other hand, our Financial dataset consists of
formal written sentences and does not contain large vocabulary given its small size. Hence, the
low occurrence of punctuation-attached word variances, resulting in neither improvement nor
degrading in performance when pre-processing text with punctuation separation.
In our experiments, exhaustive removal of stop words decreases the precision and recall of
fastText on both Amazon and Financial datasets. Our stop words list is a general compilation
of frequently used words in computational linguistics with various POS and different level of
E-PROCEEDING OF THE 6TH INTERNATIONAL CONFERENCE ON
ARTIFICIAL INTELLIGENCE AND COMPUTER SCIENCE 2019
E-PROCEEDING OF THE 6TH INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND
COMPUTER SCIENCE (AICS 2019). (e-ISBN 978-967-0792-34-7). 29 July 2019, Puri Pujangga Hotel, Bangi
Selangor, Malaysia. Organized by https://worldconferences.net Page 15
importance and meaning. Exhaustive removal of all words from this list without carefully
identifying their roles in a particular dataset causes reduction in sentiment precision. For
example, the word sho ld , ha e , does exist in stop word list with POS category of Verb
that may carry some valuable meaning in a particular context even though these words can be
less important in other contexts. This finding also justifies some reasons whereby the
exhaustive stop words removal is a debatable practice among scholars in NLP [5].
Removal of words with POS role (determiner, conjunction, preposition and pronoun) decreases
the accuracy of fastText on Financial dataset but improves the accuracy on Amazon dataset.
This finding suggests that, in a specific domain dataset, there is lower occurrence of less
important words even though the same word may appear as unimportant in other generic
dataset. In Financial dataset, it seems to suggest that words with POS determiner, conjunction,
preposition and pronoun may carry some meaning in determining sentiment. As the supervised
learning model learn from lesser example in smaller dataset, unnecessary removal of words
may worsen the result. Furthermore, word sequence is being captured when using n-Gram
model on fastText. Hence, unnecessary removal of valuable words can worsen the accuracy.
The above finding is further supported in experiment 4 Set E whereby the removal of
preposition produced the highest reduction of words and word embedding model, and worst
precision. This is due to the nature of Financial a a . U
Amazon dataset which is written in free-style format, the Financial dataset contain various
indicators in financial market domain such as do n , up , above . T , a
with POS-preposition can contribute to negative impact on precision due to the removal of such
important keywords which plays role as important indicator in Financial dataset. Below are the
examples:
Financial dataset:
negative F&N was down 18 sen to RM35.02.
positive The Hang Seng was up 1.33%.
positive prices have remained above that level since April
Amazon dataset:
negative Very disappointed in this book, as it was to have 16 illustrations,
as well as other pictures.
The selection of words removal should be carefully studied to avoid important keyword
is removed from a particular dataset.
6. Conclusion
The computational linguistic practice to remove stop words and less important words with
supporting POS is relevant on text processing. However, not all of them should be removed.
Part of these words may contain meaningful elements on a particular context. The level of
words importance may vary from one dataset to another. For example, the words do n , up ,
above a a a F a a a a . The stop word list and words with
supporting POS to be removed should be used as general guideline in text pre-processing step.
However, the selection of exact words to be removed need to be meticulously studied based on
few aspects which are (1) the structure of dataset sentences: structured or unstructured (2) the
existence of domain specific terms (3) words distribution and writing style: formal or informal
(4) volume of the dataset.
E-PROCEEDING OF THE 6TH INTERNATIONAL CONFERENCE ON
ARTIFICIAL INTELLIGENCE AND COMPUTER SCIENCE 2019
E-PROCEEDING OF THE 6TH INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND
COMPUTER SCIENCE (AICS 2019). (e-ISBN 978-967-0792-34-7). 29 July 2019, Puri Pujangga Hotel, Bangi
Selangor, Malaysia. Organized by https://worldconferences.net Page 16
Basic pre-processing steps such as punctuation separation from word is important to reduce
irrelevant combination of words and punctuation in a larger and generic domain dataset. The
omission of punctuation processing in a larger and generic dataset, such as punctuation
separation may lead to unnecessary noise exist in the dataset that could cause the increment of
word embedding model size and impairment in the learning process.
7. Future Works
In future, we plan to focus in finding the actual less important stop words that can impair the
learning process of each datasets. Apart from that, we intend to further investigate the impact
of removing words that exist in stop words by various POS categories versus removing all
words with supporting POS roles. On the other hand, a separate research to identify higher
importance words in sentiment analysis should also be done along with identification of less
important words. This is to ensure words and phrases in the training dataset are given the
correct weightage during the supervised learning process.
8. References
[1] Annett, M., & Kondrak, G. (2008). A Comparison of Sentiment Analysis Techniques:
Polarizing Movie Blogs. Advances in Artificial Intelligence, (May), 25 35.
https://doi.org/10.1007/978-3-540-68825-9_3
[2] Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2016). Enriching Word Vectors
with Subword Information. Transactions of the Association for Computational
Linguistics. https://doi.org/5. 10.1162/tacl_a_00051
[3] Ca , R., M Ca , M., Ma , G., & O K , A. (2016). English Grammar Today The
Cambridge A-Z Grammar of English. Cambridge United Kingdom: Cambridge University
Press.
[4] Deng, S., Sinha, A. P., & Zhao, H. (2017). Resolving Ambiguity in Sentiment
Classification: The Role of Dependency Features. ACM Trans. Manage. Inf. Syst., 8(2
3), 4:1--4:13. https://doi.org/10.1145/3046684
[5] Dolamic, L., & Savoy, J. (2010). Brief communication: When stopword lists make the
difference. Journal of the American Society for Information Science and Technology,
61(1), 200 203. https://doi.org/10.1002/asi.21186
[6] G F L . (1998). A a I : S a S a s for Complex
Problem Solving. In Addison-Wesley (3rd Edition). Addison-Wesley.
[7] Goldberg, Y. (2017). Neural network methods for natural language processing (Synthesis
Lectures on Human Language Technologies). Morgan & Claypool Publisher, Vol. 10, pp.
1 309. https://doi.org/10.1162/COLI_r_00312
[8] Joulin, A., Grave, E., Bojanowski, P., & Mikolov, T. (2017). Bag of Tricks for Efficient
E-PROCEEDING OF THE 6TH INTERNATIONAL CONFERENCE ON
ARTIFICIAL INTELLIGENCE AND COMPUTER SCIENCE 2019
E-PROCEEDING OF THE 6TH INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND
COMPUTER SCIENCE (AICS 2019). (e-ISBN 978-967-0792-34-7). 29 July 2019, Puri Pujangga Hotel, Bangi
Selangor, Malaysia. Organized by https://worldconferences.net Page 17
Text Classification. Proceedings of the 15th Conference of the European Chapter of the
Association for Computational Linguistics: Volume 2, Short Papers, 2, 427 431.
https://doi.org/10.18653/v1/E17-2068
[9] Kumar, A. A., & Chandrasekhar, S. (2012). Text Data Pre-processing and Dimensionality
Reduction Techniques for Document Clustering. International Journal of Engineering
Research & Technology (IJERT), 1(5), 1 6. https://doi.org/2278-0181
[10] Kumar, A., & Panda, S. P. (2018). A Survey of Sentiment Analysis on Social Media.
International Journal for Research in Applied Science & Engineering Technology
(IJRASET), 6( February). https://doi.org/2321-9653
[11] Larsen, K. R. (2016). MIS Quarterly Article Stopword List. (January).
[12] Larsen, K. R., & Bong, C. H. (2016). A Tool for Addressing Construct Identity in
Literature Reviews and Meta-Analyses. MIS Quarterly, 40(3), 529 551.
https://doi.org/10.25300/misq/2016/40.3.01
[13] Liu, B. (2010). Sentiment Analysis and Subjectivity (To appear in Handbook of Natural
Language Processing) (second edi; N. Indurkhya & F. J. Damerau, Eds.).
http://www.cs.uic.edu/~liub/FBS/NLP-handbook-sentiment-
analysis.pdf%5Cnhttp://people.sabanciuniv.edu/berrin/proj102/1-BLiu-Sentiment
Analysis and Subjectivity-NLPHandbook-2010.pdf
[14] Liu, Q., & Wu, Y. (2012). Supervised Learning. In N. M. Seel (Ed.), Encyclopedia of the
Sciences of Learning (2012th ed.). https://doi.org/10.1007/978-1-4419-1428-6_451
[15] Mao, Y., Balasubramanian, K., & Lebanon, G. (2010). Dimensionality Reduction for Text
using Domain Knowledge. COLING 10 Proceedings of the 23rd International
Conference on Computational Linguistics: Posters, (August ), 801 809.
[16] Martins, C. A., Monard, M. C., & Matsubara, E. T. (2003). Reducing the Dimensionality
of Bag-of-Words Text Representation Used by Learning Algorithms. Proc of 3rd IASTED
International Conference on Artificial Intelligence and Applications, 228 233.
https://pdfs.semanticscholar.org/a90f/ca4c78b66fb0e28f4fe8086d28f3720c7999.pdf
[17] Pang, B., Lee, L., & Vaithyanathan, S. (2002). Thumbs up? Sentiment Classification using
Machine Learning Techniques. Proceedings of the Conference on Empirical Methods in
Natural Language Processing (EMNLP), 79 86.
https://doi.org/10.3115/1118693.1118704
[18] Ponmuthuramalingam, P., & Devi, T. (2010). Effective Dimension Reduction Techniques
for Text Documents. IJCSNS International Journal of Computer Science and Network
Security, 10(7). Retrieved from http://paper.ijcsns.org/07_book/201007/20100712.pdf
[19] R , E., Ha a , M., Wa a , M., J , M., E , ., & S a , M.
(2018). More than Bags of Words: Sentiment Analysis with Word Embeddings.
Communication Methods and Measures, 12(2 3), 140 157.
https://doi.org/10.1080/19312458.2018.1455817
E-PROCEEDING OF THE 6TH INTERNATIONAL CONFERENCE ON
ARTIFICIAL INTELLIGENCE AND COMPUTER SCIENCE 2019
E-PROCEEDING OF THE 6TH INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND
COMPUTER SCIENCE (AICS 2019). (e-ISBN 978-967-0792-34-7). 29 July 2019, Puri Pujangga Hotel, Bangi
Selangor, Malaysia. Organized by https://worldconferences.net Page 18
[20] Sadanandan, A. A., Osman, N. A., Saifuddin, H., Khairuddin, M., Pham, D. N., Hoe, H.,
& Lumpur, K. (2016). Improving Accuracy in Sentiment Analysis for Malay Language.
4th International Conference on Artificial Intelligence and Computer Science
(AICS2016), (November), 28 29.
[21] Vanroose, P. (2001). Part-of-Speech Tagging From an Information-Theoretic Point of
View. (22nd July). http://citeseer.ist.psu.edu/646161. html
View publication stats
View publication stats

More Related Content

What's hot

An overview of information extraction techniques for legal document analysis ...
An overview of information extraction techniques for legal document analysis ...An overview of information extraction techniques for legal document analysis ...
An overview of information extraction techniques for legal document analysis ...IJECEIAES
 
My harmony generating statistics from clinical text for monitoring clinical...
My harmony   generating statistics from clinical text for monitoring clinical...My harmony   generating statistics from clinical text for monitoring clinical...
My harmony generating statistics from clinical text for monitoring clinical...Conference Papers
 
SECURETI: Advanced SDLC and Project Management Tool for TI (Philippines)
SECURETI: Advanced SDLC and Project Management Tool for TI (Philippines)SECURETI: Advanced SDLC and Project Management Tool for TI (Philippines)
SECURETI: Advanced SDLC and Project Management Tool for TI (Philippines)AIRCC Publishing Corporation
 
CHALLENGES FOR MANAGING COMPLEX APPLICATION PORTFOLIOS: A CASE STUDY OF SOUTH...
CHALLENGES FOR MANAGING COMPLEX APPLICATION PORTFOLIOS: A CASE STUDY OF SOUTH...CHALLENGES FOR MANAGING COMPLEX APPLICATION PORTFOLIOS: A CASE STUDY OF SOUTH...
CHALLENGES FOR MANAGING COMPLEX APPLICATION PORTFOLIOS: A CASE STUDY OF SOUTH...IJMIT JOURNAL
 
Real Time Application for Career Guidance
Real Time Application for Career GuidanceReal Time Application for Career Guidance
Real Time Application for Career Guidanceijtsrd
 
CONNECTING O*NET® DATABASE TO CYBERSECURITY WORKFORCE PROFESSIONAL CERTIFICAT...
CONNECTING O*NET® DATABASE TO CYBERSECURITY WORKFORCE PROFESSIONAL CERTIFICAT...CONNECTING O*NET® DATABASE TO CYBERSECURITY WORKFORCE PROFESSIONAL CERTIFICAT...
CONNECTING O*NET® DATABASE TO CYBERSECURITY WORKFORCE PROFESSIONAL CERTIFICAT...IJITE
 
A Model for Encryption of a Text Phrase using Genetic Algorithm
A Model for Encryption of a Text Phrase using Genetic AlgorithmA Model for Encryption of a Text Phrase using Genetic Algorithm
A Model for Encryption of a Text Phrase using Genetic Algorithmijtsrd
 
Impact of Expert System as Tools for Efficient Teaching and Learning Process ...
Impact of Expert System as Tools for Efficient Teaching and Learning Process ...Impact of Expert System as Tools for Efficient Teaching and Learning Process ...
Impact of Expert System as Tools for Efficient Teaching and Learning Process ...rahulmonikasharma
 
A Survey Study on Higher Education Trends among Information Technology Prof...
A Survey Study  on  Higher Education Trends among Information Technology Prof...A Survey Study  on  Higher Education Trends among Information Technology Prof...
A Survey Study on Higher Education Trends among Information Technology Prof...Tharindu Weerasinghe
 
CHATBOT FOR COLLEGE RELATED QUERIES | J4RV4I1008
CHATBOT FOR COLLEGE RELATED QUERIES | J4RV4I1008CHATBOT FOR COLLEGE RELATED QUERIES | J4RV4I1008
CHATBOT FOR COLLEGE RELATED QUERIES | J4RV4I1008Journal For Research
 
1621027 kim in_hong_bachelor_thesis_b_sc_no_sig
1621027 kim in_hong_bachelor_thesis_b_sc_no_sig1621027 kim in_hong_bachelor_thesis_b_sc_no_sig
1621027 kim in_hong_bachelor_thesis_b_sc_no_sigAsmaImran10
 
Privacy Preserving Chatbot Conversations
Privacy Preserving Chatbot ConversationsPrivacy Preserving Chatbot Conversations
Privacy Preserving Chatbot ConversationsDebmalya Biswas
 
Smart information desk system with voice assistant for universities
Smart information desk system with voice assistant for universities Smart information desk system with voice assistant for universities
Smart information desk system with voice assistant for universities IJECEIAES
 
AUTOMATED TOOL FOR RESUME CLASSIFICATION USING SEMENTIC ANALYSIS
AUTOMATED TOOL FOR RESUME CLASSIFICATION USING SEMENTIC ANALYSIS AUTOMATED TOOL FOR RESUME CLASSIFICATION USING SEMENTIC ANALYSIS
AUTOMATED TOOL FOR RESUME CLASSIFICATION USING SEMENTIC ANALYSIS ijaia
 
IRJET - E-Assistant: An Interactive Bot for Banking Sector using NLP Process
IRJET -  	  E-Assistant: An Interactive Bot for Banking Sector using NLP ProcessIRJET -  	  E-Assistant: An Interactive Bot for Banking Sector using NLP Process
IRJET - E-Assistant: An Interactive Bot for Banking Sector using NLP ProcessIRJET Journal
 
SEARCH FOR ANSWERS IN DOMAIN-SPECIFIC SUPPORTED BY INTELLIGENT AGENTS
SEARCH FOR ANSWERS IN DOMAIN-SPECIFIC SUPPORTED BY INTELLIGENT AGENTSSEARCH FOR ANSWERS IN DOMAIN-SPECIFIC SUPPORTED BY INTELLIGENT AGENTS
SEARCH FOR ANSWERS IN DOMAIN-SPECIFIC SUPPORTED BY INTELLIGENT AGENTSijcsit
 
SEARCH FOR ANSWERS IN DOMAIN-SPECIFIC SUPPORTED BY INTELLIGENT AGENTS
SEARCH FOR ANSWERS IN DOMAIN-SPECIFIC SUPPORTED BY INTELLIGENT AGENTSSEARCH FOR ANSWERS IN DOMAIN-SPECIFIC SUPPORTED BY INTELLIGENT AGENTS
SEARCH FOR ANSWERS IN DOMAIN-SPECIFIC SUPPORTED BY INTELLIGENT AGENTSAIRCC Publishing Corporation
 
Improving Credit Card Fraud Detection: Using Machine Learning to Profile and ...
Improving Credit Card Fraud Detection: Using Machine Learning to Profile and ...Improving Credit Card Fraud Detection: Using Machine Learning to Profile and ...
Improving Credit Card Fraud Detection: Using Machine Learning to Profile and ...Melissa Moody
 
A DEVELOPMENT FRAMEWORK FOR A CONVERSATIONAL AGENT TO EXPLORE MACHINE LEARNIN...
A DEVELOPMENT FRAMEWORK FOR A CONVERSATIONAL AGENT TO EXPLORE MACHINE LEARNIN...A DEVELOPMENT FRAMEWORK FOR A CONVERSATIONAL AGENT TO EXPLORE MACHINE LEARNIN...
A DEVELOPMENT FRAMEWORK FOR A CONVERSATIONAL AGENT TO EXPLORE MACHINE LEARNIN...mlaij
 

What's hot (20)

An overview of information extraction techniques for legal document analysis ...
An overview of information extraction techniques for legal document analysis ...An overview of information extraction techniques for legal document analysis ...
An overview of information extraction techniques for legal document analysis ...
 
My harmony generating statistics from clinical text for monitoring clinical...
My harmony   generating statistics from clinical text for monitoring clinical...My harmony   generating statistics from clinical text for monitoring clinical...
My harmony generating statistics from clinical text for monitoring clinical...
 
SECURETI: Advanced SDLC and Project Management Tool for TI (Philippines)
SECURETI: Advanced SDLC and Project Management Tool for TI (Philippines)SECURETI: Advanced SDLC and Project Management Tool for TI (Philippines)
SECURETI: Advanced SDLC and Project Management Tool for TI (Philippines)
 
CHALLENGES FOR MANAGING COMPLEX APPLICATION PORTFOLIOS: A CASE STUDY OF SOUTH...
CHALLENGES FOR MANAGING COMPLEX APPLICATION PORTFOLIOS: A CASE STUDY OF SOUTH...CHALLENGES FOR MANAGING COMPLEX APPLICATION PORTFOLIOS: A CASE STUDY OF SOUTH...
CHALLENGES FOR MANAGING COMPLEX APPLICATION PORTFOLIOS: A CASE STUDY OF SOUTH...
 
Real Time Application for Career Guidance
Real Time Application for Career GuidanceReal Time Application for Career Guidance
Real Time Application for Career Guidance
 
CONNECTING O*NET® DATABASE TO CYBERSECURITY WORKFORCE PROFESSIONAL CERTIFICAT...
CONNECTING O*NET® DATABASE TO CYBERSECURITY WORKFORCE PROFESSIONAL CERTIFICAT...CONNECTING O*NET® DATABASE TO CYBERSECURITY WORKFORCE PROFESSIONAL CERTIFICAT...
CONNECTING O*NET® DATABASE TO CYBERSECURITY WORKFORCE PROFESSIONAL CERTIFICAT...
 
A Model for Encryption of a Text Phrase using Genetic Algorithm
A Model for Encryption of a Text Phrase using Genetic AlgorithmA Model for Encryption of a Text Phrase using Genetic Algorithm
A Model for Encryption of a Text Phrase using Genetic Algorithm
 
Impact of Expert System as Tools for Efficient Teaching and Learning Process ...
Impact of Expert System as Tools for Efficient Teaching and Learning Process ...Impact of Expert System as Tools for Efficient Teaching and Learning Process ...
Impact of Expert System as Tools for Efficient Teaching and Learning Process ...
 
A Survey Study on Higher Education Trends among Information Technology Prof...
A Survey Study  on  Higher Education Trends among Information Technology Prof...A Survey Study  on  Higher Education Trends among Information Technology Prof...
A Survey Study on Higher Education Trends among Information Technology Prof...
 
CHATBOT FOR COLLEGE RELATED QUERIES | J4RV4I1008
CHATBOT FOR COLLEGE RELATED QUERIES | J4RV4I1008CHATBOT FOR COLLEGE RELATED QUERIES | J4RV4I1008
CHATBOT FOR COLLEGE RELATED QUERIES | J4RV4I1008
 
1621027 kim in_hong_bachelor_thesis_b_sc_no_sig
1621027 kim in_hong_bachelor_thesis_b_sc_no_sig1621027 kim in_hong_bachelor_thesis_b_sc_no_sig
1621027 kim in_hong_bachelor_thesis_b_sc_no_sig
 
Privacy Preserving Chatbot Conversations
Privacy Preserving Chatbot ConversationsPrivacy Preserving Chatbot Conversations
Privacy Preserving Chatbot Conversations
 
Smart information desk system with voice assistant for universities
Smart information desk system with voice assistant for universities Smart information desk system with voice assistant for universities
Smart information desk system with voice assistant for universities
 
AUTOMATED TOOL FOR RESUME CLASSIFICATION USING SEMENTIC ANALYSIS
AUTOMATED TOOL FOR RESUME CLASSIFICATION USING SEMENTIC ANALYSIS AUTOMATED TOOL FOR RESUME CLASSIFICATION USING SEMENTIC ANALYSIS
AUTOMATED TOOL FOR RESUME CLASSIFICATION USING SEMENTIC ANALYSIS
 
IRJET - E-Assistant: An Interactive Bot for Banking Sector using NLP Process
IRJET -  	  E-Assistant: An Interactive Bot for Banking Sector using NLP ProcessIRJET -  	  E-Assistant: An Interactive Bot for Banking Sector using NLP Process
IRJET - E-Assistant: An Interactive Bot for Banking Sector using NLP Process
 
SEARCH FOR ANSWERS IN DOMAIN-SPECIFIC SUPPORTED BY INTELLIGENT AGENTS
SEARCH FOR ANSWERS IN DOMAIN-SPECIFIC SUPPORTED BY INTELLIGENT AGENTSSEARCH FOR ANSWERS IN DOMAIN-SPECIFIC SUPPORTED BY INTELLIGENT AGENTS
SEARCH FOR ANSWERS IN DOMAIN-SPECIFIC SUPPORTED BY INTELLIGENT AGENTS
 
SEARCH FOR ANSWERS IN DOMAIN-SPECIFIC SUPPORTED BY INTELLIGENT AGENTS
SEARCH FOR ANSWERS IN DOMAIN-SPECIFIC SUPPORTED BY INTELLIGENT AGENTSSEARCH FOR ANSWERS IN DOMAIN-SPECIFIC SUPPORTED BY INTELLIGENT AGENTS
SEARCH FOR ANSWERS IN DOMAIN-SPECIFIC SUPPORTED BY INTELLIGENT AGENTS
 
Improving Credit Card Fraud Detection: Using Machine Learning to Profile and ...
Improving Credit Card Fraud Detection: Using Machine Learning to Profile and ...Improving Credit Card Fraud Detection: Using Machine Learning to Profile and ...
Improving Credit Card Fraud Detection: Using Machine Learning to Profile and ...
 
A DEVELOPMENT FRAMEWORK FOR A CONVERSATIONAL AGENT TO EXPLORE MACHINE LEARNIN...
A DEVELOPMENT FRAMEWORK FOR A CONVERSATIONAL AGENT TO EXPLORE MACHINE LEARNIN...A DEVELOPMENT FRAMEWORK FOR A CONVERSATIONAL AGENT TO EXPLORE MACHINE LEARNIN...
A DEVELOPMENT FRAMEWORK FOR A CONVERSATIONAL AGENT TO EXPLORE MACHINE LEARNIN...
 
Financial Tracker using NLP
Financial Tracker using NLPFinancial Tracker using NLP
Financial Tracker using NLP
 

Similar to Evaluating the impact of removing less important terms on sentiment analysis

Dialectal Arabic sentiment analysis based on tree-based pipeline optimizatio...
Dialectal Arabic sentiment analysis based on tree-based pipeline  optimizatio...Dialectal Arabic sentiment analysis based on tree-based pipeline  optimizatio...
Dialectal Arabic sentiment analysis based on tree-based pipeline optimizatio...IJECEIAES
 
TOP CITED ARTICLES - The International Journal of Multimedia & Its Applicatio...
TOP CITED ARTICLES - The International Journal of Multimedia & Its Applicatio...TOP CITED ARTICLES - The International Journal of Multimedia & Its Applicatio...
TOP CITED ARTICLES - The International Journal of Multimedia & Its Applicatio...ijma
 
Learner Ontological Model for Intelligent Virtual Collaborative Learning Envi...
Learner Ontological Model for Intelligent Virtual Collaborative Learning Envi...Learner Ontological Model for Intelligent Virtual Collaborative Learning Envi...
Learner Ontological Model for Intelligent Virtual Collaborative Learning Envi...ijceronline
 
Usage of ChatGPT at Higher Learning Institutions.pptx
Usage of ChatGPT at Higher Learning Institutions.pptxUsage of ChatGPT at Higher Learning Institutions.pptx
Usage of ChatGPT at Higher Learning Institutions.pptxNurfadhlina Mohd Sharef
 
NLP-based personal learning assistant for school education
NLP-based personal learning assistant for school education NLP-based personal learning assistant for school education
NLP-based personal learning assistant for school education IJECEIAES
 
Insights to Problems, Research Trend and Progress in Techniques of Sentiment ...
Insights to Problems, Research Trend and Progress in Techniques of Sentiment ...Insights to Problems, Research Trend and Progress in Techniques of Sentiment ...
Insights to Problems, Research Trend and Progress in Techniques of Sentiment ...IJECEIAES
 
Articulo cientifico ijaerv13n12_05
Articulo cientifico ijaerv13n12_05Articulo cientifico ijaerv13n12_05
Articulo cientifico ijaerv13n12_05Nombre Apellidos
 
Sentiment Analysis in Social Media and Its Operations
Sentiment Analysis in Social Media and Its OperationsSentiment Analysis in Social Media and Its Operations
Sentiment Analysis in Social Media and Its OperationsIRJET Journal
 
User experience improvement of japanese language mobile learning application ...
User experience improvement of japanese language mobile learning application ...User experience improvement of japanese language mobile learning application ...
User experience improvement of japanese language mobile learning application ...IJECEIAES
 
Cloud Computing and Content Management Systems : A Case Study in Macedonian E...
Cloud Computing and Content Management Systems : A Case Study in Macedonian E...Cloud Computing and Content Management Systems : A Case Study in Macedonian E...
Cloud Computing and Content Management Systems : A Case Study in Macedonian E...neirew J
 
CLOUD COMPUTING AND CONTENT MANAGEMENT SYSTEMS: A CASE STUDY IN MACEDONIAN ED...
CLOUD COMPUTING AND CONTENT MANAGEMENT SYSTEMS: A CASE STUDY IN MACEDONIAN ED...CLOUD COMPUTING AND CONTENT MANAGEMENT SYSTEMS: A CASE STUDY IN MACEDONIAN ED...
CLOUD COMPUTING AND CONTENT MANAGEMENT SYSTEMS: A CASE STUDY IN MACEDONIAN ED...ijccsa
 
Integrated Social Media Knowledge Capture in Medical Domain of Indonesia
Integrated Social Media Knowledge Capture in Medical Domain of IndonesiaIntegrated Social Media Knowledge Capture in Medical Domain of Indonesia
Integrated Social Media Knowledge Capture in Medical Domain of IndonesiaTELKOMNIKA JOURNAL
 
ANALYSING SPEECH EMOTION USING NEURAL NETWORK ALGORITHM
ANALYSING SPEECH EMOTION USING NEURAL NETWORK ALGORITHMANALYSING SPEECH EMOTION USING NEURAL NETWORK ALGORITHM
ANALYSING SPEECH EMOTION USING NEURAL NETWORK ALGORITHMIRJET Journal
 
Predictive Analytics in Education Context
Predictive Analytics in Education ContextPredictive Analytics in Education Context
Predictive Analytics in Education ContextIJMTST Journal
 
Graph embedding approach to analyze sentiments on cryptocurrency
Graph embedding approach to analyze sentiments on cryptocurrencyGraph embedding approach to analyze sentiments on cryptocurrency
Graph embedding approach to analyze sentiments on cryptocurrencyIJECEIAES
 
Knowledge Sharing Among Employees of Assosa Technical, Vocational and Educati...
Knowledge Sharing Among Employees of Assosa Technical, Vocational and Educati...Knowledge Sharing Among Employees of Assosa Technical, Vocational and Educati...
Knowledge Sharing Among Employees of Assosa Technical, Vocational and Educati...IJSRED
 
A prior case study of natural language processing on different domain
A prior case study of natural language processing  on different domain A prior case study of natural language processing  on different domain
A prior case study of natural language processing on different domain IJECEIAES
 
A scalable, lexicon based technique for sentiment analysis
A scalable, lexicon based technique for sentiment analysisA scalable, lexicon based technique for sentiment analysis
A scalable, lexicon based technique for sentiment analysisijfcstjournal
 

Similar to Evaluating the impact of removing less important terms on sentiment analysis (20)

Dialectal Arabic sentiment analysis based on tree-based pipeline optimizatio...
Dialectal Arabic sentiment analysis based on tree-based pipeline  optimizatio...Dialectal Arabic sentiment analysis based on tree-based pipeline  optimizatio...
Dialectal Arabic sentiment analysis based on tree-based pipeline optimizatio...
 
TOP CITED ARTICLES - The International Journal of Multimedia & Its Applicatio...
TOP CITED ARTICLES - The International Journal of Multimedia & Its Applicatio...TOP CITED ARTICLES - The International Journal of Multimedia & Its Applicatio...
TOP CITED ARTICLES - The International Journal of Multimedia & Its Applicatio...
 
Learner Ontological Model for Intelligent Virtual Collaborative Learning Envi...
Learner Ontological Model for Intelligent Virtual Collaborative Learning Envi...Learner Ontological Model for Intelligent Virtual Collaborative Learning Envi...
Learner Ontological Model for Intelligent Virtual Collaborative Learning Envi...
 
Usage of ChatGPT at Higher Learning Institutions.pptx
Usage of ChatGPT at Higher Learning Institutions.pptxUsage of ChatGPT at Higher Learning Institutions.pptx
Usage of ChatGPT at Higher Learning Institutions.pptx
 
NLP-based personal learning assistant for school education
NLP-based personal learning assistant for school education NLP-based personal learning assistant for school education
NLP-based personal learning assistant for school education
 
Insights to Problems, Research Trend and Progress in Techniques of Sentiment ...
Insights to Problems, Research Trend and Progress in Techniques of Sentiment ...Insights to Problems, Research Trend and Progress in Techniques of Sentiment ...
Insights to Problems, Research Trend and Progress in Techniques of Sentiment ...
 
Articulo cientifico ijaerv13n12_05
Articulo cientifico ijaerv13n12_05Articulo cientifico ijaerv13n12_05
Articulo cientifico ijaerv13n12_05
 
Sentiment Analysis in Social Media and Its Operations
Sentiment Analysis in Social Media and Its OperationsSentiment Analysis in Social Media and Its Operations
Sentiment Analysis in Social Media and Its Operations
 
Ai in Higher Education
Ai in Higher EducationAi in Higher Education
Ai in Higher Education
 
User experience improvement of japanese language mobile learning application ...
User experience improvement of japanese language mobile learning application ...User experience improvement of japanese language mobile learning application ...
User experience improvement of japanese language mobile learning application ...
 
Cloud Computing and Content Management Systems : A Case Study in Macedonian E...
Cloud Computing and Content Management Systems : A Case Study in Macedonian E...Cloud Computing and Content Management Systems : A Case Study in Macedonian E...
Cloud Computing and Content Management Systems : A Case Study in Macedonian E...
 
CLOUD COMPUTING AND CONTENT MANAGEMENT SYSTEMS: A CASE STUDY IN MACEDONIAN ED...
CLOUD COMPUTING AND CONTENT MANAGEMENT SYSTEMS: A CASE STUDY IN MACEDONIAN ED...CLOUD COMPUTING AND CONTENT MANAGEMENT SYSTEMS: A CASE STUDY IN MACEDONIAN ED...
CLOUD COMPUTING AND CONTENT MANAGEMENT SYSTEMS: A CASE STUDY IN MACEDONIAN ED...
 
Integrated Social Media Knowledge Capture in Medical Domain of Indonesia
Integrated Social Media Knowledge Capture in Medical Domain of IndonesiaIntegrated Social Media Knowledge Capture in Medical Domain of Indonesia
Integrated Social Media Knowledge Capture in Medical Domain of Indonesia
 
ANALYSING SPEECH EMOTION USING NEURAL NETWORK ALGORITHM
ANALYSING SPEECH EMOTION USING NEURAL NETWORK ALGORITHMANALYSING SPEECH EMOTION USING NEURAL NETWORK ALGORITHM
ANALYSING SPEECH EMOTION USING NEURAL NETWORK ALGORITHM
 
RESEARCH-PAPER.docx
RESEARCH-PAPER.docxRESEARCH-PAPER.docx
RESEARCH-PAPER.docx
 
Predictive Analytics in Education Context
Predictive Analytics in Education ContextPredictive Analytics in Education Context
Predictive Analytics in Education Context
 
Graph embedding approach to analyze sentiments on cryptocurrency
Graph embedding approach to analyze sentiments on cryptocurrencyGraph embedding approach to analyze sentiments on cryptocurrency
Graph embedding approach to analyze sentiments on cryptocurrency
 
Knowledge Sharing Among Employees of Assosa Technical, Vocational and Educati...
Knowledge Sharing Among Employees of Assosa Technical, Vocational and Educati...Knowledge Sharing Among Employees of Assosa Technical, Vocational and Educati...
Knowledge Sharing Among Employees of Assosa Technical, Vocational and Educati...
 
A prior case study of natural language processing on different domain
A prior case study of natural language processing  on different domain A prior case study of natural language processing  on different domain
A prior case study of natural language processing on different domain
 
A scalable, lexicon based technique for sentiment analysis
A scalable, lexicon based technique for sentiment analysisA scalable, lexicon based technique for sentiment analysis
A scalable, lexicon based technique for sentiment analysis
 

More from Conference Papers

Ai driven occupational skills generator
Ai driven occupational skills generatorAi driven occupational skills generator
Ai driven occupational skills generatorConference Papers
 
Advanced resource allocation and service level monitoring for container orche...
Advanced resource allocation and service level monitoring for container orche...Advanced resource allocation and service level monitoring for container orche...
Advanced resource allocation and service level monitoring for container orche...Conference Papers
 
Adaptive authentication to determine login attempt penalty from multiple inpu...
Adaptive authentication to determine login attempt penalty from multiple inpu...Adaptive authentication to determine login attempt penalty from multiple inpu...
Adaptive authentication to determine login attempt penalty from multiple inpu...Conference Papers
 
Absorption spectrum analysis of dentine sialophosphoprotein (dspp) in orthodo...
Absorption spectrum analysis of dentine sialophosphoprotein (dspp) in orthodo...Absorption spectrum analysis of dentine sialophosphoprotein (dspp) in orthodo...
Absorption spectrum analysis of dentine sialophosphoprotein (dspp) in orthodo...Conference Papers
 
A deployment scenario a taxonomy mapping and keyword searching for the appl...
A deployment scenario   a taxonomy mapping and keyword searching for the appl...A deployment scenario   a taxonomy mapping and keyword searching for the appl...
A deployment scenario a taxonomy mapping and keyword searching for the appl...Conference Papers
 
Automated snomed ct mapping of clinical discharge summary data for cardiology...
Automated snomed ct mapping of clinical discharge summary data for cardiology...Automated snomed ct mapping of clinical discharge summary data for cardiology...
Automated snomed ct mapping of clinical discharge summary data for cardiology...Conference Papers
 
Automated login method selection in a multi modal authentication - login meth...
Automated login method selection in a multi modal authentication - login meth...Automated login method selection in a multi modal authentication - login meth...
Automated login method selection in a multi modal authentication - login meth...Conference Papers
 
Atomization of reduced graphene oxide ultra thin film for transparent electro...
Atomization of reduced graphene oxide ultra thin film for transparent electro...Atomization of reduced graphene oxide ultra thin film for transparent electro...
Atomization of reduced graphene oxide ultra thin film for transparent electro...Conference Papers
 
An enhanced wireless presentation system for large scale content distribution
An enhanced wireless presentation system for large scale content distribution An enhanced wireless presentation system for large scale content distribution
An enhanced wireless presentation system for large scale content distribution Conference Papers
 
An analysis of a large scale wireless image distribution system deployment
An analysis of a large scale wireless image distribution system deploymentAn analysis of a large scale wireless image distribution system deployment
An analysis of a large scale wireless image distribution system deploymentConference Papers
 
Validation of early testing method for e government projects by requirement ...
Validation of early testing method for e  government projects by requirement ...Validation of early testing method for e  government projects by requirement ...
Validation of early testing method for e government projects by requirement ...Conference Papers
 
The design and implementation of trade finance application based on hyperledg...
The design and implementation of trade finance application based on hyperledg...The design and implementation of trade finance application based on hyperledg...
The design and implementation of trade finance application based on hyperledg...Conference Papers
 
Towards predictive maintenance for marine sector in malaysia
Towards predictive maintenance for marine sector in malaysiaTowards predictive maintenance for marine sector in malaysia
Towards predictive maintenance for marine sector in malaysiaConference Papers
 
The new leaed (ii) ion selective electrode on free plasticizer film of pthfa ...
The new leaed (ii) ion selective electrode on free plasticizer film of pthfa ...The new leaed (ii) ion selective electrode on free plasticizer film of pthfa ...
The new leaed (ii) ion selective electrode on free plasticizer film of pthfa ...Conference Papers
 
Searchable symmetric encryption security definitions
Searchable symmetric encryption security definitionsSearchable symmetric encryption security definitions
Searchable symmetric encryption security definitionsConference Papers
 
Super convergence of autonomous things
Super convergence of autonomous thingsSuper convergence of autonomous things
Super convergence of autonomous thingsConference Papers
 
Study on performance of capacitor less ldo with different types of resistor
Study on performance of capacitor less ldo with different types of resistorStudy on performance of capacitor less ldo with different types of resistor
Study on performance of capacitor less ldo with different types of resistorConference Papers
 
Stil test pattern generation enhancement in mixed signal design
Stil test pattern generation enhancement in mixed signal designStil test pattern generation enhancement in mixed signal design
Stil test pattern generation enhancement in mixed signal designConference Papers
 
On premise ai platform - from dc to edge
On premise ai platform - from dc to edgeOn premise ai platform - from dc to edge
On premise ai platform - from dc to edgeConference Papers
 
Review of big data analytics (bda) architecture trends and analysis
Review of big data analytics (bda) architecture   trends and analysis Review of big data analytics (bda) architecture   trends and analysis
Review of big data analytics (bda) architecture trends and analysis Conference Papers
 

More from Conference Papers (20)

Ai driven occupational skills generator
Ai driven occupational skills generatorAi driven occupational skills generator
Ai driven occupational skills generator
 
Advanced resource allocation and service level monitoring for container orche...
Advanced resource allocation and service level monitoring for container orche...Advanced resource allocation and service level monitoring for container orche...
Advanced resource allocation and service level monitoring for container orche...
 
Adaptive authentication to determine login attempt penalty from multiple inpu...
Adaptive authentication to determine login attempt penalty from multiple inpu...Adaptive authentication to determine login attempt penalty from multiple inpu...
Adaptive authentication to determine login attempt penalty from multiple inpu...
 
Absorption spectrum analysis of dentine sialophosphoprotein (dspp) in orthodo...
Absorption spectrum analysis of dentine sialophosphoprotein (dspp) in orthodo...Absorption spectrum analysis of dentine sialophosphoprotein (dspp) in orthodo...
Absorption spectrum analysis of dentine sialophosphoprotein (dspp) in orthodo...
 
A deployment scenario a taxonomy mapping and keyword searching for the appl...
A deployment scenario   a taxonomy mapping and keyword searching for the appl...A deployment scenario   a taxonomy mapping and keyword searching for the appl...
A deployment scenario a taxonomy mapping and keyword searching for the appl...
 
Automated snomed ct mapping of clinical discharge summary data for cardiology...
Automated snomed ct mapping of clinical discharge summary data for cardiology...Automated snomed ct mapping of clinical discharge summary data for cardiology...
Automated snomed ct mapping of clinical discharge summary data for cardiology...
 
Automated login method selection in a multi modal authentication - login meth...
Automated login method selection in a multi modal authentication - login meth...Automated login method selection in a multi modal authentication - login meth...
Automated login method selection in a multi modal authentication - login meth...
 
Atomization of reduced graphene oxide ultra thin film for transparent electro...
Atomization of reduced graphene oxide ultra thin film for transparent electro...Atomization of reduced graphene oxide ultra thin film for transparent electro...
Atomization of reduced graphene oxide ultra thin film for transparent electro...
 
An enhanced wireless presentation system for large scale content distribution
An enhanced wireless presentation system for large scale content distribution An enhanced wireless presentation system for large scale content distribution
An enhanced wireless presentation system for large scale content distribution
 
An analysis of a large scale wireless image distribution system deployment
An analysis of a large scale wireless image distribution system deploymentAn analysis of a large scale wireless image distribution system deployment
An analysis of a large scale wireless image distribution system deployment
 
Validation of early testing method for e government projects by requirement ...
Validation of early testing method for e  government projects by requirement ...Validation of early testing method for e  government projects by requirement ...
Validation of early testing method for e government projects by requirement ...
 
The design and implementation of trade finance application based on hyperledg...
The design and implementation of trade finance application based on hyperledg...The design and implementation of trade finance application based on hyperledg...
The design and implementation of trade finance application based on hyperledg...
 
Towards predictive maintenance for marine sector in malaysia
Towards predictive maintenance for marine sector in malaysiaTowards predictive maintenance for marine sector in malaysia
Towards predictive maintenance for marine sector in malaysia
 
The new leaed (ii) ion selective electrode on free plasticizer film of pthfa ...
The new leaed (ii) ion selective electrode on free plasticizer film of pthfa ...The new leaed (ii) ion selective electrode on free plasticizer film of pthfa ...
The new leaed (ii) ion selective electrode on free plasticizer film of pthfa ...
 
Searchable symmetric encryption security definitions
Searchable symmetric encryption security definitionsSearchable symmetric encryption security definitions
Searchable symmetric encryption security definitions
 
Super convergence of autonomous things
Super convergence of autonomous thingsSuper convergence of autonomous things
Super convergence of autonomous things
 
Study on performance of capacitor less ldo with different types of resistor
Study on performance of capacitor less ldo with different types of resistorStudy on performance of capacitor less ldo with different types of resistor
Study on performance of capacitor less ldo with different types of resistor
 
Stil test pattern generation enhancement in mixed signal design
Stil test pattern generation enhancement in mixed signal designStil test pattern generation enhancement in mixed signal design
Stil test pattern generation enhancement in mixed signal design
 
On premise ai platform - from dc to edge
On premise ai platform - from dc to edgeOn premise ai platform - from dc to edge
On premise ai platform - from dc to edge
 
Review of big data analytics (bda) architecture trends and analysis
Review of big data analytics (bda) architecture   trends and analysis Review of big data analytics (bda) architecture   trends and analysis
Review of big data analytics (bda) architecture trends and analysis
 

Recently uploaded

Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Neo4j
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 

Recently uploaded (20)

Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 

Evaluating the impact of removing less important terms on sentiment analysis

  • 1. E-PROCEEDING OF THE 6TH INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND COMPUTER SCIENCE 2019 E-PROCEEDING OF THE 6TH INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND COMPUTER SCIENCE (AICS 2019). (e-ISBN 978-967-0792-34-7). 29 July 2019, Puri Pujangga Hotel, Bangi Selangor, Malaysia. Organized by https://worldconferences.net Page 6 EVALUATING THE IMPACT OF REMOVING LESS IMPORTANT TERMS ON SENTIMENT ANALYSIS Salhana Amad Darwis, Duc Nghia Pham, Ang Jia Pheng, Ong Hong Hoe Artificial Intelligence Laboratory, MIMOS Berhad {salhana.darwis, nghia.pham, jp.ang, hh.ong} @mimos.my ABSTRACT Sentiment analysis is an important task in Natural Language Processing (NLP) that analyses and predicts people s opinion from te tual data. It is a complex process due to the interactions with computer science, linguistics, psychology and social science disciplines. There is no straight forward rule to analyse and predict sentiment. Supervised learning methods, which adopt learning models from human, are being widely used by NLP researchers and experts to predict sentiment. However, this approach is tricky due to the challenges in ensuring the quality of the manually labelled training dataset. In this study, we investigated the use of ling istic factors to impro e the model s accuracy. We gathered two datasets: (i) 125,000 annotated sentences from Amazon product reviews, and (ii) 11,250 annotated sentences from financial news articles. We then pre-processed the data, identified the less important terms that exist in the dataset, the linguistic features and their effect towards the correctness of predicted sentiment. Our experimental results showed that punctuation separation and removal of supporting POS words improves precision accuracy in larger-generic dataset rather than in smaller-context sensitive dataset. Field of Research: sentiment analysis, supervised learning, NLP, linguistics ---------------------------------------------------------------------------------------------------------------- 1. Introduction Modern technologies enable people to express their opinions, thoughts and feelings about what they experience and things happening around them through various online channels such as social media, online surveys, blogs, etc. Such massively growing social data is indeed meaningful and useful for businesses and organisations to gather feedbacks on their products and services, or to study and analyse social issues, psychological issues, political situations, etc. However, due to the variety, velocity and volume of such data, it is difficult to digest and summarise social data into a meaningful form [13]. Sentiment analysis, also known as opinion mining, specifically analyses people opinions, feelings, and thoughts that are usually expressed in textual data through the above channels [10]. Due to the practical value of sentiment analysis in several application areas, more and more efforts have been spent to address this task and the results are very promising. However, researchers working on it are facing tremendous challenges when dealing with textual data. Sentiment analysis is a part of NLP tasks which is known as one of the complex problem areas in Artificial Intelligence [6]. Solely using linguistic method for sentiment analysis is not sufficient as it does not address uncertainties that require a tacit and cannot learn from past experience when interpreting opinion; both are important elements in analysing sentiment [4,6]. For example, the words: I , am , happ , not , to , the , this and mo ie exist in both sentences below, whereby the negative sentiment is expressed sarcastically:
  • 2. E-PROCEEDING OF THE 6TH INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND COMPUTER SCIENCE 2019 E-PROCEEDING OF THE 6TH INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND COMPUTER SCIENCE (AICS 2019). (e-ISBN 978-967-0792-34-7). 29 July 2019, Puri Pujangga Hotel, Bangi Selangor, Malaysia. Organized by https://worldconferences.net Page 7 Positive - I am happ to atch this mo ie beca se the ticket price is not e pensi e Negative - I am happ to lea e the cinema because the movie is not good In the above example, by using linguistic rules, it is difficult to distinguish which words or phrases that determine positive or negative sentiment as the element of sarcasm was expressed indirectly as a negative sentiment. Due to such linguistic constraints, many researchers from both academic and industries prefer the machine learning approaches for sentiment analysis [1,17,19,20] that enables the machine to learn from examples and past experience just like a human does. On the other hand, using machine learning on textual data exposes other challenges and drawbacks as machine learning methods also have their own limitations: a) Using simpler words representation model a a a ensure the performance in sentiment analysis [19]. It looks at the word counts [7] discarded sequence of words in a sentence and word to word relations. Some important linguistic information from the words, phrases and sentences may not be identified during processing. b) In supervised learning, a good training dataset has to be set up as the input for the machine learning on NLP context to ensure it learn correctly from sufficient examples. Enormous work needs to be done in preparing the training dataset, it is tedious, high in computational cost and yet it may suffer from inaccuracy if less optimal examples fed into the training dataset. c) Using mathematical model with word2vec representation approach for instance, causes the word vector grow when more training data provided. The omission of pre-processing and lack of linguistic information analysis such as punctuation, word variance, part of speech, stop words, and noises may lead to the word vector grow even more with higher dimensions [9,15,16,18]. This factor increases the complexity in text processing and may impact the performance of sentiment analysis. In this paper, we present our work on sentiment analysis using supervised learning for English text. We employed fastText, an open sourced algorithm for text classification developed for Facebook. Studying the dataset, we noticed that there was a large variance of words repeatedly occurred in the dataset. In our experimental study, we identified less important words in the dataset, and then evaluated the impact of dimensionality reduction (by removing these words from the dataset) towards the performance of fastText. Finally, we provide our conclusion and suggestion for future works. 2. Machine Learning for Sentiment Analysis In order for a system to be categorised as intelligent, it must have the ability to learn. In developing intelligent systems, researchers are grappling to find solutions for common bottlenecks in AI which are to handle uncertainty, ambiguity and to applying human tacit knowledge [6]. Machine learning addresses these obstacles by having the ability to learn which could not be addressed by conventional computing methods or rules. Over the years, various researches were carried out to compare the performance of methods for sentiment analysis. Popular machine learning methods for sentiment analysis are Naïve Bayes, k- NearestNeighbour (k-NN), Support Vector Machine (SVM), Long Short-Term Memory (LSTM), and Neural Networks (NN) [1,6,17,19,20]. Most of these methods are supervised learning.
  • 3. E-PROCEEDING OF THE 6TH INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND COMPUTER SCIENCE 2019 E-PROCEEDING OF THE 6TH INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND COMPUTER SCIENCE (AICS 2019). (e-ISBN 978-967-0792-34-7). 29 July 2019, Puri Pujangga Hotel, Bangi Selangor, Malaysia. Organized by https://worldconferences.net Page 8 2.1 Supervised Learning Supervised learning is a machine learning approach whereby the machine learns from labelled or annotated data. The objective of supervised learning is to build intelligent system that can learn from input-output training samples [14]. The most interesting aspect of machine learning is its learning ability [6] which resemble a natural process of human learning. In sentiment analysis using supervised learning, selected samples of sentences from different scenarios are annotated with sentiment from human judgement a a a a a a a a a . Supervised learning, however, requires a good training dataset as the input to ensure the effective learning process. Learning is one of the most important aspect in intelligence and it is a complex human process. As such, training a machine to learn can lead to complexities and challenges. Building an effective learning strategy is one of the hardest questions raised in supervised learning and neural computing [6]. Hence, preparing a training dataset for supervised learning can be tricky. As the system learns from more examples, the complexity increases and the vector size grows with higher dimension of word representation. Training time will also increase with the growing data. Figure 1 illustrates the distribution of words in a multi-dimensional vector space upon completion of training in a supervised learning. Figure 1: Illustration of word vector in multiple dimensions upon completion of training a dataset. 2.2 fastText While many supervised learning methods take a long time to train on large training dataset, fastText (an open sourced text classifier for Facebook) was introduced to simplify the complexity in handling large dataset [2,8]. It improves upon traditional learning method for text classification through various ways: (i) it uses hidden layers to benefit from reusable variables, (ii) it uses hierarchical softmax to optimise the learning process and reduce running time and computational complexity, and (iii) it uses n-Grams instead of 'bags of words' to capture partial information about local word order, and yet remains comparable to other methods that capture actual word order [2,8]. 3. The Complexity in Understanding Natural Language Interpreting human language is a complex cognitive process as it relies not only on the syntactic aspect of the language used but also on how the language is perceived by human when expressing and interpreting thoughts. While various levels of knowledge involved in natural language understanding process such as syntactic knowledge, semantic knowledge and pragmatic knowledge [6], there is no straight forwards rule when analysing natural language. 3.1 Part of Speech (POS) In linguistics, a language is spoken, written and understood by following set of rules such as the POS rule. The role of POS has been studied in relation with psycho-linguistic effect of
  • 4. E-PROCEEDING OF THE 6TH INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND COMPUTER SCIENCE 2019 E-PROCEEDING OF THE 6TH INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND COMPUTER SCIENCE (AICS 2019). (e-ISBN 978-967-0792-34-7). 29 July 2019, Puri Pujangga Hotel, Bangi Selangor, Malaysia. Organized by https://worldconferences.net Page 9 human in constructing and understanding sentences. Due to its importance and vital role in language, the identification of POS, also known as POS tagging, has become a pre-requisite step of language analysis in many computational linguistic applications to provide more linguistic information before other NLP methods are applied to the text. POS tagging is important to assign a word in a sentence with morpho-syntactic category [21]. Table 1 summarises different POS types and their role adapted from an English grammar book [3]. Table 1: Summary of POS types and roles. Word Class Role Example Noun Identifies a person, a thing, an idea, quality or state girl, engineer, friend, horse, wall, flower, country, anger, courage, life Verb Describes what a person or thing does or what happens run, kick, eat Adjective Describes a noun, giving more information about people, animal or things represented by noun or pronoun big, tall, long, hungry, beautiful Adverb Gives information about a verb, adjective, or other adverb quietly, loudly, badly, accurately Pronoun Used in place of a noun that is already known or has already been mentioned to avoid repeating the noun. she, him, that, something, them Preposition Used in front of nouns or pronouns to show the relationship between them. They describe the position, time, or the way in which something is done after, in, to, on, with Conjunction Used to connect phrases, clauses, and sentences. and, because, but, for, if, or, and when. Determiner Introduces a noun and usually used before noun a, an, the, every, this, those Exclamation /Interjection Expresses strong emotion, such as surprise, pleasure, or anger. How wonderful! Hello! Well done! Despite of the crucial role of POS, there are POS types that only play supporting roles to provide description to other words with different POS. Referring to Table 1, POS classes that play supporting role are Adverb, Determiner, Conjunction, Pronoun, and Preposition [17]. In this paper, we refer them as supporting POS . Some of them are frequently used but carry less information in computational linguistic perspective. 3.2 Stop Words In NLP, stop words (such as a , an , the , e er , this , and , and beca se ) are words with lesser information that commonly exist in textual document written in natural language. Stop words are commonly identified from supporting POS such as determiner, pronoun, and preposition [9]. It is a common pre-requisite steps to remove stop words in pre-processing stage in NLP. Even though the exhaustive removal of stop words is a debatable practice [5], the research on stop words removal shows encouraging trend amongst NLP experts from academic and industry through various projects [5,11,12]. 4. Methodology In this research, we focus on text pre-processing step to identify less important words in the training datasets that could impair the effectiveness of the learning. We study various POS classes and stop words. In this section, we experiment the effect of dimensionality reduction of dataset by removing stop words and the less important words such as words with supporting
  • 5. E-PROCEEDING OF THE 6TH INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND COMPUTER SCIENCE 2019 E-PROCEEDING OF THE 6TH INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND COMPUTER SCIENCE (AICS 2019). (e-ISBN 978-967-0792-34-7). 29 July 2019, Puri Pujangga Hotel, Bangi Selangor, Malaysia. Organized by https://worldconferences.net Page 10 POS towards the precision and correctness of sentiment analysis trained using fastText algorithm. The overall sentiment analysis process is illustrated in Figure 2. Step 1 Step 2 Step 3 Step 4 ------------------------------------------------------------------------------------ Step 5 Step 6 Figure 2: The sentiment analysis process. 4.1 The Datasets As illustrated in Figure 2, step 1 involves collecting English dataset of sentiment annotated sentences. For this research, we prepared two datasets as described below: (a) Amazon Dataset: consists of 125,000 sentences from Amazon product reviews (that were written in informal language). These sentences are annotated with either positive or negative sentiment. This dataset was later split into a training set of 100,000 sentences and a test set of 25,000 sentences. (b) Financial Dataset: consists of 11,250 sentences from newspaper articles on Financial Market (that were written in formal language). These sentences are annotated with either positive, negative or neutral sentiment. The dataset was then split into a training set of 9,000 sentences and a test set of 2,250 sentences. 4.2 Pre-processing In this work, we focused on the dimensionality reduction of the training and test dataset as illustrated in step 2, Figure 2 through data pre-processing steps: tokenisation, punctuation separation, POS identification, stop words removal and words with supporting POS removal. Note that, during pre-processing, we ensured that the punctuation was detached from words to allow fastText to identify words as an individual item. For example, the punctuation in the string eat. was separated as eat and . . We refer this process as punctuation separation. Next, we proceeded with removal of less important words in two main parts: First Part: Firstly, we performed stop words removal on the dataset, consists of few sections: 1. Removal of maximum list of stop words collection from various sources of researches and python (consists of 340 unique words) [11,12] Raw Data Sentences with Manual Sentiment Annotation (Input File) Training Dataset Word embedding Model (output) Sentiment Analysis Training Supervise the dataset using fastText Data Pre- processing Tokenization, punctuation, POS, stop words Classification Model (output) New Query Sentiment Prediction Sentiment Predictions (output) Model testing Precision Recall (output)
  • 6. E-PROCEEDING OF THE 6TH INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND COMPUTER SCIENCE 2019 E-PROCEEDING OF THE 6TH INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND COMPUTER SCIENCE (AICS 2019). (e-ISBN 978-967-0792-34-7). 29 July 2019, Puri Pujangga Hotel, Bangi Selangor, Malaysia. Organized by https://worldconferences.net Page 11 2. Removal of minimal list of stop words collection from python (consist of 127 unique words) 3. Removal of minimal stop words with supporting POS- Determiner, Conjunction, Preposition and Pronoun, by their respective POS Second Part: Secondly, we performed removal of words with POS that only play supporting role to other main POS, referred to as supporting POS, consist of POS types: Conjunction, Determiner, Preposition and Pronoun (as mentioned in section 3.1). We gathered maximum list of words with these supporting POS to be removed and validated their POS category with online Cambridge Dictionary and an English grammar book [3]. These word lists are the extended collection to contain maximum words with the respective POS regardless they are in stop words list or not. This is to enable analysis when comparing of the effect of solely removing stop words versus removing all words with supporting POS regardless of their existence in stop words list. In this second part implementation, we only focused to perform below step: 1. Removal of words with supporting POS - Conjunction, Determiner, Preposition and Pronoun from the dataset. 4.3 Training the Model In step 4, we trained the model using the pre-processed dataset (which is shown in step 3) using fastText. This step was repeated for both Amazon dataset and Financial dataset. The training parameters were set with learning rate, lr = 1.0, epoch = 25, and word n-gram = 5. This step is referred to as supervised , where all the samples in dataset is trained using fastText classifier algorithm. Upon completion of training, two models were generated: (1) word embedding model, and (2) classification model. 4.4 Evaluating the Model Finally, in step 5 we evaluate the performance of the newly trained models (which were generated in step 4) using the corresponding test set. We then calculate precision and recall results of this sentiment analysis model. In step 6, we performed sentiment prediction query on live cases to evaluate the correctness of sentiment prediction of the generated model. 5. Results & Discussion The datasets described in section 4.1 were used to conduct experiments to find out the effect of pre-processing on Amazon Dataset and Financial Dataset towards the precision, recall and correctness of sentiment prediction. We also looked at the impact of pre-processing on the vector size of the embedding models that were generated from these tests (by evaluating the Vector file size ) a a a a -processing steps. The results are discussed in the following sections (Section 5.1 and 5.2) 5.1 Removal of Stop Words and Supporting POS from Amazon Dataset Experiment 1: Removal of Stop Words from Amazon dataset. Table 2 shows the results of our experiment after removing stop words from the Amazon dataset. Set A is the baseline where no change was made to the original Amazon dataset. We then separated punctuations from words in the original dataset, resulted in set B. This set was
  • 7. E-PROCEEDING OF THE 6TH INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND COMPUTER SCIENCE 2019 E-PROCEEDING OF THE 6TH INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND COMPUTER SCIENCE (AICS 2019). (e-ISBN 978-967-0792-34-7). 29 July 2019, Puri Pujangga Hotel, Bangi Selangor, Malaysia. Organized by https://worldconferences.net Page 12 then used to create the subsequent sets C H, by removing either all the stop words or only the stop words with a certain POS role. Table 2: Experiment 1 results Removal of stop words from Amazon dataset. Amazon dataset Evaluation Total words Word embedding model Stop word removal Precision Recall Count Reduction (%) Size (KBs) Reduction (%) Set A - Original Training Dataset 0.907 0.907 9,179,234 - 410,166 - Set B Punctuation Separation 0.920 0.920 9,179,234 - 249,031 39.280 Set C - Stop words (Maximum) 0.896 0.896 4,927,534 46.32% 217,425 46.990 Set D - Stop words (Minimum) 0.903 0.903 5,401,919 41.15% 211,048 48.545 Set E - Stop words (POS Conjunction) 0.919 0.919 8,884,591 3.21% 213,034 48.061 Set F - Stop words (POS Determiner) 0.919 0.919 8,199,831 10.67% 212,778 48.123 Set G - Stop words (POS Pronoun) 0.919 0.919 8,594,147 6.37% 212,721 48.137 Set H - Stop words (POS Preposition) 0.918 0.918 8,314,093 9.42% 212,836 48.109 From this experiment, we observed that separation of punctuation from words (set B) contributes to the highest precision and recall at 0.92, while reducing the vector size by 39.28%. On the other hand, removal of stop words from the punctuation-separated dataset had a negative impact on the performance of fastText (see results of sets C and D) compared to the original set A. Meanwhile, removal of stop words by a specific POS role on top of punctuation separation (sets E H) resulted in slight improvements on precision and recall compared to the original set A. Out of the four support POS roles tested, removal of stop words with preposition role produced the least performance boost. However, the difference in precision and recall to the other three POS roles (which were on par on performance) is very negligible (0.001). Table 3: Experiment 2 results Removal of words with supporting POS roles (Determiner, Conjunction, Preposition, or Pronoun) from the Amazon dataset. Amazon dataset Evaluation Total words Word embedding model Supporting POS removal Precision Recall Count Reduction (%) Size (KBs) Reduction (%) Set A - Original Training Dataset 0.907 0.907 9,179,234 - 410,166 - Set B Punctuation Separation 0.920 0.920 9,179,234 - 249,031 39.280 Set C - All words (POS Determiner) 0.917 0.917 7,908,000 13.84% 212,603 48.166 Set D - All words (POS Conjunction) 0.919 0.919 8,593,953 6.37% 212,662 48.152 Set E - All words (POS Preposition) 0.919 0.919 8,214,290 10.51% 212,520 48.186 Set F - All words (POS Pronoun) 0.918 0.918 8,664,127 5.61% 212,792 48.121
  • 8. E-PROCEEDING OF THE 6TH INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND COMPUTER SCIENCE 2019 E-PROCEEDING OF THE 6TH INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND COMPUTER SCIENCE (AICS 2019). (e-ISBN 978-967-0792-34-7). 29 July 2019, Puri Pujangga Hotel, Bangi Selangor, Malaysia. Organized by https://worldconferences.net Page 13 Experiment 2: Removal of All Words with Supporting POS from Amazon dataset. Table 3 shows our experimental result on removal of words with supporting POS roles (determiner, conjunction, preposition, or pronoun) from the Amazon dataset. Similar to experiment 1, set B was created by separating punctuations from words; and sets C F were created by removing words with respective support POS role from set B. The previous observation in Experiment 1 also applicable here: removal of words with supporting POS roles improved fastText a a a on the original set A. However, the impact of these word removals remained slightly less than just simply separating punctuations, even though these settings can significantly reduce the size of the word embedding model. Among four supporting POS roles tested, removal of determiners produced the highest reduction of 13.67% total words. However, its impact on a a a a a a POS . 5.2 Removal of All Words with Supporting POS from Financial Dataset Experiment 3: Removal of Stop Words from Financial Dataset Table 4 shows the results of our experiment after removing stop words from the Financial dataset. Set A is the baseline where no change was made to the original Financial dataset. We then separated punctuations from words in the original dataset, resulted in set B. This set was then used to create the subsequent sets C H, by removing either all the stop words or only the stop words with a certain POS role. Table 4: Experiment 3 results Removal of stop words from Financial dataset. Financial dataset Evaluation Total words Word embedding model Stop word removal Precision Recall Count Reduction (%) Size (KBs) Reduction (%) Set A - Original Training Dataset 0.809 0.809 234,110 - 25,301 - Set B Punctuation Separation 0.809 0.809 234,110 - 16,928 33.00 Set C - Stop words (Maximum) 0.787 0.787 166,853 28.72 16,334 35.44 Set D - Stop words (Minimum) 0.793 0.793 173,139 26.04 16,498 34.79 Set E - Stop words (POS Conjunction) 0.806 0.806 232,264 0.79 16,563 34.53 Set F - Stop words (POS Determiner) 0.801 0.801 216,902 7.35 16,799 33.60 Set G -Stop words (POS Pronoun) 0.799 0.799 232,821 0.55 16,813 33.54 Set H - Stop words (POS Preposition) 0.794 0.794 212,009 9.44 16,552 34.59 Table 4 shows that separation of punctuations from words (set B) neither increased nor decreased the performance of fastText in comparison to its performance on the original dataset (set A). This is contrast to our finding in experiments 1 & 2 where separation of punctuations from words gave the highest accuracy boost to fastText. In addition, removal of stop words, regardless of whether POS role filter was applied or not, decreased the accuracy of fastText on this Financial dataset, ranging from -0.022 (set C:
  • 9. E-PROCEEDING OF THE 6TH INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND COMPUTER SCIENCE 2019 E-PROCEEDING OF THE 6TH INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND COMPUTER SCIENCE (AICS 2019). (e-ISBN 978-967-0792-34-7). 29 July 2019, Puri Pujangga Hotel, Bangi Selangor, Malaysia. Organized by https://worldconferences.net Page 14 maximum list of stop words) to -0.003 (set E: conjunction). This observation is again contrast to our finding on the larger Amazon dataset where removal of only stop words with a specific POS role helped improve fastText a . Experiment 4: Removal of All Words with Supporting POS from Financial Dataset Table 5 shows our experimental result on removal of words with supporting POS roles (Determiner, Conjunction, Preposition, or Pronoun) from the Financial dataset. Similar to previous experiments, set B was created by separating punctuations from words; and sets C F were created by removing words with respective support POS role from set B. Table 5: Experiment 4 results - Removal of words with supporting POS roles (Determiner, Conjunction, Preposition, or Pronoun) from Financial dataset. Financial dataset Evaluation Total words Word embedding model Supporting POS removal Precision Recall Count Reduction (%) Size (KBs) Reduction (%) Set A - Original Training Dataset 0.809 0.809 234,110 - 25,301 - Set B - Punctuation Separation 0.809 0.809 234,110 - 16,928 33.00 Set C - All words (POS Determiner) 0.797 0.797 214,040 8.57% 16,909 33.16 Set D -All words (POS Conjunction) 0.802 0.802 225,743 3.57% 16,959 32.97 Set E - All words (POS Preposition) 0.789 0.789 206,336 11.86% 16,869 33.33 Set F - All words (POS Pronoun) 0.802 0.802 231,503 1.12% 16,974 32.91 In contrast to our finding in experiment 2, removal of words with any supporting POS roles (determiner, conjunction, preposition, or pronoun) worsened the accuracy of fastText on the Financial dataset, with a negative difference ranging from -0.020 (set E: preposition) to -0.007 (set D: conjunction and set F: pronoun). Here, removal of prepositions (not determiners as found in experiment 2) produced the highest reduction of 11.86% total words. It also reduced the word embedding model the most by 33.33%. 5.3 Discussion Punctuation separation provided the best result in the Amazon dataset due to the reduction of noise from the original dataset. In the original dataset, punctuations were attached with word causing unnecessary formation of new tokens which were regarded as a new word variance by fastText. For example, both tokens eat and eat. exist in the original dataset, causing the formation a new word eat. . Due to the large data volume in Amazon dataset, these word variances grow tremendously and increase the size of word embedding model. They were eliminated by punctuation separation. On the other hand, our Financial dataset consists of formal written sentences and does not contain large vocabulary given its small size. Hence, the low occurrence of punctuation-attached word variances, resulting in neither improvement nor degrading in performance when pre-processing text with punctuation separation. In our experiments, exhaustive removal of stop words decreases the precision and recall of fastText on both Amazon and Financial datasets. Our stop words list is a general compilation of frequently used words in computational linguistics with various POS and different level of
  • 10. E-PROCEEDING OF THE 6TH INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND COMPUTER SCIENCE 2019 E-PROCEEDING OF THE 6TH INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND COMPUTER SCIENCE (AICS 2019). (e-ISBN 978-967-0792-34-7). 29 July 2019, Puri Pujangga Hotel, Bangi Selangor, Malaysia. Organized by https://worldconferences.net Page 15 importance and meaning. Exhaustive removal of all words from this list without carefully identifying their roles in a particular dataset causes reduction in sentiment precision. For example, the word sho ld , ha e , does exist in stop word list with POS category of Verb that may carry some valuable meaning in a particular context even though these words can be less important in other contexts. This finding also justifies some reasons whereby the exhaustive stop words removal is a debatable practice among scholars in NLP [5]. Removal of words with POS role (determiner, conjunction, preposition and pronoun) decreases the accuracy of fastText on Financial dataset but improves the accuracy on Amazon dataset. This finding suggests that, in a specific domain dataset, there is lower occurrence of less important words even though the same word may appear as unimportant in other generic dataset. In Financial dataset, it seems to suggest that words with POS determiner, conjunction, preposition and pronoun may carry some meaning in determining sentiment. As the supervised learning model learn from lesser example in smaller dataset, unnecessary removal of words may worsen the result. Furthermore, word sequence is being captured when using n-Gram model on fastText. Hence, unnecessary removal of valuable words can worsen the accuracy. The above finding is further supported in experiment 4 Set E whereby the removal of preposition produced the highest reduction of words and word embedding model, and worst precision. This is due to the nature of Financial a a . U Amazon dataset which is written in free-style format, the Financial dataset contain various indicators in financial market domain such as do n , up , above . T , a with POS-preposition can contribute to negative impact on precision due to the removal of such important keywords which plays role as important indicator in Financial dataset. Below are the examples: Financial dataset: negative F&N was down 18 sen to RM35.02. positive The Hang Seng was up 1.33%. positive prices have remained above that level since April Amazon dataset: negative Very disappointed in this book, as it was to have 16 illustrations, as well as other pictures. The selection of words removal should be carefully studied to avoid important keyword is removed from a particular dataset. 6. Conclusion The computational linguistic practice to remove stop words and less important words with supporting POS is relevant on text processing. However, not all of them should be removed. Part of these words may contain meaningful elements on a particular context. The level of words importance may vary from one dataset to another. For example, the words do n , up , above a a a F a a a a . The stop word list and words with supporting POS to be removed should be used as general guideline in text pre-processing step. However, the selection of exact words to be removed need to be meticulously studied based on few aspects which are (1) the structure of dataset sentences: structured or unstructured (2) the existence of domain specific terms (3) words distribution and writing style: formal or informal (4) volume of the dataset.
  • 11. E-PROCEEDING OF THE 6TH INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND COMPUTER SCIENCE 2019 E-PROCEEDING OF THE 6TH INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND COMPUTER SCIENCE (AICS 2019). (e-ISBN 978-967-0792-34-7). 29 July 2019, Puri Pujangga Hotel, Bangi Selangor, Malaysia. Organized by https://worldconferences.net Page 16 Basic pre-processing steps such as punctuation separation from word is important to reduce irrelevant combination of words and punctuation in a larger and generic domain dataset. The omission of punctuation processing in a larger and generic dataset, such as punctuation separation may lead to unnecessary noise exist in the dataset that could cause the increment of word embedding model size and impairment in the learning process. 7. Future Works In future, we plan to focus in finding the actual less important stop words that can impair the learning process of each datasets. Apart from that, we intend to further investigate the impact of removing words that exist in stop words by various POS categories versus removing all words with supporting POS roles. On the other hand, a separate research to identify higher importance words in sentiment analysis should also be done along with identification of less important words. This is to ensure words and phrases in the training dataset are given the correct weightage during the supervised learning process. 8. References [1] Annett, M., & Kondrak, G. (2008). A Comparison of Sentiment Analysis Techniques: Polarizing Movie Blogs. Advances in Artificial Intelligence, (May), 25 35. https://doi.org/10.1007/978-3-540-68825-9_3 [2] Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2016). Enriching Word Vectors with Subword Information. Transactions of the Association for Computational Linguistics. https://doi.org/5. 10.1162/tacl_a_00051 [3] Ca , R., M Ca , M., Ma , G., & O K , A. (2016). English Grammar Today The Cambridge A-Z Grammar of English. Cambridge United Kingdom: Cambridge University Press. [4] Deng, S., Sinha, A. P., & Zhao, H. (2017). Resolving Ambiguity in Sentiment Classification: The Role of Dependency Features. ACM Trans. Manage. Inf. Syst., 8(2 3), 4:1--4:13. https://doi.org/10.1145/3046684 [5] Dolamic, L., & Savoy, J. (2010). Brief communication: When stopword lists make the difference. Journal of the American Society for Information Science and Technology, 61(1), 200 203. https://doi.org/10.1002/asi.21186 [6] G F L . (1998). A a I : S a S a s for Complex Problem Solving. In Addison-Wesley (3rd Edition). Addison-Wesley. [7] Goldberg, Y. (2017). Neural network methods for natural language processing (Synthesis Lectures on Human Language Technologies). Morgan & Claypool Publisher, Vol. 10, pp. 1 309. https://doi.org/10.1162/COLI_r_00312 [8] Joulin, A., Grave, E., Bojanowski, P., & Mikolov, T. (2017). Bag of Tricks for Efficient
  • 12. E-PROCEEDING OF THE 6TH INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND COMPUTER SCIENCE 2019 E-PROCEEDING OF THE 6TH INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND COMPUTER SCIENCE (AICS 2019). (e-ISBN 978-967-0792-34-7). 29 July 2019, Puri Pujangga Hotel, Bangi Selangor, Malaysia. Organized by https://worldconferences.net Page 17 Text Classification. Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, 2, 427 431. https://doi.org/10.18653/v1/E17-2068 [9] Kumar, A. A., & Chandrasekhar, S. (2012). Text Data Pre-processing and Dimensionality Reduction Techniques for Document Clustering. International Journal of Engineering Research & Technology (IJERT), 1(5), 1 6. https://doi.org/2278-0181 [10] Kumar, A., & Panda, S. P. (2018). A Survey of Sentiment Analysis on Social Media. International Journal for Research in Applied Science & Engineering Technology (IJRASET), 6( February). https://doi.org/2321-9653 [11] Larsen, K. R. (2016). MIS Quarterly Article Stopword List. (January). [12] Larsen, K. R., & Bong, C. H. (2016). A Tool for Addressing Construct Identity in Literature Reviews and Meta-Analyses. MIS Quarterly, 40(3), 529 551. https://doi.org/10.25300/misq/2016/40.3.01 [13] Liu, B. (2010). Sentiment Analysis and Subjectivity (To appear in Handbook of Natural Language Processing) (second edi; N. Indurkhya & F. J. Damerau, Eds.). http://www.cs.uic.edu/~liub/FBS/NLP-handbook-sentiment- analysis.pdf%5Cnhttp://people.sabanciuniv.edu/berrin/proj102/1-BLiu-Sentiment Analysis and Subjectivity-NLPHandbook-2010.pdf [14] Liu, Q., & Wu, Y. (2012). Supervised Learning. In N. M. Seel (Ed.), Encyclopedia of the Sciences of Learning (2012th ed.). https://doi.org/10.1007/978-1-4419-1428-6_451 [15] Mao, Y., Balasubramanian, K., & Lebanon, G. (2010). Dimensionality Reduction for Text using Domain Knowledge. COLING 10 Proceedings of the 23rd International Conference on Computational Linguistics: Posters, (August ), 801 809. [16] Martins, C. A., Monard, M. C., & Matsubara, E. T. (2003). Reducing the Dimensionality of Bag-of-Words Text Representation Used by Learning Algorithms. Proc of 3rd IASTED International Conference on Artificial Intelligence and Applications, 228 233. https://pdfs.semanticscholar.org/a90f/ca4c78b66fb0e28f4fe8086d28f3720c7999.pdf [17] Pang, B., Lee, L., & Vaithyanathan, S. (2002). Thumbs up? Sentiment Classification using Machine Learning Techniques. Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 79 86. https://doi.org/10.3115/1118693.1118704 [18] Ponmuthuramalingam, P., & Devi, T. (2010). Effective Dimension Reduction Techniques for Text Documents. IJCSNS International Journal of Computer Science and Network Security, 10(7). Retrieved from http://paper.ijcsns.org/07_book/201007/20100712.pdf [19] R , E., Ha a , M., Wa a , M., J , M., E , ., & S a , M. (2018). More than Bags of Words: Sentiment Analysis with Word Embeddings. Communication Methods and Measures, 12(2 3), 140 157. https://doi.org/10.1080/19312458.2018.1455817
  • 13. E-PROCEEDING OF THE 6TH INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND COMPUTER SCIENCE 2019 E-PROCEEDING OF THE 6TH INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND COMPUTER SCIENCE (AICS 2019). (e-ISBN 978-967-0792-34-7). 29 July 2019, Puri Pujangga Hotel, Bangi Selangor, Malaysia. Organized by https://worldconferences.net Page 18 [20] Sadanandan, A. A., Osman, N. A., Saifuddin, H., Khairuddin, M., Pham, D. N., Hoe, H., & Lumpur, K. (2016). Improving Accuracy in Sentiment Analysis for Malay Language. 4th International Conference on Artificial Intelligence and Computer Science (AICS2016), (November), 28 29. [21] Vanroose, P. (2001). Part-of-Speech Tagging From an Information-Theoretic Point of View. (22nd July). http://citeseer.ist.psu.edu/646161. html View publication stats View publication stats