SentimentAnalysisofTwitterProductReviewsDocument.pdf

TWITTER PRODUCT REVIEWS USING NLP
A PREPRINT
Andrew Chan
Department of Computer Science
Virginia Tech
andrewclchan211@vt.edu
Devin Sohi
Virginia Tech
devin1@vt.edu
Justin Cheng
Virginia Tech
jcheng2000@vt.edu
Nicholas Frankenberg
Virginia Tech
nicholascf@vt.edu
May 8, 2022
ABSTRACT
Product reviews help inform consumers to make better decisions about what products to buy and what
those reasons are. To broaden the reach of analysis, we decided to use product reviews on Twitter
and other datasets to create a rating system based off of the sentiments found in those reviews using
sentiment analysis. This will lead to an easier time for consumers to figure out what products are
worth buying and why that may be the case. Initially, we plan on evaluating standard traditional and
deep learning models on our datasets. However, labeling the incoming data is too time-consuming
and labor-exhaustive to consider on large-scale tasks. To mitigate the time requirements for labeling
such data, and if time allows, we plan to benchmark how well traditional machine learning models
and our bidirectional LSTMs perform with pseudo-labeling.
Keywords NLP · LSTM · Sentiment Analysis
1 Introduction
Sentiment analysis has been a popular topic in the natural language processing domain and has been closely tied
to product reviews and how consumers make decisions between products. The goal is to summarize the various
complexities and emotions in human language into various categories.
Traditionally, Naive Bayes on some embedding space has been used to a fair amount of success in the NLP domain.[1]
Relatively recent deep learning models such as RNNs and LSTMs have been introduced to approach the problem and
NLP in general, which is expected to perform better than Naive Bayes [2].
However, the labeling needed for this data can be very time-consuming and labor intensive. This has spurred ongoing
research [3], in semi-supervised learning.
Some recent notable results in semi-supervised learning results has been published by Google Research, with a simple
unsupervised consistency loss function with synonymous transformations, which has shown great success achieving
95.5% accuracy with semi-supervised learning versus 93.5% without semi-supervised learning on the IMDB database
with only 20/25000 data points labeled [4].
However, the most basic approach to this problem is pseudo-labeling. In summary, a model is trained on known data
and asked to predict labels for unlabeled data, called pseudo labels, and then, the model is retrained on the dataset
overall. Pseudo-labeling has been theoretically justified [5] and experimentally proven successful [6, 7].1
1
We will pursue this approach for extra credit

Twitter Product Reviews using NLP A PREPRINT
2 The Approach
Our approach to tackling the problem of Twitter product reviews is twofold. First, we plan on tackling the problem of
sentence level sentiment analysis in a supervised manner using the benchmarked IMDB dataset[8] along with a twitter
sentiment dataset[9] to train our chosen model: the bidirectional LSTM. Second, to add a bias to classifying Twitter
product reviews, for extra credit, we will add in pseudo-labeling with an unlabeled Twitter product review dataset to
improve performance.
2.1 Datasets
For handling the sentiment analysis in a supervised manner, we plan on using two datasets: a benchmarked IMDB
dataset and a twitter centric dataset called Twitter Sentiment Extraction [8, 9]. The IMDB dataset consists of about
50,000 movie reviews, 25,000 of which are labeled positive and the others, negative. It has been benchmarked on
PapersWithCode and has been used in numerous research papers. We believe this dataset will help us build a good
review based sentiment analysis model that can handle a strong variety of words and opinions. Some of the cons of
using this dataset to train our model for Twitter product reviews is that it is a polarized dataset, does not contain tweets,
and some of the reviews reach up to 3,000 words. It is also polarized due to it only containing reviews that are 0-4 stars
out of 10 or 7-10 stars out of 10. This might bias our model to take more extreme stances on product reviews.
The Twitter Sentiment Extraction (TSE) dataset is a competition dataset on Kaggle meant for standard twitter sentiment
analysis[9]. It consists of 27,000 tweets labeled negative, neutral, or positive. This dataset will help focus our model on
how to interpret tweets. Tweets don’t always have the same tone as normal sentences so it is important for our model to
be able to understand the context of the reviews. The pros to this dataset are that it is based on twitter and the short
tweets allow for quick model training times. However, the dataset being small is also a con. Sentiment analysis often
requires many training examples so our model will likely underfit.
Neither of the above datasets are explicitly meant for the domain of analyzing twitter product reviews. The IMDB
dataset is meant for movie reviews, while the TSE dataset is meant for tweet sentiment analysis regardless of context. A
model trained on such datasets might be lacking some of the intricacies particular to Twitter product reviews. Therefore,
it would be prudent to add twitter product review training examples to our model. Unfortunately, we were not able to
find a labeled dataset of Twitter product reviews. There is an unlabeled one on Kaggle, called Twitter Product Sentiment
Analysis (TPSA), focused on on a subset of product reviews including Apple products [10]. The semi-supervised
technique of pseudo-labeling would allow us to leverage this data to give our models more context on how to understand
Twitter product reviews. We plan on attempting this technique for extra credit and analyzing if any gain in performance
is made.
The final set of data we are using is a tiny set of hand picked tweets that will be used to test the models we have trained
on both of the datasets aforementioned. This dataset will be referred to as the "Hand-Labeled" dataset from this point
on [9] [10]. This dataset is curated to have many classifications that would be tricky for the models to handle, such
as target dependent sentences or sarcastic ones. Most of the tweets are related to product reviews of Apple products.
Additionally, all the tweets are either positive or negative and the label is from the point of view of the Apple product,
i.e., being target dependent. For example, "Google>Apple" would be classified as negative from the point of view of the
target Apple. As a result of using sarcastic and target dependent tweets, we are expecting a comparatively low accuracy
on this dataset.
Something worth mentioning is the IMDB dataset has two output classes, while the TSE has three output classes. For
the models trained on each of these datasets, particular rating systems, i.e. 5-star rating scale, would require scaling of
the output values: something outside the scope of our project. Before we can actually use the data from these datasets
to train our models, we need to preprocess the data.
2.2 Preprocessing
In order to train our models on the above datasets, we must first preprocess the data. The different models that we have
chosen to compare will use different preprocessing techniques to fit their needs.
Our baseline Naive Bayes model utilizes a simple Text Frequency - Inverse Document Frequency (TF-IDF) preprocessing
technique[11]. This method vectorizes the frequency of words within a tweet and normalizes each term by its occurrence
throughout the dataset as a whole. A more thorough explanation of TF-IDF is provided in the appendix. An advantage
of this method is that it is very quick to run and yields a decent vectorization that Naive Bayes is able to utilize. Some
disadvantages of this method is its complete loss of word ordering, not being able to handle words that were not seen in
the training process, and its high input dimension that would not be appropriate for more sophisticated models. Despite
2

this method’s simplicity, Naive Bayes is able to extract enough information from the vectorization to achieve adequate
results.
Our Neural Network based models (biRNN and BiLSTM) took a different approach to preprocessing. Before the data
was handed over to the model, a few techniques took place. English Stop Word Removal is the first of these methods. It
is simply the process of removing words that provide little to no meaning in an actual sentence. An example would
be removing words such as: “the”, “it”, “to”, “a,” etc. The next method is called Lemmatization. This is the process
of reducing a word down to its root. An example would be translating “going” to the word “go.” These methods are
used in conjunction to reduce the total dictionary size and to assist in achieving a more meaningful encoding. Once
the previous methods are complete, the resulting words are translated into integers representing their index within the
dictionary that has been created. The neural networks take these integers as its input. Because of the recurrent nature of
these models, the word order is preserved throughout the training process. As the model trains it will learn an optimal
encoding of the words within its encoding/embedding layer as it tries to reduce loss. This method has proven to be
effective and is able to meaningfully encode words in a similar way to how an Auto Encoder would. Given our results
and research, no external word embeddings are necessary.
2.3 Model Selection
As for the models we have chosen, we have selected Naive Bayes, BiRNN, and BiLSTM for our comparisons. While
trying to find a good model as our baseline, we thought about a variety of different ones including Naive Bayes and
SVM. We decided on using the Naive Bayes model because there have been many other projects with great success
using that.[1] In addition, a Naive Bayes model was the most accurate model for the IMDB dataset on Kaggle which
provides additional motivation for our choice.
To go over quickly how the Naive Bayes model works, it takes the probabilities of a classification given different
features and labels and returns the most likely classification given the features. Essentially, it will predict a classification
based on the frequency of features for a given label. This provides a context free way to analyze the features and provide
predictions.
The deep learning baseline model we have chosen is a Bidirectional Recurring Neural Network (BiRNN). We opted for
a bidirectional RNN as opposed to a single direction RNN because single directional RNNs have the issue of missing
the context of a word because it only runs in one direction. A BiRNN attempts to solve this problem by feeding the
sequence of words in reverse order to generate the weights for each word, which in turn provides more context for
words that are spelled the same, but have different meanings.
Finally, we chose the BiLSTM model as our cutting edge solution. Similarly to the BiRNN, the BiLSTM also feeds the
sequence of words in 2 orders to minimize the context problem. As an improvement to the RNN, the BiLSTM also
uses a different algorithm that uses a series of gates to make decisions on what information to keep and what to forget.
Ideally, these gates will provide an optimal network to predict the classifications for the datasets.
We chose to use the BiRNN as a deep learning base model because the BiLSTM was our chosen state-of-the-art model
to provide a better comparison. Since the BiLSTM is essentially an improved form of the BiRNN, this would provide a
better comparison between the different models.
3 Experiments
We ran a few experiments on our models in order to compare their performance using different data. While we ran into
some problems because of a lack of GPU access, we were able to use Weights and Biases to handle our hyper-parameter
grid search and find the optimal parameters for each of our models. The set up of the grid search, as well as other
experiments we ran, is explained below.
3.1 Experimental Set-up
Due to it being quick to train, we decided the first dataset we should experiment on was the TSE dataset. A 64/16/20%
train/validation/test split was done on this dataset and given to each of our three chosen models: Naive Bayes, BiRNN,
and BiLSTM. Due to being baseline models, we decided not to hyperparameter tune the Naive Bayes and BiRNN,
keeping their architectures constant. We used the optimal hyperparamters found for the BiLSTM model in the BiRNN
model. The BiRNN and BiLSTM models were both trained using early stopping for the number of epochs and the
categorical cross entropy loss function with the optimizer Adam for learning. Additionally, a batch size of 32 was
used for both BiRNN and BiLSTM models. 2 hidden cells were used for the BiLSTM model while only 1 hidden cell
was used for the BiRNN. For our BiLSTM, we ran a full grid-search, on the hyperparameters described below, using
3

Weights and Biases. The four hyperparameters we decided to focus on are the learning rate, embedding size, hidden
size, and dropout rate within the LSTM cell.
We chose these four hyperparameters due to them being easy to change with the PyTorch API. The learning rate
makes a big difference in how quickly the model converges to a local minimal and if that local minimal is closer to
a global minimal. Thus, trying different values for it could lead to drastic differences in the model’s performance.
The embedding size determines the size of the embedding layer and the dimension a word should be represented in.
Increasing the dimension allows for more nuisances between words to be captured at the possible expense of over-fitting
what a word means which can cause context problems. The hidden layer size determines how much memory can be
held within an LSTM cell which is an important parameter to the model. Lastly, the dropout rate within the LSTM
cell can prevent potential overfitting and is a common practice. Though, overfitting will likely not be a problem with
smaller datasets. The table below shows the different hyperparameter combinations we attempted:
Table 1: LSTM model parameters
Hyperparameters Values
Epochs used 4
Learning rate 0.0001, 0.001, 0.01
Embedding size 300, 400, 500 (best above)
Hidden size 128, 256, 512 (best above)
Dropout within LSTM cell 0.2, 0.4, 0.6 (best above)
The second dataset we experimented on was the IMDB dataset. A 40/10/50% train/validation/test split was done on this
dataset and given to each of our three chosen models: Naive Bayes, BiRNN, and BiLSTM. We chose these percentages
as the common benchmarks in literature used 50% of the data for testing. The BiRNN and BiLSTM models were both
trained using early stopping for the number of epochs and the categorical cross entropy loss function with the optimizer
Adam for learning. Additionally, a batch size of 32 and 2 hidden cells were used for both BiRNN and BiLSTM models.
Using the possible hyperparameter combinations shown in the above table, we ran a random search using Weights and
Biases. We chose random search as opposed to grid search due to the much longer training time making trying all the
combinations infeasible. Similarly to the first set of experiments done on the TSE dataset, hyperparameter tuning was
only done on the BiLSTM. BiRNN was ran on the optimal parameters found for the BiLSTM model.
Once the hyperparameter searches are done, we evaluate the accuracies by training new models with the same
hyperparameters and evaluating the accuracies.
As extra credit, we plan on doing pseudo-labeling on the BiLSTM with optimal hyperparameters on the TSE dataset.
The training setup is the same as our standard supervised setup, but with an unsupervised training setup. We split the
data into 64% labeled data, 20% unlabeled data, and 16% validation. In every epoch, we train over the labeled dataset,
then the unlabeled dataset, with our unsupervised loss function being the cross entropy between the previous model’s
predictions and the current model’s output. We have an alpha value of 1.0 applied to the unsupervised loss.
We also use Captum integrated gradient analysis to visualize our results.
3.2 Results
3.2.1 TSE Dataset Results
After executing grid-search on our BiLSTM, the best hyper-parameters were learning rate: 0.001, embedding size: 300,
hidden size: 256, and dropout within LSTM cell: 0. The two figures below come from Weights and Biases and are
visualizations of the different hyperparameter combinations and the corresponding results.
4

Figure 1: Hyperparameter Combinations
Figure 2: Validation Loss of 10 Best Combinations
It is a bit hard to see any noticeable patterns from Figure 1 in terms of what combinations of hyperparameters work
best. However, we did notice that combinations using a high learning rate and hidden size performed poorly. This was
confirmed with a second run of grid search. It is worth noting in our second run of grid search (figures above are for
the first run), the optimal hyperparameters were slightly different. However, in comparing the two runs, the first run
5

appeared to produce a slightly better optimal model in terms of fitting the validation data. The results were relatively
consistent across the two grid search runs (shown in appendix).
Based on our readings of relevant sentiment analysis research papers, the four metrics we used to evaluate the
performance of our optimal model on the TSE dataset were accuracy, test loss, average precision score, and macro AUC
score. [1, 2] A description of why we chose these metrics can be found in the appendix.
Table 2: Model comparisons on TSE dataset
Model/Results Average Test Loss Test Accuracy Average Precision Score (Macro) AUROC Score
Naive Bayes - 0.607 - -
BiRNN 0.915 0.629 0.641 0.778
BiLSTM 0.879 0.666 0.6907 0.805
Figure 3: Statistical Significance with n=10 for BiLSTM on Testing Data
Our BiLSTM model performed better than the BiRNN model and our Naive Bayes baseline which is consistent with
the literature [1, 2]. Furthermore, the BiLSTM results were highly consistent when re-trained and re-tested on the TSE
dataset as seen in the boxplot in Figure 3 across 10 runs. The BiRNN slightly outperformed the Naive Bayes. Though,
they have very close results. For why the BiLSTM outperforms the BiRNN, we believe our BiRNN is slightly suffering
from the vanishing gradient problem. Though, since tweets are shorter, we believe the vanishing gradient problem isn’t
as profound as compared to the results on the IMDB dataset. It is possible, that hyperparameter tuning might bring the
BiRNN closer to the BiLSTM in performance.
3.2.2 IMDB Dataset Results
Our results on the IMDB database yield 89% and 87% accuracy for BiLSTM and Naive Bayes respectively.
Table 3: Model comparisons on IMDB dataset
Model/Results Average Test Loss Test Accuracy
Naive Bayes - 0.866
BiRNN 0.719 0.531
BiLSTM 0.421 0.890
These results met our expectations. We can compare our models’ results to a few benchmarks. The first being a
pre-trained bert model. When passing the IMDB data to the bert-base-multilingual-uncased-sentiment model, it was
able to apply labels with an accuracy of 61%. This shows that domain specificity is important for this problem and that a
general pre-trained model might not be the best choice. Looking through a few benchmarks from paperswithcode.com,
we found the highest accuracy to be in the low 90s with models more similar to ours performing in the mid 80s. This
is a good indication that our LSTM model was performing as expected with its relatively high accuracy of 89%. We
believe that those few extra accuracy points can be contributed to our thorough hyper-parameter tuning analysis.
Similarly to the TSE dataset, we used weights and biases to accomplish this analysis. We found that the optimal
hyper-parameters were: learning rate: 0.001, embedding size: 400, hidden size: 512, and dropout: 0. The learning
rate was shown to have the biggest effect on the models performance with embedding size also having a significant
influence. The dropout and hidden size contributed less but still shifted the model’s performance.
We should note that the biRNN model’s performance has dropped dramatically when switching from the TSE dataset to
IMDB. We are predicting that this can be attributed to the vanishing gradient problem. Since the sentence size gets
6

much larger in the IMDB dataset, the RNN could be having trouble converging. This problem is better addressed by the
LSTM making it a better option for this scenario.
3.2.3 Hand-Labeled Results
We tested our models on the Hand-Labeled data set to assess its generalization into our goal domain: Twitter product
reviews. On the BiLSTM model that was trained on IMDB data, the model had a 0.656 accuracy when predicting labels
on our hand-labeled data. The BiLSTM model trained on the TSE data had a 0.40 accuracy. The higher accuracy of the
IMDB model likely stems from a couple sources.
First of all, the IMDB dataset only has two output classes: positive and negative. Whereas the TSE dataset has three
output classes including neutral. Since the Hand-Labeled dataset is only made up of positive and negative classified
tweets, this appears to have confused the TSE trained model which predicted neutral much of the time. Additionally,
the IMDB dataset is much larger and contains a richer vocabulary. The IMDB trained model appeared to recognize
more of the words and their relationships in the Hand-Labeled dataset, than the TSE trained model did. We will further
elaborate on this point in the analysis section.
3.2.4 (Extra Credit) Pseudo-labeling Results
For our pseudo-labeling we achieved a validation accuracy of 0.6339% in 8 epochs on the TSE dataset, which is lower
than our baseline model. Perhaps adjusting the ratios of iterations between the unsupervised dataset and supervised
datasets would help or adjusting the alpha value to only ramp up at the end. Perhaps pseudo-labeling doesn’t work well
with this dataset. Additional experiments would be needed to make a valid conclusion.
4 Analysis
With how poorly both the TSE trained and IMDB trained models performed on the Hand-Labeled dataset, our target
domain, we were curious what went wrong. Some of the more obvious reasons are the datasets being different, leading
to a lacking vocabulary, and our models’ inability to handle target dependencies. As mentioned earlier, we were not
able to find a labeled dataset made to analyze product reviews making the first issue much harder to deal with. To make
our model adaptable to target dependencies, we would need a more complicated architecture such as those described
in [12, 13, 14, 15]. To gain an even deeper understanding of why the models are struggling, we used an attribution
technique known as integrated gradients using a third-party library called Captum.
Integrated gradients is an axiomatic attribution method which makes use of vector calculus, specifically the line integral,
to determine the impact each feature has on a prediction [16]. More information about this fascinating method can be
analyzed in the cited paper.
Specific to interpreting Captum’s visualization tool, the legend colors are somewhat unintuitive. After investigating
Captum’s source code, we found positive means that attribution to the model’s predicted label was positive (green)
whereas negative means negative (red) attribution to the model’s predicted label. In a sense, green text is what positively
influence the model to make its decision, whereas red text is what negatively influences the model to make its decision.
Using Captum’s integrated gradient visualization tool, we were able to visualize how our model is interpreting the our
hand-labeled dataset on sarcasm.
Figure 4: Captum visualization on sarcastic hand-picked tweets
In this case, we hypothesize that our model is interpreting sarcasm in this example, but in the wrong way. In both
examples, it seems our model is interpreting both the positive and negative sections sarcastically, hence "great" and
"amazing" being interpreted as negative. However, to interpret sarcasm, usually only the beginning of the sentence or
end should be inverted. It’s notable that our model is a bidirectional model, so it’s plausible that our classifier layer is
7

not complex enough to construct relations and stronger signals to interpret which direction of sarcasm should be correct,
and rather our model interprets that both directions should be inverted.
8

Because, in a sense, we are not training on the hand-picked set, it’s useful to see how sarcastic examples are interpreted
in the training set, so we hand picked some sarcastic examples from our training TSE dataset for Captum visualization.
Figure 5: Captum visualization on sarcastic TSE tweets
Notably in many of our examples that we picked for sarcasm became unintelligible as lemmatization was applied,
which would likely make sarcasm harder to interpret. However, it seems there is a somewhat similar example:
"Screw review, I think Wolverine awesome. But enough Dominic Monaghan liking," or "Screw the reviews,
I thought Wolverine was awesome. But not enough Dominic Monaghan for my liking."
Once our model saw the word "Screw review, I think," it mistakenly interpreted "awesome" negatively, which implies
our model is incorrectly interpreting sarcasm. Our model has a negative attribution score here, which means our model
didn’t want to predict positive, but the other options were worse, so it was forced to do so. We also note that the during
our processing, "but not enough" became "but enough," which have drastically different meanings.
Another one to note is:
"I wan na single rest life," or "I don’t wanna be single for the rest of my life."
In this case, our preprocessing is removing key parts of the sentence. The negation was removed. But other-
wise, note that our model interprets "life" as highly negative (positive attribution for prediction of 0), which potentially
shows the temporal relations, however "na single" was interpreted as positive (negative attribution for prediction of 0).
This seems to be a general trend in the model as it’s not interpreting the sentences how a human would but is finding
keywords (that don’t always make sense) with temporal relations that swing its decision.
5 Challenges
There were a few challenges that we faced while training our models. The computational expense of running LSTM
multiple times for hyper parameter tuning was so expensive that it got multiple of our accounts kicked off of Google
Colab. We solved this problem by subscribing to a Google Colab pro account. In the future we will be more careful
with how we are using our resources and try to avoid leaving our code running overnight. Computational resources are
a common issue when trying to train larger models so we expected problems like this to arise.
9

6 Conclusion and Future work
We have performed experiments with bidirectional LSTMs on Twitter and review data and have shown that, as expected,
bidirectional LSTMs perform better than our RNN and Naive bayes baselines. We have briefly performed experiments
on pseudo-labeling but obtained suboptimal results on our Twitter (TSE) dataset. The carry over from our models
learning on the IMDB and TSE dataset proved suboptimal for Twitter product reviews. To improve our results toward
this domain, we would need a more complex training setup, i.e. better datasets, a different semi-supervised learning
setup, and target dependent capable models.
10

References
[1] S. Tripathi, R. Mehrotra, V. Bansal, and S. Upadhyay, “Analyzing sentiment using imdb dataset,” in 2020 12th
International Conference on Computational Intelligence and Communication Networks (CICN), 2020, pp. 30–33.
[2] C. Zhang and L. Liu, “Research on semantic sentiment analysis based on bilstm,” in 2021 4th International
Conference on Artificial Intelligence and Big Data (ICAIBD), 2021, pp. 377–381.
[3] B. Chen, J. Jiang, X. Wang, J. Wang, and M. Long, “Debiased pseudo labeling in self-training,” 2022.
[4] Q. Xie, Z. Dai, E. Hovy, M.-T. Luong, and Q. V. Le, “Unsupervised data augmentation for consistency training,”
2019.
[5] C. Wei, K. Shen, Y. Chen, and T. Ma, “Theoretical analysis of self-training with deep networks on
unlabeled data,” in International Conference on Learning Representations, 2021. [Online]. Available:
https://openreview.net/forum?id=rC8sJ4i6kaH
[6] K. Sohn, D. Berthelot, C.-L. Li, Z. Zhang, N. Carlini, E. D. Cubuk, A. Kurakin, H. Zhang, and C. Raffel, “Fixmatch:
Simplifying semi-supervised learning with consistency and confidence,” arXiv preprint arXiv:2001.07685, 2020.
[7] D. Zhou, O. Bousquet, T. Lal, J. Weston, and B. Schölkopf, “Learning with local and global
consistency,” in Advances in Neural Information Processing Systems, S. Thrun, L. Saul, and B. Schölkopf,
Eds., vol. 16. MIT Press, 2003. [Online]. Available: https://proceedings.neurips.cc/paper/2003/file/
87682805257e619d49b8e0dfdc14affa-Paper.pdf
[8] L. N., “Imdb dataset of 50k movie reviews,” 2019. [Online]. Available: https://www.kaggle.com/datasets/
lakshmi25npathi/imdb-dataset-of-50k-movie-reviews
[9] “Tweet sentiment extraction.” [Online]. Available: https://www.kaggle.com/competitions/
tweet-sentiment-extraction/data
[10] B. Densil, “Twitter product sentiment analysis,” 2020. [Online]. Available: https://www.kaggle.com/datasets/
blessondensil294/twitter-product-sentiment-analysis
[11] P. Singh, “Fundamentals of bag of words and tf-idf,” Feb 2020. [Online]. Available: https:
//medium.com/analytics-vidhya/fundamentals-of-bag-of-words-and-tf-idf-9846d301ff22
[12] C. Sun, L. Huang, and X. Qiu, “Utilizing bert for aspect-based sentiment analysis via constructing auxiliary
sentence,” Mar 2019. [Online]. Available: https://arxiv.org/abs/1903.09588v1
[13] Y. J. KAIST, Y. Jo, Kaist, A. H. O. KAIST, A. H. Oh, Cuhk, L. Hannover, U. of, M. R. Asia, and
O. M. A. Metrics, “Aspect and sentiment unification model for online review analysis: Proceedings of
the fourth acm international conference on web search and data mining,” Feb 2011. [Online]. Available:
https://dl.acm.org/doi/10.1145/1935826.1935932
[14] D. Tang, B. Qin, X. Feng, and T. Liu, “Effective lstms for target-dependent sentiment classification,” 2015.
[Online]. Available: https://arxiv.org/abs/1512.01100
[15] Z. Gao, A. Feng, X. Song, and X. Wu, “Target-dependent sentiment classification with bert,” IEEE Access, vol. 7,
pp. 154 290–154 299, 2019.
[16] M. Sundararajan, A. Taly, and Q. Yan, “Axiomatic attribution for deep networks,” 2017. [Online]. Available:
https://arxiv.org/abs/1703.01365
[17] G. Kour and R. Saabne, “Real-time segmentation of on-line handwritten arabic script,” in Frontiers in Handwriting
Recognition (ICFHR), 2014 14th International Conference on. IEEE, 2014, pp. 417–422.
[18] ——, “Fast classification of handwritten on-line arabic characters,” in Soft Computing and Pattern Recognition
(SoCPaR), 2014 6th International Conference of. IEEE, 2014, pp. 312–318.
[19] G. Hadash, E. Kermany, B. Carmeli, O. Lavi, G. Kour, and A. Jacovi, “Estimate and replace: A novel approach to
integrating deep neural networks with existing applications,” arXiv preprint arXiv:1804.09028, 2018.
[20] B. Pang, L. Lee, and S. Vaithyanathan, “Thumbs up? sentiment classification using machine learning techniques,”
2002.
[21] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press, 2016, http://www.deeplearningbook.org.
[22] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss,
V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay, “Scikit-learn:
Machine learning in Python,” Journal of Machine Learning Research, vol. 12, pp. 2825–2830, 2011.
11

[23] S. Liao, J. Wang, R. Yu, K. Sato, and Z. Cheng, “Cnn for situations understanding based on
sentiment analysis of twitter data,” Procedia Computer Science, vol. 111, pp. 376–381, 2017, the
8th International Conference on Advances in Information Technology. [Online]. Available: https:
//www.sciencedirect.com/science/article/pii/S1877050917312103
[24] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for
language understanding,” 2019.
[25] G. Sidorov, F. Velasquez, E. Stamatatos, A. Gelbukh, and L. Chanona-Hernández, “Syntactic n-grams as machine
learning features for natural language processing,” Expert Systems with Applications, vol. 41, no. 3, p. 853–860,
Feb 2014. [Online]. Available: https://linkinghub.elsevier.com/retrieve/pii/S0957417413006271
[26] C. Manning, P. Raghavan, and H. Schütze, “Stemming and lemmatization.” [Online]. Available:
https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html
[27] C. Khanna, “Text pre-processing: Stop words removal using differ-
ent libraries,” Feb 2021. [Online]. Available: https://towardsdatascience.com/
text-pre-processing-stop-words-removal-using-different-libraries-f20bac19929a
[28] M. Monette, “Anatomy of a tweet | twitter for business,” Jul 2014. [Online]. Available: https:
//openmoves.com/blog/anatomy-of-a-tweet/
[29] R. Liu, Y. Shi, C. Ji, and M. Jia, “A survey of sentiment analysis based on transfer learning,” IEEE Access, vol. 7,
pp. 85 401–85 412, 2019.
[30] R. Liu, Y. SHI, C. JI, and M. JIA, “A survey of sentiment analysis based on transfer learning,” IEEE Access,
vol. PP, pp. 1–1, 06 2019.
[31] “Semeval-2014 task 4.” [Online]. Available: https://alt.qcri.org/semeval2014/task4//
[32] KazAnova, “Sentiment140 dataset,” Sep 2017. [Online]. Available: https://www.kaggle.com/datasets/kazanova/
sentiment140
[33] Passionate-Nlp, “Twitter sentiment analysis,” Aug 2021. [Online]. Available: https://www.kaggle.com/datasets/
jp797498e/twitter-entity-sentiment-analysis
A Data
The first component of a tweet is the profile photo. According to twitter, a profile photo should be 400x400 pixels, less
than 2MB, and either a JPG, PNG or GIF. This photo can give insight to whether the account belongs to an individual
user or an organization. The photo can be analyzed using facial recognition to determine if a person is present in the
profile photo. We can also try to detect logos and text within the profile photo. Combining these two attributes can
be useful in classifying a twitter account as belonging to an individual or an organization. For the use of this project,
tweets from individuals are more valuable as they are less likely to have biases towards certain products.
The next component of a tweet is the account name and the @username. The account name is the portion of the
username that can be changed while the @username can not be changed after an account is made. This information may
seem useless, but there is some analysis that can be performed on it. These usernames can be compared against common
name banks to determine if it contains a person’s name or not. Accounts that have a common name in their username
will more often belong to an individual. The name can also be used to predict a user’s gender, age and ethnicity.
Every tweet has a timestamp of when it was posted. This information can be used to select tweets from a certain time
period or to look for trends over time. If we are looking for product reviews of a product that was released only a month
ago, it will be a good idea to remove any tweets from our search that are older than one month. For a product that has
been out for much longer we might be interested in how people’s perception of the product has shifted since its release.
We can use the timestamp to analyze this shift and detect if sentiments are becoming more positive or negative over
time. This might help us catch products that were over-promised before their release as their sentiments will be very
high before the release date and drop soon after.
The body of the tweet can contain both text and images. The text of a tweet should not be blindly treated as any block
of text as they often contain hashtags, @mentions, links, emojis, and emoticons. A hashtag is used to link the tweet
to a certain topic. This can be useful in determining the subject of a tweet or to link the tweet to others with similar
subjects. An @mention shows that this tweet has a relationship with the account tagged. People may also often include
an @mention to the company or individuals responsible for making the product they are tweeting about. Links easily
attach the tweet to a website. These links could potentially be analyzed but otherwise, they should be removed from the
model [28].
12

We have looked into how other twitter researchers have handled emojis and emoticons when performing sentiment
analysis and have not found much research. We plan to incorporate emojis and emoticons into our model as they give a
lot of information to the sentiment of the tweet.
A.1 Experimental Set-Up
A.2 Bag of Words
This method is the simplest way to vectorize a body of text into an array of frequencies. This is accomplished by
simply taking the number of times that a word occurs in the text [11]. An example sentence, “I walked my dog to my
neighborhood’s dog park,” would be encoded:
I 1
Walked 1
My 2
Dog 2
To 1
Neighborhood’s 1
Park 1
A.3 TF-IDF
Text Frequency Inverse Document Frequency (TF-IDF) is an expansion of the Bag of Words vectorization technique
that attempts to normalize a word’s frequency against the number of words in the document and the number of times
the word occurs in other documents in the corpus [11]. In the case of tweets, the TF-IDF of a given word would be
calculated:
TF-IDF =
times a words appears in a tweet
number of words in tweet
· log
number of documents in the corpus
number of documents in the corpus that contain the word
Where the corpus is a collection of random tweets.
A.4 N-Gram
One issue with Bag of Words and TF-IDF vectorization is the ignorance of word order. These methods only take into
account the number of times a word occurs. N-Grams attempts to fix this issue. An N-Gram is simply the process of
taking n words and combining them into one unit [25]. For example, the sentence, “I am not happy with this product,”
could be divided into the 2-Grams:
“I am”, “am not”, “not happy”, “happy with”, “with this”, and “this product.”
Combining the words into 2-Grams allows us to analyze the phrase, “not happy” which is clearly negative. This negation
of the word “happy” would not have been caught without looking at 2-Grams. This method can also be used to analyze
3-Grams, 4-Grams and so on.
A.5 Lemmatization
To simplify the feature space further we can use a method called Lemmatization. This is the process of condensing a
word down to its root. The root conversion of a word is known as its lemma. This process often involves turning plural
words singular or removing suffixes with no meaning [26]. An example would be simplifying the word “running” into
“run.”
A.6 English Stop Word Removal
A simple technique to reduce the feature space is to remove words that carry little to no meaning. This would include
words similar to: “the”, “an”, “it”, and “to.” This process is known as English stop word removal and is often used in
search engines to simplify user inputs to their core meaning [27].
13

A.7 Word Embedding
Some other Vectorization packages that are commonly used include Word2vec, Glove Embedding, and Fastext.
B Model Approaches
B.1 Rule-Based
Rule-based learning revolves around the use of a sentiment lexicon, which is basically just a vocabulary or dictionary of
words where the words have some sort of rating based on averaging scores given by language experts. The ratings can
either be discrete, otherwise known as polarity based lexicons, or continuous, known as valence based lexicons. Using
these rules, sentences can be analyzed and given discrete or continuous scores. The most common metric is whether a
sentence is positive, negative, or neutral in tone. VADER, created in 2014, is a popular rule based model that works well
on social media content. VADER uses a valence based lexicon and can be used to compute how positive or negative a
sentence on a scale. This scale is determined by adding the scores of individual words with modifications based on the
structure of the sentence before normalizing the score to be from -1 to 1.
B.2 Machine Learning
B.2.1 Naive Bayes
Given a class vector y and feature vectors x1, x2, · · · xn traditional (Binary) classification with Naive Bayes works by
calculating the following [22]:
P(y|x1, . . . , xn) =
P(y)P(x1, x2, · · · , xn|y)
P(x1, x2, · · · , xn)
(1)
Using the naive conditional independence assumption:
P(xi|y, x1 . . . , xi−1, xi+1, xn) = P(xi|y) (2)
Then, Naive Bayes will assign labels that have the highest probability of being a specific label.
B.2.2 Random Forest
Random Forest Classifiers use a series of uncorrelated decision trees to protect the main decision from errors that some
of the trees may make.
B.2.3 SVM
The SVM is a classification model which can analyze features in higher dimensions using a kernel trick before finding
the optimal decision boundary or hyperplane for those features in the higher dimension.
B.3 Deep Learning
B.3.1 CNN
CNN is a type of neural network which makes use of a convolutional layer. This layer applies a convolution filter to the
data which enables detection of important features. The CNN is most known for its usage in computer vision, however,
can be applied to NLP problems.
B.3.2 RNN
The ordering of the words has a huge impact on a tweets meaning. Thus, a model that can process a tweet sequentially
would make sense. The RNN is a type of deep learning model that takes the input in a sequential manner and, thus,
can take into account word dependencies and relationships. Since its initial conception, the RNN model has evolved
substantially. Several of the more important evolutions are the bidirectional RNN, the long short term memory (LSTM),
gated recurrent unit (GRU), and the transformer architecture.
14

The bidirectional RNN allows for the RNN to learn word dependencies and relationships before and after a given input
word in a sentence. This is an extremely useful addition to the original RNN as the the context of a given word might
require the word in front and behind it to be known. Both the LSTM and GRU architectures help reduce the vanishing
gradient problem allowing long-term word dependencies to be more easily learned in a tweet. Due to the nature of
a tweet being short, however, this might not be as much of a problem as in NLP problems requiring larger inputs.
Lastly, the transformer architecture speeds up the training of the RNN model allowing for parallelization a flaw in prior
versions.
C Hyperparameter Tuning
Above we have some results on hyperparameter tuning using Weights and Biases, or WandB, on our model using the
Competition dataset. The sweep on the left was our first run of WandB while the one on the right is our second run, in
case the true optimal parameters ended up with a faulty result. All in all, since both sweeps seemed to report similar
results, we believe that the optimal parameters were found during the earlier sweep already.
This sweep above was for our model on the IMDB dataset.
D Team Leads
1. Research Lead - Andrew Chan
2. Data Scraping Lead - Devin Sohi
3. Model Lead - Nick Frankenberg
4. Data preprocessing lead - Justin Cheng
15

E Task assignment / Schedule
F Code availability
Code is avalible here: https://drive.google.com/drive/folders/1Qm1i_JuAcGrD2OvgUAq9vF_AQG5t1zzV?
usp=sharing
16

SentimentAnalysisofTwitterProductReviewsDocument.pdf

Recommended

Recommended

More Related Content

Similar to SentimentAnalysisofTwitterProductReviewsDocument.pdf

Similar to SentimentAnalysisofTwitterProductReviewsDocument.pdf (20)

Recently uploaded

Recently uploaded (20)

SentimentAnalysisofTwitterProductReviewsDocument.pdf