SlideShare a Scribd company logo
1 of 16
Download to read offline
TWITTER PRODUCT REVIEWS USING NLP
A PREPRINT
Andrew Chan
Department of Computer Science
Virginia Tech
andrewclchan211@vt.edu
Devin Sohi
Department of Computer Science
Virginia Tech
devin1@vt.edu
Justin Cheng
Department of Computer Science
Virginia Tech
jcheng2000@vt.edu
Nicholas Frankenberg
Department of Computer Science
Virginia Tech
nicholascf@vt.edu
May 8, 2022
ABSTRACT
Product reviews help inform consumers to make better decisions about what products to buy and what
those reasons are. To broaden the reach of analysis, we decided to use product reviews on Twitter
and other datasets to create a rating system based off of the sentiments found in those reviews using
sentiment analysis. This will lead to an easier time for consumers to figure out what products are
worth buying and why that may be the case. Initially, we plan on evaluating standard traditional and
deep learning models on our datasets. However, labeling the incoming data is too time-consuming
and labor-exhaustive to consider on large-scale tasks. To mitigate the time requirements for labeling
such data, and if time allows, we plan to benchmark how well traditional machine learning models
and our bidirectional LSTMs perform with pseudo-labeling.
Keywords NLP · LSTM · Sentiment Analysis
1 Introduction
Sentiment analysis has been a popular topic in the natural language processing domain and has been closely tied
to product reviews and how consumers make decisions between products. The goal is to summarize the various
complexities and emotions in human language into various categories.
Traditionally, Naive Bayes on some embedding space has been used to a fair amount of success in the NLP domain.[1]
Relatively recent deep learning models such as RNNs and LSTMs have been introduced to approach the problem and
NLP in general, which is expected to perform better than Naive Bayes [2].
However, the labeling needed for this data can be very time-consuming and labor intensive. This has spurred ongoing
research [3], in semi-supervised learning.
Some recent notable results in semi-supervised learning results has been published by Google Research, with a simple
unsupervised consistency loss function with synonymous transformations, which has shown great success achieving
95.5% accuracy with semi-supervised learning versus 93.5% without semi-supervised learning on the IMDB database
with only 20/25000 data points labeled [4].
However, the most basic approach to this problem is pseudo-labeling. In summary, a model is trained on known data
and asked to predict labels for unlabeled data, called pseudo labels, and then, the model is retrained on the dataset
overall. Pseudo-labeling has been theoretically justified [5] and experimentally proven successful [6, 7].1
1
We will pursue this approach for extra credit
Twitter Product Reviews using NLP A PREPRINT
2 The Approach
Our approach to tackling the problem of Twitter product reviews is twofold. First, we plan on tackling the problem of
sentence level sentiment analysis in a supervised manner using the benchmarked IMDB dataset[8] along with a twitter
sentiment dataset[9] to train our chosen model: the bidirectional LSTM. Second, to add a bias to classifying Twitter
product reviews, for extra credit, we will add in pseudo-labeling with an unlabeled Twitter product review dataset to
improve performance.
2.1 Datasets
For handling the sentiment analysis in a supervised manner, we plan on using two datasets: a benchmarked IMDB
dataset and a twitter centric dataset called Twitter Sentiment Extraction [8, 9]. The IMDB dataset consists of about
50,000 movie reviews, 25,000 of which are labeled positive and the others, negative. It has been benchmarked on
PapersWithCode and has been used in numerous research papers. We believe this dataset will help us build a good
review based sentiment analysis model that can handle a strong variety of words and opinions. Some of the cons of
using this dataset to train our model for Twitter product reviews is that it is a polarized dataset, does not contain tweets,
and some of the reviews reach up to 3,000 words. It is also polarized due to it only containing reviews that are 0-4 stars
out of 10 or 7-10 stars out of 10. This might bias our model to take more extreme stances on product reviews.
The Twitter Sentiment Extraction (TSE) dataset is a competition dataset on Kaggle meant for standard twitter sentiment
analysis[9]. It consists of 27,000 tweets labeled negative, neutral, or positive. This dataset will help focus our model on
how to interpret tweets. Tweets don’t always have the same tone as normal sentences so it is important for our model to
be able to understand the context of the reviews. The pros to this dataset are that it is based on twitter and the short
tweets allow for quick model training times. However, the dataset being small is also a con. Sentiment analysis often
requires many training examples so our model will likely underfit.
Neither of the above datasets are explicitly meant for the domain of analyzing twitter product reviews. The IMDB
dataset is meant for movie reviews, while the TSE dataset is meant for tweet sentiment analysis regardless of context. A
model trained on such datasets might be lacking some of the intricacies particular to Twitter product reviews. Therefore,
it would be prudent to add twitter product review training examples to our model. Unfortunately, we were not able to
find a labeled dataset of Twitter product reviews. There is an unlabeled one on Kaggle, called Twitter Product Sentiment
Analysis (TPSA), focused on on a subset of product reviews including Apple products [10]. The semi-supervised
technique of pseudo-labeling would allow us to leverage this data to give our models more context on how to understand
Twitter product reviews. We plan on attempting this technique for extra credit and analyzing if any gain in performance
is made.
The final set of data we are using is a tiny set of hand picked tweets that will be used to test the models we have trained
on both of the datasets aforementioned. This dataset will be referred to as the "Hand-Labeled" dataset from this point
on [9] [10]. This dataset is curated to have many classifications that would be tricky for the models to handle, such
as target dependent sentences or sarcastic ones. Most of the tweets are related to product reviews of Apple products.
Additionally, all the tweets are either positive or negative and the label is from the point of view of the Apple product,
i.e., being target dependent. For example, "Google>Apple" would be classified as negative from the point of view of the
target Apple. As a result of using sarcastic and target dependent tweets, we are expecting a comparatively low accuracy
on this dataset.
Something worth mentioning is the IMDB dataset has two output classes, while the TSE has three output classes. For
the models trained on each of these datasets, particular rating systems, i.e. 5-star rating scale, would require scaling of
the output values: something outside the scope of our project. Before we can actually use the data from these datasets
to train our models, we need to preprocess the data.
2.2 Preprocessing
In order to train our models on the above datasets, we must first preprocess the data. The different models that we have
chosen to compare will use different preprocessing techniques to fit their needs.
Our baseline Naive Bayes model utilizes a simple Text Frequency - Inverse Document Frequency (TF-IDF) preprocessing
technique[11]. This method vectorizes the frequency of words within a tweet and normalizes each term by its occurrence
throughout the dataset as a whole. A more thorough explanation of TF-IDF is provided in the appendix. An advantage
of this method is that it is very quick to run and yields a decent vectorization that Naive Bayes is able to utilize. Some
disadvantages of this method is its complete loss of word ordering, not being able to handle words that were not seen in
the training process, and its high input dimension that would not be appropriate for more sophisticated models. Despite
2
Twitter Product Reviews using NLP A PREPRINT
this method’s simplicity, Naive Bayes is able to extract enough information from the vectorization to achieve adequate
results.
Our Neural Network based models (biRNN and BiLSTM) took a different approach to preprocessing. Before the data
was handed over to the model, a few techniques took place. English Stop Word Removal is the first of these methods. It
is simply the process of removing words that provide little to no meaning in an actual sentence. An example would
be removing words such as: “the”, “it”, “to”, “a,” etc. The next method is called Lemmatization. This is the process
of reducing a word down to its root. An example would be translating “going” to the word “go.” These methods are
used in conjunction to reduce the total dictionary size and to assist in achieving a more meaningful encoding. Once
the previous methods are complete, the resulting words are translated into integers representing their index within the
dictionary that has been created. The neural networks take these integers as its input. Because of the recurrent nature of
these models, the word order is preserved throughout the training process. As the model trains it will learn an optimal
encoding of the words within its encoding/embedding layer as it tries to reduce loss. This method has proven to be
effective and is able to meaningfully encode words in a similar way to how an Auto Encoder would. Given our results
and research, no external word embeddings are necessary.
2.3 Model Selection
As for the models we have chosen, we have selected Naive Bayes, BiRNN, and BiLSTM for our comparisons. While
trying to find a good model as our baseline, we thought about a variety of different ones including Naive Bayes and
SVM. We decided on using the Naive Bayes model because there have been many other projects with great success
using that.[1] In addition, a Naive Bayes model was the most accurate model for the IMDB dataset on Kaggle which
provides additional motivation for our choice.
To go over quickly how the Naive Bayes model works, it takes the probabilities of a classification given different
features and labels and returns the most likely classification given the features. Essentially, it will predict a classification
based on the frequency of features for a given label. This provides a context free way to analyze the features and provide
predictions.
The deep learning baseline model we have chosen is a Bidirectional Recurring Neural Network (BiRNN). We opted for
a bidirectional RNN as opposed to a single direction RNN because single directional RNNs have the issue of missing
the context of a word because it only runs in one direction. A BiRNN attempts to solve this problem by feeding the
sequence of words in reverse order to generate the weights for each word, which in turn provides more context for
words that are spelled the same, but have different meanings.
Finally, we chose the BiLSTM model as our cutting edge solution. Similarly to the BiRNN, the BiLSTM also feeds the
sequence of words in 2 orders to minimize the context problem. As an improvement to the RNN, the BiLSTM also
uses a different algorithm that uses a series of gates to make decisions on what information to keep and what to forget.
Ideally, these gates will provide an optimal network to predict the classifications for the datasets.
We chose to use the BiRNN as a deep learning base model because the BiLSTM was our chosen state-of-the-art model
to provide a better comparison. Since the BiLSTM is essentially an improved form of the BiRNN, this would provide a
better comparison between the different models.
3 Experiments
We ran a few experiments on our models in order to compare their performance using different data. While we ran into
some problems because of a lack of GPU access, we were able to use Weights and Biases to handle our hyper-parameter
grid search and find the optimal parameters for each of our models. The set up of the grid search, as well as other
experiments we ran, is explained below.
3.1 Experimental Set-up
Due to it being quick to train, we decided the first dataset we should experiment on was the TSE dataset. A 64/16/20%
train/validation/test split was done on this dataset and given to each of our three chosen models: Naive Bayes, BiRNN,
and BiLSTM. Due to being baseline models, we decided not to hyperparameter tune the Naive Bayes and BiRNN,
keeping their architectures constant. We used the optimal hyperparamters found for the BiLSTM model in the BiRNN
model. The BiRNN and BiLSTM models were both trained using early stopping for the number of epochs and the
categorical cross entropy loss function with the optimizer Adam for learning. Additionally, a batch size of 32 was
used for both BiRNN and BiLSTM models. 2 hidden cells were used for the BiLSTM model while only 1 hidden cell
was used for the BiRNN. For our BiLSTM, we ran a full grid-search, on the hyperparameters described below, using
3
Twitter Product Reviews using NLP A PREPRINT
Weights and Biases. The four hyperparameters we decided to focus on are the learning rate, embedding size, hidden
size, and dropout rate within the LSTM cell.
We chose these four hyperparameters due to them being easy to change with the PyTorch API. The learning rate
makes a big difference in how quickly the model converges to a local minimal and if that local minimal is closer to
a global minimal. Thus, trying different values for it could lead to drastic differences in the model’s performance.
The embedding size determines the size of the embedding layer and the dimension a word should be represented in.
Increasing the dimension allows for more nuisances between words to be captured at the possible expense of over-fitting
what a word means which can cause context problems. The hidden layer size determines how much memory can be
held within an LSTM cell which is an important parameter to the model. Lastly, the dropout rate within the LSTM
cell can prevent potential overfitting and is a common practice. Though, overfitting will likely not be a problem with
smaller datasets. The table below shows the different hyperparameter combinations we attempted:
Table 1: LSTM model parameters
Hyperparameters Values
Epochs used 4
Learning rate 0.0001, 0.001, 0.01
Embedding size 300, 400, 500 (best above)
Hidden size 128, 256, 512 (best above)
Dropout within LSTM cell 0.2, 0.4, 0.6 (best above)
The second dataset we experimented on was the IMDB dataset. A 40/10/50% train/validation/test split was done on this
dataset and given to each of our three chosen models: Naive Bayes, BiRNN, and BiLSTM. We chose these percentages
as the common benchmarks in literature used 50% of the data for testing. The BiRNN and BiLSTM models were both
trained using early stopping for the number of epochs and the categorical cross entropy loss function with the optimizer
Adam for learning. Additionally, a batch size of 32 and 2 hidden cells were used for both BiRNN and BiLSTM models.
Using the possible hyperparameter combinations shown in the above table, we ran a random search using Weights and
Biases. We chose random search as opposed to grid search due to the much longer training time making trying all the
combinations infeasible. Similarly to the first set of experiments done on the TSE dataset, hyperparameter tuning was
only done on the BiLSTM. BiRNN was ran on the optimal parameters found for the BiLSTM model.
Once the hyperparameter searches are done, we evaluate the accuracies by training new models with the same
hyperparameters and evaluating the accuracies.
As extra credit, we plan on doing pseudo-labeling on the BiLSTM with optimal hyperparameters on the TSE dataset.
The training setup is the same as our standard supervised setup, but with an unsupervised training setup. We split the
data into 64% labeled data, 20% unlabeled data, and 16% validation. In every epoch, we train over the labeled dataset,
then the unlabeled dataset, with our unsupervised loss function being the cross entropy between the previous model’s
predictions and the current model’s output. We have an alpha value of 1.0 applied to the unsupervised loss.
We also use Captum integrated gradient analysis to visualize our results.
3.2 Results
3.2.1 TSE Dataset Results
After executing grid-search on our BiLSTM, the best hyper-parameters were learning rate: 0.001, embedding size: 300,
hidden size: 256, and dropout within LSTM cell: 0. The two figures below come from Weights and Biases and are
visualizations of the different hyperparameter combinations and the corresponding results.
4
Twitter Product Reviews using NLP A PREPRINT
Figure 1: Hyperparameter Combinations
Figure 2: Validation Loss of 10 Best Combinations
It is a bit hard to see any noticeable patterns from Figure 1 in terms of what combinations of hyperparameters work
best. However, we did notice that combinations using a high learning rate and hidden size performed poorly. This was
confirmed with a second run of grid search. It is worth noting in our second run of grid search (figures above are for
the first run), the optimal hyperparameters were slightly different. However, in comparing the two runs, the first run
5
Twitter Product Reviews using NLP A PREPRINT
appeared to produce a slightly better optimal model in terms of fitting the validation data. The results were relatively
consistent across the two grid search runs (shown in appendix).
Based on our readings of relevant sentiment analysis research papers, the four metrics we used to evaluate the
performance of our optimal model on the TSE dataset were accuracy, test loss, average precision score, and macro AUC
score. [1, 2] A description of why we chose these metrics can be found in the appendix.
Table 2: Model comparisons on TSE dataset
Model/Results Average Test Loss Test Accuracy Average Precision Score (Macro) AUROC Score
Naive Bayes - 0.607 - -
BiRNN 0.915 0.629 0.641 0.778
BiLSTM 0.879 0.666 0.6907 0.805
Figure 3: Statistical Significance with n=10 for BiLSTM on Testing Data
Our BiLSTM model performed better than the BiRNN model and our Naive Bayes baseline which is consistent with
the literature [1, 2]. Furthermore, the BiLSTM results were highly consistent when re-trained and re-tested on the TSE
dataset as seen in the boxplot in Figure 3 across 10 runs. The BiRNN slightly outperformed the Naive Bayes. Though,
they have very close results. For why the BiLSTM outperforms the BiRNN, we believe our BiRNN is slightly suffering
from the vanishing gradient problem. Though, since tweets are shorter, we believe the vanishing gradient problem isn’t
as profound as compared to the results on the IMDB dataset. It is possible, that hyperparameter tuning might bring the
BiRNN closer to the BiLSTM in performance.
3.2.2 IMDB Dataset Results
Our results on the IMDB database yield 89% and 87% accuracy for BiLSTM and Naive Bayes respectively.
Table 3: Model comparisons on IMDB dataset
Model/Results Average Test Loss Test Accuracy
Naive Bayes - 0.866
BiRNN 0.719 0.531
BiLSTM 0.421 0.890
These results met our expectations. We can compare our models’ results to a few benchmarks. The first being a
pre-trained bert model. When passing the IMDB data to the bert-base-multilingual-uncased-sentiment model, it was
able to apply labels with an accuracy of 61%. This shows that domain specificity is important for this problem and that a
general pre-trained model might not be the best choice. Looking through a few benchmarks from paperswithcode.com,
we found the highest accuracy to be in the low 90s with models more similar to ours performing in the mid 80s. This
is a good indication that our LSTM model was performing as expected with its relatively high accuracy of 89%. We
believe that those few extra accuracy points can be contributed to our thorough hyper-parameter tuning analysis.
Similarly to the TSE dataset, we used weights and biases to accomplish this analysis. We found that the optimal
hyper-parameters were: learning rate: 0.001, embedding size: 400, hidden size: 512, and dropout: 0. The learning
rate was shown to have the biggest effect on the models performance with embedding size also having a significant
influence. The dropout and hidden size contributed less but still shifted the model’s performance.
We should note that the biRNN model’s performance has dropped dramatically when switching from the TSE dataset to
IMDB. We are predicting that this can be attributed to the vanishing gradient problem. Since the sentence size gets
6
Twitter Product Reviews using NLP A PREPRINT
much larger in the IMDB dataset, the RNN could be having trouble converging. This problem is better addressed by the
LSTM making it a better option for this scenario.
3.2.3 Hand-Labeled Results
We tested our models on the Hand-Labeled data set to assess its generalization into our goal domain: Twitter product
reviews. On the BiLSTM model that was trained on IMDB data, the model had a 0.656 accuracy when predicting labels
on our hand-labeled data. The BiLSTM model trained on the TSE data had a 0.40 accuracy. The higher accuracy of the
IMDB model likely stems from a couple sources.
First of all, the IMDB dataset only has two output classes: positive and negative. Whereas the TSE dataset has three
output classes including neutral. Since the Hand-Labeled dataset is only made up of positive and negative classified
tweets, this appears to have confused the TSE trained model which predicted neutral much of the time. Additionally,
the IMDB dataset is much larger and contains a richer vocabulary. The IMDB trained model appeared to recognize
more of the words and their relationships in the Hand-Labeled dataset, than the TSE trained model did. We will further
elaborate on this point in the analysis section.
3.2.4 (Extra Credit) Pseudo-labeling Results
For our pseudo-labeling we achieved a validation accuracy of 0.6339% in 8 epochs on the TSE dataset, which is lower
than our baseline model. Perhaps adjusting the ratios of iterations between the unsupervised dataset and supervised
datasets would help or adjusting the alpha value to only ramp up at the end. Perhaps pseudo-labeling doesn’t work well
with this dataset. Additional experiments would be needed to make a valid conclusion.
4 Analysis
With how poorly both the TSE trained and IMDB trained models performed on the Hand-Labeled dataset, our target
domain, we were curious what went wrong. Some of the more obvious reasons are the datasets being different, leading
to a lacking vocabulary, and our models’ inability to handle target dependencies. As mentioned earlier, we were not
able to find a labeled dataset made to analyze product reviews making the first issue much harder to deal with. To make
our model adaptable to target dependencies, we would need a more complicated architecture such as those described
in [12, 13, 14, 15]. To gain an even deeper understanding of why the models are struggling, we used an attribution
technique known as integrated gradients using a third-party library called Captum.
Integrated gradients is an axiomatic attribution method which makes use of vector calculus, specifically the line integral,
to determine the impact each feature has on a prediction [16]. More information about this fascinating method can be
analyzed in the cited paper.
Specific to interpreting Captum’s visualization tool, the legend colors are somewhat unintuitive. After investigating
Captum’s source code, we found positive means that attribution to the model’s predicted label was positive (green)
whereas negative means negative (red) attribution to the model’s predicted label. In a sense, green text is what positively
influence the model to make its decision, whereas red text is what negatively influences the model to make its decision.
Using Captum’s integrated gradient visualization tool, we were able to visualize how our model is interpreting the our
hand-labeled dataset on sarcasm.
Figure 4: Captum visualization on sarcastic hand-picked tweets
In this case, we hypothesize that our model is interpreting sarcasm in this example, but in the wrong way. In both
examples, it seems our model is interpreting both the positive and negative sections sarcastically, hence "great" and
"amazing" being interpreted as negative. However, to interpret sarcasm, usually only the beginning of the sentence or
end should be inverted. It’s notable that our model is a bidirectional model, so it’s plausible that our classifier layer is
7
Twitter Product Reviews using NLP A PREPRINT
not complex enough to construct relations and stronger signals to interpret which direction of sarcasm should be correct,
and rather our model interprets that both directions should be inverted.
8
Twitter Product Reviews using NLP A PREPRINT
Because, in a sense, we are not training on the hand-picked set, it’s useful to see how sarcastic examples are interpreted
in the training set, so we hand picked some sarcastic examples from our training TSE dataset for Captum visualization.
Figure 5: Captum visualization on sarcastic TSE tweets
Notably in many of our examples that we picked for sarcasm became unintelligible as lemmatization was applied,
which would likely make sarcasm harder to interpret. However, it seems there is a somewhat similar example:
"Screw review, I think Wolverine awesome. But enough Dominic Monaghan liking," or "Screw the reviews,
I thought Wolverine was awesome. But not enough Dominic Monaghan for my liking."
Once our model saw the word "Screw review, I think," it mistakenly interpreted "awesome" negatively, which implies
our model is incorrectly interpreting sarcasm. Our model has a negative attribution score here, which means our model
didn’t want to predict positive, but the other options were worse, so it was forced to do so. We also note that the during
our processing, "but not enough" became "but enough," which have drastically different meanings.
Another one to note is:
"I wan na single rest life," or "I don’t wanna be single for the rest of my life."
In this case, our preprocessing is removing key parts of the sentence. The negation was removed. But other-
wise, note that our model interprets "life" as highly negative (positive attribution for prediction of 0), which potentially
shows the temporal relations, however "na single" was interpreted as positive (negative attribution for prediction of 0).
This seems to be a general trend in the model as it’s not interpreting the sentences how a human would but is finding
keywords (that don’t always make sense) with temporal relations that swing its decision.
5 Challenges
There were a few challenges that we faced while training our models. The computational expense of running LSTM
multiple times for hyper parameter tuning was so expensive that it got multiple of our accounts kicked off of Google
Colab. We solved this problem by subscribing to a Google Colab pro account. In the future we will be more careful
with how we are using our resources and try to avoid leaving our code running overnight. Computational resources are
a common issue when trying to train larger models so we expected problems like this to arise.
9
Twitter Product Reviews using NLP A PREPRINT
6 Conclusion and Future work
We have performed experiments with bidirectional LSTMs on Twitter and review data and have shown that, as expected,
bidirectional LSTMs perform better than our RNN and Naive bayes baselines. We have briefly performed experiments
on pseudo-labeling but obtained suboptimal results on our Twitter (TSE) dataset. The carry over from our models
learning on the IMDB and TSE dataset proved suboptimal for Twitter product reviews. To improve our results toward
this domain, we would need a more complex training setup, i.e. better datasets, a different semi-supervised learning
setup, and target dependent capable models.
10
Twitter Product Reviews using NLP A PREPRINT
References
[1] S. Tripathi, R. Mehrotra, V. Bansal, and S. Upadhyay, “Analyzing sentiment using imdb dataset,” in 2020 12th
International Conference on Computational Intelligence and Communication Networks (CICN), 2020, pp. 30–33.
[2] C. Zhang and L. Liu, “Research on semantic sentiment analysis based on bilstm,” in 2021 4th International
Conference on Artificial Intelligence and Big Data (ICAIBD), 2021, pp. 377–381.
[3] B. Chen, J. Jiang, X. Wang, J. Wang, and M. Long, “Debiased pseudo labeling in self-training,” 2022.
[4] Q. Xie, Z. Dai, E. Hovy, M.-T. Luong, and Q. V. Le, “Unsupervised data augmentation for consistency training,”
2019.
[5] C. Wei, K. Shen, Y. Chen, and T. Ma, “Theoretical analysis of self-training with deep networks on
unlabeled data,” in International Conference on Learning Representations, 2021. [Online]. Available:
https://openreview.net/forum?id=rC8sJ4i6kaH
[6] K. Sohn, D. Berthelot, C.-L. Li, Z. Zhang, N. Carlini, E. D. Cubuk, A. Kurakin, H. Zhang, and C. Raffel, “Fixmatch:
Simplifying semi-supervised learning with consistency and confidence,” arXiv preprint arXiv:2001.07685, 2020.
[7] D. Zhou, O. Bousquet, T. Lal, J. Weston, and B. Schölkopf, “Learning with local and global
consistency,” in Advances in Neural Information Processing Systems, S. Thrun, L. Saul, and B. Schölkopf,
Eds., vol. 16. MIT Press, 2003. [Online]. Available: https://proceedings.neurips.cc/paper/2003/file/
87682805257e619d49b8e0dfdc14affa-Paper.pdf
[8] L. N., “Imdb dataset of 50k movie reviews,” 2019. [Online]. Available: https://www.kaggle.com/datasets/
lakshmi25npathi/imdb-dataset-of-50k-movie-reviews
[9] “Tweet sentiment extraction.” [Online]. Available: https://www.kaggle.com/competitions/
tweet-sentiment-extraction/data
[10] B. Densil, “Twitter product sentiment analysis,” 2020. [Online]. Available: https://www.kaggle.com/datasets/
blessondensil294/twitter-product-sentiment-analysis
[11] P. Singh, “Fundamentals of bag of words and tf-idf,” Feb 2020. [Online]. Available: https:
//medium.com/analytics-vidhya/fundamentals-of-bag-of-words-and-tf-idf-9846d301ff22
[12] C. Sun, L. Huang, and X. Qiu, “Utilizing bert for aspect-based sentiment analysis via constructing auxiliary
sentence,” Mar 2019. [Online]. Available: https://arxiv.org/abs/1903.09588v1
[13] Y. J. KAIST, Y. Jo, Kaist, A. H. O. KAIST, A. H. Oh, Cuhk, L. Hannover, U. of, M. R. Asia, and
O. M. A. Metrics, “Aspect and sentiment unification model for online review analysis: Proceedings of
the fourth acm international conference on web search and data mining,” Feb 2011. [Online]. Available:
https://dl.acm.org/doi/10.1145/1935826.1935932
[14] D. Tang, B. Qin, X. Feng, and T. Liu, “Effective lstms for target-dependent sentiment classification,” 2015.
[Online]. Available: https://arxiv.org/abs/1512.01100
[15] Z. Gao, A. Feng, X. Song, and X. Wu, “Target-dependent sentiment classification with bert,” IEEE Access, vol. 7,
pp. 154 290–154 299, 2019.
[16] M. Sundararajan, A. Taly, and Q. Yan, “Axiomatic attribution for deep networks,” 2017. [Online]. Available:
https://arxiv.org/abs/1703.01365
[17] G. Kour and R. Saabne, “Real-time segmentation of on-line handwritten arabic script,” in Frontiers in Handwriting
Recognition (ICFHR), 2014 14th International Conference on. IEEE, 2014, pp. 417–422.
[18] ——, “Fast classification of handwritten on-line arabic characters,” in Soft Computing and Pattern Recognition
(SoCPaR), 2014 6th International Conference of. IEEE, 2014, pp. 312–318.
[19] G. Hadash, E. Kermany, B. Carmeli, O. Lavi, G. Kour, and A. Jacovi, “Estimate and replace: A novel approach to
integrating deep neural networks with existing applications,” arXiv preprint arXiv:1804.09028, 2018.
[20] B. Pang, L. Lee, and S. Vaithyanathan, “Thumbs up? sentiment classification using machine learning techniques,”
2002.
[21] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press, 2016, http://www.deeplearningbook.org.
[22] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss,
V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay, “Scikit-learn:
Machine learning in Python,” Journal of Machine Learning Research, vol. 12, pp. 2825–2830, 2011.
11
Twitter Product Reviews using NLP A PREPRINT
[23] S. Liao, J. Wang, R. Yu, K. Sato, and Z. Cheng, “Cnn for situations understanding based on
sentiment analysis of twitter data,” Procedia Computer Science, vol. 111, pp. 376–381, 2017, the
8th International Conference on Advances in Information Technology. [Online]. Available: https:
//www.sciencedirect.com/science/article/pii/S1877050917312103
[24] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for
language understanding,” 2019.
[25] G. Sidorov, F. Velasquez, E. Stamatatos, A. Gelbukh, and L. Chanona-Hernández, “Syntactic n-grams as machine
learning features for natural language processing,” Expert Systems with Applications, vol. 41, no. 3, p. 853–860,
Feb 2014. [Online]. Available: https://linkinghub.elsevier.com/retrieve/pii/S0957417413006271
[26] C. Manning, P. Raghavan, and H. Schütze, “Stemming and lemmatization.” [Online]. Available:
https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html
[27] C. Khanna, “Text pre-processing: Stop words removal using differ-
ent libraries,” Feb 2021. [Online]. Available: https://towardsdatascience.com/
text-pre-processing-stop-words-removal-using-different-libraries-f20bac19929a
[28] M. Monette, “Anatomy of a tweet | twitter for business,” Jul 2014. [Online]. Available: https:
//openmoves.com/blog/anatomy-of-a-tweet/
[29] R. Liu, Y. Shi, C. Ji, and M. Jia, “A survey of sentiment analysis based on transfer learning,” IEEE Access, vol. 7,
pp. 85 401–85 412, 2019.
[30] R. Liu, Y. SHI, C. JI, and M. JIA, “A survey of sentiment analysis based on transfer learning,” IEEE Access,
vol. PP, pp. 1–1, 06 2019.
[31] “Semeval-2014 task 4.” [Online]. Available: https://alt.qcri.org/semeval2014/task4//
[32] KazAnova, “Sentiment140 dataset,” Sep 2017. [Online]. Available: https://www.kaggle.com/datasets/kazanova/
sentiment140
[33] Passionate-Nlp, “Twitter sentiment analysis,” Aug 2021. [Online]. Available: https://www.kaggle.com/datasets/
jp797498e/twitter-entity-sentiment-analysis
A Data
The first component of a tweet is the profile photo. According to twitter, a profile photo should be 400x400 pixels, less
than 2MB, and either a JPG, PNG or GIF. This photo can give insight to whether the account belongs to an individual
user or an organization. The photo can be analyzed using facial recognition to determine if a person is present in the
profile photo. We can also try to detect logos and text within the profile photo. Combining these two attributes can
be useful in classifying a twitter account as belonging to an individual or an organization. For the use of this project,
tweets from individuals are more valuable as they are less likely to have biases towards certain products.
The next component of a tweet is the account name and the @username. The account name is the portion of the
username that can be changed while the @username can not be changed after an account is made. This information may
seem useless, but there is some analysis that can be performed on it. These usernames can be compared against common
name banks to determine if it contains a person’s name or not. Accounts that have a common name in their username
will more often belong to an individual. The name can also be used to predict a user’s gender, age and ethnicity.
Every tweet has a timestamp of when it was posted. This information can be used to select tweets from a certain time
period or to look for trends over time. If we are looking for product reviews of a product that was released only a month
ago, it will be a good idea to remove any tweets from our search that are older than one month. For a product that has
been out for much longer we might be interested in how people’s perception of the product has shifted since its release.
We can use the timestamp to analyze this shift and detect if sentiments are becoming more positive or negative over
time. This might help us catch products that were over-promised before their release as their sentiments will be very
high before the release date and drop soon after.
The body of the tweet can contain both text and images. The text of a tweet should not be blindly treated as any block
of text as they often contain hashtags, @mentions, links, emojis, and emoticons. A hashtag is used to link the tweet
to a certain topic. This can be useful in determining the subject of a tweet or to link the tweet to others with similar
subjects. An @mention shows that this tweet has a relationship with the account tagged. People may also often include
an @mention to the company or individuals responsible for making the product they are tweeting about. Links easily
attach the tweet to a website. These links could potentially be analyzed but otherwise, they should be removed from the
model [28].
12
Twitter Product Reviews using NLP A PREPRINT
We have looked into how other twitter researchers have handled emojis and emoticons when performing sentiment
analysis and have not found much research. We plan to incorporate emojis and emoticons into our model as they give a
lot of information to the sentiment of the tweet.
A.1 Experimental Set-Up
A.2 Bag of Words
This method is the simplest way to vectorize a body of text into an array of frequencies. This is accomplished by
simply taking the number of times that a word occurs in the text [11]. An example sentence, “I walked my dog to my
neighborhood’s dog park,” would be encoded:
I 1
Walked 1
My 2
Dog 2
To 1
Neighborhood’s 1
Park 1
A.3 TF-IDF
Text Frequency Inverse Document Frequency (TF-IDF) is an expansion of the Bag of Words vectorization technique
that attempts to normalize a word’s frequency against the number of words in the document and the number of times
the word occurs in other documents in the corpus [11]. In the case of tweets, the TF-IDF of a given word would be
calculated:
TF-IDF =
times a words appears in a tweet
number of words in tweet
· log
number of documents in the corpus
number of documents in the corpus that contain the word
Where the corpus is a collection of random tweets.
A.4 N-Gram
One issue with Bag of Words and TF-IDF vectorization is the ignorance of word order. These methods only take into
account the number of times a word occurs. N-Grams attempts to fix this issue. An N-Gram is simply the process of
taking n words and combining them into one unit [25]. For example, the sentence, “I am not happy with this product,”
could be divided into the 2-Grams:
“I am”, “am not”, “not happy”, “happy with”, “with this”, and “this product.”
Combining the words into 2-Grams allows us to analyze the phrase, “not happy” which is clearly negative. This negation
of the word “happy” would not have been caught without looking at 2-Grams. This method can also be used to analyze
3-Grams, 4-Grams and so on.
A.5 Lemmatization
To simplify the feature space further we can use a method called Lemmatization. This is the process of condensing a
word down to its root. The root conversion of a word is known as its lemma. This process often involves turning plural
words singular or removing suffixes with no meaning [26]. An example would be simplifying the word “running” into
“run.”
A.6 English Stop Word Removal
A simple technique to reduce the feature space is to remove words that carry little to no meaning. This would include
words similar to: “the”, “an”, “it”, and “to.” This process is known as English stop word removal and is often used in
search engines to simplify user inputs to their core meaning [27].
13
Twitter Product Reviews using NLP A PREPRINT
A.7 Word Embedding
Some other Vectorization packages that are commonly used include Word2vec, Glove Embedding, and Fastext.
B Model Approaches
B.1 Rule-Based
Rule-based learning revolves around the use of a sentiment lexicon, which is basically just a vocabulary or dictionary of
words where the words have some sort of rating based on averaging scores given by language experts. The ratings can
either be discrete, otherwise known as polarity based lexicons, or continuous, known as valence based lexicons. Using
these rules, sentences can be analyzed and given discrete or continuous scores. The most common metric is whether a
sentence is positive, negative, or neutral in tone. VADER, created in 2014, is a popular rule based model that works well
on social media content. VADER uses a valence based lexicon and can be used to compute how positive or negative a
sentence on a scale. This scale is determined by adding the scores of individual words with modifications based on the
structure of the sentence before normalizing the score to be from -1 to 1.
B.2 Machine Learning
B.2.1 Naive Bayes
Given a class vector y and feature vectors x1, x2, · · · xn traditional (Binary) classification with Naive Bayes works by
calculating the following [22]:
P(y|x1, . . . , xn) =
P(y)P(x1, x2, · · · , xn|y)
P(x1, x2, · · · , xn)
(1)
Using the naive conditional independence assumption:
P(xi|y, x1 . . . , xi−1, xi+1, xn) = P(xi|y) (2)
Then, Naive Bayes will assign labels that have the highest probability of being a specific label.
B.2.2 Random Forest
Random Forest Classifiers use a series of uncorrelated decision trees to protect the main decision from errors that some
of the trees may make.
B.2.3 SVM
The SVM is a classification model which can analyze features in higher dimensions using a kernel trick before finding
the optimal decision boundary or hyperplane for those features in the higher dimension.
B.3 Deep Learning
B.3.1 CNN
CNN is a type of neural network which makes use of a convolutional layer. This layer applies a convolution filter to the
data which enables detection of important features. The CNN is most known for its usage in computer vision, however,
can be applied to NLP problems.
B.3.2 RNN
The ordering of the words has a huge impact on a tweets meaning. Thus, a model that can process a tweet sequentially
would make sense. The RNN is a type of deep learning model that takes the input in a sequential manner and, thus,
can take into account word dependencies and relationships. Since its initial conception, the RNN model has evolved
substantially. Several of the more important evolutions are the bidirectional RNN, the long short term memory (LSTM),
gated recurrent unit (GRU), and the transformer architecture.
14
Twitter Product Reviews using NLP A PREPRINT
The bidirectional RNN allows for the RNN to learn word dependencies and relationships before and after a given input
word in a sentence. This is an extremely useful addition to the original RNN as the the context of a given word might
require the word in front and behind it to be known. Both the LSTM and GRU architectures help reduce the vanishing
gradient problem allowing long-term word dependencies to be more easily learned in a tweet. Due to the nature of
a tweet being short, however, this might not be as much of a problem as in NLP problems requiring larger inputs.
Lastly, the transformer architecture speeds up the training of the RNN model allowing for parallelization a flaw in prior
versions.
C Hyperparameter Tuning
Above we have some results on hyperparameter tuning using Weights and Biases, or WandB, on our model using the
Competition dataset. The sweep on the left was our first run of WandB while the one on the right is our second run, in
case the true optimal parameters ended up with a faulty result. All in all, since both sweeps seemed to report similar
results, we believe that the optimal parameters were found during the earlier sweep already.
This sweep above was for our model on the IMDB dataset.
D Team Leads
1. Research Lead - Andrew Chan
2. Data Scraping Lead - Devin Sohi
3. Model Lead - Nick Frankenberg
4. Data preprocessing lead - Justin Cheng
15
Twitter Product Reviews using NLP A PREPRINT
E Task assignment / Schedule
F Code availability
Code is avalible here: https://drive.google.com/drive/folders/1Qm1i_JuAcGrD2OvgUAq9vF_AQG5t1zzV?
usp=sharing
16

More Related Content

Similar to SentimentAnalysisofTwitterProductReviewsDocument.pdf

AI TESTING: ENSURING A GOOD DATA SPLIT BETWEEN DATA SETS (TRAINING AND TEST) ...
AI TESTING: ENSURING A GOOD DATA SPLIT BETWEEN DATA SETS (TRAINING AND TEST) ...AI TESTING: ENSURING A GOOD DATA SPLIT BETWEEN DATA SETS (TRAINING AND TEST) ...
AI TESTING: ENSURING A GOOD DATA SPLIT BETWEEN DATA SETS (TRAINING AND TEST) ...ijsc
 
Derogatory Comment Classification
Derogatory Comment ClassificationDerogatory Comment Classification
Derogatory Comment ClassificationIRJET Journal
 
Methods for Sentiment Analysis: A Literature Study
Methods for Sentiment Analysis: A Literature StudyMethods for Sentiment Analysis: A Literature Study
Methods for Sentiment Analysis: A Literature Studyvivatechijri
 
Afternoons with Azure - Azure Machine Learning
Afternoons with Azure - Azure Machine Learning Afternoons with Azure - Azure Machine Learning
Afternoons with Azure - Azure Machine Learning CCG
 
IRJET- Testing Improvement in Business Intelligence Area
IRJET- Testing Improvement in Business Intelligence AreaIRJET- Testing Improvement in Business Intelligence Area
IRJET- Testing Improvement in Business Intelligence AreaIRJET Journal
 
Driver Analysis and Product Optimization with Bayesian Networks
Driver Analysis and Product Optimization with Bayesian NetworksDriver Analysis and Product Optimization with Bayesian Networks
Driver Analysis and Product Optimization with Bayesian NetworksBayesia USA
 
EMOTION RECOGNITION BY TEXTUAL TWEETS CLASSIFICATION USING VOTING CLASSIFIER ...
EMOTION RECOGNITION BY TEXTUAL TWEETS CLASSIFICATION USING VOTING CLASSIFIER ...EMOTION RECOGNITION BY TEXTUAL TWEETS CLASSIFICATION USING VOTING CLASSIFIER ...
EMOTION RECOGNITION BY TEXTUAL TWEETS CLASSIFICATION USING VOTING CLASSIFIER ...Shakas Technologies
 
IRJET- Analysis of Brand Value Prediction based on Social Media Data
IRJET-  	  Analysis of Brand Value Prediction based on Social Media DataIRJET-  	  Analysis of Brand Value Prediction based on Social Media Data
IRJET- Analysis of Brand Value Prediction based on Social Media DataIRJET Journal
 
White paper - Job skills extraction with LSTM and Word embeddings - Nikita Sh...
White paper - Job skills extraction with LSTM and Word embeddings - Nikita Sh...White paper - Job skills extraction with LSTM and Word embeddings - Nikita Sh...
White paper - Job skills extraction with LSTM and Word embeddings - Nikita Sh...Nikita Sharma
 
Using the Machine to predict Testability
Using the Machine to predict TestabilityUsing the Machine to predict Testability
Using the Machine to predict TestabilityMiguel Lopez
 
Estimating the Efficacy of Efficient Machine Learning Classifiers for Twitter...
Estimating the Efficacy of Efficient Machine Learning Classifiers for Twitter...Estimating the Efficacy of Efficient Machine Learning Classifiers for Twitter...
Estimating the Efficacy of Efficient Machine Learning Classifiers for Twitter...IRJET Journal
 
Combining Lexicon based and Machine Learning based Methods for Twitter Sentim...
Combining Lexicon based and Machine Learning based Methods for Twitter Sentim...Combining Lexicon based and Machine Learning based Methods for Twitter Sentim...
Combining Lexicon based and Machine Learning based Methods for Twitter Sentim...IRJET Journal
 
Fine-tuning Pre-Trained Models for Generative AI Applications
Fine-tuning Pre-Trained Models for Generative AI ApplicationsFine-tuning Pre-Trained Models for Generative AI Applications
Fine-tuning Pre-Trained Models for Generative AI ApplicationsBenjaminlapid1
 
professional fuzzy type-ahead rummage around in xml type-ahead search techni...
professional fuzzy type-ahead rummage around in xml  type-ahead search techni...professional fuzzy type-ahead rummage around in xml  type-ahead search techni...
professional fuzzy type-ahead rummage around in xml type-ahead search techni...Kumar Goud
 
Barga Data Science lecture 10
Barga Data Science lecture 10Barga Data Science lecture 10
Barga Data Science lecture 10Roger Barga
 
Supervised Sentiment Classification using DTDP algorithm
Supervised Sentiment Classification using DTDP algorithmSupervised Sentiment Classification using DTDP algorithm
Supervised Sentiment Classification using DTDP algorithmIJSRD
 
IDENTIFYING THE DAMAGE ASSESSMENT TWEETS DURING DISASTER
IDENTIFYING THE DAMAGE ASSESSMENT TWEETS DURING DISASTERIDENTIFYING THE DAMAGE ASSESSMENT TWEETS DURING DISASTER
IDENTIFYING THE DAMAGE ASSESSMENT TWEETS DURING DISASTERIRJET Journal
 
54 C o m m u n i C at i o n s o F t h e a C m j u.docx
54    C o m m u n i C at i o n s  o F  t h e  a C m       j u.docx54    C o m m u n i C at i o n s  o F  t h e  a C m       j u.docx
54 C o m m u n i C at i o n s o F t h e a C m j u.docxevonnehoggarth79783
 
IRJET- A Review on: Sentiment Polarity Analysis on Twitter Data from Diff...
IRJET-  	  A Review on: Sentiment Polarity Analysis on Twitter Data from Diff...IRJET-  	  A Review on: Sentiment Polarity Analysis on Twitter Data from Diff...
IRJET- A Review on: Sentiment Polarity Analysis on Twitter Data from Diff...IRJET Journal
 
data-science-pdf-16588.pdf
data-science-pdf-16588.pdfdata-science-pdf-16588.pdf
data-science-pdf-16588.pdfvkharish18
 

Similar to SentimentAnalysisofTwitterProductReviewsDocument.pdf (20)

AI TESTING: ENSURING A GOOD DATA SPLIT BETWEEN DATA SETS (TRAINING AND TEST) ...
AI TESTING: ENSURING A GOOD DATA SPLIT BETWEEN DATA SETS (TRAINING AND TEST) ...AI TESTING: ENSURING A GOOD DATA SPLIT BETWEEN DATA SETS (TRAINING AND TEST) ...
AI TESTING: ENSURING A GOOD DATA SPLIT BETWEEN DATA SETS (TRAINING AND TEST) ...
 
Derogatory Comment Classification
Derogatory Comment ClassificationDerogatory Comment Classification
Derogatory Comment Classification
 
Methods for Sentiment Analysis: A Literature Study
Methods for Sentiment Analysis: A Literature StudyMethods for Sentiment Analysis: A Literature Study
Methods for Sentiment Analysis: A Literature Study
 
Afternoons with Azure - Azure Machine Learning
Afternoons with Azure - Azure Machine Learning Afternoons with Azure - Azure Machine Learning
Afternoons with Azure - Azure Machine Learning
 
IRJET- Testing Improvement in Business Intelligence Area
IRJET- Testing Improvement in Business Intelligence AreaIRJET- Testing Improvement in Business Intelligence Area
IRJET- Testing Improvement in Business Intelligence Area
 
Driver Analysis and Product Optimization with Bayesian Networks
Driver Analysis and Product Optimization with Bayesian NetworksDriver Analysis and Product Optimization with Bayesian Networks
Driver Analysis and Product Optimization with Bayesian Networks
 
EMOTION RECOGNITION BY TEXTUAL TWEETS CLASSIFICATION USING VOTING CLASSIFIER ...
EMOTION RECOGNITION BY TEXTUAL TWEETS CLASSIFICATION USING VOTING CLASSIFIER ...EMOTION RECOGNITION BY TEXTUAL TWEETS CLASSIFICATION USING VOTING CLASSIFIER ...
EMOTION RECOGNITION BY TEXTUAL TWEETS CLASSIFICATION USING VOTING CLASSIFIER ...
 
IRJET- Analysis of Brand Value Prediction based on Social Media Data
IRJET-  	  Analysis of Brand Value Prediction based on Social Media DataIRJET-  	  Analysis of Brand Value Prediction based on Social Media Data
IRJET- Analysis of Brand Value Prediction based on Social Media Data
 
White paper - Job skills extraction with LSTM and Word embeddings - Nikita Sh...
White paper - Job skills extraction with LSTM and Word embeddings - Nikita Sh...White paper - Job skills extraction with LSTM and Word embeddings - Nikita Sh...
White paper - Job skills extraction with LSTM and Word embeddings - Nikita Sh...
 
Using the Machine to predict Testability
Using the Machine to predict TestabilityUsing the Machine to predict Testability
Using the Machine to predict Testability
 
Estimating the Efficacy of Efficient Machine Learning Classifiers for Twitter...
Estimating the Efficacy of Efficient Machine Learning Classifiers for Twitter...Estimating the Efficacy of Efficient Machine Learning Classifiers for Twitter...
Estimating the Efficacy of Efficient Machine Learning Classifiers for Twitter...
 
Combining Lexicon based and Machine Learning based Methods for Twitter Sentim...
Combining Lexicon based and Machine Learning based Methods for Twitter Sentim...Combining Lexicon based and Machine Learning based Methods for Twitter Sentim...
Combining Lexicon based and Machine Learning based Methods for Twitter Sentim...
 
Fine-tuning Pre-Trained Models for Generative AI Applications
Fine-tuning Pre-Trained Models for Generative AI ApplicationsFine-tuning Pre-Trained Models for Generative AI Applications
Fine-tuning Pre-Trained Models for Generative AI Applications
 
professional fuzzy type-ahead rummage around in xml type-ahead search techni...
professional fuzzy type-ahead rummage around in xml  type-ahead search techni...professional fuzzy type-ahead rummage around in xml  type-ahead search techni...
professional fuzzy type-ahead rummage around in xml type-ahead search techni...
 
Barga Data Science lecture 10
Barga Data Science lecture 10Barga Data Science lecture 10
Barga Data Science lecture 10
 
Supervised Sentiment Classification using DTDP algorithm
Supervised Sentiment Classification using DTDP algorithmSupervised Sentiment Classification using DTDP algorithm
Supervised Sentiment Classification using DTDP algorithm
 
IDENTIFYING THE DAMAGE ASSESSMENT TWEETS DURING DISASTER
IDENTIFYING THE DAMAGE ASSESSMENT TWEETS DURING DISASTERIDENTIFYING THE DAMAGE ASSESSMENT TWEETS DURING DISASTER
IDENTIFYING THE DAMAGE ASSESSMENT TWEETS DURING DISASTER
 
54 C o m m u n i C at i o n s o F t h e a C m j u.docx
54    C o m m u n i C at i o n s  o F  t h e  a C m       j u.docx54    C o m m u n i C at i o n s  o F  t h e  a C m       j u.docx
54 C o m m u n i C at i o n s o F t h e a C m j u.docx
 
IRJET- A Review on: Sentiment Polarity Analysis on Twitter Data from Diff...
IRJET-  	  A Review on: Sentiment Polarity Analysis on Twitter Data from Diff...IRJET-  	  A Review on: Sentiment Polarity Analysis on Twitter Data from Diff...
IRJET- A Review on: Sentiment Polarity Analysis on Twitter Data from Diff...
 
data-science-pdf-16588.pdf
data-science-pdf-16588.pdfdata-science-pdf-16588.pdf
data-science-pdf-16588.pdf
 

Recently uploaded

9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一F La
 
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...ThinkInnovation
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAbdelrhman abooda
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...Pooja Nehwal
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改atducpo
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 

Recently uploaded (20)

9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
 
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 

SentimentAnalysisofTwitterProductReviewsDocument.pdf

  • 1. TWITTER PRODUCT REVIEWS USING NLP A PREPRINT Andrew Chan Department of Computer Science Virginia Tech andrewclchan211@vt.edu Devin Sohi Department of Computer Science Virginia Tech devin1@vt.edu Justin Cheng Department of Computer Science Virginia Tech jcheng2000@vt.edu Nicholas Frankenberg Department of Computer Science Virginia Tech nicholascf@vt.edu May 8, 2022 ABSTRACT Product reviews help inform consumers to make better decisions about what products to buy and what those reasons are. To broaden the reach of analysis, we decided to use product reviews on Twitter and other datasets to create a rating system based off of the sentiments found in those reviews using sentiment analysis. This will lead to an easier time for consumers to figure out what products are worth buying and why that may be the case. Initially, we plan on evaluating standard traditional and deep learning models on our datasets. However, labeling the incoming data is too time-consuming and labor-exhaustive to consider on large-scale tasks. To mitigate the time requirements for labeling such data, and if time allows, we plan to benchmark how well traditional machine learning models and our bidirectional LSTMs perform with pseudo-labeling. Keywords NLP · LSTM · Sentiment Analysis 1 Introduction Sentiment analysis has been a popular topic in the natural language processing domain and has been closely tied to product reviews and how consumers make decisions between products. The goal is to summarize the various complexities and emotions in human language into various categories. Traditionally, Naive Bayes on some embedding space has been used to a fair amount of success in the NLP domain.[1] Relatively recent deep learning models such as RNNs and LSTMs have been introduced to approach the problem and NLP in general, which is expected to perform better than Naive Bayes [2]. However, the labeling needed for this data can be very time-consuming and labor intensive. This has spurred ongoing research [3], in semi-supervised learning. Some recent notable results in semi-supervised learning results has been published by Google Research, with a simple unsupervised consistency loss function with synonymous transformations, which has shown great success achieving 95.5% accuracy with semi-supervised learning versus 93.5% without semi-supervised learning on the IMDB database with only 20/25000 data points labeled [4]. However, the most basic approach to this problem is pseudo-labeling. In summary, a model is trained on known data and asked to predict labels for unlabeled data, called pseudo labels, and then, the model is retrained on the dataset overall. Pseudo-labeling has been theoretically justified [5] and experimentally proven successful [6, 7].1 1 We will pursue this approach for extra credit
  • 2. Twitter Product Reviews using NLP A PREPRINT 2 The Approach Our approach to tackling the problem of Twitter product reviews is twofold. First, we plan on tackling the problem of sentence level sentiment analysis in a supervised manner using the benchmarked IMDB dataset[8] along with a twitter sentiment dataset[9] to train our chosen model: the bidirectional LSTM. Second, to add a bias to classifying Twitter product reviews, for extra credit, we will add in pseudo-labeling with an unlabeled Twitter product review dataset to improve performance. 2.1 Datasets For handling the sentiment analysis in a supervised manner, we plan on using two datasets: a benchmarked IMDB dataset and a twitter centric dataset called Twitter Sentiment Extraction [8, 9]. The IMDB dataset consists of about 50,000 movie reviews, 25,000 of which are labeled positive and the others, negative. It has been benchmarked on PapersWithCode and has been used in numerous research papers. We believe this dataset will help us build a good review based sentiment analysis model that can handle a strong variety of words and opinions. Some of the cons of using this dataset to train our model for Twitter product reviews is that it is a polarized dataset, does not contain tweets, and some of the reviews reach up to 3,000 words. It is also polarized due to it only containing reviews that are 0-4 stars out of 10 or 7-10 stars out of 10. This might bias our model to take more extreme stances on product reviews. The Twitter Sentiment Extraction (TSE) dataset is a competition dataset on Kaggle meant for standard twitter sentiment analysis[9]. It consists of 27,000 tweets labeled negative, neutral, or positive. This dataset will help focus our model on how to interpret tweets. Tweets don’t always have the same tone as normal sentences so it is important for our model to be able to understand the context of the reviews. The pros to this dataset are that it is based on twitter and the short tweets allow for quick model training times. However, the dataset being small is also a con. Sentiment analysis often requires many training examples so our model will likely underfit. Neither of the above datasets are explicitly meant for the domain of analyzing twitter product reviews. The IMDB dataset is meant for movie reviews, while the TSE dataset is meant for tweet sentiment analysis regardless of context. A model trained on such datasets might be lacking some of the intricacies particular to Twitter product reviews. Therefore, it would be prudent to add twitter product review training examples to our model. Unfortunately, we were not able to find a labeled dataset of Twitter product reviews. There is an unlabeled one on Kaggle, called Twitter Product Sentiment Analysis (TPSA), focused on on a subset of product reviews including Apple products [10]. The semi-supervised technique of pseudo-labeling would allow us to leverage this data to give our models more context on how to understand Twitter product reviews. We plan on attempting this technique for extra credit and analyzing if any gain in performance is made. The final set of data we are using is a tiny set of hand picked tweets that will be used to test the models we have trained on both of the datasets aforementioned. This dataset will be referred to as the "Hand-Labeled" dataset from this point on [9] [10]. This dataset is curated to have many classifications that would be tricky for the models to handle, such as target dependent sentences or sarcastic ones. Most of the tweets are related to product reviews of Apple products. Additionally, all the tweets are either positive or negative and the label is from the point of view of the Apple product, i.e., being target dependent. For example, "Google>Apple" would be classified as negative from the point of view of the target Apple. As a result of using sarcastic and target dependent tweets, we are expecting a comparatively low accuracy on this dataset. Something worth mentioning is the IMDB dataset has two output classes, while the TSE has three output classes. For the models trained on each of these datasets, particular rating systems, i.e. 5-star rating scale, would require scaling of the output values: something outside the scope of our project. Before we can actually use the data from these datasets to train our models, we need to preprocess the data. 2.2 Preprocessing In order to train our models on the above datasets, we must first preprocess the data. The different models that we have chosen to compare will use different preprocessing techniques to fit their needs. Our baseline Naive Bayes model utilizes a simple Text Frequency - Inverse Document Frequency (TF-IDF) preprocessing technique[11]. This method vectorizes the frequency of words within a tweet and normalizes each term by its occurrence throughout the dataset as a whole. A more thorough explanation of TF-IDF is provided in the appendix. An advantage of this method is that it is very quick to run and yields a decent vectorization that Naive Bayes is able to utilize. Some disadvantages of this method is its complete loss of word ordering, not being able to handle words that were not seen in the training process, and its high input dimension that would not be appropriate for more sophisticated models. Despite 2
  • 3. Twitter Product Reviews using NLP A PREPRINT this method’s simplicity, Naive Bayes is able to extract enough information from the vectorization to achieve adequate results. Our Neural Network based models (biRNN and BiLSTM) took a different approach to preprocessing. Before the data was handed over to the model, a few techniques took place. English Stop Word Removal is the first of these methods. It is simply the process of removing words that provide little to no meaning in an actual sentence. An example would be removing words such as: “the”, “it”, “to”, “a,” etc. The next method is called Lemmatization. This is the process of reducing a word down to its root. An example would be translating “going” to the word “go.” These methods are used in conjunction to reduce the total dictionary size and to assist in achieving a more meaningful encoding. Once the previous methods are complete, the resulting words are translated into integers representing their index within the dictionary that has been created. The neural networks take these integers as its input. Because of the recurrent nature of these models, the word order is preserved throughout the training process. As the model trains it will learn an optimal encoding of the words within its encoding/embedding layer as it tries to reduce loss. This method has proven to be effective and is able to meaningfully encode words in a similar way to how an Auto Encoder would. Given our results and research, no external word embeddings are necessary. 2.3 Model Selection As for the models we have chosen, we have selected Naive Bayes, BiRNN, and BiLSTM for our comparisons. While trying to find a good model as our baseline, we thought about a variety of different ones including Naive Bayes and SVM. We decided on using the Naive Bayes model because there have been many other projects with great success using that.[1] In addition, a Naive Bayes model was the most accurate model for the IMDB dataset on Kaggle which provides additional motivation for our choice. To go over quickly how the Naive Bayes model works, it takes the probabilities of a classification given different features and labels and returns the most likely classification given the features. Essentially, it will predict a classification based on the frequency of features for a given label. This provides a context free way to analyze the features and provide predictions. The deep learning baseline model we have chosen is a Bidirectional Recurring Neural Network (BiRNN). We opted for a bidirectional RNN as opposed to a single direction RNN because single directional RNNs have the issue of missing the context of a word because it only runs in one direction. A BiRNN attempts to solve this problem by feeding the sequence of words in reverse order to generate the weights for each word, which in turn provides more context for words that are spelled the same, but have different meanings. Finally, we chose the BiLSTM model as our cutting edge solution. Similarly to the BiRNN, the BiLSTM also feeds the sequence of words in 2 orders to minimize the context problem. As an improvement to the RNN, the BiLSTM also uses a different algorithm that uses a series of gates to make decisions on what information to keep and what to forget. Ideally, these gates will provide an optimal network to predict the classifications for the datasets. We chose to use the BiRNN as a deep learning base model because the BiLSTM was our chosen state-of-the-art model to provide a better comparison. Since the BiLSTM is essentially an improved form of the BiRNN, this would provide a better comparison between the different models. 3 Experiments We ran a few experiments on our models in order to compare their performance using different data. While we ran into some problems because of a lack of GPU access, we were able to use Weights and Biases to handle our hyper-parameter grid search and find the optimal parameters for each of our models. The set up of the grid search, as well as other experiments we ran, is explained below. 3.1 Experimental Set-up Due to it being quick to train, we decided the first dataset we should experiment on was the TSE dataset. A 64/16/20% train/validation/test split was done on this dataset and given to each of our three chosen models: Naive Bayes, BiRNN, and BiLSTM. Due to being baseline models, we decided not to hyperparameter tune the Naive Bayes and BiRNN, keeping their architectures constant. We used the optimal hyperparamters found for the BiLSTM model in the BiRNN model. The BiRNN and BiLSTM models were both trained using early stopping for the number of epochs and the categorical cross entropy loss function with the optimizer Adam for learning. Additionally, a batch size of 32 was used for both BiRNN and BiLSTM models. 2 hidden cells were used for the BiLSTM model while only 1 hidden cell was used for the BiRNN. For our BiLSTM, we ran a full grid-search, on the hyperparameters described below, using 3
  • 4. Twitter Product Reviews using NLP A PREPRINT Weights and Biases. The four hyperparameters we decided to focus on are the learning rate, embedding size, hidden size, and dropout rate within the LSTM cell. We chose these four hyperparameters due to them being easy to change with the PyTorch API. The learning rate makes a big difference in how quickly the model converges to a local minimal and if that local minimal is closer to a global minimal. Thus, trying different values for it could lead to drastic differences in the model’s performance. The embedding size determines the size of the embedding layer and the dimension a word should be represented in. Increasing the dimension allows for more nuisances between words to be captured at the possible expense of over-fitting what a word means which can cause context problems. The hidden layer size determines how much memory can be held within an LSTM cell which is an important parameter to the model. Lastly, the dropout rate within the LSTM cell can prevent potential overfitting and is a common practice. Though, overfitting will likely not be a problem with smaller datasets. The table below shows the different hyperparameter combinations we attempted: Table 1: LSTM model parameters Hyperparameters Values Epochs used 4 Learning rate 0.0001, 0.001, 0.01 Embedding size 300, 400, 500 (best above) Hidden size 128, 256, 512 (best above) Dropout within LSTM cell 0.2, 0.4, 0.6 (best above) The second dataset we experimented on was the IMDB dataset. A 40/10/50% train/validation/test split was done on this dataset and given to each of our three chosen models: Naive Bayes, BiRNN, and BiLSTM. We chose these percentages as the common benchmarks in literature used 50% of the data for testing. The BiRNN and BiLSTM models were both trained using early stopping for the number of epochs and the categorical cross entropy loss function with the optimizer Adam for learning. Additionally, a batch size of 32 and 2 hidden cells were used for both BiRNN and BiLSTM models. Using the possible hyperparameter combinations shown in the above table, we ran a random search using Weights and Biases. We chose random search as opposed to grid search due to the much longer training time making trying all the combinations infeasible. Similarly to the first set of experiments done on the TSE dataset, hyperparameter tuning was only done on the BiLSTM. BiRNN was ran on the optimal parameters found for the BiLSTM model. Once the hyperparameter searches are done, we evaluate the accuracies by training new models with the same hyperparameters and evaluating the accuracies. As extra credit, we plan on doing pseudo-labeling on the BiLSTM with optimal hyperparameters on the TSE dataset. The training setup is the same as our standard supervised setup, but with an unsupervised training setup. We split the data into 64% labeled data, 20% unlabeled data, and 16% validation. In every epoch, we train over the labeled dataset, then the unlabeled dataset, with our unsupervised loss function being the cross entropy between the previous model’s predictions and the current model’s output. We have an alpha value of 1.0 applied to the unsupervised loss. We also use Captum integrated gradient analysis to visualize our results. 3.2 Results 3.2.1 TSE Dataset Results After executing grid-search on our BiLSTM, the best hyper-parameters were learning rate: 0.001, embedding size: 300, hidden size: 256, and dropout within LSTM cell: 0. The two figures below come from Weights and Biases and are visualizations of the different hyperparameter combinations and the corresponding results. 4
  • 5. Twitter Product Reviews using NLP A PREPRINT Figure 1: Hyperparameter Combinations Figure 2: Validation Loss of 10 Best Combinations It is a bit hard to see any noticeable patterns from Figure 1 in terms of what combinations of hyperparameters work best. However, we did notice that combinations using a high learning rate and hidden size performed poorly. This was confirmed with a second run of grid search. It is worth noting in our second run of grid search (figures above are for the first run), the optimal hyperparameters were slightly different. However, in comparing the two runs, the first run 5
  • 6. Twitter Product Reviews using NLP A PREPRINT appeared to produce a slightly better optimal model in terms of fitting the validation data. The results were relatively consistent across the two grid search runs (shown in appendix). Based on our readings of relevant sentiment analysis research papers, the four metrics we used to evaluate the performance of our optimal model on the TSE dataset were accuracy, test loss, average precision score, and macro AUC score. [1, 2] A description of why we chose these metrics can be found in the appendix. Table 2: Model comparisons on TSE dataset Model/Results Average Test Loss Test Accuracy Average Precision Score (Macro) AUROC Score Naive Bayes - 0.607 - - BiRNN 0.915 0.629 0.641 0.778 BiLSTM 0.879 0.666 0.6907 0.805 Figure 3: Statistical Significance with n=10 for BiLSTM on Testing Data Our BiLSTM model performed better than the BiRNN model and our Naive Bayes baseline which is consistent with the literature [1, 2]. Furthermore, the BiLSTM results were highly consistent when re-trained and re-tested on the TSE dataset as seen in the boxplot in Figure 3 across 10 runs. The BiRNN slightly outperformed the Naive Bayes. Though, they have very close results. For why the BiLSTM outperforms the BiRNN, we believe our BiRNN is slightly suffering from the vanishing gradient problem. Though, since tweets are shorter, we believe the vanishing gradient problem isn’t as profound as compared to the results on the IMDB dataset. It is possible, that hyperparameter tuning might bring the BiRNN closer to the BiLSTM in performance. 3.2.2 IMDB Dataset Results Our results on the IMDB database yield 89% and 87% accuracy for BiLSTM and Naive Bayes respectively. Table 3: Model comparisons on IMDB dataset Model/Results Average Test Loss Test Accuracy Naive Bayes - 0.866 BiRNN 0.719 0.531 BiLSTM 0.421 0.890 These results met our expectations. We can compare our models’ results to a few benchmarks. The first being a pre-trained bert model. When passing the IMDB data to the bert-base-multilingual-uncased-sentiment model, it was able to apply labels with an accuracy of 61%. This shows that domain specificity is important for this problem and that a general pre-trained model might not be the best choice. Looking through a few benchmarks from paperswithcode.com, we found the highest accuracy to be in the low 90s with models more similar to ours performing in the mid 80s. This is a good indication that our LSTM model was performing as expected with its relatively high accuracy of 89%. We believe that those few extra accuracy points can be contributed to our thorough hyper-parameter tuning analysis. Similarly to the TSE dataset, we used weights and biases to accomplish this analysis. We found that the optimal hyper-parameters were: learning rate: 0.001, embedding size: 400, hidden size: 512, and dropout: 0. The learning rate was shown to have the biggest effect on the models performance with embedding size also having a significant influence. The dropout and hidden size contributed less but still shifted the model’s performance. We should note that the biRNN model’s performance has dropped dramatically when switching from the TSE dataset to IMDB. We are predicting that this can be attributed to the vanishing gradient problem. Since the sentence size gets 6
  • 7. Twitter Product Reviews using NLP A PREPRINT much larger in the IMDB dataset, the RNN could be having trouble converging. This problem is better addressed by the LSTM making it a better option for this scenario. 3.2.3 Hand-Labeled Results We tested our models on the Hand-Labeled data set to assess its generalization into our goal domain: Twitter product reviews. On the BiLSTM model that was trained on IMDB data, the model had a 0.656 accuracy when predicting labels on our hand-labeled data. The BiLSTM model trained on the TSE data had a 0.40 accuracy. The higher accuracy of the IMDB model likely stems from a couple sources. First of all, the IMDB dataset only has two output classes: positive and negative. Whereas the TSE dataset has three output classes including neutral. Since the Hand-Labeled dataset is only made up of positive and negative classified tweets, this appears to have confused the TSE trained model which predicted neutral much of the time. Additionally, the IMDB dataset is much larger and contains a richer vocabulary. The IMDB trained model appeared to recognize more of the words and their relationships in the Hand-Labeled dataset, than the TSE trained model did. We will further elaborate on this point in the analysis section. 3.2.4 (Extra Credit) Pseudo-labeling Results For our pseudo-labeling we achieved a validation accuracy of 0.6339% in 8 epochs on the TSE dataset, which is lower than our baseline model. Perhaps adjusting the ratios of iterations between the unsupervised dataset and supervised datasets would help or adjusting the alpha value to only ramp up at the end. Perhaps pseudo-labeling doesn’t work well with this dataset. Additional experiments would be needed to make a valid conclusion. 4 Analysis With how poorly both the TSE trained and IMDB trained models performed on the Hand-Labeled dataset, our target domain, we were curious what went wrong. Some of the more obvious reasons are the datasets being different, leading to a lacking vocabulary, and our models’ inability to handle target dependencies. As mentioned earlier, we were not able to find a labeled dataset made to analyze product reviews making the first issue much harder to deal with. To make our model adaptable to target dependencies, we would need a more complicated architecture such as those described in [12, 13, 14, 15]. To gain an even deeper understanding of why the models are struggling, we used an attribution technique known as integrated gradients using a third-party library called Captum. Integrated gradients is an axiomatic attribution method which makes use of vector calculus, specifically the line integral, to determine the impact each feature has on a prediction [16]. More information about this fascinating method can be analyzed in the cited paper. Specific to interpreting Captum’s visualization tool, the legend colors are somewhat unintuitive. After investigating Captum’s source code, we found positive means that attribution to the model’s predicted label was positive (green) whereas negative means negative (red) attribution to the model’s predicted label. In a sense, green text is what positively influence the model to make its decision, whereas red text is what negatively influences the model to make its decision. Using Captum’s integrated gradient visualization tool, we were able to visualize how our model is interpreting the our hand-labeled dataset on sarcasm. Figure 4: Captum visualization on sarcastic hand-picked tweets In this case, we hypothesize that our model is interpreting sarcasm in this example, but in the wrong way. In both examples, it seems our model is interpreting both the positive and negative sections sarcastically, hence "great" and "amazing" being interpreted as negative. However, to interpret sarcasm, usually only the beginning of the sentence or end should be inverted. It’s notable that our model is a bidirectional model, so it’s plausible that our classifier layer is 7
  • 8. Twitter Product Reviews using NLP A PREPRINT not complex enough to construct relations and stronger signals to interpret which direction of sarcasm should be correct, and rather our model interprets that both directions should be inverted. 8
  • 9. Twitter Product Reviews using NLP A PREPRINT Because, in a sense, we are not training on the hand-picked set, it’s useful to see how sarcastic examples are interpreted in the training set, so we hand picked some sarcastic examples from our training TSE dataset for Captum visualization. Figure 5: Captum visualization on sarcastic TSE tweets Notably in many of our examples that we picked for sarcasm became unintelligible as lemmatization was applied, which would likely make sarcasm harder to interpret. However, it seems there is a somewhat similar example: "Screw review, I think Wolverine awesome. But enough Dominic Monaghan liking," or "Screw the reviews, I thought Wolverine was awesome. But not enough Dominic Monaghan for my liking." Once our model saw the word "Screw review, I think," it mistakenly interpreted "awesome" negatively, which implies our model is incorrectly interpreting sarcasm. Our model has a negative attribution score here, which means our model didn’t want to predict positive, but the other options were worse, so it was forced to do so. We also note that the during our processing, "but not enough" became "but enough," which have drastically different meanings. Another one to note is: "I wan na single rest life," or "I don’t wanna be single for the rest of my life." In this case, our preprocessing is removing key parts of the sentence. The negation was removed. But other- wise, note that our model interprets "life" as highly negative (positive attribution for prediction of 0), which potentially shows the temporal relations, however "na single" was interpreted as positive (negative attribution for prediction of 0). This seems to be a general trend in the model as it’s not interpreting the sentences how a human would but is finding keywords (that don’t always make sense) with temporal relations that swing its decision. 5 Challenges There were a few challenges that we faced while training our models. The computational expense of running LSTM multiple times for hyper parameter tuning was so expensive that it got multiple of our accounts kicked off of Google Colab. We solved this problem by subscribing to a Google Colab pro account. In the future we will be more careful with how we are using our resources and try to avoid leaving our code running overnight. Computational resources are a common issue when trying to train larger models so we expected problems like this to arise. 9
  • 10. Twitter Product Reviews using NLP A PREPRINT 6 Conclusion and Future work We have performed experiments with bidirectional LSTMs on Twitter and review data and have shown that, as expected, bidirectional LSTMs perform better than our RNN and Naive bayes baselines. We have briefly performed experiments on pseudo-labeling but obtained suboptimal results on our Twitter (TSE) dataset. The carry over from our models learning on the IMDB and TSE dataset proved suboptimal for Twitter product reviews. To improve our results toward this domain, we would need a more complex training setup, i.e. better datasets, a different semi-supervised learning setup, and target dependent capable models. 10
  • 11. Twitter Product Reviews using NLP A PREPRINT References [1] S. Tripathi, R. Mehrotra, V. Bansal, and S. Upadhyay, “Analyzing sentiment using imdb dataset,” in 2020 12th International Conference on Computational Intelligence and Communication Networks (CICN), 2020, pp. 30–33. [2] C. Zhang and L. Liu, “Research on semantic sentiment analysis based on bilstm,” in 2021 4th International Conference on Artificial Intelligence and Big Data (ICAIBD), 2021, pp. 377–381. [3] B. Chen, J. Jiang, X. Wang, J. Wang, and M. Long, “Debiased pseudo labeling in self-training,” 2022. [4] Q. Xie, Z. Dai, E. Hovy, M.-T. Luong, and Q. V. Le, “Unsupervised data augmentation for consistency training,” 2019. [5] C. Wei, K. Shen, Y. Chen, and T. Ma, “Theoretical analysis of self-training with deep networks on unlabeled data,” in International Conference on Learning Representations, 2021. [Online]. Available: https://openreview.net/forum?id=rC8sJ4i6kaH [6] K. Sohn, D. Berthelot, C.-L. Li, Z. Zhang, N. Carlini, E. D. Cubuk, A. Kurakin, H. Zhang, and C. Raffel, “Fixmatch: Simplifying semi-supervised learning with consistency and confidence,” arXiv preprint arXiv:2001.07685, 2020. [7] D. Zhou, O. Bousquet, T. Lal, J. Weston, and B. Schölkopf, “Learning with local and global consistency,” in Advances in Neural Information Processing Systems, S. Thrun, L. Saul, and B. Schölkopf, Eds., vol. 16. MIT Press, 2003. [Online]. Available: https://proceedings.neurips.cc/paper/2003/file/ 87682805257e619d49b8e0dfdc14affa-Paper.pdf [8] L. N., “Imdb dataset of 50k movie reviews,” 2019. [Online]. Available: https://www.kaggle.com/datasets/ lakshmi25npathi/imdb-dataset-of-50k-movie-reviews [9] “Tweet sentiment extraction.” [Online]. Available: https://www.kaggle.com/competitions/ tweet-sentiment-extraction/data [10] B. Densil, “Twitter product sentiment analysis,” 2020. [Online]. Available: https://www.kaggle.com/datasets/ blessondensil294/twitter-product-sentiment-analysis [11] P. Singh, “Fundamentals of bag of words and tf-idf,” Feb 2020. [Online]. Available: https: //medium.com/analytics-vidhya/fundamentals-of-bag-of-words-and-tf-idf-9846d301ff22 [12] C. Sun, L. Huang, and X. Qiu, “Utilizing bert for aspect-based sentiment analysis via constructing auxiliary sentence,” Mar 2019. [Online]. Available: https://arxiv.org/abs/1903.09588v1 [13] Y. J. KAIST, Y. Jo, Kaist, A. H. O. KAIST, A. H. Oh, Cuhk, L. Hannover, U. of, M. R. Asia, and O. M. A. Metrics, “Aspect and sentiment unification model for online review analysis: Proceedings of the fourth acm international conference on web search and data mining,” Feb 2011. [Online]. Available: https://dl.acm.org/doi/10.1145/1935826.1935932 [14] D. Tang, B. Qin, X. Feng, and T. Liu, “Effective lstms for target-dependent sentiment classification,” 2015. [Online]. Available: https://arxiv.org/abs/1512.01100 [15] Z. Gao, A. Feng, X. Song, and X. Wu, “Target-dependent sentiment classification with bert,” IEEE Access, vol. 7, pp. 154 290–154 299, 2019. [16] M. Sundararajan, A. Taly, and Q. Yan, “Axiomatic attribution for deep networks,” 2017. [Online]. Available: https://arxiv.org/abs/1703.01365 [17] G. Kour and R. Saabne, “Real-time segmentation of on-line handwritten arabic script,” in Frontiers in Handwriting Recognition (ICFHR), 2014 14th International Conference on. IEEE, 2014, pp. 417–422. [18] ——, “Fast classification of handwritten on-line arabic characters,” in Soft Computing and Pattern Recognition (SoCPaR), 2014 6th International Conference of. IEEE, 2014, pp. 312–318. [19] G. Hadash, E. Kermany, B. Carmeli, O. Lavi, G. Kour, and A. Jacovi, “Estimate and replace: A novel approach to integrating deep neural networks with existing applications,” arXiv preprint arXiv:1804.09028, 2018. [20] B. Pang, L. Lee, and S. Vaithyanathan, “Thumbs up? sentiment classification using machine learning techniques,” 2002. [21] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press, 2016, http://www.deeplearningbook.org. [22] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay, “Scikit-learn: Machine learning in Python,” Journal of Machine Learning Research, vol. 12, pp. 2825–2830, 2011. 11
  • 12. Twitter Product Reviews using NLP A PREPRINT [23] S. Liao, J. Wang, R. Yu, K. Sato, and Z. Cheng, “Cnn for situations understanding based on sentiment analysis of twitter data,” Procedia Computer Science, vol. 111, pp. 376–381, 2017, the 8th International Conference on Advances in Information Technology. [Online]. Available: https: //www.sciencedirect.com/science/article/pii/S1877050917312103 [24] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” 2019. [25] G. Sidorov, F. Velasquez, E. Stamatatos, A. Gelbukh, and L. Chanona-Hernández, “Syntactic n-grams as machine learning features for natural language processing,” Expert Systems with Applications, vol. 41, no. 3, p. 853–860, Feb 2014. [Online]. Available: https://linkinghub.elsevier.com/retrieve/pii/S0957417413006271 [26] C. Manning, P. Raghavan, and H. Schütze, “Stemming and lemmatization.” [Online]. Available: https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html [27] C. Khanna, “Text pre-processing: Stop words removal using differ- ent libraries,” Feb 2021. [Online]. Available: https://towardsdatascience.com/ text-pre-processing-stop-words-removal-using-different-libraries-f20bac19929a [28] M. Monette, “Anatomy of a tweet | twitter for business,” Jul 2014. [Online]. Available: https: //openmoves.com/blog/anatomy-of-a-tweet/ [29] R. Liu, Y. Shi, C. Ji, and M. Jia, “A survey of sentiment analysis based on transfer learning,” IEEE Access, vol. 7, pp. 85 401–85 412, 2019. [30] R. Liu, Y. SHI, C. JI, and M. JIA, “A survey of sentiment analysis based on transfer learning,” IEEE Access, vol. PP, pp. 1–1, 06 2019. [31] “Semeval-2014 task 4.” [Online]. Available: https://alt.qcri.org/semeval2014/task4// [32] KazAnova, “Sentiment140 dataset,” Sep 2017. [Online]. Available: https://www.kaggle.com/datasets/kazanova/ sentiment140 [33] Passionate-Nlp, “Twitter sentiment analysis,” Aug 2021. [Online]. Available: https://www.kaggle.com/datasets/ jp797498e/twitter-entity-sentiment-analysis A Data The first component of a tweet is the profile photo. According to twitter, a profile photo should be 400x400 pixels, less than 2MB, and either a JPG, PNG or GIF. This photo can give insight to whether the account belongs to an individual user or an organization. The photo can be analyzed using facial recognition to determine if a person is present in the profile photo. We can also try to detect logos and text within the profile photo. Combining these two attributes can be useful in classifying a twitter account as belonging to an individual or an organization. For the use of this project, tweets from individuals are more valuable as they are less likely to have biases towards certain products. The next component of a tweet is the account name and the @username. The account name is the portion of the username that can be changed while the @username can not be changed after an account is made. This information may seem useless, but there is some analysis that can be performed on it. These usernames can be compared against common name banks to determine if it contains a person’s name or not. Accounts that have a common name in their username will more often belong to an individual. The name can also be used to predict a user’s gender, age and ethnicity. Every tweet has a timestamp of when it was posted. This information can be used to select tweets from a certain time period or to look for trends over time. If we are looking for product reviews of a product that was released only a month ago, it will be a good idea to remove any tweets from our search that are older than one month. For a product that has been out for much longer we might be interested in how people’s perception of the product has shifted since its release. We can use the timestamp to analyze this shift and detect if sentiments are becoming more positive or negative over time. This might help us catch products that were over-promised before their release as their sentiments will be very high before the release date and drop soon after. The body of the tweet can contain both text and images. The text of a tweet should not be blindly treated as any block of text as they often contain hashtags, @mentions, links, emojis, and emoticons. A hashtag is used to link the tweet to a certain topic. This can be useful in determining the subject of a tweet or to link the tweet to others with similar subjects. An @mention shows that this tweet has a relationship with the account tagged. People may also often include an @mention to the company or individuals responsible for making the product they are tweeting about. Links easily attach the tweet to a website. These links could potentially be analyzed but otherwise, they should be removed from the model [28]. 12
  • 13. Twitter Product Reviews using NLP A PREPRINT We have looked into how other twitter researchers have handled emojis and emoticons when performing sentiment analysis and have not found much research. We plan to incorporate emojis and emoticons into our model as they give a lot of information to the sentiment of the tweet. A.1 Experimental Set-Up A.2 Bag of Words This method is the simplest way to vectorize a body of text into an array of frequencies. This is accomplished by simply taking the number of times that a word occurs in the text [11]. An example sentence, “I walked my dog to my neighborhood’s dog park,” would be encoded: I 1 Walked 1 My 2 Dog 2 To 1 Neighborhood’s 1 Park 1 A.3 TF-IDF Text Frequency Inverse Document Frequency (TF-IDF) is an expansion of the Bag of Words vectorization technique that attempts to normalize a word’s frequency against the number of words in the document and the number of times the word occurs in other documents in the corpus [11]. In the case of tweets, the TF-IDF of a given word would be calculated: TF-IDF = times a words appears in a tweet number of words in tweet · log number of documents in the corpus number of documents in the corpus that contain the word Where the corpus is a collection of random tweets. A.4 N-Gram One issue with Bag of Words and TF-IDF vectorization is the ignorance of word order. These methods only take into account the number of times a word occurs. N-Grams attempts to fix this issue. An N-Gram is simply the process of taking n words and combining them into one unit [25]. For example, the sentence, “I am not happy with this product,” could be divided into the 2-Grams: “I am”, “am not”, “not happy”, “happy with”, “with this”, and “this product.” Combining the words into 2-Grams allows us to analyze the phrase, “not happy” which is clearly negative. This negation of the word “happy” would not have been caught without looking at 2-Grams. This method can also be used to analyze 3-Grams, 4-Grams and so on. A.5 Lemmatization To simplify the feature space further we can use a method called Lemmatization. This is the process of condensing a word down to its root. The root conversion of a word is known as its lemma. This process often involves turning plural words singular or removing suffixes with no meaning [26]. An example would be simplifying the word “running” into “run.” A.6 English Stop Word Removal A simple technique to reduce the feature space is to remove words that carry little to no meaning. This would include words similar to: “the”, “an”, “it”, and “to.” This process is known as English stop word removal and is often used in search engines to simplify user inputs to their core meaning [27]. 13
  • 14. Twitter Product Reviews using NLP A PREPRINT A.7 Word Embedding Some other Vectorization packages that are commonly used include Word2vec, Glove Embedding, and Fastext. B Model Approaches B.1 Rule-Based Rule-based learning revolves around the use of a sentiment lexicon, which is basically just a vocabulary or dictionary of words where the words have some sort of rating based on averaging scores given by language experts. The ratings can either be discrete, otherwise known as polarity based lexicons, or continuous, known as valence based lexicons. Using these rules, sentences can be analyzed and given discrete or continuous scores. The most common metric is whether a sentence is positive, negative, or neutral in tone. VADER, created in 2014, is a popular rule based model that works well on social media content. VADER uses a valence based lexicon and can be used to compute how positive or negative a sentence on a scale. This scale is determined by adding the scores of individual words with modifications based on the structure of the sentence before normalizing the score to be from -1 to 1. B.2 Machine Learning B.2.1 Naive Bayes Given a class vector y and feature vectors x1, x2, · · · xn traditional (Binary) classification with Naive Bayes works by calculating the following [22]: P(y|x1, . . . , xn) = P(y)P(x1, x2, · · · , xn|y) P(x1, x2, · · · , xn) (1) Using the naive conditional independence assumption: P(xi|y, x1 . . . , xi−1, xi+1, xn) = P(xi|y) (2) Then, Naive Bayes will assign labels that have the highest probability of being a specific label. B.2.2 Random Forest Random Forest Classifiers use a series of uncorrelated decision trees to protect the main decision from errors that some of the trees may make. B.2.3 SVM The SVM is a classification model which can analyze features in higher dimensions using a kernel trick before finding the optimal decision boundary or hyperplane for those features in the higher dimension. B.3 Deep Learning B.3.1 CNN CNN is a type of neural network which makes use of a convolutional layer. This layer applies a convolution filter to the data which enables detection of important features. The CNN is most known for its usage in computer vision, however, can be applied to NLP problems. B.3.2 RNN The ordering of the words has a huge impact on a tweets meaning. Thus, a model that can process a tweet sequentially would make sense. The RNN is a type of deep learning model that takes the input in a sequential manner and, thus, can take into account word dependencies and relationships. Since its initial conception, the RNN model has evolved substantially. Several of the more important evolutions are the bidirectional RNN, the long short term memory (LSTM), gated recurrent unit (GRU), and the transformer architecture. 14
  • 15. Twitter Product Reviews using NLP A PREPRINT The bidirectional RNN allows for the RNN to learn word dependencies and relationships before and after a given input word in a sentence. This is an extremely useful addition to the original RNN as the the context of a given word might require the word in front and behind it to be known. Both the LSTM and GRU architectures help reduce the vanishing gradient problem allowing long-term word dependencies to be more easily learned in a tweet. Due to the nature of a tweet being short, however, this might not be as much of a problem as in NLP problems requiring larger inputs. Lastly, the transformer architecture speeds up the training of the RNN model allowing for parallelization a flaw in prior versions. C Hyperparameter Tuning Above we have some results on hyperparameter tuning using Weights and Biases, or WandB, on our model using the Competition dataset. The sweep on the left was our first run of WandB while the one on the right is our second run, in case the true optimal parameters ended up with a faulty result. All in all, since both sweeps seemed to report similar results, we believe that the optimal parameters were found during the earlier sweep already. This sweep above was for our model on the IMDB dataset. D Team Leads 1. Research Lead - Andrew Chan 2. Data Scraping Lead - Devin Sohi 3. Model Lead - Nick Frankenberg 4. Data preprocessing lead - Justin Cheng 15
  • 16. Twitter Product Reviews using NLP A PREPRINT E Task assignment / Schedule F Code availability Code is avalible here: https://drive.google.com/drive/folders/1Qm1i_JuAcGrD2OvgUAq9vF_AQG5t1zzV? usp=sharing 16