Comparison of Word2vec and Doc2Vec model driven Sentiment Analysis using SVM, LR, Keras CNN, Bidirectional LSTM with and without pre-trained Word and Document Embeddings

Sofia Dutta
Data 602 – Spring 2019
Semester Project
Comparing Word2vec, Doc2Vec
model driven Sentiment Analysis
using SVM, LR vs Keras CNN and
Bidirectional LSTM
with and without pre-trained Word
and Document Embeddings

INTRODUCTION
SENTIMENT ANALYSIS

WHAT IS A WORD VECTOR?
• Assume a vocabulary has five words:
King, Queen, Man, Woman, Child.
• One-hot vector doesn’t allow
meaningful comparisons, i.e. no
semantics available
• The solution is to use Word2Vec
which uses distributed representation
of word in vector space.
One-hot encoded vector for ‘Queen’

• Distributed representations of word2vec
can help encode various aspects of a
word
• Aspects are represented by elements of
the vector and can help define a word
• Aspects can represent things like royalty,
gender or age in our “little language”

• It has been observed that we can
perform simple algebraic operations on
the word vectors.
• We can remove the masculinity aspect
from King by performing the vector
operation: vector(“King”) – vector(“Man”)
• Following that we can add
vector(“Woman”) to the result from
above and obtain a vector that will be
closest to the vector representation of
the word Queen.

Movie review
dataset
50K reviews
Amazon Laptop
review dataset
40K+ reviews
Cleanup and
pre-processing
Word and Document
embeddings
Logistic regression,
XGBoost, SVM etc.
Keras word
embeddings
Keras CNN, Keras
Bidirectional LSTM

DATA OVERVIEW
Data source Columns
Dataset review count Max review
Comments
Total Positive
Negativ
e
Positive
Negativ
e
Amazon laptop
reviews
 Review
 Rating
40K+ ~30K ~10K 20K 3K
Distribution mass
should be covered
by 900 to 1000
characters
IMDb movie
reviews
 Review
 Rating 50K 25K 25K 14K 5K
Distribution mass
should be covered
by 1400 to 1500
characters

DATA EXPLORATION: REVIEW LENGTH, LAPTOP
REVIEWS
Laptop dataset review
lengths

DATA EXPLORATION: REVIEW LENGTH, MOVIE
REVIEWS
Movie dataset review
lengths

DATA EXPLORATION: REVIEW LENGTH
Movie dataset review
lengths
Laptop dataset review
lengths

DATA EXPLORATION: REVIEW COUNT
Movie dataset review countLaptop dataset review
count

TEXT PRE-PROCESSING
Input
• Review string
Remove
• HTML tags, URLs
Convert
• To lowercase
Split
• Into words
Remove
• Punctuations, Empty Strings, Stop-words
Return
• Concatenated tokens as a sentence

EXPLORING DATASET
Looking at high frequency words in each dataset using
nltk.FreqDist()

WORD2VEC: CONTINUOUS BAG-OF-WORDS

WORD2VEC CBOW MODEL EVALUATION

POSITIVE AND NEGATIVE REVIEW WORD CLOUDS
• Amazon dataset • IMDb movie review dataset
Positive cloud is with white background and negative cloud is with black background

SENTIMENT ANALYSIS
• Word2Vec Word Embedding Based Sentiment Analysis using
LogisticRegression
• Word2Vec Word Embedding Based Sentiment Analysis using SVC
• Word2Vec Word Embedding Based Sentiment Analysis using
XGBClassifier

KERAS CONVOLUTIONAL NEURAL NETWORK
• Sentiment Analysis using Keras
Convolutional Neural
Networks(CNN)
• Sentiment Analysis Using Pre-
trained Word2Vec Word
Embedding To Keras CNN
Image credit

BIDIRECTIONAL LONG SHORT TERM MEMORY
Embedding To Keras CNN And
Bidirectional LSTM
Embedding To Keras
Bidirectional LSTM
Image credit

DOC2VEC: DISTRIBUTED BAG-OF-WORDS
• Doc2Vec DBOW Based Sentiment
Analysis using LogisticRegression
Analysis using SVC
Analysis using XGBClassifier

DOC2VEC: DISTRIBUTED MEMORY
• DM (Concatenated)
• Doc2Vec DMC Based Sentiment Analysis using
LogisticRegression
SVC
XGBClassifier
• DM (Mean)
• Doc2Vec DMM Based Sentiment Analysis using
LogisticRegression
SVC
XGBClassifier

DOC2VEC: DBOW + DMC AND DBOW + DMM
• DBOW + DMC
• Doc2Vec DBOW + DMC Based Sentiment Analysis using LogisticRegression
• Doc2Vec DBOW + DMC Based Sentiment Analysis using SVC
• Doc2Vec DBOW + DMC Based Sentiment Analysis using XGBClassifier
• Doc2Vec DBOW + DMC Based Sentiment Analysis using Keras Neural Network
• DBOW + DMM
• Doc2Vec DBOW + DMM Based Sentiment Analysis using LogisticRegression
• Doc2Vec DBOW + DMM Based Sentiment Analysis using SVC
• Doc2Vec DBOW + DMM Based Sentiment Analysis using XGBClassifier
• Doc2Vec DBOW + DMC Based Sentiment Analysis using Keras Neural Network

RESULTS: WORD2VEC
Model
IMDB Movie Reviews
Accuracy
Amazon Laptop Reviews
Train Validation Test Train Validation Test
Word2Vec
LogisticRegression .8753 .8692 .8706 .9046 .9060 .9009
SVC with linear kernel .8755 .8714 .8690 .9050 .9089 .9001
XGBClassifier .8695 .8540 .8506 .9058 .9057 .8967
Keras Convolutional Neural Networks (CNN) .9994 .8770 .8788 .9992 .9239 .9146
Using Pre-trained Word2Vec Word Embedding
Keras CNN
.9593 .8444 .8268 .9690 .8969 .8891
Keras CNN And Bidirectional LSTM
.9356 .8854 .8902 .9567 .9212 .9144
Keras Bidirectional LSTM
.9011 .8800 .8786 .9418 .9205 .9229

RESULTS: DOC2VEC SIMPLE MODELS
Model
IMDB Movie Reviews
Accuracy
Amazon Laptop Reviews Accuracy
Doc2Vec DBOW
XGBClassifier .8682 .8500 .8556 .9008 .8964 .8849
Doc2Vec DMC
XGBClassifier .6214 .5952 .5958 .8137 .8193 .7980
Doc2Vec DMM
XGBClassifier .8115 .7858 .7920 .8494 .8525 .8282

RESULTS: DOC2VEC MODEL COMBOS
Model
IMDB Movie Reviews
Accuracy
Amazon Laptop Reviews Accuracy
DBOW + DMC
XGBClassifier .8692 .8510 .8574 .9000 .8969 .8849
DBOW + DMM
XGBClassifier .8736 .8548 .8590 .9016 .8986 .8854
Combination of Doc2Vec DBOW And
Document Embedding and Keras Neural
Network
.9088 .8742 .8746 .9312 .9040 .9033
Combination of Doc2Vec DBOW And
Document Embedding and Keras Neural
Network
.9295 .8722 .8720 .9475 .9156 .8984

CONFUSION MATRIX: WORD2VEC
Model
IMDB Movie Reviews
Confusion Matrix with Accuracy
TN FP FN TP
Accura
cy
TN FP FN TP
Accura
cy
Word2Vec
LR 2185 344 303 2168 0.87 554 279 125 3117 0.90
SVC 2182 347 308 2163 0.87 557 276 131 3111 0.90
XGBClassifier 2127 402 345 2126 0.85 535 298 123 3119 0.90
Keras Convolutional Neural
Networks(CNN)
2302 227 379 2092 0.88 591 242 106 3136 0.91
Pre-trained Word2Vec Word
2110 419 447 2024 0.83 571 262 190 3052 0.89
And Bidirectional LSTM
2208 321 228 2243 0.89 615 218 131 3111 0.91
Embedding To Keras
Bidirectional LSTM
2176 353 254 2217 0.88 652 181 133 3109 0.92

CONFUSION MATRIX: DOC2VEC SIMPLE MODELS
Model
IMDB Movie Reviews
TN FP FN TP
Accura
cy
TN FP FN TP
Accura
cy
DBOW
LogisticRegression 2205 324 287 2184 0.88 530 303 123 3119 0.90
SVC 2204 325 292 2179 0.88 533 300 120 3122 0.90
XGBClassifier 2152 377 345 2126 0.86 467 366 103 3139 0.88
DMC
SVC 1654 875 1157 1314 0.59 0 833 0 3242 0.80
XGBClassifier 1452 1077 944 1527 0.60 28 805 18 3224 0.80
DMM
SVC 2062 467 438 2033 0.82 201 632 69 3173 0.83
XGBClassifier 1981 548 492 1979 0.79 193 640 60 3182 0.83

CONFUSION MATRIX: DOC2VEC MODEL COMBOS
Model
IMDB Movie Reviews
TN FP FN TP
Accura
cy
TN FP FN TP
Accura
cy
DBOW + DMC
LR 2204 325 286 2185 0.88 524 309 123 3119 0.89
SVC 2210 319 286 2185 0.88 529 304 113 3129 0.90
XGBClassifier 2164 365 348 2123 0.86 472 361 108 3134 0.88
DBOW + DMM
LR 2206 323 275 2196 0.88 538 295 115 3127 0.90
SVC 2209 320 287 2184 0.88 538 295 112 3130 0.90
XGBClassifier 2166 363 342 2129 0.86 465 368 99 3143 0.89
DBOW + DMC And Keras Neural
Network
2187 342 285 2186 0.87 564 269 125 3117 0.90
DBOW + DMM And Keras Neural
Network
2142 387 253 2218 0.87 572 261 153 3089 0.90

CONCLUSION
Keras Bidirectional LSTM and Keras CNN + Bidirectional LSTM
• with Pre-trained Word2Vec Word Embedding
Keras CNN
• With Tokenizer
Word2Vec
• LogisticRegression > SVC > XGBClassifier
Keras CNN
• with Pre-trained Word2Vec Word Embedding
Word2Vec > Doc2Vec
• DBOW > DMM > DMC and DOW + DMC > DMC Also DOW + DMM > DMM
• DBOW + DMM > DBOW + DMC

CONCLUSION
Bidirectional LSTM is a Recurrent Neural Network (RNN). RNNs
have the advantage of being able to persist information. The
network considers current inputs as well as, previously received
inputs. Hence, it works really well with sequence data like text,
time series, videos, DNA sequences, etc

Comparison of Word2vec and Doc2Vec model driven Sentiment Analysis using SVM, LR, Keras CNN, Bidirectional LSTM with and without pre-trained Word and Document Embeddings

Recommended

Recommended

More Related Content

What's hot

What's hot (8)

Similar to Comparison of Word2vec and Doc2Vec model driven Sentiment Analysis using SVM, LR, Keras CNN, Bidirectional LSTM with and without pre-trained Word and Document Embeddings

Similar to Comparison of Word2vec and Doc2Vec model driven Sentiment Analysis using SVM, LR, Keras CNN, Bidirectional LSTM with and without pre-trained Word and Document Embeddings (20)

Recently uploaded

Recently uploaded (20)

Comparison of Word2vec and Doc2Vec model driven Sentiment Analysis using SVM, LR, Keras CNN, Bidirectional LSTM with and without pre-trained Word and Document Embeddings

Editor's Notes