SlideShare a Scribd company logo
Sofia Dutta
Data 602 – Spring 2019
Semester Project
Comparing Word2vec, Doc2Vec
model driven Sentiment Analysis
using SVM, LR vs Keras CNN and
Bidirectional LSTM
with and without pre-trained Word
and Document Embeddings
INTRODUCTION
SENTIMENT ANALYSIS
WHAT IS A WORD VECTOR?
• Assume a vocabulary has five words:
King, Queen, Man, Woman, Child.
• One-hot vector doesn’t allow
meaningful comparisons, i.e. no
semantics available
• The solution is to use Word2Vec
which uses distributed representation
of word in vector space.
One-hot encoded vector for ‘Queen’
WHAT IS A WORD VECTOR?
• Distributed representations of word2vec
can help encode various aspects of a
word
• Aspects are represented by elements of
the vector and can help define a word
• Aspects can represent things like royalty,
gender or age in our “little language”
WHAT IS A WORD VECTOR?
• It has been observed that we can
perform simple algebraic operations on
the word vectors.
• We can remove the masculinity aspect
from King by performing the vector
operation: vector(“King”) – vector(“Man”)
• Following that we can add
vector(“Woman”) to the result from
above and obtain a vector that will be
closest to the vector representation of
the word Queen.
SYSTEM ARCHITECTURE
Movie review
dataset
50K reviews
Amazon Laptop
review dataset
40K+ reviews
Cleanup and
pre-processing
Word and Document
embeddings
Logistic regression,
XGBoost, SVM etc.
Keras word
embeddings
Keras CNN, Keras
Bidirectional LSTM
DATA OVERVIEW
Data source Columns
Dataset review count Max review
Comments
Total Positive
Negativ
e
Positive
Negativ
e
Amazon laptop
reviews
 Review
 Rating
40K+ ~30K ~10K 20K 3K
Distribution mass
should be covered
by 900 to 1000
characters
IMDb movie
reviews
 Review
 Rating 50K 25K 25K 14K 5K
Distribution mass
should be covered
by 1400 to 1500
characters
DATA EXPLORATION: REVIEW LENGTH, LAPTOP
REVIEWS
Laptop dataset review
lengths
DATA EXPLORATION: REVIEW LENGTH, MOVIE
REVIEWS
Movie dataset review
lengths
DATA EXPLORATION: REVIEW LENGTH
Movie dataset review
lengths
Laptop dataset review
lengths
DATA EXPLORATION: REVIEW COUNT
Movie dataset review countLaptop dataset review
count
TEXT PRE-PROCESSING
Input
• Review string
Remove
• HTML tags, URLs
Convert
• To lowercase
Split
• Into words
Remove
• Punctuations, Empty Strings, Stop-words
Return
• Concatenated tokens as a sentence
EXPLORING DATASET
Looking at high frequency words in each dataset using
nltk.FreqDist()
WORD2VEC: CONTINUOUS BAG-OF-WORDS
WORD2VEC CBOW MODEL EVALUATION
WORD2VEC CBOW MODEL EVALUATION
WORD2VEC CBOW MODEL EVALUATION
POSITIVE AND NEGATIVE REVIEW WORD CLOUDS
• Amazon dataset • IMDb movie review dataset
Positive cloud is with white background and negative cloud is with black background
SENTIMENT ANALYSIS
• Word2Vec Word Embedding Based Sentiment Analysis using
LogisticRegression
• Word2Vec Word Embedding Based Sentiment Analysis using SVC
• Word2Vec Word Embedding Based Sentiment Analysis using
XGBClassifier
KERAS CONVOLUTIONAL NEURAL NETWORK
• Sentiment Analysis using Keras
Convolutional Neural
Networks(CNN)
• Sentiment Analysis Using Pre-
trained Word2Vec Word
Embedding To Keras CNN
Image credit
BIDIRECTIONAL LONG SHORT TERM MEMORY
• Sentiment Analysis Using Pre-
trained Word2Vec Word
Embedding To Keras CNN And
Bidirectional LSTM
• Sentiment Analysis Using Pre-
trained Word2Vec Word
Embedding To Keras
Bidirectional LSTM
Image credit
DOC2VEC: DISTRIBUTED BAG-OF-WORDS
• Doc2Vec DBOW Based Sentiment
Analysis using LogisticRegression
• Doc2Vec DBOW Based Sentiment
Analysis using SVC
• Doc2Vec DBOW Based Sentiment
Analysis using XGBClassifier
DOC2VEC: DISTRIBUTED MEMORY
• DM (Concatenated)
• Doc2Vec DMC Based Sentiment Analysis using
LogisticRegression
• Doc2Vec DMC Based Sentiment Analysis using
SVC
• Doc2Vec DMC Based Sentiment Analysis using
XGBClassifier
• DM (Mean)
• Doc2Vec DMM Based Sentiment Analysis using
LogisticRegression
• Doc2Vec DMM Based Sentiment Analysis using
SVC
• Doc2Vec DMM Based Sentiment Analysis using
XGBClassifier
DOC2VEC: DBOW + DMC AND DBOW + DMM
• DBOW + DMC
• Doc2Vec DBOW + DMC Based Sentiment Analysis using LogisticRegression
• Doc2Vec DBOW + DMC Based Sentiment Analysis using SVC
• Doc2Vec DBOW + DMC Based Sentiment Analysis using XGBClassifier
• Doc2Vec DBOW + DMC Based Sentiment Analysis using Keras Neural Network
• DBOW + DMM
• Doc2Vec DBOW + DMM Based Sentiment Analysis using LogisticRegression
• Doc2Vec DBOW + DMM Based Sentiment Analysis using SVC
• Doc2Vec DBOW + DMM Based Sentiment Analysis using XGBClassifier
• Doc2Vec DBOW + DMC Based Sentiment Analysis using Keras Neural Network
RESULTS: WORD2VEC
Model
IMDB Movie Reviews
Accuracy
Amazon Laptop Reviews
Train Validation Test Train Validation Test
Word2Vec
LogisticRegression .8753 .8692 .8706 .9046 .9060 .9009
SVC with linear kernel .8755 .8714 .8690 .9050 .9089 .9001
XGBClassifier .8695 .8540 .8506 .9058 .9057 .8967
Keras Convolutional Neural Networks (CNN) .9994 .8770 .8788 .9992 .9239 .9146
Using Pre-trained Word2Vec Word Embedding
Keras CNN
.9593 .8444 .8268 .9690 .8969 .8891
Using Pre-trained Word2Vec Word Embedding
Keras CNN And Bidirectional LSTM
.9356 .8854 .8902 .9567 .9212 .9144
Using Pre-trained Word2Vec Word Embedding
Keras Bidirectional LSTM
.9011 .8800 .8786 .9418 .9205 .9229
RESULTS: DOC2VEC SIMPLE MODELS
Model
IMDB Movie Reviews
Accuracy
Amazon Laptop Reviews Accuracy
Train Validation Test Train Validation Test
Doc2Vec DBOW
LogisticRegression .8732 .8738 .8778 .8982 .9094 .8955
SVC with linear kernel .8738 .8742 .8766 .8985 .9072 .8969
XGBClassifier .8682 .8500 .8556 .9008 .8964 .8849
Doc2Vec DMC
LogisticRegression .5939 .5862 .5986 .8088 .8137 .7961
SVC with linear kernel .5933 .5864 .5936 .8086 .8130 .7956
XGBClassifier .6214 .5952 .5958 .8137 .8193 .7980
Doc2Vec DMM
LogisticRegression .8187 .8196 .8174 .8472 .8571 .8307
SVC with linear kernel .8193 .8184 .8190 .8428 .8554 .8280
XGBClassifier .8115 .7858 .7920 .8494 .8525 .8282
RESULTS: DOC2VEC MODEL COMBOS
Model
IMDB Movie Reviews
Accuracy
Amazon Laptop Reviews Accuracy
Train Validation Test Train Validation Test
DBOW + DMC
LogisticRegression .8741 .8738 .8778 .8982 .9099 .8940
SVC with linear kernel .8751 .8744 .8790 .8990 .9072 .8977
XGBClassifier .8692 .8510 .8574 .9000 .8969 .8849
DBOW + DMM
LogisticRegression .8813 .8742 .8804 .9024 .9092 .8994
SVC with linear kernel .8809 .8756 .8786 .9033 .9092 .9001
XGBClassifier .8736 .8548 .8590 .9016 .8986 .8854
Combination of Doc2Vec DBOW And
Document Embedding and Keras Neural
Network
.9088 .8742 .8746 .9312 .9040 .9033
Combination of Doc2Vec DBOW And
Document Embedding and Keras Neural
Network
.9295 .8722 .8720 .9475 .9156 .8984
CONFUSION MATRIX: WORD2VEC
Model
IMDB Movie Reviews
Confusion Matrix with Accuracy
Amazon Laptop Reviews
Confusion Matrix with Accuracy
TN FP FN TP
Accura
cy
TN FP FN TP
Accura
cy
Word2Vec
LR 2185 344 303 2168 0.87 554 279 125 3117 0.90
SVC 2182 347 308 2163 0.87 557 276 131 3111 0.90
XGBClassifier 2127 402 345 2126 0.85 535 298 123 3119 0.90
Keras Convolutional Neural
Networks(CNN)
2302 227 379 2092 0.88 591 242 106 3136 0.91
Pre-trained Word2Vec Word
Embedding To Keras CNN
2110 419 447 2024 0.83 571 262 190 3052 0.89
Pre-trained Word2Vec Word
Embedding To Keras CNN
And Bidirectional LSTM
2208 321 228 2243 0.89 615 218 131 3111 0.91
Pre-trained Word2Vec Word
Embedding To Keras
Bidirectional LSTM
2176 353 254 2217 0.88 652 181 133 3109 0.92
CONFUSION MATRIX: DOC2VEC SIMPLE MODELS
Model
IMDB Movie Reviews
Confusion Matrix with Accuracy
Amazon Laptop Reviews
Confusion Matrix with Accuracy
TN FP FN TP
Accura
cy
TN FP FN TP
Accura
cy
DBOW
LogisticRegression 2205 324 287 2184 0.88 530 303 123 3119 0.90
SVC 2204 325 292 2179 0.88 533 300 120 3122 0.90
XGBClassifier 2152 377 345 2126 0.86 467 366 103 3139 0.88
DMC
LogisticRegression 1586 943 1064 1407 0.60 2 831 0 3242 0.80
SVC 1654 875 1157 1314 0.59 0 833 0 3242 0.80
XGBClassifier 1452 1077 944 1527 0.60 28 805 18 3224 0.80
DMM
LogisticRegression 2064 465 448 2023 0.82 246 587 103 3139 0.83
SVC 2062 467 438 2033 0.82 201 632 69 3173 0.83
XGBClassifier 1981 548 492 1979 0.79 193 640 60 3182 0.83
CONFUSION MATRIX: DOC2VEC MODEL COMBOS
Model
IMDB Movie Reviews
Confusion Matrix with Accuracy
Amazon Laptop Reviews
Confusion Matrix with Accuracy
TN FP FN TP
Accura
cy
TN FP FN TP
Accura
cy
DBOW + DMC
LR 2204 325 286 2185 0.88 524 309 123 3119 0.89
SVC 2210 319 286 2185 0.88 529 304 113 3129 0.90
XGBClassifier 2164 365 348 2123 0.86 472 361 108 3134 0.88
DBOW + DMM
LR 2206 323 275 2196 0.88 538 295 115 3127 0.90
SVC 2209 320 287 2184 0.88 538 295 112 3130 0.90
XGBClassifier 2166 363 342 2129 0.86 465 368 99 3143 0.89
DBOW + DMC And Keras Neural
Network
2187 342 285 2186 0.87 564 269 125 3117 0.90
DBOW + DMM And Keras Neural
Network
2142 387 253 2218 0.87 572 261 153 3089 0.90
CONCLUSION
Keras Bidirectional LSTM and Keras CNN + Bidirectional LSTM
• with Pre-trained Word2Vec Word Embedding
Keras CNN
• With Tokenizer
Word2Vec
• LogisticRegression > SVC > XGBClassifier
Keras CNN
• with Pre-trained Word2Vec Word Embedding
Word2Vec > Doc2Vec
• DBOW > DMM > DMC and DOW + DMC > DMC Also DOW + DMM > DMM
• DBOW + DMM > DBOW + DMC
CONCLUSION
Bidirectional LSTM is a Recurrent Neural Network (RNN). RNNs
have the advantage of being able to persist information. The
network considers current inputs as well as, previously received
inputs. Hence, it works really well with sequence data like text,
time series, videos, DNA sequences, etc

More Related Content

What's hot

Performance Tipping Points - Hitting Hardware Bottlenecks
Performance Tipping Points - Hitting Hardware BottlenecksPerformance Tipping Points - Hitting Hardware Bottlenecks
Performance Tipping Points - Hitting Hardware Bottlenecks
MongoDB
 
MongoDB World 2018: Enterprise Security in the Cloud
MongoDB World 2018: Enterprise Security in the CloudMongoDB World 2018: Enterprise Security in the Cloud
MongoDB World 2018: Enterprise Security in the Cloud
MongoDB
 
iland Internet Solutions: Leveraging Cassandra for real-time multi-datacenter...
iland Internet Solutions: Leveraging Cassandra for real-time multi-datacenter...iland Internet Solutions: Leveraging Cassandra for real-time multi-datacenter...
iland Internet Solutions: Leveraging Cassandra for real-time multi-datacenter...
DataStax Academy
 
Scaling opensimulator inventory using nosql
Scaling opensimulator inventory using nosqlScaling opensimulator inventory using nosql
Scaling opensimulator inventory using nosql
David Daeschler
 
SparkSQL et Cassandra - Tool In Action Devoxx 2015
 SparkSQL et Cassandra - Tool In Action Devoxx 2015 SparkSQL et Cassandra - Tool In Action Devoxx 2015
SparkSQL et Cassandra - Tool In Action Devoxx 2015
Alexander DEJANOVSKI
 
World’s Best Data Modeling Tool
World’s Best Data Modeling ToolWorld’s Best Data Modeling Tool
World’s Best Data Modeling Tool
Artem Chebotko
 
Sizing MongoDB Clusters
Sizing MongoDB Clusters Sizing MongoDB Clusters
Sizing MongoDB Clusters
MongoDB
 
Workshop on Advanced Design Patterns for Amazon DynamoDB - DAT405 - re:Invent...
Workshop on Advanced Design Patterns for Amazon DynamoDB - DAT405 - re:Invent...Workshop on Advanced Design Patterns for Amazon DynamoDB - DAT405 - re:Invent...
Workshop on Advanced Design Patterns for Amazon DynamoDB - DAT405 - re:Invent...
Amazon Web Services
 

What's hot (8)

Performance Tipping Points - Hitting Hardware Bottlenecks
Performance Tipping Points - Hitting Hardware BottlenecksPerformance Tipping Points - Hitting Hardware Bottlenecks
Performance Tipping Points - Hitting Hardware Bottlenecks
 
MongoDB World 2018: Enterprise Security in the Cloud
MongoDB World 2018: Enterprise Security in the CloudMongoDB World 2018: Enterprise Security in the Cloud
MongoDB World 2018: Enterprise Security in the Cloud
 
iland Internet Solutions: Leveraging Cassandra for real-time multi-datacenter...
iland Internet Solutions: Leveraging Cassandra for real-time multi-datacenter...iland Internet Solutions: Leveraging Cassandra for real-time multi-datacenter...
iland Internet Solutions: Leveraging Cassandra for real-time multi-datacenter...
 
Scaling opensimulator inventory using nosql
Scaling opensimulator inventory using nosqlScaling opensimulator inventory using nosql
Scaling opensimulator inventory using nosql
 
SparkSQL et Cassandra - Tool In Action Devoxx 2015
 SparkSQL et Cassandra - Tool In Action Devoxx 2015 SparkSQL et Cassandra - Tool In Action Devoxx 2015
SparkSQL et Cassandra - Tool In Action Devoxx 2015
 
World’s Best Data Modeling Tool
World’s Best Data Modeling ToolWorld’s Best Data Modeling Tool
World’s Best Data Modeling Tool
 
Sizing MongoDB Clusters
Sizing MongoDB Clusters Sizing MongoDB Clusters
Sizing MongoDB Clusters
 
Workshop on Advanced Design Patterns for Amazon DynamoDB - DAT405 - re:Invent...
Workshop on Advanced Design Patterns for Amazon DynamoDB - DAT405 - re:Invent...Workshop on Advanced Design Patterns for Amazon DynamoDB - DAT405 - re:Invent...
Workshop on Advanced Design Patterns for Amazon DynamoDB - DAT405 - re:Invent...
 

Similar to Comparison of Word2vec and Doc2Vec model driven Sentiment Analysis using SVM, LR, Keras CNN, Bidirectional LSTM with and without pre-trained Word and Document Embeddings

Dragon flow neutron lightning talk
Dragon flow neutron lightning talkDragon flow neutron lightning talk
Dragon flow neutron lightning talk
Eran Gampel
 
Dragonflow Austin Summit Talk
Dragonflow Austin Summit Talk Dragonflow Austin Summit Talk
Dragonflow Austin Summit Talk
Eran Gampel
 
A Technical Deep Dive on Protecting Acropolis Workloads with Rubrik
A Technical Deep Dive on Protecting Acropolis Workloads with RubrikA Technical Deep Dive on Protecting Acropolis Workloads with Rubrik
A Technical Deep Dive on Protecting Acropolis Workloads with Rubrik
NEXTtour
 
Day 4 - Cloud Migration - But How?
Day 4 - Cloud Migration - But How?Day 4 - Cloud Migration - But How?
Day 4 - Cloud Migration - But How?
Amazon Web Services
 
ARC209_A Day in the Life of A Netflix Engineer
ARC209_A Day in the Life of A Netflix EngineerARC209_A Day in the Life of A Netflix Engineer
ARC209_A Day in the Life of A Netflix Engineer
Amazon Web Services
 
Discovering Your AI Super Powers - Tips and Tricks to Jumpstart your AI Projects
Discovering Your AI Super Powers - Tips and Tricks to Jumpstart your AI ProjectsDiscovering Your AI Super Powers - Tips and Tricks to Jumpstart your AI Projects
Discovering Your AI Super Powers - Tips and Tricks to Jumpstart your AI Projects
Wee Hyong Tok
 
Высокопроизводительный инференс глубоких сетей на GPU с помощью TensorRT / Ма...
Высокопроизводительный инференс глубоких сетей на GPU с помощью TensorRT / Ма...Высокопроизводительный инференс глубоких сетей на GPU с помощью TensorRT / Ма...
Высокопроизводительный инференс глубоких сетей на GPU с помощью TensorRT / Ма...
Ontico
 
Introduction to Apache Cassandra™ + What’s New in 4.0
Introduction to Apache Cassandra™ + What’s New in 4.0Introduction to Apache Cassandra™ + What’s New in 4.0
Introduction to Apache Cassandra™ + What’s New in 4.0
DataStax
 
AWS Cloud Experience CA: Bases de Datos en AWS: distintas necesidades, distin...
AWS Cloud Experience CA: Bases de Datos en AWS: distintas necesidades, distin...AWS Cloud Experience CA: Bases de Datos en AWS: distintas necesidades, distin...
AWS Cloud Experience CA: Bases de Datos en AWS: distintas necesidades, distin...
Amazon Web Services LATAM
 
FULLTEXT02
FULLTEXT02FULLTEXT02
FULLTEXT02
Sathvik Katam
 
1000 node Cassandra cluster on Amazon's EKS? - Matt Overstreet (DoK Day EU 2022)
1000 node Cassandra cluster on Amazon's EKS? - Matt Overstreet (DoK Day EU 2022)1000 node Cassandra cluster on Amazon's EKS? - Matt Overstreet (DoK Day EU 2022)
1000 node Cassandra cluster on Amazon's EKS? - Matt Overstreet (DoK Day EU 2022)
DoKC
 
1000 node Cassandra cluster on Amazon's EKS?
1000 node Cassandra cluster on Amazon's EKS?1000 node Cassandra cluster on Amazon's EKS?
1000 node Cassandra cluster on Amazon's EKS?
DoKC
 
Day 4 - Big Data on AWS - RedShift, EMR & the Internet of Things
Day 4 - Big Data on AWS - RedShift, EMR & the Internet of ThingsDay 4 - Big Data on AWS - RedShift, EMR & the Internet of Things
Day 4 - Big Data on AWS - RedShift, EMR & the Internet of Things
Amazon Web Services
 
OpenStack Dragonflow shenzhen and Hangzhou meetups
OpenStack Dragonflow shenzhen and Hangzhou  meetupsOpenStack Dragonflow shenzhen and Hangzhou  meetups
OpenStack Dragonflow shenzhen and Hangzhou meetups
Eran Gampel
 
AWS re:Invent 2017 Recap - Solutions Updates
AWS re:Invent 2017 Recap - Solutions UpdatesAWS re:Invent 2017 Recap - Solutions Updates
AWS re:Invent 2017 Recap - Solutions Updates
Amazon Web Services
 
[GeeCON2024] How I learned to stop worrying and love the dark silicon apocalypse
[GeeCON2024] How I learned to stop worrying and love the dark silicon apocalypse[GeeCON2024] How I learned to stop worrying and love the dark silicon apocalypse
[GeeCON2024] How I learned to stop worrying and love the dark silicon apocalypse
Tomasz Kowalczewski
 
Return on Ignite 2019: Azure, .NET, A.I. & Data
Return on Ignite 2019: Azure, .NET, A.I. & DataReturn on Ignite 2019: Azure, .NET, A.I. & Data
Return on Ignite 2019: Azure, .NET, A.I. & Data
MSDEVMTL
 
Dissolving the Problem (Making an ACID-Compliant Database Out of Apache Kafka®)
Dissolving the Problem (Making an ACID-Compliant Database Out of Apache Kafka®)Dissolving the Problem (Making an ACID-Compliant Database Out of Apache Kafka®)
Dissolving the Problem (Making an ACID-Compliant Database Out of Apache Kafka®)
confluent
 
AWS reInvent 2018 Recap - Solutions Updates Part 2
AWS reInvent 2018 Recap - Solutions Updates Part 2AWS reInvent 2018 Recap - Solutions Updates Part 2
AWS reInvent 2018 Recap - Solutions Updates Part 2
Amazon Web Services
 
Using MongoDB to Build a Fast and Scalable Content Repository
Using MongoDB to Build a Fast and Scalable Content RepositoryUsing MongoDB to Build a Fast and Scalable Content Repository
Using MongoDB to Build a Fast and Scalable Content Repository
MongoDB
 

Similar to Comparison of Word2vec and Doc2Vec model driven Sentiment Analysis using SVM, LR, Keras CNN, Bidirectional LSTM with and without pre-trained Word and Document Embeddings (20)

Dragon flow neutron lightning talk
Dragon flow neutron lightning talkDragon flow neutron lightning talk
Dragon flow neutron lightning talk
 
Dragonflow Austin Summit Talk
Dragonflow Austin Summit Talk Dragonflow Austin Summit Talk
Dragonflow Austin Summit Talk
 
A Technical Deep Dive on Protecting Acropolis Workloads with Rubrik
A Technical Deep Dive on Protecting Acropolis Workloads with RubrikA Technical Deep Dive on Protecting Acropolis Workloads with Rubrik
A Technical Deep Dive on Protecting Acropolis Workloads with Rubrik
 
Day 4 - Cloud Migration - But How?
Day 4 - Cloud Migration - But How?Day 4 - Cloud Migration - But How?
Day 4 - Cloud Migration - But How?
 
ARC209_A Day in the Life of A Netflix Engineer
ARC209_A Day in the Life of A Netflix EngineerARC209_A Day in the Life of A Netflix Engineer
ARC209_A Day in the Life of A Netflix Engineer
 
Discovering Your AI Super Powers - Tips and Tricks to Jumpstart your AI Projects
Discovering Your AI Super Powers - Tips and Tricks to Jumpstart your AI ProjectsDiscovering Your AI Super Powers - Tips and Tricks to Jumpstart your AI Projects
Discovering Your AI Super Powers - Tips and Tricks to Jumpstart your AI Projects
 
Высокопроизводительный инференс глубоких сетей на GPU с помощью TensorRT / Ма...
Высокопроизводительный инференс глубоких сетей на GPU с помощью TensorRT / Ма...Высокопроизводительный инференс глубоких сетей на GPU с помощью TensorRT / Ма...
Высокопроизводительный инференс глубоких сетей на GPU с помощью TensorRT / Ма...
 
Introduction to Apache Cassandra™ + What’s New in 4.0
Introduction to Apache Cassandra™ + What’s New in 4.0Introduction to Apache Cassandra™ + What’s New in 4.0
Introduction to Apache Cassandra™ + What’s New in 4.0
 
AWS Cloud Experience CA: Bases de Datos en AWS: distintas necesidades, distin...
AWS Cloud Experience CA: Bases de Datos en AWS: distintas necesidades, distin...AWS Cloud Experience CA: Bases de Datos en AWS: distintas necesidades, distin...
AWS Cloud Experience CA: Bases de Datos en AWS: distintas necesidades, distin...
 
FULLTEXT02
FULLTEXT02FULLTEXT02
FULLTEXT02
 
1000 node Cassandra cluster on Amazon's EKS? - Matt Overstreet (DoK Day EU 2022)
1000 node Cassandra cluster on Amazon's EKS? - Matt Overstreet (DoK Day EU 2022)1000 node Cassandra cluster on Amazon's EKS? - Matt Overstreet (DoK Day EU 2022)
1000 node Cassandra cluster on Amazon's EKS? - Matt Overstreet (DoK Day EU 2022)
 
1000 node Cassandra cluster on Amazon's EKS?
1000 node Cassandra cluster on Amazon's EKS?1000 node Cassandra cluster on Amazon's EKS?
1000 node Cassandra cluster on Amazon's EKS?
 
Day 4 - Big Data on AWS - RedShift, EMR & the Internet of Things
Day 4 - Big Data on AWS - RedShift, EMR & the Internet of ThingsDay 4 - Big Data on AWS - RedShift, EMR & the Internet of Things
Day 4 - Big Data on AWS - RedShift, EMR & the Internet of Things
 
OpenStack Dragonflow shenzhen and Hangzhou meetups
OpenStack Dragonflow shenzhen and Hangzhou  meetupsOpenStack Dragonflow shenzhen and Hangzhou  meetups
OpenStack Dragonflow shenzhen and Hangzhou meetups
 
AWS re:Invent 2017 Recap - Solutions Updates
AWS re:Invent 2017 Recap - Solutions UpdatesAWS re:Invent 2017 Recap - Solutions Updates
AWS re:Invent 2017 Recap - Solutions Updates
 
[GeeCON2024] How I learned to stop worrying and love the dark silicon apocalypse
[GeeCON2024] How I learned to stop worrying and love the dark silicon apocalypse[GeeCON2024] How I learned to stop worrying and love the dark silicon apocalypse
[GeeCON2024] How I learned to stop worrying and love the dark silicon apocalypse
 
Return on Ignite 2019: Azure, .NET, A.I. & Data
Return on Ignite 2019: Azure, .NET, A.I. & DataReturn on Ignite 2019: Azure, .NET, A.I. & Data
Return on Ignite 2019: Azure, .NET, A.I. & Data
 
Dissolving the Problem (Making an ACID-Compliant Database Out of Apache Kafka®)
Dissolving the Problem (Making an ACID-Compliant Database Out of Apache Kafka®)Dissolving the Problem (Making an ACID-Compliant Database Out of Apache Kafka®)
Dissolving the Problem (Making an ACID-Compliant Database Out of Apache Kafka®)
 
AWS reInvent 2018 Recap - Solutions Updates Part 2
AWS reInvent 2018 Recap - Solutions Updates Part 2AWS reInvent 2018 Recap - Solutions Updates Part 2
AWS reInvent 2018 Recap - Solutions Updates Part 2
 
Using MongoDB to Build a Fast and Scalable Content Repository
Using MongoDB to Build a Fast and Scalable Content RepositoryUsing MongoDB to Build a Fast and Scalable Content Repository
Using MongoDB to Build a Fast and Scalable Content Repository
 

Recently uploaded

Walmart Business+ and Spark Good for Nonprofits.pdf
Walmart Business+ and Spark Good for Nonprofits.pdfWalmart Business+ and Spark Good for Nonprofits.pdf
Walmart Business+ and Spark Good for Nonprofits.pdf
TechSoup
 
LAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UP
LAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UPLAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UP
LAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UP
RAHUL
 
BÀI TẬP DẠY THÊM TIẾNG ANH LỚP 7 CẢ NĂM FRIENDS PLUS SÁCH CHÂN TRỜI SÁNG TẠO ...
BÀI TẬP DẠY THÊM TIẾNG ANH LỚP 7 CẢ NĂM FRIENDS PLUS SÁCH CHÂN TRỜI SÁNG TẠO ...BÀI TẬP DẠY THÊM TIẾNG ANH LỚP 7 CẢ NĂM FRIENDS PLUS SÁCH CHÂN TRỜI SÁNG TẠO ...
BÀI TẬP DẠY THÊM TIẾNG ANH LỚP 7 CẢ NĂM FRIENDS PLUS SÁCH CHÂN TRỜI SÁNG TẠO ...
Nguyen Thanh Tu Collection
 
Stack Memory Organization of 8086 Microprocessor
Stack Memory Organization of 8086 MicroprocessorStack Memory Organization of 8086 Microprocessor
Stack Memory Organization of 8086 Microprocessor
JomonJoseph58
 
SWOT analysis in the project Keeping the Memory @live.pptx
SWOT analysis in the project Keeping the Memory @live.pptxSWOT analysis in the project Keeping the Memory @live.pptx
SWOT analysis in the project Keeping the Memory @live.pptx
zuzanka
 
A Independência da América Espanhola LAPBOOK.pdf
A Independência da América Espanhola LAPBOOK.pdfA Independência da América Espanhola LAPBOOK.pdf
A Independência da América Espanhola LAPBOOK.pdf
Jean Carlos Nunes Paixão
 
Bonku-Babus-Friend by Sathyajith Ray (9)
Bonku-Babus-Friend by Sathyajith Ray  (9)Bonku-Babus-Friend by Sathyajith Ray  (9)
Bonku-Babus-Friend by Sathyajith Ray (9)
nitinpv4ai
 
Philippine Edukasyong Pantahanan at Pangkabuhayan (EPP) Curriculum
Philippine Edukasyong Pantahanan at Pangkabuhayan (EPP) CurriculumPhilippine Edukasyong Pantahanan at Pangkabuhayan (EPP) Curriculum
Philippine Edukasyong Pantahanan at Pangkabuhayan (EPP) Curriculum
MJDuyan
 
BIOLOGY NATIONAL EXAMINATION COUNCIL (NECO) 2024 PRACTICAL MANUAL.pptx
BIOLOGY NATIONAL EXAMINATION COUNCIL (NECO) 2024 PRACTICAL MANUAL.pptxBIOLOGY NATIONAL EXAMINATION COUNCIL (NECO) 2024 PRACTICAL MANUAL.pptx
BIOLOGY NATIONAL EXAMINATION COUNCIL (NECO) 2024 PRACTICAL MANUAL.pptx
RidwanHassanYusuf
 
Level 3 NCEA - NZ: A Nation In the Making 1872 - 1900 SML.ppt
Level 3 NCEA - NZ: A  Nation In the Making 1872 - 1900 SML.pptLevel 3 NCEA - NZ: A  Nation In the Making 1872 - 1900 SML.ppt
Level 3 NCEA - NZ: A Nation In the Making 1872 - 1900 SML.ppt
Henry Hollis
 
Electric Fetus - Record Store Scavenger Hunt
Electric Fetus - Record Store Scavenger HuntElectric Fetus - Record Store Scavenger Hunt
Electric Fetus - Record Store Scavenger Hunt
RamseyBerglund
 
HYPERTENSION - SLIDE SHARE PRESENTATION.
HYPERTENSION - SLIDE SHARE PRESENTATION.HYPERTENSION - SLIDE SHARE PRESENTATION.
HYPERTENSION - SLIDE SHARE PRESENTATION.
deepaannamalai16
 
RESULTS OF THE EVALUATION QUESTIONNAIRE.pptx
RESULTS OF THE EVALUATION QUESTIONNAIRE.pptxRESULTS OF THE EVALUATION QUESTIONNAIRE.pptx
RESULTS OF THE EVALUATION QUESTIONNAIRE.pptx
zuzanka
 
Benner "Expanding Pathways to Publishing Careers"
Benner "Expanding Pathways to Publishing Careers"Benner "Expanding Pathways to Publishing Careers"
Benner "Expanding Pathways to Publishing Careers"
National Information Standards Organization (NISO)
 
Temple of Asclepius in Thrace. Excavation results
Temple of Asclepius in Thrace. Excavation resultsTemple of Asclepius in Thrace. Excavation results
Temple of Asclepius in Thrace. Excavation results
Krassimira Luka
 
How to Setup Warehouse & Location in Odoo 17 Inventory
How to Setup Warehouse & Location in Odoo 17 InventoryHow to Setup Warehouse & Location in Odoo 17 Inventory
How to Setup Warehouse & Location in Odoo 17 Inventory
Celine George
 
Beyond Degrees - Empowering the Workforce in the Context of Skills-First.pptx
Beyond Degrees - Empowering the Workforce in the Context of Skills-First.pptxBeyond Degrees - Empowering the Workforce in the Context of Skills-First.pptx
Beyond Degrees - Empowering the Workforce in the Context of Skills-First.pptx
EduSkills OECD
 
Mule event processing models | MuleSoft Mysore Meetup #47
Mule event processing models | MuleSoft Mysore Meetup #47Mule event processing models | MuleSoft Mysore Meetup #47
Mule event processing models | MuleSoft Mysore Meetup #47
MysoreMuleSoftMeetup
 
BÀI TẬP BỔ TRỢ TIẾNG ANH LỚP 9 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2024-2025 - ...
BÀI TẬP BỔ TRỢ TIẾNG ANH LỚP 9 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2024-2025 - ...BÀI TẬP BỔ TRỢ TIẾNG ANH LỚP 9 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2024-2025 - ...
BÀI TẬP BỔ TRỢ TIẾNG ANH LỚP 9 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2024-2025 - ...
Nguyen Thanh Tu Collection
 
B. Ed Syllabus for babasaheb ambedkar education university.pdf
B. Ed Syllabus for babasaheb ambedkar education university.pdfB. Ed Syllabus for babasaheb ambedkar education university.pdf
B. Ed Syllabus for babasaheb ambedkar education university.pdf
BoudhayanBhattachari
 

Recently uploaded (20)

Walmart Business+ and Spark Good for Nonprofits.pdf
Walmart Business+ and Spark Good for Nonprofits.pdfWalmart Business+ and Spark Good for Nonprofits.pdf
Walmart Business+ and Spark Good for Nonprofits.pdf
 
LAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UP
LAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UPLAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UP
LAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UP
 
BÀI TẬP DẠY THÊM TIẾNG ANH LỚP 7 CẢ NĂM FRIENDS PLUS SÁCH CHÂN TRỜI SÁNG TẠO ...
BÀI TẬP DẠY THÊM TIẾNG ANH LỚP 7 CẢ NĂM FRIENDS PLUS SÁCH CHÂN TRỜI SÁNG TẠO ...BÀI TẬP DẠY THÊM TIẾNG ANH LỚP 7 CẢ NĂM FRIENDS PLUS SÁCH CHÂN TRỜI SÁNG TẠO ...
BÀI TẬP DẠY THÊM TIẾNG ANH LỚP 7 CẢ NĂM FRIENDS PLUS SÁCH CHÂN TRỜI SÁNG TẠO ...
 
Stack Memory Organization of 8086 Microprocessor
Stack Memory Organization of 8086 MicroprocessorStack Memory Organization of 8086 Microprocessor
Stack Memory Organization of 8086 Microprocessor
 
SWOT analysis in the project Keeping the Memory @live.pptx
SWOT analysis in the project Keeping the Memory @live.pptxSWOT analysis in the project Keeping the Memory @live.pptx
SWOT analysis in the project Keeping the Memory @live.pptx
 
A Independência da América Espanhola LAPBOOK.pdf
A Independência da América Espanhola LAPBOOK.pdfA Independência da América Espanhola LAPBOOK.pdf
A Independência da América Espanhola LAPBOOK.pdf
 
Bonku-Babus-Friend by Sathyajith Ray (9)
Bonku-Babus-Friend by Sathyajith Ray  (9)Bonku-Babus-Friend by Sathyajith Ray  (9)
Bonku-Babus-Friend by Sathyajith Ray (9)
 
Philippine Edukasyong Pantahanan at Pangkabuhayan (EPP) Curriculum
Philippine Edukasyong Pantahanan at Pangkabuhayan (EPP) CurriculumPhilippine Edukasyong Pantahanan at Pangkabuhayan (EPP) Curriculum
Philippine Edukasyong Pantahanan at Pangkabuhayan (EPP) Curriculum
 
BIOLOGY NATIONAL EXAMINATION COUNCIL (NECO) 2024 PRACTICAL MANUAL.pptx
BIOLOGY NATIONAL EXAMINATION COUNCIL (NECO) 2024 PRACTICAL MANUAL.pptxBIOLOGY NATIONAL EXAMINATION COUNCIL (NECO) 2024 PRACTICAL MANUAL.pptx
BIOLOGY NATIONAL EXAMINATION COUNCIL (NECO) 2024 PRACTICAL MANUAL.pptx
 
Level 3 NCEA - NZ: A Nation In the Making 1872 - 1900 SML.ppt
Level 3 NCEA - NZ: A  Nation In the Making 1872 - 1900 SML.pptLevel 3 NCEA - NZ: A  Nation In the Making 1872 - 1900 SML.ppt
Level 3 NCEA - NZ: A Nation In the Making 1872 - 1900 SML.ppt
 
Electric Fetus - Record Store Scavenger Hunt
Electric Fetus - Record Store Scavenger HuntElectric Fetus - Record Store Scavenger Hunt
Electric Fetus - Record Store Scavenger Hunt
 
HYPERTENSION - SLIDE SHARE PRESENTATION.
HYPERTENSION - SLIDE SHARE PRESENTATION.HYPERTENSION - SLIDE SHARE PRESENTATION.
HYPERTENSION - SLIDE SHARE PRESENTATION.
 
RESULTS OF THE EVALUATION QUESTIONNAIRE.pptx
RESULTS OF THE EVALUATION QUESTIONNAIRE.pptxRESULTS OF THE EVALUATION QUESTIONNAIRE.pptx
RESULTS OF THE EVALUATION QUESTIONNAIRE.pptx
 
Benner "Expanding Pathways to Publishing Careers"
Benner "Expanding Pathways to Publishing Careers"Benner "Expanding Pathways to Publishing Careers"
Benner "Expanding Pathways to Publishing Careers"
 
Temple of Asclepius in Thrace. Excavation results
Temple of Asclepius in Thrace. Excavation resultsTemple of Asclepius in Thrace. Excavation results
Temple of Asclepius in Thrace. Excavation results
 
How to Setup Warehouse & Location in Odoo 17 Inventory
How to Setup Warehouse & Location in Odoo 17 InventoryHow to Setup Warehouse & Location in Odoo 17 Inventory
How to Setup Warehouse & Location in Odoo 17 Inventory
 
Beyond Degrees - Empowering the Workforce in the Context of Skills-First.pptx
Beyond Degrees - Empowering the Workforce in the Context of Skills-First.pptxBeyond Degrees - Empowering the Workforce in the Context of Skills-First.pptx
Beyond Degrees - Empowering the Workforce in the Context of Skills-First.pptx
 
Mule event processing models | MuleSoft Mysore Meetup #47
Mule event processing models | MuleSoft Mysore Meetup #47Mule event processing models | MuleSoft Mysore Meetup #47
Mule event processing models | MuleSoft Mysore Meetup #47
 
BÀI TẬP BỔ TRỢ TIẾNG ANH LỚP 9 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2024-2025 - ...
BÀI TẬP BỔ TRỢ TIẾNG ANH LỚP 9 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2024-2025 - ...BÀI TẬP BỔ TRỢ TIẾNG ANH LỚP 9 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2024-2025 - ...
BÀI TẬP BỔ TRỢ TIẾNG ANH LỚP 9 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2024-2025 - ...
 
B. Ed Syllabus for babasaheb ambedkar education university.pdf
B. Ed Syllabus for babasaheb ambedkar education university.pdfB. Ed Syllabus for babasaheb ambedkar education university.pdf
B. Ed Syllabus for babasaheb ambedkar education university.pdf
 

Comparison of Word2vec and Doc2Vec model driven Sentiment Analysis using SVM, LR, Keras CNN, Bidirectional LSTM with and without pre-trained Word and Document Embeddings

  • 1. Sofia Dutta Data 602 – Spring 2019 Semester Project Comparing Word2vec, Doc2Vec model driven Sentiment Analysis using SVM, LR vs Keras CNN and Bidirectional LSTM with and without pre-trained Word and Document Embeddings
  • 3. WHAT IS A WORD VECTOR? • Assume a vocabulary has five words: King, Queen, Man, Woman, Child. • One-hot vector doesn’t allow meaningful comparisons, i.e. no semantics available • The solution is to use Word2Vec which uses distributed representation of word in vector space. One-hot encoded vector for ‘Queen’
  • 4. WHAT IS A WORD VECTOR? • Distributed representations of word2vec can help encode various aspects of a word • Aspects are represented by elements of the vector and can help define a word • Aspects can represent things like royalty, gender or age in our “little language”
  • 5. WHAT IS A WORD VECTOR? • It has been observed that we can perform simple algebraic operations on the word vectors. • We can remove the masculinity aspect from King by performing the vector operation: vector(“King”) – vector(“Man”) • Following that we can add vector(“Woman”) to the result from above and obtain a vector that will be closest to the vector representation of the word Queen.
  • 7. Movie review dataset 50K reviews Amazon Laptop review dataset 40K+ reviews Cleanup and pre-processing Word and Document embeddings Logistic regression, XGBoost, SVM etc. Keras word embeddings Keras CNN, Keras Bidirectional LSTM
  • 8. DATA OVERVIEW Data source Columns Dataset review count Max review Comments Total Positive Negativ e Positive Negativ e Amazon laptop reviews  Review  Rating 40K+ ~30K ~10K 20K 3K Distribution mass should be covered by 900 to 1000 characters IMDb movie reviews  Review  Rating 50K 25K 25K 14K 5K Distribution mass should be covered by 1400 to 1500 characters
  • 9. DATA EXPLORATION: REVIEW LENGTH, LAPTOP REVIEWS Laptop dataset review lengths
  • 10. DATA EXPLORATION: REVIEW LENGTH, MOVIE REVIEWS Movie dataset review lengths
  • 11. DATA EXPLORATION: REVIEW LENGTH Movie dataset review lengths Laptop dataset review lengths
  • 12. DATA EXPLORATION: REVIEW COUNT Movie dataset review countLaptop dataset review count
  • 13. TEXT PRE-PROCESSING Input • Review string Remove • HTML tags, URLs Convert • To lowercase Split • Into words Remove • Punctuations, Empty Strings, Stop-words Return • Concatenated tokens as a sentence
  • 14. EXPLORING DATASET Looking at high frequency words in each dataset using nltk.FreqDist()
  • 16. WORD2VEC CBOW MODEL EVALUATION
  • 17. WORD2VEC CBOW MODEL EVALUATION
  • 18. WORD2VEC CBOW MODEL EVALUATION
  • 19. POSITIVE AND NEGATIVE REVIEW WORD CLOUDS • Amazon dataset • IMDb movie review dataset Positive cloud is with white background and negative cloud is with black background
  • 20. SENTIMENT ANALYSIS • Word2Vec Word Embedding Based Sentiment Analysis using LogisticRegression • Word2Vec Word Embedding Based Sentiment Analysis using SVC • Word2Vec Word Embedding Based Sentiment Analysis using XGBClassifier
  • 21. KERAS CONVOLUTIONAL NEURAL NETWORK • Sentiment Analysis using Keras Convolutional Neural Networks(CNN) • Sentiment Analysis Using Pre- trained Word2Vec Word Embedding To Keras CNN Image credit
  • 22. BIDIRECTIONAL LONG SHORT TERM MEMORY • Sentiment Analysis Using Pre- trained Word2Vec Word Embedding To Keras CNN And Bidirectional LSTM • Sentiment Analysis Using Pre- trained Word2Vec Word Embedding To Keras Bidirectional LSTM Image credit
  • 23. DOC2VEC: DISTRIBUTED BAG-OF-WORDS • Doc2Vec DBOW Based Sentiment Analysis using LogisticRegression • Doc2Vec DBOW Based Sentiment Analysis using SVC • Doc2Vec DBOW Based Sentiment Analysis using XGBClassifier
  • 24. DOC2VEC: DISTRIBUTED MEMORY • DM (Concatenated) • Doc2Vec DMC Based Sentiment Analysis using LogisticRegression • Doc2Vec DMC Based Sentiment Analysis using SVC • Doc2Vec DMC Based Sentiment Analysis using XGBClassifier • DM (Mean) • Doc2Vec DMM Based Sentiment Analysis using LogisticRegression • Doc2Vec DMM Based Sentiment Analysis using SVC • Doc2Vec DMM Based Sentiment Analysis using XGBClassifier
  • 25. DOC2VEC: DBOW + DMC AND DBOW + DMM • DBOW + DMC • Doc2Vec DBOW + DMC Based Sentiment Analysis using LogisticRegression • Doc2Vec DBOW + DMC Based Sentiment Analysis using SVC • Doc2Vec DBOW + DMC Based Sentiment Analysis using XGBClassifier • Doc2Vec DBOW + DMC Based Sentiment Analysis using Keras Neural Network • DBOW + DMM • Doc2Vec DBOW + DMM Based Sentiment Analysis using LogisticRegression • Doc2Vec DBOW + DMM Based Sentiment Analysis using SVC • Doc2Vec DBOW + DMM Based Sentiment Analysis using XGBClassifier • Doc2Vec DBOW + DMC Based Sentiment Analysis using Keras Neural Network
  • 26. RESULTS: WORD2VEC Model IMDB Movie Reviews Accuracy Amazon Laptop Reviews Train Validation Test Train Validation Test Word2Vec LogisticRegression .8753 .8692 .8706 .9046 .9060 .9009 SVC with linear kernel .8755 .8714 .8690 .9050 .9089 .9001 XGBClassifier .8695 .8540 .8506 .9058 .9057 .8967 Keras Convolutional Neural Networks (CNN) .9994 .8770 .8788 .9992 .9239 .9146 Using Pre-trained Word2Vec Word Embedding Keras CNN .9593 .8444 .8268 .9690 .8969 .8891 Using Pre-trained Word2Vec Word Embedding Keras CNN And Bidirectional LSTM .9356 .8854 .8902 .9567 .9212 .9144 Using Pre-trained Word2Vec Word Embedding Keras Bidirectional LSTM .9011 .8800 .8786 .9418 .9205 .9229
  • 27. RESULTS: DOC2VEC SIMPLE MODELS Model IMDB Movie Reviews Accuracy Amazon Laptop Reviews Accuracy Train Validation Test Train Validation Test Doc2Vec DBOW LogisticRegression .8732 .8738 .8778 .8982 .9094 .8955 SVC with linear kernel .8738 .8742 .8766 .8985 .9072 .8969 XGBClassifier .8682 .8500 .8556 .9008 .8964 .8849 Doc2Vec DMC LogisticRegression .5939 .5862 .5986 .8088 .8137 .7961 SVC with linear kernel .5933 .5864 .5936 .8086 .8130 .7956 XGBClassifier .6214 .5952 .5958 .8137 .8193 .7980 Doc2Vec DMM LogisticRegression .8187 .8196 .8174 .8472 .8571 .8307 SVC with linear kernel .8193 .8184 .8190 .8428 .8554 .8280 XGBClassifier .8115 .7858 .7920 .8494 .8525 .8282
  • 28. RESULTS: DOC2VEC MODEL COMBOS Model IMDB Movie Reviews Accuracy Amazon Laptop Reviews Accuracy Train Validation Test Train Validation Test DBOW + DMC LogisticRegression .8741 .8738 .8778 .8982 .9099 .8940 SVC with linear kernel .8751 .8744 .8790 .8990 .9072 .8977 XGBClassifier .8692 .8510 .8574 .9000 .8969 .8849 DBOW + DMM LogisticRegression .8813 .8742 .8804 .9024 .9092 .8994 SVC with linear kernel .8809 .8756 .8786 .9033 .9092 .9001 XGBClassifier .8736 .8548 .8590 .9016 .8986 .8854 Combination of Doc2Vec DBOW And Document Embedding and Keras Neural Network .9088 .8742 .8746 .9312 .9040 .9033 Combination of Doc2Vec DBOW And Document Embedding and Keras Neural Network .9295 .8722 .8720 .9475 .9156 .8984
  • 29. CONFUSION MATRIX: WORD2VEC Model IMDB Movie Reviews Confusion Matrix with Accuracy Amazon Laptop Reviews Confusion Matrix with Accuracy TN FP FN TP Accura cy TN FP FN TP Accura cy Word2Vec LR 2185 344 303 2168 0.87 554 279 125 3117 0.90 SVC 2182 347 308 2163 0.87 557 276 131 3111 0.90 XGBClassifier 2127 402 345 2126 0.85 535 298 123 3119 0.90 Keras Convolutional Neural Networks(CNN) 2302 227 379 2092 0.88 591 242 106 3136 0.91 Pre-trained Word2Vec Word Embedding To Keras CNN 2110 419 447 2024 0.83 571 262 190 3052 0.89 Pre-trained Word2Vec Word Embedding To Keras CNN And Bidirectional LSTM 2208 321 228 2243 0.89 615 218 131 3111 0.91 Pre-trained Word2Vec Word Embedding To Keras Bidirectional LSTM 2176 353 254 2217 0.88 652 181 133 3109 0.92
  • 30. CONFUSION MATRIX: DOC2VEC SIMPLE MODELS Model IMDB Movie Reviews Confusion Matrix with Accuracy Amazon Laptop Reviews Confusion Matrix with Accuracy TN FP FN TP Accura cy TN FP FN TP Accura cy DBOW LogisticRegression 2205 324 287 2184 0.88 530 303 123 3119 0.90 SVC 2204 325 292 2179 0.88 533 300 120 3122 0.90 XGBClassifier 2152 377 345 2126 0.86 467 366 103 3139 0.88 DMC LogisticRegression 1586 943 1064 1407 0.60 2 831 0 3242 0.80 SVC 1654 875 1157 1314 0.59 0 833 0 3242 0.80 XGBClassifier 1452 1077 944 1527 0.60 28 805 18 3224 0.80 DMM LogisticRegression 2064 465 448 2023 0.82 246 587 103 3139 0.83 SVC 2062 467 438 2033 0.82 201 632 69 3173 0.83 XGBClassifier 1981 548 492 1979 0.79 193 640 60 3182 0.83
  • 31. CONFUSION MATRIX: DOC2VEC MODEL COMBOS Model IMDB Movie Reviews Confusion Matrix with Accuracy Amazon Laptop Reviews Confusion Matrix with Accuracy TN FP FN TP Accura cy TN FP FN TP Accura cy DBOW + DMC LR 2204 325 286 2185 0.88 524 309 123 3119 0.89 SVC 2210 319 286 2185 0.88 529 304 113 3129 0.90 XGBClassifier 2164 365 348 2123 0.86 472 361 108 3134 0.88 DBOW + DMM LR 2206 323 275 2196 0.88 538 295 115 3127 0.90 SVC 2209 320 287 2184 0.88 538 295 112 3130 0.90 XGBClassifier 2166 363 342 2129 0.86 465 368 99 3143 0.89 DBOW + DMC And Keras Neural Network 2187 342 285 2186 0.87 564 269 125 3117 0.90 DBOW + DMM And Keras Neural Network 2142 387 253 2218 0.87 572 261 153 3089 0.90
  • 32. CONCLUSION Keras Bidirectional LSTM and Keras CNN + Bidirectional LSTM • with Pre-trained Word2Vec Word Embedding Keras CNN • With Tokenizer Word2Vec • LogisticRegression > SVC > XGBClassifier Keras CNN • with Pre-trained Word2Vec Word Embedding Word2Vec > Doc2Vec • DBOW > DMM > DMC and DOW + DMC > DMC Also DOW + DMM > DMM • DBOW + DMM > DBOW + DMC
  • 33. CONCLUSION Bidirectional LSTM is a Recurrent Neural Network (RNN). RNNs have the advantage of being able to persist information. The network considers current inputs as well as, previously received inputs. Hence, it works really well with sequence data like text, time series, videos, DNA sequences, etc

Editor's Notes

  1. Good evening folks, today I am going to present my semester project for Data 602 – Introduction to Data Analysis and Machine Learning In this project I am going to carry out Sentiment analysis using both the machine learning library Keras and common classification algorithms like SVM that use Word2Vec embeddings as their features. I will compare the results of these two and present the results in my final report.
  2. Let’s assume a vocabulary has five words: King, Queen, Man, Woman, Child. A one-hot encoded vector of a word in this language will have 1 in a single position to represent a specific word. All other elements will be zero. Such an encoding can only allow comparisons in form of equality testing. Meaningful comparisons cannot be performed as each word is independent of each other. Word2Vec on the other hand represents words using a distributed representation. Each word represented by a vector is defined by the combination of its various aspects. Aspects are represented in the elements of the vector. As a result, we can have an aspect of royalty, gender, age etc. in our “little language”. Once such a representation is created, we can perform algebraic operations with language. We can remove the masculinity of a King by performing vector(“King”) – vector(“Man”). Then we can add femininity to the result vector by adding vector(“Woman”) to it to obtain a vector that is closest to the vector representation of the word Queen. See image below
  3. The source for movie review dataset is from Stanford https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz And the source URL for the Amazon product review is from UIUC http://sifaka.cs.uiuc.edu/~wang296/Data/LARA/Amazon/AmazonReviews.zip There is no cost to accessing this data. Accessing this data does not require creation of an account. Accessing this data does not violate any laws. The Data Set I am using is amazon laptops review and movie review. IMDb shape: Before Shape of the Data Frame : (50000, 2) After Shape of the Data Frame : (50000, 2) Amazon shape: Before Shape of the Data Frame : (40762, 2) After Shape of the Data Frame : (40762, 2) Performing sentiment analysis. Using Word2Vec word embeddings Classification Algorithms: Logistic regression, XGBoost, SVM, Random Forest Compared results to Keras Convolutional Neural Network
  4. The first dataset of laptop product reviews from Amazon’s website contained 40,744 reviews. The dataset contained over 30K positive reviews and close to 10K negative reviews. The longest positive review length contained over 20,000 characters and longest negative review contained close to 3,000 characters. The box plot of review length shows exponential properties. Looking at the graph shown in Fig 5 it can be estimated that the mass of the distribution can probably be covered with a clipped length of 900 to 1,000 words. The review dataset contains just two columns, one with the rating and the other with the review text. The IMDb movie review dataset contains 50,000 reviews. The dataset contains equal number of reviews annotated as positive and negative. The longest positive review length contains about 14,000 characters and the longest negative review contains close to 5,000 characters. Similar to the first dataset, the box plot of review length shows exponential properties. Looking at the graph shown in Fig 6 it can be estimated that the mass of the distribution can probably be covered with a clipped length of 1,400 to 1,500 words. The review dataset contains just two columns, one with the rating and the other with the review text.
  5. The first dataset of laptop product reviews from Amazon’s website contained 40,744 reviews. The dataset contained over 30K positive reviews and close to 10K negative reviews. The longest positive review length contained over 20,000 characters and longest negative review contained close to 3,000 characters. The box plot of review length shows exponential properties. Looking at the graph shown in Fig 5 it can be estimated that the mass of the distribution can probably be covered with a clipped length of 900 to 1,000 words. The review dataset contains just two columns, one with the rating and the other with the review text. The IMDb movie review dataset contains 50,000 reviews. The dataset contains equal number of reviews annotated as positive and negative. The longest positive review length contains about 14,000 characters and the longest negative review contains close to 5,000 characters. Similar to the first dataset, the box plot of review length shows exponential properties. Looking at the graph shown in Fig 6 it can be estimated that the mass of the distribution can probably be covered with a clipped length of 1,400 to 1,500 words. The review dataset contains just two columns, one with the rating and the other with the review text.
  6. The first dataset of laptop product reviews from Amazon’s website contained 40,744 reviews. The dataset contained over 30K positive reviews and close to 10K negative reviews. The longest positive review length contained over 20,000 characters and longest negative review contained close to 3,000 characters. The box plot of review length shows exponential properties. Looking at the graph shown in Fig 5 it can be estimated that the mass of the distribution can probably be covered with a clipped length of 900 to 1,000 words. The review dataset contains just two columns, one with the rating and the other with the review text. The IMDb movie review dataset contains 50,000 reviews. The dataset contains equal number of reviews annotated as positive and negative. The longest positive review length contains about 14,000 characters and the longest negative review contains close to 5,000 characters. Similar to the first dataset, the box plot of review length shows exponential properties. Looking at the graph shown in Fig 6 it can be estimated that the mass of the distribution can probably be covered with a clipped length of 1,400 to 1,500 words. The review dataset contains just two columns, one with the rating and the other with the review text.
  7. The first dataset of laptop product reviews from Amazon’s website contained 40,744 reviews. The dataset contained over 30K positive reviews and close to 10K negative reviews. The longest positive review length contained over 20,000 characters and longest negative review contained close to 3,000 characters. The box plot of review length shows exponential properties. Looking at the graph shown in Fig 5 it can be estimated that the mass of the distribution can probably be covered with a clipped length of 900 to 1,000 words. The review dataset contains just two columns, one with the rating and the other with the review text. The IMDb movie review dataset contains 50,000 reviews. The dataset contains equal number of reviews annotated as positive and negative. The longest positive review length contains about 14,000 characters and the longest negative review contains close to 5,000 characters. Similar to the first dataset, the box plot of review length shows exponential properties. Looking at the graph shown in Fig 6 it can be estimated that the mass of the distribution can probably be covered with a clipped length of 1,400 to 1,500 words. The review dataset contains just two columns, one with the rating and the other with the review text.
  8. The first dataset of laptop product reviews from Amazon’s website contained 40,744 reviews. The dataset contained over 30K positive reviews and close to 10K negative reviews. The longest positive review length contained over 20,000 characters and longest negative review contained close to 3,000 characters. The box plot of review length shows exponential properties. Looking at the graph shown in Fig 5 it can be estimated that the mass of the distribution can probably be covered with a clipped length of 900 to 1,000 words. The review dataset contains just two columns, one with the rating and the other with the review text. The IMDb movie review dataset contains 50,000 reviews. The dataset contains equal number of reviews annotated as positive and negative. The longest positive review length contains about 14,000 characters and the longest negative review contains close to 5,000 characters. Similar to the first dataset, the box plot of review length shows exponential properties. Looking at the graph shown in Fig 6 it can be estimated that the mass of the distribution can probably be covered with a clipped length of 1,400 to 1,500 words. The review dataset contains just two columns, one with the rating and the other with the review text.
  9. The term frequency distribution of words in the review is obtained using nltk.FreqDist(). This provides us a rough idea of the main topic in the review dataset.
  10. Word embedding is a language modeling technique that uses vectors with several dimensions to represent words from large amounts of unstructured text data. Word embeddings can be generated using various methods like neural networks, co-occurrence matrix, probabilistic models, etc. There are two models for generating word embeddings, i.e. CBOW and Skip-gram. CBOW (Continuous Bag of Words) model CBOW model predicts the current word given a context of words. The input layer contains context words and the output layer contains current predicted word. The hidden layer contains the number of dimensions in which to represent current word at output layer. The CBOW architecture is shown in Fig 1.
  11. The first dataset of laptop product reviews from Amazon’s website contained 40,744 reviews. The dataset contained over 30K positive reviews and close to 10K negative reviews. The longest positive review length contained over 20,000 characters and longest negative review contained close to 3,000 characters. The box plot of review length shows exponential properties. Looking at the graph shown in Fig 5 it can be estimated that the mass of the distribution can probably be covered with a clipped length of 900 to 1,000 words. The review dataset contains just two columns, one with the rating and the other with the review text. The IMDb movie review dataset contains 50,000 reviews. The dataset contains equal number of reviews annotated as positive and negative. The longest positive review length contains about 14,000 characters and the longest negative review contains close to 5,000 characters. Similar to the first dataset, the box plot of review length shows exponential properties. Looking at the graph shown in Fig 6 it can be estimated that the mass of the distribution can probably be covered with a clipped length of 1,400 to 1,500 words. The review dataset contains just two columns, one with the rating and the other with the review text.
  12. The first dataset of laptop product reviews from Amazon’s website contained 40,744 reviews. The dataset contained over 30K positive reviews and close to 10K negative reviews. The longest positive review length contained over 20,000 characters and longest negative review contained close to 3,000 characters. The box plot of review length shows exponential properties. Looking at the graph shown in Fig 5 it can be estimated that the mass of the distribution can probably be covered with a clipped length of 900 to 1,000 words. The review dataset contains just two columns, one with the rating and the other with the review text. The IMDb movie review dataset contains 50,000 reviews. The dataset contains equal number of reviews annotated as positive and negative. The longest positive review length contains about 14,000 characters and the longest negative review contains close to 5,000 characters. Similar to the first dataset, the box plot of review length shows exponential properties. Looking at the graph shown in Fig 6 it can be estimated that the mass of the distribution can probably be covered with a clipped length of 1,400 to 1,500 words. The review dataset contains just two columns, one with the rating and the other with the review text.
  13. The first dataset of laptop product reviews from Amazon’s website contained 40,744 reviews. The dataset contained over 30K positive reviews and close to 10K negative reviews. The longest positive review length contained over 20,000 characters and longest negative review contained close to 3,000 characters. The box plot of review length shows exponential properties. Looking at the graph shown in Fig 5 it can be estimated that the mass of the distribution can probably be covered with a clipped length of 900 to 1,000 words. The review dataset contains just two columns, one with the rating and the other with the review text. The IMDb movie review dataset contains 50,000 reviews. The dataset contains equal number of reviews annotated as positive and negative. The longest positive review length contains about 14,000 characters and the longest negative review contains close to 5,000 characters. Similar to the first dataset, the box plot of review length shows exponential properties. Looking at the graph shown in Fig 6 it can be estimated that the mass of the distribution can probably be covered with a clipped length of 1,400 to 1,500 words. The review dataset contains just two columns, one with the rating and the other with the review text.
  14. CNN is a class of deep, feed-forward artificial neural networks ( where connections between nodes do not form a cycle) & use a variation of multilayer perceptrons designed to require minimal preprocessing. These are inspired by animal visual cortex. I have taken reference from Yoon Kim paper and this blog by Denny Britz. CNNs are generally used in computer vision, however they’ve recently been applied to various NLP tasks and the results were promising 🙌 .
  15. Recurrent Neural Network (RNN) is one of the most popular architectures used in Natural Language Processing (NLP) tasks because its recurrent structure is very suitable to process variable length text. RNN can utilize distributed representations of words by first converting the tokens comprising each text into vectors, which form a matrix. And this matrix includes two dimensions: the time-step dimension and the feature vector dimension. Then most existing models usually utilize one-dimensional (1D) max pooling operation or attention-based operation only on the time-step dimension to obtain a fixed-length vector. However, the features on the feature vector dimension are not mutually independent, and simply applying 1D pooling operation over the time-step dimension independently may destroy the structure of the feature representation. On the other hand, applying two-dimensional (2D) pooling operation over the two dimensions may sample more meaningful features for sequence modeling tasks. Compared with the state-of-the-art models, the proposed models achieve excellent performance on 4 out of 6 tasks. Specifically, one of the proposed models achieves highest accuracy on Stanford Sentiment Treebank binary classification and fine-grained classification tasks.
  16. Doc2Vec can be modeled using paragraph vector distributed bag of words (PV-DBOW or DBOW model) which is a model analogous to Skip-gram in Word2Vec. The document vectors are obtained by training a neural network on the task of predicting a probability distribution of words in a paragraph given a randomly-sampled word from the document.
  17. Often called paragraph vector distributed memory (PV-DM or DM model) is obtained by training a neural network on the task of inferring a center word based on context words and a context paragraph.
  18. Often called paragraph vector distributed memory (PV-DM or DM model) is obtained by training a neural network on the task of inferring a center word based on context words and a context paragraph.