Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Applying word vectors sentiment analysis
1. Applying Word Vectors for
Sentiment Analysis
&
Text Analysis while Browsing
Abdullah Khan Zehady
Department Of Computer Science,
Purdue University
2. Movie Review- Sentiment Analysis
● Collected from Kaggle ML Competition.
● Data
o “Review Index” “Review” “Sentiment( 0/1)”
1. LabeledTrainData
● 25000 movie reviews
1. TestData
● 25000 movie reviews
3. Approach 1: Bag Of Word - Baseline
● Data Preprocessing
o Removal of HTML, Non-Letters, Stopwords, space +
LowerCase conversion
● Creating Features from Bag Of Words
o 5000 most freq words (25000 x 5000)
o { the, cat, sat, on, hat, dog, ate, and } ---> { 2, 1, 1, 1, 1, 0, 0, 0 }
o { the, cat, sat, on, hat, dog, ate, and } ---> { 3, 1, 0, 0, 1, 1, 1, 1}
● Supervised Learning
o Random Forest Classifier with 100 trees
4. Approach 2: TF-IDF Word Weight
Approach 3: Vector Averaging
● Review Vector ← TF-IDF word weight
● Word2Vec word vectors (Dim = 300)
o Review Vector ← Element wise Average
Approach 4: Bag Of Centroids
● K-Means Clustering to find word clusters
● Number of Features = Number of Clusters
● Review Feature Vector
o Find which feature a word belongs to and increase the cluster value.
5. Approach 5:
Clustering + Pretrained Vector
+ External Sentiment Dict.
● Pre-trained Data (using word2vec)
o Entity vectors trained on 100B words from various news articles: freebase-vectors-
skipgram1000.bin.gz
o pre-trained vectors trained on part of Google News dataset (about 100 billion words)
● Word2Vec “distance”, “most_similar” to lookup close
words + find review tones
● Incorporating “Sentiwordnet” information
o Positive, Negative Score for each word
6. Result
Method Accuracy
Bag Of Words 0.84
TF-IDF 0.74
Vector Averaging 0.63
Bag Of Centroids 0.81
PreTrain + Ext.
Knowledge 0.87
7. Page Analysis Chrome Extension
● Important Word List
● Important Named Entities
● Tag Distribution
● Summarization of Text
● Sentiment Analysis
○ Comment Analysis
A useful tool everybody will be able to use to extract
meaningful information from a webpage.
8. Future Work
● Implementation of RNN, LSTM-RNN, Paragraph Vector
o Y Bengio, R Ducharme, P Vincent… - The Journal of Machine …,
2003 - dl.acm.org
o P Le, W Zuidema - COLING, 2012
o QV Le, T Mikolov, 2014
● Relational inference for wikification
o Disambiguation to Wikipedia
Pr(title|surface)
o Candidate title <- Compositional Semantics for candidate wiki page
● Extension: Reranking Google Search result using information visualization.
Editor's Notes
TF-IDF: how important a given word is within a given set of documents