This document discusses sentiment analysis models for short text like SMS messages. It describes earlier models using word embeddings and character-level embeddings. The current model uses concatenated word and character embeddings, CNN for sentence embeddings, and additional techniques like attention layers and highway networks to improve accuracy. Preprocessing steps like removing tags and stopwords are also covered. The model is evaluated on the Stanford Twitter Sentiment dataset and achieves accuracy in the 75-79% range.
2. PROBLEM STATEMENT & MODEL
How to get the sentiment from the shorter test messages with SMS
language.
Earlier Model –
SENTENCE Repr:
word2vec Char-level
Word embedding
Sentence
embedding
4. MODEL – WORD EMBEDDING
•Concatenate pretrained Word2Vec embedding and char-level word
embedding to represent the word.
•Use Yoon Kim et. al CNN procedure to get the sentence embeddings
and pass it fully connected layer to get the sentiment.
•Got 76-79 accuracy for 80K Stanford Twitter Sentiment (STS) dataset.
•
5. WORD EMBEDDING
•Rather than just using pretrained word embeddings, we have
implemented another word embedding layer giving word vectors of
size 5 with training set to true.
•We have seen 0.15 accuracy improvement.
6. HIGHWAY NETWORK
•Modifiation to char-level word embedding:
• In the paper, BI-DIRECTIONAL ATTENTION FLOW FOR MACHINE COMPREHENSION
(BiDAF) paper they have used
• f – tanh
• c – tells how to remember from the original x (concatenated word representation)
• Here all outputs have same dimension as x
• For small dataset accuracy improved form 75.35 to 75.95
7. ATTENTION & HIGHWAY
•To improve the longer dependencies, we have implemented on
attention layer (Bahdanau et al. ).
•Concatenated the sentence representation from the CNN and
Attention and passes it to Highway Network layer.
9. PREPROCESSING
•Repeated Letters – GROUPING
• Every repeated character will be grouped to one character.
• Exception set
['d', 'e', 'f', 'g', 'l', 'm', 'n', 'o', 'p', 'r', 's', 't', 'z’]
• If characters in exception set occurred 2 or more time they will
grouped to two repetitions.
• Example:
• "haapppppy“ -> happy
• "huuuungrrrrryyyyyy“ -> hungry
•Emoticons – English Text
• “:-)” -> Happy face
10. TRAINING DATA
•Stanford Twitter Sentiment (STS-2) or sentiment140 dataset
•Twitter messages - with emoticons used as noisy labels.
•Because we are using very less vocabulary. Words – lower cased, Porter Stemmer