Sentiment analysis on Twitter dataset

•Download as PPTX, PDF•

0 likes•20 views

This document discusses sentiment analysis models for short text like SMS messages. It describes earlier models using word embeddings and character-level embeddings. The current model uses concatenated word and character embeddings, CNN for sentence embeddings, and additional techniques like attention layers and highway networks to improve accuracy. Preprocessing steps like removing tags and stopwords are also covered. The model is evaluated on the Stanford Twitter Sentiment dataset and achieves accuracy in the 75-79% range.

Education

SENTIMENT ANALYSIS ON
SHORT TEXT CORPUS
By Samyuktha

PROBLEM STATEMENT & MODEL
How to get the sentiment from the shorter test messages with SMS
language.
Earlier Model –
SENTENCE Repr:
word2vec Char-level
Word embedding
Sentence
embedding

EARLIER MODEL–CHAR-LEVEL
WORD EMBEDDING
c
e
l
e
b
r
a
t
e
Use:
OOV and
hashtags

MODEL – WORD EMBEDDING
•Concatenate pretrained Word2Vec embedding and char-level word
embedding to represent the word.
•Use Yoon Kim et. al CNN procedure to get the sentence embeddings
and pass it fully connected layer to get the sentiment.
•Got 76-79 accuracy for 80K Stanford Twitter Sentiment (STS) dataset.
•

WORD EMBEDDING
•Rather than just using pretrained word embeddings, we have
implemented another word embedding layer giving word vectors of
size 5 with training set to true.
•We have seen 0.15 accuracy improvement.

HIGHWAY NETWORK
•Modifiation to char-level word embedding:
• In the paper, BI-DIRECTIONAL ATTENTION FLOW FOR MACHINE COMPREHENSION
(BiDAF) paper they have used
• f – tanh
• c – tells how to remember from the original x (concatenated word representation)
• Here all outputs have same dimension as x
• For small dataset accuracy improved form 75.35 to 75.95

ATTENTION & HIGHWAY
•To improve the longer dependencies, we have implemented on
attention layer (Bahdanau et al. ).
•Concatenated the sentence representation from the CNN and
Attention and passes it to Highway Network layer.

PREPROCESSING
•Removed
• HTML tags
• Stopwords
•HashTags - WordPiece
• Tag XLNetTokenizer BertTokenizer
• #fullservice ['▁', '#', 'full', 'service'] ['#', 'full', '##ser', '##vic', '##e’]
• #randomness ['▁', '#', 'ran', 'dom', 'ness'] ['#', 'random', '##ness’]
• #icantstand ['▁', '#', 'ic', 'ant', 'stand'] ['#', 'ic', '##ants', '##tan', '##d’]

PREPROCESSING
•Repeated Letters – GROUPING
• Every repeated character will be grouped to one character.
• Exception set
['d', 'e', 'f', 'g', 'l', 'm', 'n', 'o', 'p', 'r', 's', 't', 'z’]
• If characters in exception set occurred 2 or more time they will
grouped to two repetitions.
• Example:
• "haapppppy“ -> happy
• "huuuungrrrrryyyyyy“ -> hungry
•Emoticons – English Text
• “:-)” -> Happy face

TRAINING DATA
•Stanford Twitter Sentiment (STS-2) or sentiment140 dataset
•Twitter messages - with emoticons used as noisy labels.
•Because we are using very less vocabulary. Words – lower cased, Porter Stemmer

MODEL PARAMEERS
•Adam optimizer, binary cross entropy

COMPARISON
Test set size is fixed to 320K for comparison.
Kaggle Best one – LSTM, 290419 vocabulary, sentence length =
300

Similar to Sentiment analysis on Twitter dataset

SMS Spam Filter Design Using R: A Machine Learning ApproachReza Rahimi

Ai based character recognition and speech synthesisAnkita Jadhao

Attention mechanisms with tensorflowKeon Kim

SAE: Structured Aspect ExtractionGiorgio Orsi

Supervised Learning-classification Part-3.pptVenneladonthireddy1

Supervised Learningclassification Part3.pptKush736264

Unit iii-111206004501-phpapp02riddhi viradiya

Deep learning Malaysia presentation 12/4/2017Brian Ho

1909 BERT: why-and-how (CODE SEMINAR)WarNik Chow

Word_Embedding.pptxNameetDaga1

Introduction of c_languageSINGH PROJECTS

MULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLEIRJET Journal

CS 112 PA #4Like the previous programming assignment, this assignm.docxannettsparrow

Lecture 2 Bca 1 year.pptxclassall

Unit I - 1R introduction to R program.pptxSreeLaya9

Utilizing the Pre-trained Model Effectively for Speech TranslationChen Xu

Deep Learning for Speech Recognition in Cortana at AI NEXT ConferenceBill Liu

bert presentation.pptxChristina197938

ComputerVisionwithDeepLearning.pdfSyedMahmoodAliRoomi

Chef Compliance & Workflow w/Delivery Chef

Similar to Sentiment analysis on Twitter dataset (20)

SMS Spam Filter Design Using R: A Machine Learning Approach

Ai based character recognition and speech synthesis

Attention mechanisms with tensorflow

SAE: Structured Aspect Extraction

Supervised Learning-classification Part-3.ppt

Supervised Learningclassification Part3.ppt

Unit iii-111206004501-phpapp02

Deep learning Malaysia presentation 12/4/2017

1909 BERT: why-and-how (CODE SEMINAR)

Word_Embedding.pptx

Introduction of c_language

MULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLE

CS 112 PA #4Like the previous programming assignment, this assignm.docx

Lecture 2 Bca 1 year.pptx

Unit I - 1R introduction to R program.pptx

Utilizing the Pre-trained Model Effectively for Speech Translation

Deep Learning for Speech Recognition in Cortana at AI NEXT Conference

bert presentation.pptx

ComputerVisionwithDeepLearning.pdf

Chef Compliance & Workflow w/Delivery

Recently uploaded

How to Manage Notification Preferences in the Odoo 17Celine George

slides CapTechTalks Webinar May 2024 Alexander Perry.pptxCapitolTechU

Benefits and Challenges of Using Open Educational Resourcesdimpy50

IATP How-to Foreign Travel May 2024.pdff17thcssbs2

50 ĐỀ LUYỆN THI IOE LỚP 9 - NĂM HỌC 2022-2023 (CÓ LINK HÌNH, FILE AUDIO VÀ ĐÁ...Nguyen Thanh Tu Collection

Gyanartha SciBizTech Quiz slideshare.pptxShibin Azad

PART A. Introduction to Costumer ServicePedroFerreira53928

The impact of social media on mental health and well-being has been a topic o...sanghavirahi2

Basic Civil Engg Notes_Chapter-6_Environment Pollution & EngineeringDenish Jangid

Incoming and Outgoing Shipments in 2 STEPS Using Odoo 17Celine George

An Overview of the Odoo 17 Discuss App.pptxCeline George

Basic phrases for greeting and assisting costumersPedroFerreira53928

Championnat de France de Tennis de table/siemaillard

UNIT – IV_PCI Complaints: Complaints and evaluation of complaints, Handling o...Sayali Powar

“O BEIJO” EM ARTE .Colégio Santa Teresinha

Advances in production technology of Grapes.pdfDr. M. Kumaresan Hort.

Salient features of Environment protection Act 1986.pptxakshayaramakrishnan21

Neurulation and the formation of the neural tubeSaadHumayun7

Industrial Training Report- AKTU Industrial Training ReportAvinash Rai

The basics of sentences session 4pptx.pptxheathfieldcps1

Recently uploaded (20)

How to Manage Notification Preferences in the Odoo 17

slides CapTechTalks Webinar May 2024 Alexander Perry.pptx

Benefits and Challenges of Using Open Educational Resources

IATP How-to Foreign Travel May 2024.pdff

50 ĐỀ LUYỆN THI IOE LỚP 9 - NĂM HỌC 2022-2023 (CÓ LINK HÌNH, FILE AUDIO VÀ ĐÁ...

Gyanartha SciBizTech Quiz slideshare.pptx

PART A. Introduction to Costumer Service

The impact of social media on mental health and well-being has been a topic o...

Basic Civil Engg Notes_Chapter-6_Environment Pollution & Engineering

Incoming and Outgoing Shipments in 2 STEPS Using Odoo 17

An Overview of the Odoo 17 Discuss App.pptx

Basic phrases for greeting and assisting costumers

Championnat de France de Tennis de table/

UNIT – IV_PCI Complaints: Complaints and evaluation of complaints, Handling o...

“O BEIJO” EM ARTE .

Advances in production technology of Grapes.pdf

Salient features of Environment protection Act 1986.pptx

Neurulation and the formation of the neural tube

Industrial Training Report- AKTU Industrial Training Report

The basics of sentences session 4pptx.pptx

Sentiment analysis on Twitter dataset

1. SENTIMENT ANALYSIS ON SHORT TEXT CORPUS By Samyuktha

2. PROBLEM STATEMENT & MODEL How to get the sentiment from the shorter test messages with SMS language. Earlier Model – SENTENCE Repr: word2vec Char-level Word embedding Sentence embedding

3. EARLIER MODEL–CHAR-LEVEL WORD EMBEDDING c e l e b r a t e Use: OOV and hashtags

4. MODEL – WORD EMBEDDING •Concatenate pretrained Word2Vec embedding and char-level word embedding to represent the word. •Use Yoon Kim et. al CNN procedure to get the sentence embeddings and pass it fully connected layer to get the sentiment. •Got 76-79 accuracy for 80K Stanford Twitter Sentiment (STS) dataset. •

5. WORD EMBEDDING •Rather than just using pretrained word embeddings, we have implemented another word embedding layer giving word vectors of size 5 with training set to true. •We have seen 0.15 accuracy improvement.

6. HIGHWAY NETWORK •Modifiation to char-level word embedding: • In the paper, BI-DIRECTIONAL ATTENTION FLOW FOR MACHINE COMPREHENSION (BiDAF) paper they have used • f – tanh • c – tells how to remember from the original x (concatenated word representation) • Here all outputs have same dimension as x • For small dataset accuracy improved form 75.35 to 75.95

7. ATTENTION & HIGHWAY •To improve the longer dependencies, we have implemented on attention layer (Bahdanau et al. ). •Concatenated the sentence representation from the CNN and Attention and passes it to Highway Network layer.

8. PREPROCESSING •Removed • HTML tags • Stopwords •HashTags - WordPiece • Tag XLNetTokenizer BertTokenizer • #fullservice ['▁', '#', 'full', 'service'] ['#', 'full', '##ser', '##vic', '##e’] • #randomness ['▁', '#', 'ran', 'dom', 'ness'] ['#', 'random', '##ness’] • #icantstand ['▁', '#', 'ic', 'ant', 'stand'] ['#', 'ic', '##ants', '##tan', '##d’]

9. PREPROCESSING •Repeated Letters – GROUPING • Every repeated character will be grouped to one character. • Exception set ['d', 'e', 'f', 'g', 'l', 'm', 'n', 'o', 'p', 'r', 's', 't', 'z’] • If characters in exception set occurred 2 or more time they will grouped to two repetitions. • Example: • "haapppppy“ -> happy • "huuuungrrrrryyyyyy“ -> hungry •Emoticons – English Text • “:-)” -> Happy face

10. TRAINING DATA •Stanford Twitter Sentiment (STS-2) or sentiment140 dataset •Twitter messages - with emoticons used as noisy labels. •Because we are using very less vocabulary. Words – lower cased, Porter Stemmer

11. MODEL PARAMEERS •Adam optimizer, binary cross entropy

12. COMPARISON Test set size is fixed to 320K for comparison. Kaggle Best one – LSTM, 290419 vocabulary, sentence length = 300

13. PREDICTIONS

Sentiment analysis on Twitter dataset

Recommended

Recommended

More Related Content

Similar to Sentiment analysis on Twitter dataset

Similar to Sentiment analysis on Twitter dataset (20)

Recently uploaded

Recently uploaded (20)

Sentiment analysis on Twitter dataset