SlideShare a Scribd company logo
TE Project Based Seminar
On
Analyzing Text Preprocessing and Feature
Selection Methods for Sentiment Analysis
Student’s Name: Nirav Raje
Guide’s Name: Dr. Debajyoti Mukhopadhyay
 Definition: The task of automatically classifying a text written in a
natural language into a positive or negative feeling, opinion or
subjectivity.
 The subjective analysis of a text is the main task of Sentiment
Analysis (SA).
 Other tasks:
▪ Predicting the polarity of a given sentence
▪ Identifying emotional status of a sentence.
Sentiment Analysis - Introduction
Process of Sentiment Analysis
Data
Gathering
Text Pre-
processing
Feature
Extraction
Feature
Vector
ClassifierEvaluation
 Personal interpretation of individuals
 Noise and uninformative parts in text
 Words with no impact on SA of text
 Sarcasm
 Named Entity Recognition
 Anaphora Resolution (Pronoun/noun phrase resolution)
Challenges in SA
 Sentiment analysis is mainly a classification task.
 Pre-processing : The process of cleaning and preparing the text for
classification.
 Pre-processing operations can be widely divided into 2 categories:
 Transformations:
Online text cleaning, white space removal, expanding
abbreviation, stemming, stop words removal, negation handling
 Filtering:
Involves the most challenging part of feature selection.
Text Pre-processing
 An extended comparison of sentiment polarity
classification methods for Twitter text has not been
done.
 Effect on different data sets has not been analyzed.
 Hence, we present the role of text pre-processing in
sentiment analysis, and a report on experiment results
demonstrating that feature selection and representation
can affect the classification performance positively.
 3 different data sets have been used to examine classifier
accuracies.
Conclusion from Literature Review
 To tackle the extended comparison of sentiment polarity
classification methods for Twitter text and the role of
text pre-processing in sentiment analysis.
 Provide a report on experimental results which
demonstrates that with the use of appropriate feature
selection and representation procedures, the
performance of SA classifiers is positively affected.
Problem Statement
 To reduce the noise in the text should help improve the
performance of the classifier and speed up the
classification process, thus aiding in real time sentiment
analysis.
Hypothesis of Pre-processing
 Basic Operation and Cleaning
 Removing unimportant or disturbing elements.
 Normalization of some misspelled words.
 Text should not contain URLs, hash tags (i.e. #happy) or
mentions (i.e. @BarackObama).
 Tabs and line breaks should be replaced with a blank and
quotation marks with apexes.
 To remove the vowels repeated in sequence at least three times.
 Laughs, which are normally sequences of “a" and “h". These are
replaced with a “laugh" tag.
 Convert text to lowercase.
Data Transformations
 Emoticon Handling:
This module reduces the number of emoticons to only two
categories: smile positive and smile negative, as shown in table.
Smile Positive Smile Negative
0:-) >:(
:) ;(
:D >:)
:* D:<
:o :(
:P :|
;) >:/
Data Transformations
 Negation Handling:
 Dealing with negations (like “not good")
 All negative constructs (can't, don't, isn't, never etc.) are
replaced with “not".
 Dictionary:
 Detection and correction of misspelled words using a dictionary.
 Substitute slang with its formal meaning (i.e., l8 → late), using a
list.
 Replace insults with the tag “bad word".
Data Transformations
 Stemming:
 Reduces words to root form and groups them.
 Puts word variations like “great", “greatly", “greatest", and
“greater" all into one bucket,
 Effectively decreases entropy and increases the relevance of the
concept of “great”.
 Stop words Removal
 These words are, for example, pronouns, articles, etc.
 These could be words like: a, and, is, on, of, or, the, was, with.
 They can lead to a less accurate classification.
Data Transformations
Feature Selection
 Features - words, terms or phrases that strongly express the opinion
as positive or negative.
 Feature selection is the process of selecting those attributes in your
dataset that are most relevant to the predictive modeling problem
you are working on.
 Drawbacks of the extra features:
 They make document classification slower.
 They reduce accuracy.
 Allows the classifier to fit a model to the problem set more quickly
 Allows it to classify items faster.
Filtering
Filtering
 Feature Weighting Methods:
1. Feature Frequency (FF):
 The method uses the term frequency, i.e. the frequency that each
unigram occurs within a document, as the feature values for that
document.
2. Feature Presence (FP):
 Very similar to feature frequency.
 Difference: Rather than using frequency of unigram simple we use a
one to indicate its existence.
Filtering
3. Term Frequency Inverse Document Frequency (TF-IDF):
 A numerical statistic that is intended to reflect how important a
word is to a document in a collection or corpus.
 Often used as a weighting factor in information retrieval, text
mining and user modeling.
 The TF-IDF value increases proportionally to the number of
times a word appears in the document.
TF-IDF = FF*Log (N/DF)
where,
N indicates the number of documents
DF is the number of documents that contains this feature
FF is the number of occurrences in the document.
Filtering
 To evaluate the role of pre-processing techniques on
classification problems.
 Hence, we examine the performance of several well-
known learning based classification algorithms using
various pre-processing options on three different subject
datasets.
Goal of Current Experiment
Performance Evaluation Process
Select
Dataset
String to
Word
Vector
Attribute
Selection
Classificati
on
Evaluatio
n
Proposed Pre-processing Techniques
Data Sets and Classifiers Used
Our Evaluation results indicated:
 On selection of attributes with IG>0, their resultant number
decreased appreciably.
 Overall algorithms trained faster due to attribute selection.
 1-to-3-grams performed better than the other representations,
having a close competition with unigram.
 In case of NB classifier, percentage of correctly classified instances
increased over 7 points.
 The effect of pre-processing techniques on classifier accuracy was
the same regardless of the datasets.
Results of the Proposed Work
 Feature extraction improves the classification accuracy
in comparison with using all created attributes.
 Significant accuracy rates are obtained when applying
the attribute selection based on information gain.
 Unigram and 1-to-3-grams perform better than the other
representations of n-grams.
 Thus our experiments’ results illustrate that with
appropriate feature selection and representation,
sentiment analysis accuracies can be improved.
Conclusion
 To investigate further the available pre-processing
options in order to find the optimal settings.
 Focusing on choice of best algorithm for attribute
selection strategies.
 Evaluation of rankings methods such as Infogain, Chi-
square, etc.
 To involve embedded methods, which carry out feature
selection and model tuning at the same time.
Future Work
References
1. E. Haddi, X. Liu, Y. Shi, “The role of text pre-processing in sentiment
analysis”, Procedia Computer Science 17, pp. 26–32, 2013.
2. Giulio Angiani, Laura Ferrari, Tomaso Fontanini, Paolo Fornacciari, Eleonora
Iotti, Federico Magliani, and Stefano Manicardi, “A Comparison between
Preprocessing Techniques for Sentiment Analysis in Twitter”, Dipartimento di
Ingegneria dell'Informazione Universita degli Studi di Parma Parco Area delle
Scienze 181/A, 43124 Parma, Italy, 2016.
3. Gonçalves, P. Araújo, M. Benevenuto, F. Cha, “Comparing and Combining
Sentiment Analysis Methods”, Proceedings of the First ACM Conference on
Online Social Networks, COSN ’13. ACM, New York, NY, USA, pp. 27–38,
2013.
4. Akrivi Krouska, Christos Troussas, Maria Virvou Software Engineering
Laboratory, “The Effect Of Preprocessing Techniques On Twitter Sentiment
Analysis”, Department of Informatics University of Piraeus Greece, 2016.
References
5. Tim O’Keefe, Irena Koprinska, “Feature Selection and Weighting Methods in
Sentiment Analysis”, School of Information Technologies, University of
Sydney, NSW, Australia, 2006.
6. Yan Xu, Lin Chen, Beijing Language And Culture University, “Term-
frequency based feature Selection methods for Text Categorization”, Beijing,
China, Institute of Computing Technology, Chinese Academy of Sciences,
2010.
7. “The Role of Text Pre-Processing in Opinion Mining on a Social Media
Language Dataset” Fernando Leandro dos Santos, CIC-UnB University of
Brasilia, Brasilia, Brazil, Marcelo Ladeira, CIC-UnB, University of Brasilia,
Brasilia, Brazil
Analyzing Text Preprocessing and Feature Selection Methods for Sentiment Analysis

More Related Content

What's hot

Lecture Notes-Finite State Automata for NLP.pdf
Lecture Notes-Finite State Automata for NLP.pdfLecture Notes-Finite State Automata for NLP.pdf
Lecture Notes-Finite State Automata for NLP.pdf
Deptii Chaudhari
 
Emotion mining in text
Emotion mining in textEmotion mining in text
Emotion mining in text
Lovepreet Singh
 
Text Classification/Categorization
Text Classification/CategorizationText Classification/Categorization
Text Classification/Categorization
Oswal Abhishek
 
Information retrieval 9 tf idf weights
Information retrieval 9 tf idf weightsInformation retrieval 9 tf idf weights
Information retrieval 9 tf idf weights
Vaibhav Khanna
 
Natural lanaguage processing
Natural lanaguage processingNatural lanaguage processing
Natural lanaguage processing
gulshan kumar
 
Natural language processing (NLP) introduction
Natural language processing (NLP) introductionNatural language processing (NLP) introduction
Natural language processing (NLP) introduction
Robert Lujo
 
Textmining Introduction
Textmining IntroductionTextmining Introduction
Textmining Introduction
Datamining Tools
 
Information Extraction
Information ExtractionInformation Extraction
Information Extraction
ssbd6985
 
bag-of-words models
bag-of-words models bag-of-words models
bag-of-words models
Xiaotao Zou
 
Text mining Pre-processing
Text mining Pre-processingText mining Pre-processing
Text mining Pre-processing
Creditas
 
Lecture: Word Sense Disambiguation
Lecture: Word Sense DisambiguationLecture: Word Sense Disambiguation
Lecture: Word Sense Disambiguation
Marina Santini
 
Sentiment Analysis
Sentiment AnalysisSentiment Analysis
Sentiment Analysis
Data Science Society
 
Natural language processing: feature extraction
Natural language processing: feature extractionNatural language processing: feature extraction
Natural language processing: feature extraction
Gabriel Hamilton
 
Twitter sentiment analysis ppt
Twitter sentiment analysis pptTwitter sentiment analysis ppt
Twitter sentiment analysis ppt
AntaraBhattacharya12
 
Text classification presentation
Text classification presentationText classification presentation
Text classification presentation
Marijn van Zelst
 
NLP - Sentiment Analysis
NLP - Sentiment AnalysisNLP - Sentiment Analysis
NLP - Sentiment Analysis
Rupak Roy
 
Sentiment analysis using naive bayes classifier
Sentiment analysis using naive bayes classifier Sentiment analysis using naive bayes classifier
Sentiment analysis using naive bayes classifier
Dev Sahu
 
NLP_KASHK:Text Normalization
NLP_KASHK:Text NormalizationNLP_KASHK:Text Normalization
NLP_KASHK:Text Normalization
Hemantha Kulathilake
 

What's hot (20)

Lecture Notes-Finite State Automata for NLP.pdf
Lecture Notes-Finite State Automata for NLP.pdfLecture Notes-Finite State Automata for NLP.pdf
Lecture Notes-Finite State Automata for NLP.pdf
 
Topdown parsing
Topdown parsingTopdown parsing
Topdown parsing
 
Emotion mining in text
Emotion mining in textEmotion mining in text
Emotion mining in text
 
Text Classification/Categorization
Text Classification/CategorizationText Classification/Categorization
Text Classification/Categorization
 
Information retrieval 9 tf idf weights
Information retrieval 9 tf idf weightsInformation retrieval 9 tf idf weights
Information retrieval 9 tf idf weights
 
Natural lanaguage processing
Natural lanaguage processingNatural lanaguage processing
Natural lanaguage processing
 
Natural language processing (NLP) introduction
Natural language processing (NLP) introductionNatural language processing (NLP) introduction
Natural language processing (NLP) introduction
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Textmining Introduction
Textmining IntroductionTextmining Introduction
Textmining Introduction
 
Information Extraction
Information ExtractionInformation Extraction
Information Extraction
 
bag-of-words models
bag-of-words models bag-of-words models
bag-of-words models
 
Text mining Pre-processing
Text mining Pre-processingText mining Pre-processing
Text mining Pre-processing
 
Lecture: Word Sense Disambiguation
Lecture: Word Sense DisambiguationLecture: Word Sense Disambiguation
Lecture: Word Sense Disambiguation
 
Sentiment Analysis
Sentiment AnalysisSentiment Analysis
Sentiment Analysis
 
Natural language processing: feature extraction
Natural language processing: feature extractionNatural language processing: feature extraction
Natural language processing: feature extraction
 
Twitter sentiment analysis ppt
Twitter sentiment analysis pptTwitter sentiment analysis ppt
Twitter sentiment analysis ppt
 
Text classification presentation
Text classification presentationText classification presentation
Text classification presentation
 
NLP - Sentiment Analysis
NLP - Sentiment AnalysisNLP - Sentiment Analysis
NLP - Sentiment Analysis
 
Sentiment analysis using naive bayes classifier
Sentiment analysis using naive bayes classifier Sentiment analysis using naive bayes classifier
Sentiment analysis using naive bayes classifier
 
NLP_KASHK:Text Normalization
NLP_KASHK:Text NormalizationNLP_KASHK:Text Normalization
NLP_KASHK:Text Normalization
 

Similar to Analyzing Text Preprocessing and Feature Selection Methods for Sentiment Analysis

Project prSentiment Analysis of Twitter Data Using Machine Learning Approach...
Project prSentiment Analysis  of Twitter Data Using Machine Learning Approach...Project prSentiment Analysis  of Twitter Data Using Machine Learning Approach...
Project prSentiment Analysis of Twitter Data Using Machine Learning Approach...
Geetika Gautam
 
Neural Network Based Context Sensitive Sentiment Analysis
Neural Network Based Context Sensitive Sentiment AnalysisNeural Network Based Context Sensitive Sentiment Analysis
Neural Network Based Context Sensitive Sentiment Analysis
Editor IJCATR
 
Aq35241246
Aq35241246Aq35241246
Aq35241246
IJERA Editor
 
Major presentation
Major presentationMajor presentation
Major presentation
PS241092
 
Issues in Sentiment analysis
Issues in Sentiment analysisIssues in Sentiment analysis
Issues in Sentiment analysis
IOSR Journals
 
INFORMATION RETRIEVAL FROM TEXT
INFORMATION RETRIEVAL FROM TEXTINFORMATION RETRIEVAL FROM TEXT
INFORMATION RETRIEVAL FROM TEXT
ijcseit
 
Improving Sentiment Analysis of Short Informal Indonesian Product Reviews usi...
Improving Sentiment Analysis of Short Informal Indonesian Product Reviews usi...Improving Sentiment Analysis of Short Informal Indonesian Product Reviews usi...
Improving Sentiment Analysis of Short Informal Indonesian Product Reviews usi...
TELKOMNIKA JOURNAL
 
Survey of Machine Learning Techniques in Textual Document Classification
Survey of Machine Learning Techniques in Textual Document ClassificationSurvey of Machine Learning Techniques in Textual Document Classification
Survey of Machine Learning Techniques in Textual Document Classification
IOSR Journals
 
Co-Extracting Opinions from Online Reviews
Co-Extracting Opinions from Online ReviewsCo-Extracting Opinions from Online Reviews
Co-Extracting Opinions from Online Reviews
Editor IJCATR
 
Estimating the overall sentiment score by inferring modus ponens law
Estimating the overall sentiment score by inferring modus ponens lawEstimating the overall sentiment score by inferring modus ponens law
Estimating the overall sentiment score by inferring modus ponens law
International Journal of Advance Research and Innovative Ideas in Education
 
A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTE
A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTEA FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTE
A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTE
ijnlc
 
A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTE
A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTEA FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTE
A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTE
kevig
 
Business recommendation based on collaborative filtering and feature engineer...
Business recommendation based on collaborative filtering and feature engineer...Business recommendation based on collaborative filtering and feature engineer...
Business recommendation based on collaborative filtering and feature engineer...
IJECEIAES
 
IRJET-Sentiment Analysis in Twitter
IRJET-Sentiment Analysis in TwitterIRJET-Sentiment Analysis in Twitter
IRJET-Sentiment Analysis in Twitter
IRJET Journal
 
Supervised Sentiment Classification using DTDP algorithm
Supervised Sentiment Classification using DTDP algorithmSupervised Sentiment Classification using DTDP algorithm
Supervised Sentiment Classification using DTDP algorithm
IJSRD
 
Camera ready sentiment analysis : quantification of real time brand advocacy ...
Camera ready sentiment analysis : quantification of real time brand advocacy ...Camera ready sentiment analysis : quantification of real time brand advocacy ...
Camera ready sentiment analysis : quantification of real time brand advocacy ...
Absolutdata Analytics
 
An experimental study of feature
An experimental study of featureAn experimental study of feature
An experimental study of feature
ijscai
 
An Experimental Study of Feature Extraction Techniques in Opinion Mining
An Experimental Study of Feature Extraction Techniques in Opinion MiningAn Experimental Study of Feature Extraction Techniques in Opinion Mining
An Experimental Study of Feature Extraction Techniques in Opinion Mining
IJSCAI Journal
 
Sentiment Analysis Using Hybrid Approach: A Survey
Sentiment Analysis Using Hybrid Approach: A SurveySentiment Analysis Using Hybrid Approach: A Survey
Sentiment Analysis Using Hybrid Approach: A Survey
IJERA Editor
 
Brm unit iv - cheet sheet
Brm   unit iv - cheet sheetBrm   unit iv - cheet sheet
Brm unit iv - cheet sheet
Hallmark B-school
 

Similar to Analyzing Text Preprocessing and Feature Selection Methods for Sentiment Analysis (20)

Project prSentiment Analysis of Twitter Data Using Machine Learning Approach...
Project prSentiment Analysis  of Twitter Data Using Machine Learning Approach...Project prSentiment Analysis  of Twitter Data Using Machine Learning Approach...
Project prSentiment Analysis of Twitter Data Using Machine Learning Approach...
 
Neural Network Based Context Sensitive Sentiment Analysis
Neural Network Based Context Sensitive Sentiment AnalysisNeural Network Based Context Sensitive Sentiment Analysis
Neural Network Based Context Sensitive Sentiment Analysis
 
Aq35241246
Aq35241246Aq35241246
Aq35241246
 
Major presentation
Major presentationMajor presentation
Major presentation
 
Issues in Sentiment analysis
Issues in Sentiment analysisIssues in Sentiment analysis
Issues in Sentiment analysis
 
INFORMATION RETRIEVAL FROM TEXT
INFORMATION RETRIEVAL FROM TEXTINFORMATION RETRIEVAL FROM TEXT
INFORMATION RETRIEVAL FROM TEXT
 
Improving Sentiment Analysis of Short Informal Indonesian Product Reviews usi...
Improving Sentiment Analysis of Short Informal Indonesian Product Reviews usi...Improving Sentiment Analysis of Short Informal Indonesian Product Reviews usi...
Improving Sentiment Analysis of Short Informal Indonesian Product Reviews usi...
 
Survey of Machine Learning Techniques in Textual Document Classification
Survey of Machine Learning Techniques in Textual Document ClassificationSurvey of Machine Learning Techniques in Textual Document Classification
Survey of Machine Learning Techniques in Textual Document Classification
 
Co-Extracting Opinions from Online Reviews
Co-Extracting Opinions from Online ReviewsCo-Extracting Opinions from Online Reviews
Co-Extracting Opinions from Online Reviews
 
Estimating the overall sentiment score by inferring modus ponens law
Estimating the overall sentiment score by inferring modus ponens lawEstimating the overall sentiment score by inferring modus ponens law
Estimating the overall sentiment score by inferring modus ponens law
 
A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTE
A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTEA FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTE
A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTE
 
A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTE
A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTEA FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTE
A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTE
 
Business recommendation based on collaborative filtering and feature engineer...
Business recommendation based on collaborative filtering and feature engineer...Business recommendation based on collaborative filtering and feature engineer...
Business recommendation based on collaborative filtering and feature engineer...
 
IRJET-Sentiment Analysis in Twitter
IRJET-Sentiment Analysis in TwitterIRJET-Sentiment Analysis in Twitter
IRJET-Sentiment Analysis in Twitter
 
Supervised Sentiment Classification using DTDP algorithm
Supervised Sentiment Classification using DTDP algorithmSupervised Sentiment Classification using DTDP algorithm
Supervised Sentiment Classification using DTDP algorithm
 
Camera ready sentiment analysis : quantification of real time brand advocacy ...
Camera ready sentiment analysis : quantification of real time brand advocacy ...Camera ready sentiment analysis : quantification of real time brand advocacy ...
Camera ready sentiment analysis : quantification of real time brand advocacy ...
 
An experimental study of feature
An experimental study of featureAn experimental study of feature
An experimental study of feature
 
An Experimental Study of Feature Extraction Techniques in Opinion Mining
An Experimental Study of Feature Extraction Techniques in Opinion MiningAn Experimental Study of Feature Extraction Techniques in Opinion Mining
An Experimental Study of Feature Extraction Techniques in Opinion Mining
 
Sentiment Analysis Using Hybrid Approach: A Survey
Sentiment Analysis Using Hybrid Approach: A SurveySentiment Analysis Using Hybrid Approach: A Survey
Sentiment Analysis Using Hybrid Approach: A Survey
 
Brm unit iv - cheet sheet
Brm   unit iv - cheet sheetBrm   unit iv - cheet sheet
Brm unit iv - cheet sheet
 

Recently uploaded

ML for identifying fraud using open blockchain data.pptx
ML for identifying fraud using open blockchain data.pptxML for identifying fraud using open blockchain data.pptx
ML for identifying fraud using open blockchain data.pptx
Vijay Dialani, PhD
 
weather web application report.pdf
weather web application report.pdfweather web application report.pdf
weather web application report.pdf
Pratik Pawar
 
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
AJAYKUMARPUND1
 
MCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdfMCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdf
Osamah Alsalih
 
WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234
AafreenAbuthahir2
 
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdfWater Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation & Control
 
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
obonagu
 
The role of big data in decision making.
The role of big data in decision making.The role of big data in decision making.
The role of big data in decision making.
ankuprajapati0525
 
English lab ppt no titlespecENG PPTt.pdf
English lab ppt no titlespecENG PPTt.pdfEnglish lab ppt no titlespecENG PPTt.pdf
English lab ppt no titlespecENG PPTt.pdf
BrazilAccount1
 
road safety engineering r s e unit 3.pdf
road safety engineering  r s e unit 3.pdfroad safety engineering  r s e unit 3.pdf
road safety engineering r s e unit 3.pdf
VENKATESHvenky89705
 
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
Amil Baba Dawood bangali
 
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
bakpo1
 
H.Seo, ICLR 2024, MLILAB, KAIST AI.pdf
H.Seo,  ICLR 2024, MLILAB,  KAIST AI.pdfH.Seo,  ICLR 2024, MLILAB,  KAIST AI.pdf
H.Seo, ICLR 2024, MLILAB, KAIST AI.pdf
MLILAB
 
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&BDesign and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Sreedhar Chowdam
 
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
MdTanvirMahtab2
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
zwunae
 
space technology lecture notes on satellite
space technology lecture notes on satellitespace technology lecture notes on satellite
space technology lecture notes on satellite
ongomchris
 
Investor-Presentation-Q1FY2024 investor presentation document.pptx
Investor-Presentation-Q1FY2024 investor presentation document.pptxInvestor-Presentation-Q1FY2024 investor presentation document.pptx
Investor-Presentation-Q1FY2024 investor presentation document.pptx
AmarGB2
 
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Dr.Costas Sachpazis
 
Final project report on grocery store management system..pdf
Final project report on grocery store management system..pdfFinal project report on grocery store management system..pdf
Final project report on grocery store management system..pdf
Kamal Acharya
 

Recently uploaded (20)

ML for identifying fraud using open blockchain data.pptx
ML for identifying fraud using open blockchain data.pptxML for identifying fraud using open blockchain data.pptx
ML for identifying fraud using open blockchain data.pptx
 
weather web application report.pdf
weather web application report.pdfweather web application report.pdf
weather web application report.pdf
 
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
 
MCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdfMCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdf
 
WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234
 
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdfWater Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdf
 
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
 
The role of big data in decision making.
The role of big data in decision making.The role of big data in decision making.
The role of big data in decision making.
 
English lab ppt no titlespecENG PPTt.pdf
English lab ppt no titlespecENG PPTt.pdfEnglish lab ppt no titlespecENG PPTt.pdf
English lab ppt no titlespecENG PPTt.pdf
 
road safety engineering r s e unit 3.pdf
road safety engineering  r s e unit 3.pdfroad safety engineering  r s e unit 3.pdf
road safety engineering r s e unit 3.pdf
 
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
 
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
 
H.Seo, ICLR 2024, MLILAB, KAIST AI.pdf
H.Seo,  ICLR 2024, MLILAB,  KAIST AI.pdfH.Seo,  ICLR 2024, MLILAB,  KAIST AI.pdf
H.Seo, ICLR 2024, MLILAB, KAIST AI.pdf
 
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&BDesign and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
 
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
 
space technology lecture notes on satellite
space technology lecture notes on satellitespace technology lecture notes on satellite
space technology lecture notes on satellite
 
Investor-Presentation-Q1FY2024 investor presentation document.pptx
Investor-Presentation-Q1FY2024 investor presentation document.pptxInvestor-Presentation-Q1FY2024 investor presentation document.pptx
Investor-Presentation-Q1FY2024 investor presentation document.pptx
 
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
 
Final project report on grocery store management system..pdf
Final project report on grocery store management system..pdfFinal project report on grocery store management system..pdf
Final project report on grocery store management system..pdf
 

Analyzing Text Preprocessing and Feature Selection Methods for Sentiment Analysis

  • 1. TE Project Based Seminar On Analyzing Text Preprocessing and Feature Selection Methods for Sentiment Analysis Student’s Name: Nirav Raje Guide’s Name: Dr. Debajyoti Mukhopadhyay
  • 2.  Definition: The task of automatically classifying a text written in a natural language into a positive or negative feeling, opinion or subjectivity.  The subjective analysis of a text is the main task of Sentiment Analysis (SA).  Other tasks: ▪ Predicting the polarity of a given sentence ▪ Identifying emotional status of a sentence. Sentiment Analysis - Introduction
  • 3. Process of Sentiment Analysis Data Gathering Text Pre- processing Feature Extraction Feature Vector ClassifierEvaluation
  • 4.  Personal interpretation of individuals  Noise and uninformative parts in text  Words with no impact on SA of text  Sarcasm  Named Entity Recognition  Anaphora Resolution (Pronoun/noun phrase resolution) Challenges in SA
  • 5.  Sentiment analysis is mainly a classification task.  Pre-processing : The process of cleaning and preparing the text for classification.  Pre-processing operations can be widely divided into 2 categories:  Transformations: Online text cleaning, white space removal, expanding abbreviation, stemming, stop words removal, negation handling  Filtering: Involves the most challenging part of feature selection. Text Pre-processing
  • 6.  An extended comparison of sentiment polarity classification methods for Twitter text has not been done.  Effect on different data sets has not been analyzed.  Hence, we present the role of text pre-processing in sentiment analysis, and a report on experiment results demonstrating that feature selection and representation can affect the classification performance positively.  3 different data sets have been used to examine classifier accuracies. Conclusion from Literature Review
  • 7.  To tackle the extended comparison of sentiment polarity classification methods for Twitter text and the role of text pre-processing in sentiment analysis.  Provide a report on experimental results which demonstrates that with the use of appropriate feature selection and representation procedures, the performance of SA classifiers is positively affected. Problem Statement
  • 8.  To reduce the noise in the text should help improve the performance of the classifier and speed up the classification process, thus aiding in real time sentiment analysis. Hypothesis of Pre-processing
  • 9.  Basic Operation and Cleaning  Removing unimportant or disturbing elements.  Normalization of some misspelled words.  Text should not contain URLs, hash tags (i.e. #happy) or mentions (i.e. @BarackObama).  Tabs and line breaks should be replaced with a blank and quotation marks with apexes.  To remove the vowels repeated in sequence at least three times.  Laughs, which are normally sequences of “a" and “h". These are replaced with a “laugh" tag.  Convert text to lowercase. Data Transformations
  • 10.  Emoticon Handling: This module reduces the number of emoticons to only two categories: smile positive and smile negative, as shown in table. Smile Positive Smile Negative 0:-) >:( :) ;( :D >:) :* D:< :o :( :P :| ;) >:/ Data Transformations
  • 11.  Negation Handling:  Dealing with negations (like “not good")  All negative constructs (can't, don't, isn't, never etc.) are replaced with “not".  Dictionary:  Detection and correction of misspelled words using a dictionary.  Substitute slang with its formal meaning (i.e., l8 → late), using a list.  Replace insults with the tag “bad word". Data Transformations
  • 12.  Stemming:  Reduces words to root form and groups them.  Puts word variations like “great", “greatly", “greatest", and “greater" all into one bucket,  Effectively decreases entropy and increases the relevance of the concept of “great”.  Stop words Removal  These words are, for example, pronouns, articles, etc.  These could be words like: a, and, is, on, of, or, the, was, with.  They can lead to a less accurate classification. Data Transformations
  • 13. Feature Selection  Features - words, terms or phrases that strongly express the opinion as positive or negative.  Feature selection is the process of selecting those attributes in your dataset that are most relevant to the predictive modeling problem you are working on.  Drawbacks of the extra features:  They make document classification slower.  They reduce accuracy.  Allows the classifier to fit a model to the problem set more quickly  Allows it to classify items faster. Filtering
  • 15.  Feature Weighting Methods: 1. Feature Frequency (FF):  The method uses the term frequency, i.e. the frequency that each unigram occurs within a document, as the feature values for that document. 2. Feature Presence (FP):  Very similar to feature frequency.  Difference: Rather than using frequency of unigram simple we use a one to indicate its existence. Filtering
  • 16. 3. Term Frequency Inverse Document Frequency (TF-IDF):  A numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.  Often used as a weighting factor in information retrieval, text mining and user modeling.  The TF-IDF value increases proportionally to the number of times a word appears in the document. TF-IDF = FF*Log (N/DF) where, N indicates the number of documents DF is the number of documents that contains this feature FF is the number of occurrences in the document. Filtering
  • 17.  To evaluate the role of pre-processing techniques on classification problems.  Hence, we examine the performance of several well- known learning based classification algorithms using various pre-processing options on three different subject datasets. Goal of Current Experiment
  • 18. Performance Evaluation Process Select Dataset String to Word Vector Attribute Selection Classificati on Evaluatio n
  • 20. Data Sets and Classifiers Used
  • 21. Our Evaluation results indicated:  On selection of attributes with IG>0, their resultant number decreased appreciably.  Overall algorithms trained faster due to attribute selection.  1-to-3-grams performed better than the other representations, having a close competition with unigram.  In case of NB classifier, percentage of correctly classified instances increased over 7 points.  The effect of pre-processing techniques on classifier accuracy was the same regardless of the datasets. Results of the Proposed Work
  • 22.  Feature extraction improves the classification accuracy in comparison with using all created attributes.  Significant accuracy rates are obtained when applying the attribute selection based on information gain.  Unigram and 1-to-3-grams perform better than the other representations of n-grams.  Thus our experiments’ results illustrate that with appropriate feature selection and representation, sentiment analysis accuracies can be improved. Conclusion
  • 23.  To investigate further the available pre-processing options in order to find the optimal settings.  Focusing on choice of best algorithm for attribute selection strategies.  Evaluation of rankings methods such as Infogain, Chi- square, etc.  To involve embedded methods, which carry out feature selection and model tuning at the same time. Future Work
  • 24. References 1. E. Haddi, X. Liu, Y. Shi, “The role of text pre-processing in sentiment analysis”, Procedia Computer Science 17, pp. 26–32, 2013. 2. Giulio Angiani, Laura Ferrari, Tomaso Fontanini, Paolo Fornacciari, Eleonora Iotti, Federico Magliani, and Stefano Manicardi, “A Comparison between Preprocessing Techniques for Sentiment Analysis in Twitter”, Dipartimento di Ingegneria dell'Informazione Universita degli Studi di Parma Parco Area delle Scienze 181/A, 43124 Parma, Italy, 2016. 3. Gonçalves, P. Araújo, M. Benevenuto, F. Cha, “Comparing and Combining Sentiment Analysis Methods”, Proceedings of the First ACM Conference on Online Social Networks, COSN ’13. ACM, New York, NY, USA, pp. 27–38, 2013. 4. Akrivi Krouska, Christos Troussas, Maria Virvou Software Engineering Laboratory, “The Effect Of Preprocessing Techniques On Twitter Sentiment Analysis”, Department of Informatics University of Piraeus Greece, 2016.
  • 25. References 5. Tim O’Keefe, Irena Koprinska, “Feature Selection and Weighting Methods in Sentiment Analysis”, School of Information Technologies, University of Sydney, NSW, Australia, 2006. 6. Yan Xu, Lin Chen, Beijing Language And Culture University, “Term- frequency based feature Selection methods for Text Categorization”, Beijing, China, Institute of Computing Technology, Chinese Academy of Sciences, 2010. 7. “The Role of Text Pre-Processing in Opinion Mining on a Social Media Language Dataset” Fernando Leandro dos Santos, CIC-UnB University of Brasilia, Brasilia, Brazil, Marcelo Ladeira, CIC-UnB, University of Brasilia, Brasilia, Brazil