SlideShare a Scribd company logo
1 of 26
TE Project Based Seminar
On
Analyzing Text Preprocessing and Feature
Selection Methods for Sentiment Analysis
Student’s Name: Nirav Raje
Guide’s Name: Dr. Debajyoti Mukhopadhyay
 Definition: The task of automatically classifying a text written in a
natural language into a positive or negative feeling, opinion or
subjectivity.
 The subjective analysis of a text is the main task of Sentiment
Analysis (SA).
 Other tasks:
▪ Predicting the polarity of a given sentence
▪ Identifying emotional status of a sentence.
Sentiment Analysis - Introduction
Process of Sentiment Analysis
Data
Gathering
Text Pre-
processing
Feature
Extraction
Feature
Vector
ClassifierEvaluation
 Personal interpretation of individuals
 Noise and uninformative parts in text
 Words with no impact on SA of text
 Sarcasm
 Named Entity Recognition
 Anaphora Resolution (Pronoun/noun phrase resolution)
Challenges in SA
 Sentiment analysis is mainly a classification task.
 Pre-processing : The process of cleaning and preparing the text for
classification.
 Pre-processing operations can be widely divided into 2 categories:
 Transformations:
Online text cleaning, white space removal, expanding
abbreviation, stemming, stop words removal, negation handling
 Filtering:
Involves the most challenging part of feature selection.
Text Pre-processing
 An extended comparison of sentiment polarity
classification methods for Twitter text has not been
done.
 Effect on different data sets has not been analyzed.
 Hence, we present the role of text pre-processing in
sentiment analysis, and a report on experiment results
demonstrating that feature selection and representation
can affect the classification performance positively.
 3 different data sets have been used to examine classifier
accuracies.
Conclusion from Literature Review
 To tackle the extended comparison of sentiment polarity
classification methods for Twitter text and the role of
text pre-processing in sentiment analysis.
 Provide a report on experimental results which
demonstrates that with the use of appropriate feature
selection and representation procedures, the
performance of SA classifiers is positively affected.
Problem Statement
 To reduce the noise in the text should help improve the
performance of the classifier and speed up the
classification process, thus aiding in real time sentiment
analysis.
Hypothesis of Pre-processing
 Basic Operation and Cleaning
 Removing unimportant or disturbing elements.
 Normalization of some misspelled words.
 Text should not contain URLs, hash tags (i.e. #happy) or
mentions (i.e. @BarackObama).
 Tabs and line breaks should be replaced with a blank and
quotation marks with apexes.
 To remove the vowels repeated in sequence at least three times.
 Laughs, which are normally sequences of “a" and “h". These are
replaced with a “laugh" tag.
 Convert text to lowercase.
Data Transformations
 Emoticon Handling:
This module reduces the number of emoticons to only two
categories: smile positive and smile negative, as shown in table.
Smile Positive Smile Negative
0:-) >:(
:) ;(
:D >:)
:* D:<
:o :(
:P :|
;) >:/
Data Transformations
 Negation Handling:
 Dealing with negations (like “not good")
 All negative constructs (can't, don't, isn't, never etc.) are
replaced with “not".
 Dictionary:
 Detection and correction of misspelled words using a dictionary.
 Substitute slang with its formal meaning (i.e., l8 → late), using a
list.
 Replace insults with the tag “bad word".
Data Transformations
 Stemming:
 Reduces words to root form and groups them.
 Puts word variations like “great", “greatly", “greatest", and
“greater" all into one bucket,
 Effectively decreases entropy and increases the relevance of the
concept of “great”.
 Stop words Removal
 These words are, for example, pronouns, articles, etc.
 These could be words like: a, and, is, on, of, or, the, was, with.
 They can lead to a less accurate classification.
Data Transformations
Feature Selection
 Features - words, terms or phrases that strongly express the opinion
as positive or negative.
 Feature selection is the process of selecting those attributes in your
dataset that are most relevant to the predictive modeling problem
you are working on.
 Drawbacks of the extra features:
 They make document classification slower.
 They reduce accuracy.
 Allows the classifier to fit a model to the problem set more quickly
 Allows it to classify items faster.
Filtering
Filtering
 Feature Weighting Methods:
1. Feature Frequency (FF):
 The method uses the term frequency, i.e. the frequency that each
unigram occurs within a document, as the feature values for that
document.
2. Feature Presence (FP):
 Very similar to feature frequency.
 Difference: Rather than using frequency of unigram simple we use a
one to indicate its existence.
Filtering
3. Term Frequency Inverse Document Frequency (TF-IDF):
 A numerical statistic that is intended to reflect how important a
word is to a document in a collection or corpus.
 Often used as a weighting factor in information retrieval, text
mining and user modeling.
 The TF-IDF value increases proportionally to the number of
times a word appears in the document.
TF-IDF = FF*Log (N/DF)
where,
N indicates the number of documents
DF is the number of documents that contains this feature
FF is the number of occurrences in the document.
Filtering
 To evaluate the role of pre-processing techniques on
classification problems.
 Hence, we examine the performance of several well-
known learning based classification algorithms using
various pre-processing options on three different subject
datasets.
Goal of Current Experiment
Performance Evaluation Process
Select
Dataset
String to
Word
Vector
Attribute
Selection
Classificati
on
Evaluatio
n
Proposed Pre-processing Techniques
Data Sets and Classifiers Used
Our Evaluation results indicated:
 On selection of attributes with IG>0, their resultant number
decreased appreciably.
 Overall algorithms trained faster due to attribute selection.
 1-to-3-grams performed better than the other representations,
having a close competition with unigram.
 In case of NB classifier, percentage of correctly classified instances
increased over 7 points.
 The effect of pre-processing techniques on classifier accuracy was
the same regardless of the datasets.
Results of the Proposed Work
 Feature extraction improves the classification accuracy
in comparison with using all created attributes.
 Significant accuracy rates are obtained when applying
the attribute selection based on information gain.
 Unigram and 1-to-3-grams perform better than the other
representations of n-grams.
 Thus our experiments’ results illustrate that with
appropriate feature selection and representation,
sentiment analysis accuracies can be improved.
Conclusion
 To investigate further the available pre-processing
options in order to find the optimal settings.
 Focusing on choice of best algorithm for attribute
selection strategies.
 Evaluation of rankings methods such as Infogain, Chi-
square, etc.
 To involve embedded methods, which carry out feature
selection and model tuning at the same time.
Future Work
References
1. E. Haddi, X. Liu, Y. Shi, “The role of text pre-processing in sentiment
analysis”, Procedia Computer Science 17, pp. 26–32, 2013.
2. Giulio Angiani, Laura Ferrari, Tomaso Fontanini, Paolo Fornacciari, Eleonora
Iotti, Federico Magliani, and Stefano Manicardi, “A Comparison between
Preprocessing Techniques for Sentiment Analysis in Twitter”, Dipartimento di
Ingegneria dell'Informazione Universita degli Studi di Parma Parco Area delle
Scienze 181/A, 43124 Parma, Italy, 2016.
3. Gonçalves, P. Araújo, M. Benevenuto, F. Cha, “Comparing and Combining
Sentiment Analysis Methods”, Proceedings of the First ACM Conference on
Online Social Networks, COSN ’13. ACM, New York, NY, USA, pp. 27–38,
2013.
4. Akrivi Krouska, Christos Troussas, Maria Virvou Software Engineering
Laboratory, “The Effect Of Preprocessing Techniques On Twitter Sentiment
Analysis”, Department of Informatics University of Piraeus Greece, 2016.
References
5. Tim O’Keefe, Irena Koprinska, “Feature Selection and Weighting Methods in
Sentiment Analysis”, School of Information Technologies, University of
Sydney, NSW, Australia, 2006.
6. Yan Xu, Lin Chen, Beijing Language And Culture University, “Term-
frequency based feature Selection methods for Text Categorization”, Beijing,
China, Institute of Computing Technology, Chinese Academy of Sciences,
2010.
7. “The Role of Text Pre-Processing in Opinion Mining on a Social Media
Language Dataset” Fernando Leandro dos Santos, CIC-UnB University of
Brasilia, Brasilia, Brazil, Marcelo Ladeira, CIC-UnB, University of Brasilia,
Brasilia, Brazil
Analyzing Text Preprocessing and Feature Selection Methods for Sentiment Analysis

More Related Content

What's hot

Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language ProcessingBhavya Chawla
 
Sentiment Analysis Using Product Review
Sentiment Analysis Using Product ReviewSentiment Analysis Using Product Review
Sentiment Analysis Using Product ReviewAbdullah Moin
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language ProcessingCloudxLab
 
Natural language-processing
Natural language-processingNatural language-processing
Natural language-processingHareem Naz
 
A Panorama of Natural Language Processing
A Panorama of Natural Language ProcessingA Panorama of Natural Language Processing
A Panorama of Natural Language ProcessingTed Xiao
 
Text classification presentation
Text classification presentationText classification presentation
Text classification presentationMarijn van Zelst
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language ProcessingRishikese MR
 
Natural language processing
Natural language processingNatural language processing
Natural language processingYogendra Tamang
 
Natural language processing
Natural language processingNatural language processing
Natural language processingprashantdahake
 
Natural language processing (nlp)
Natural language processing (nlp)Natural language processing (nlp)
Natural language processing (nlp)Kuppusamy P
 
Natural Language Processing
Natural Language Processing Natural Language Processing
Natural Language Processing Adarsh Saxena
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language ProcessingYasir Khan
 
Introduction to Natural Language Processing (NLP)
Introduction to Natural Language Processing (NLP)Introduction to Natural Language Processing (NLP)
Introduction to Natural Language Processing (NLP)VenkateshMurugadas
 
Natural Language processing
Natural Language processingNatural Language processing
Natural Language processingSanzid Kawsar
 
Natural language processing
Natural language processing Natural language processing
Natural language processing Md.Sumon Sarder
 

What's hot (20)

Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Sentiment Analysis Using Product Review
Sentiment Analysis Using Product ReviewSentiment Analysis Using Product Review
Sentiment Analysis Using Product Review
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Natural language-processing
Natural language-processingNatural language-processing
Natural language-processing
 
A Panorama of Natural Language Processing
A Panorama of Natural Language ProcessingA Panorama of Natural Language Processing
A Panorama of Natural Language Processing
 
NLP
NLPNLP
NLP
 
Natural language processing
Natural language processingNatural language processing
Natural language processing
 
Text classification presentation
Text classification presentationText classification presentation
Text classification presentation
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Machine Tanslation
Machine TanslationMachine Tanslation
Machine Tanslation
 
Natural language processing
Natural language processingNatural language processing
Natural language processing
 
Sentiment Analysis
Sentiment AnalysisSentiment Analysis
Sentiment Analysis
 
Natural language processing
Natural language processingNatural language processing
Natural language processing
 
Natural language processing (nlp)
Natural language processing (nlp)Natural language processing (nlp)
Natural language processing (nlp)
 
Natural Language Processing
Natural Language Processing Natural Language Processing
Natural Language Processing
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Text Classification
Text ClassificationText Classification
Text Classification
 
Introduction to Natural Language Processing (NLP)
Introduction to Natural Language Processing (NLP)Introduction to Natural Language Processing (NLP)
Introduction to Natural Language Processing (NLP)
 
Natural Language processing
Natural Language processingNatural Language processing
Natural Language processing
 
Natural language processing
Natural language processing Natural language processing
Natural language processing
 

Similar to Analyzing Text Preprocessing and Feature Selection Methods for Sentiment Analysis

Project prSentiment Analysis of Twitter Data Using Machine Learning Approach...
Project prSentiment Analysis  of Twitter Data Using Machine Learning Approach...Project prSentiment Analysis  of Twitter Data Using Machine Learning Approach...
Project prSentiment Analysis of Twitter Data Using Machine Learning Approach...Geetika Gautam
 
Neural Network Based Context Sensitive Sentiment Analysis
Neural Network Based Context Sensitive Sentiment AnalysisNeural Network Based Context Sensitive Sentiment Analysis
Neural Network Based Context Sensitive Sentiment AnalysisEditor IJCATR
 
Major presentation
Major presentationMajor presentation
Major presentationPS241092
 
Issues in Sentiment analysis
Issues in Sentiment analysisIssues in Sentiment analysis
Issues in Sentiment analysisIOSR Journals
 
INFORMATION RETRIEVAL FROM TEXT
INFORMATION RETRIEVAL FROM TEXTINFORMATION RETRIEVAL FROM TEXT
INFORMATION RETRIEVAL FROM TEXTijcseit
 
Improving Sentiment Analysis of Short Informal Indonesian Product Reviews usi...
Improving Sentiment Analysis of Short Informal Indonesian Product Reviews usi...Improving Sentiment Analysis of Short Informal Indonesian Product Reviews usi...
Improving Sentiment Analysis of Short Informal Indonesian Product Reviews usi...TELKOMNIKA JOURNAL
 
Survey of Machine Learning Techniques in Textual Document Classification
Survey of Machine Learning Techniques in Textual Document ClassificationSurvey of Machine Learning Techniques in Textual Document Classification
Survey of Machine Learning Techniques in Textual Document ClassificationIOSR Journals
 
Co-Extracting Opinions from Online Reviews
Co-Extracting Opinions from Online ReviewsCo-Extracting Opinions from Online Reviews
Co-Extracting Opinions from Online ReviewsEditor IJCATR
 
A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTE
A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTEA FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTE
A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTEijnlc
 
A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTE
A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTEA FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTE
A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTEkevig
 
Business recommendation based on collaborative filtering and feature engineer...
Business recommendation based on collaborative filtering and feature engineer...Business recommendation based on collaborative filtering and feature engineer...
Business recommendation based on collaborative filtering and feature engineer...IJECEIAES
 
IRJET-Sentiment Analysis in Twitter
IRJET-Sentiment Analysis in TwitterIRJET-Sentiment Analysis in Twitter
IRJET-Sentiment Analysis in TwitterIRJET Journal
 
Supervised Sentiment Classification using DTDP algorithm
Supervised Sentiment Classification using DTDP algorithmSupervised Sentiment Classification using DTDP algorithm
Supervised Sentiment Classification using DTDP algorithmIJSRD
 
Camera ready sentiment analysis : quantification of real time brand advocacy ...
Camera ready sentiment analysis : quantification of real time brand advocacy ...Camera ready sentiment analysis : quantification of real time brand advocacy ...
Camera ready sentiment analysis : quantification of real time brand advocacy ...Absolutdata Analytics
 
An Experimental Study of Feature Extraction Techniques in Opinion Mining
An Experimental Study of Feature Extraction Techniques in Opinion MiningAn Experimental Study of Feature Extraction Techniques in Opinion Mining
An Experimental Study of Feature Extraction Techniques in Opinion MiningIJSCAI Journal
 
An experimental study of feature
An experimental study of featureAn experimental study of feature
An experimental study of featureijscai
 
Sentiment Analysis Using Hybrid Approach: A Survey
Sentiment Analysis Using Hybrid Approach: A SurveySentiment Analysis Using Hybrid Approach: A Survey
Sentiment Analysis Using Hybrid Approach: A SurveyIJERA Editor
 

Similar to Analyzing Text Preprocessing and Feature Selection Methods for Sentiment Analysis (20)

Project prSentiment Analysis of Twitter Data Using Machine Learning Approach...
Project prSentiment Analysis  of Twitter Data Using Machine Learning Approach...Project prSentiment Analysis  of Twitter Data Using Machine Learning Approach...
Project prSentiment Analysis of Twitter Data Using Machine Learning Approach...
 
Neural Network Based Context Sensitive Sentiment Analysis
Neural Network Based Context Sensitive Sentiment AnalysisNeural Network Based Context Sensitive Sentiment Analysis
Neural Network Based Context Sensitive Sentiment Analysis
 
Aq35241246
Aq35241246Aq35241246
Aq35241246
 
Major presentation
Major presentationMajor presentation
Major presentation
 
Issues in Sentiment analysis
Issues in Sentiment analysisIssues in Sentiment analysis
Issues in Sentiment analysis
 
INFORMATION RETRIEVAL FROM TEXT
INFORMATION RETRIEVAL FROM TEXTINFORMATION RETRIEVAL FROM TEXT
INFORMATION RETRIEVAL FROM TEXT
 
Improving Sentiment Analysis of Short Informal Indonesian Product Reviews usi...
Improving Sentiment Analysis of Short Informal Indonesian Product Reviews usi...Improving Sentiment Analysis of Short Informal Indonesian Product Reviews usi...
Improving Sentiment Analysis of Short Informal Indonesian Product Reviews usi...
 
Survey of Machine Learning Techniques in Textual Document Classification
Survey of Machine Learning Techniques in Textual Document ClassificationSurvey of Machine Learning Techniques in Textual Document Classification
Survey of Machine Learning Techniques in Textual Document Classification
 
Co-Extracting Opinions from Online Reviews
Co-Extracting Opinions from Online ReviewsCo-Extracting Opinions from Online Reviews
Co-Extracting Opinions from Online Reviews
 
Estimating the overall sentiment score by inferring modus ponens law
Estimating the overall sentiment score by inferring modus ponens lawEstimating the overall sentiment score by inferring modus ponens law
Estimating the overall sentiment score by inferring modus ponens law
 
A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTE
A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTEA FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTE
A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTE
 
A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTE
A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTEA FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTE
A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTE
 
Business recommendation based on collaborative filtering and feature engineer...
Business recommendation based on collaborative filtering and feature engineer...Business recommendation based on collaborative filtering and feature engineer...
Business recommendation based on collaborative filtering and feature engineer...
 
IRJET-Sentiment Analysis in Twitter
IRJET-Sentiment Analysis in TwitterIRJET-Sentiment Analysis in Twitter
IRJET-Sentiment Analysis in Twitter
 
Supervised Sentiment Classification using DTDP algorithm
Supervised Sentiment Classification using DTDP algorithmSupervised Sentiment Classification using DTDP algorithm
Supervised Sentiment Classification using DTDP algorithm
 
Camera ready sentiment analysis : quantification of real time brand advocacy ...
Camera ready sentiment analysis : quantification of real time brand advocacy ...Camera ready sentiment analysis : quantification of real time brand advocacy ...
Camera ready sentiment analysis : quantification of real time brand advocacy ...
 
An Experimental Study of Feature Extraction Techniques in Opinion Mining
An Experimental Study of Feature Extraction Techniques in Opinion MiningAn Experimental Study of Feature Extraction Techniques in Opinion Mining
An Experimental Study of Feature Extraction Techniques in Opinion Mining
 
An experimental study of feature
An experimental study of featureAn experimental study of feature
An experimental study of feature
 
Sentiment Analysis Using Hybrid Approach: A Survey
Sentiment Analysis Using Hybrid Approach: A SurveySentiment Analysis Using Hybrid Approach: A Survey
Sentiment Analysis Using Hybrid Approach: A Survey
 
Brm unit iv - cheet sheet
Brm   unit iv - cheet sheetBrm   unit iv - cheet sheet
Brm unit iv - cheet sheet
 

Recently uploaded

Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)dollysharma2066
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSKurinjimalarL3
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxpurnimasatapathy1234
 
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfCCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfAsst.prof M.Gokilavani
 
Current Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCLCurrent Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCLDeelipZope
 
Application of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptxApplication of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptx959SahilShah
 
Heart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxHeart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxPoojaBan
 
Artificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxArtificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxbritheesh05
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...Soham Mondal
 
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfCCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfAsst.prof M.Gokilavani
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024Mark Billinghurst
 
Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...VICTOR MAESTRE RAMIREZ
 
Introduction to Microprocesso programming and interfacing.pptx
Introduction to Microprocesso programming and interfacing.pptxIntroduction to Microprocesso programming and interfacing.pptx
Introduction to Microprocesso programming and interfacing.pptxvipinkmenon1
 
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerStudy on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerAnamika Sarkar
 
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile servicerehmti665
 
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETEINFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETEroselinkalist12
 
main PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidmain PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidNikhilNagaraju
 

Recently uploaded (20)

9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
 
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
 
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Serviceyoung call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
 
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptxExploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptx
 
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfCCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
 
Current Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCLCurrent Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCL
 
Application of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptxApplication of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptx
 
Heart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxHeart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptx
 
Artificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxArtificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptx
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
 
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfCCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024
 
Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...
 
Introduction to Microprocesso programming and interfacing.pptx
Introduction to Microprocesso programming and interfacing.pptxIntroduction to Microprocesso programming and interfacing.pptx
Introduction to Microprocesso programming and interfacing.pptx
 
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerStudy on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
 
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile service
 
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETEINFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
 
main PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidmain PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfid
 

Analyzing Text Preprocessing and Feature Selection Methods for Sentiment Analysis

  • 1. TE Project Based Seminar On Analyzing Text Preprocessing and Feature Selection Methods for Sentiment Analysis Student’s Name: Nirav Raje Guide’s Name: Dr. Debajyoti Mukhopadhyay
  • 2.  Definition: The task of automatically classifying a text written in a natural language into a positive or negative feeling, opinion or subjectivity.  The subjective analysis of a text is the main task of Sentiment Analysis (SA).  Other tasks: ▪ Predicting the polarity of a given sentence ▪ Identifying emotional status of a sentence. Sentiment Analysis - Introduction
  • 3. Process of Sentiment Analysis Data Gathering Text Pre- processing Feature Extraction Feature Vector ClassifierEvaluation
  • 4.  Personal interpretation of individuals  Noise and uninformative parts in text  Words with no impact on SA of text  Sarcasm  Named Entity Recognition  Anaphora Resolution (Pronoun/noun phrase resolution) Challenges in SA
  • 5.  Sentiment analysis is mainly a classification task.  Pre-processing : The process of cleaning and preparing the text for classification.  Pre-processing operations can be widely divided into 2 categories:  Transformations: Online text cleaning, white space removal, expanding abbreviation, stemming, stop words removal, negation handling  Filtering: Involves the most challenging part of feature selection. Text Pre-processing
  • 6.  An extended comparison of sentiment polarity classification methods for Twitter text has not been done.  Effect on different data sets has not been analyzed.  Hence, we present the role of text pre-processing in sentiment analysis, and a report on experiment results demonstrating that feature selection and representation can affect the classification performance positively.  3 different data sets have been used to examine classifier accuracies. Conclusion from Literature Review
  • 7.  To tackle the extended comparison of sentiment polarity classification methods for Twitter text and the role of text pre-processing in sentiment analysis.  Provide a report on experimental results which demonstrates that with the use of appropriate feature selection and representation procedures, the performance of SA classifiers is positively affected. Problem Statement
  • 8.  To reduce the noise in the text should help improve the performance of the classifier and speed up the classification process, thus aiding in real time sentiment analysis. Hypothesis of Pre-processing
  • 9.  Basic Operation and Cleaning  Removing unimportant or disturbing elements.  Normalization of some misspelled words.  Text should not contain URLs, hash tags (i.e. #happy) or mentions (i.e. @BarackObama).  Tabs and line breaks should be replaced with a blank and quotation marks with apexes.  To remove the vowels repeated in sequence at least three times.  Laughs, which are normally sequences of “a" and “h". These are replaced with a “laugh" tag.  Convert text to lowercase. Data Transformations
  • 10.  Emoticon Handling: This module reduces the number of emoticons to only two categories: smile positive and smile negative, as shown in table. Smile Positive Smile Negative 0:-) >:( :) ;( :D >:) :* D:< :o :( :P :| ;) >:/ Data Transformations
  • 11.  Negation Handling:  Dealing with negations (like “not good")  All negative constructs (can't, don't, isn't, never etc.) are replaced with “not".  Dictionary:  Detection and correction of misspelled words using a dictionary.  Substitute slang with its formal meaning (i.e., l8 → late), using a list.  Replace insults with the tag “bad word". Data Transformations
  • 12.  Stemming:  Reduces words to root form and groups them.  Puts word variations like “great", “greatly", “greatest", and “greater" all into one bucket,  Effectively decreases entropy and increases the relevance of the concept of “great”.  Stop words Removal  These words are, for example, pronouns, articles, etc.  These could be words like: a, and, is, on, of, or, the, was, with.  They can lead to a less accurate classification. Data Transformations
  • 13. Feature Selection  Features - words, terms or phrases that strongly express the opinion as positive or negative.  Feature selection is the process of selecting those attributes in your dataset that are most relevant to the predictive modeling problem you are working on.  Drawbacks of the extra features:  They make document classification slower.  They reduce accuracy.  Allows the classifier to fit a model to the problem set more quickly  Allows it to classify items faster. Filtering
  • 15.  Feature Weighting Methods: 1. Feature Frequency (FF):  The method uses the term frequency, i.e. the frequency that each unigram occurs within a document, as the feature values for that document. 2. Feature Presence (FP):  Very similar to feature frequency.  Difference: Rather than using frequency of unigram simple we use a one to indicate its existence. Filtering
  • 16. 3. Term Frequency Inverse Document Frequency (TF-IDF):  A numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.  Often used as a weighting factor in information retrieval, text mining and user modeling.  The TF-IDF value increases proportionally to the number of times a word appears in the document. TF-IDF = FF*Log (N/DF) where, N indicates the number of documents DF is the number of documents that contains this feature FF is the number of occurrences in the document. Filtering
  • 17.  To evaluate the role of pre-processing techniques on classification problems.  Hence, we examine the performance of several well- known learning based classification algorithms using various pre-processing options on three different subject datasets. Goal of Current Experiment
  • 18. Performance Evaluation Process Select Dataset String to Word Vector Attribute Selection Classificati on Evaluatio n
  • 20. Data Sets and Classifiers Used
  • 21. Our Evaluation results indicated:  On selection of attributes with IG>0, their resultant number decreased appreciably.  Overall algorithms trained faster due to attribute selection.  1-to-3-grams performed better than the other representations, having a close competition with unigram.  In case of NB classifier, percentage of correctly classified instances increased over 7 points.  The effect of pre-processing techniques on classifier accuracy was the same regardless of the datasets. Results of the Proposed Work
  • 22.  Feature extraction improves the classification accuracy in comparison with using all created attributes.  Significant accuracy rates are obtained when applying the attribute selection based on information gain.  Unigram and 1-to-3-grams perform better than the other representations of n-grams.  Thus our experiments’ results illustrate that with appropriate feature selection and representation, sentiment analysis accuracies can be improved. Conclusion
  • 23.  To investigate further the available pre-processing options in order to find the optimal settings.  Focusing on choice of best algorithm for attribute selection strategies.  Evaluation of rankings methods such as Infogain, Chi- square, etc.  To involve embedded methods, which carry out feature selection and model tuning at the same time. Future Work
  • 24. References 1. E. Haddi, X. Liu, Y. Shi, “The role of text pre-processing in sentiment analysis”, Procedia Computer Science 17, pp. 26–32, 2013. 2. Giulio Angiani, Laura Ferrari, Tomaso Fontanini, Paolo Fornacciari, Eleonora Iotti, Federico Magliani, and Stefano Manicardi, “A Comparison between Preprocessing Techniques for Sentiment Analysis in Twitter”, Dipartimento di Ingegneria dell'Informazione Universita degli Studi di Parma Parco Area delle Scienze 181/A, 43124 Parma, Italy, 2016. 3. Gonçalves, P. Araújo, M. Benevenuto, F. Cha, “Comparing and Combining Sentiment Analysis Methods”, Proceedings of the First ACM Conference on Online Social Networks, COSN ’13. ACM, New York, NY, USA, pp. 27–38, 2013. 4. Akrivi Krouska, Christos Troussas, Maria Virvou Software Engineering Laboratory, “The Effect Of Preprocessing Techniques On Twitter Sentiment Analysis”, Department of Informatics University of Piraeus Greece, 2016.
  • 25. References 5. Tim O’Keefe, Irena Koprinska, “Feature Selection and Weighting Methods in Sentiment Analysis”, School of Information Technologies, University of Sydney, NSW, Australia, 2006. 6. Yan Xu, Lin Chen, Beijing Language And Culture University, “Term- frequency based feature Selection methods for Text Categorization”, Beijing, China, Institute of Computing Technology, Chinese Academy of Sciences, 2010. 7. “The Role of Text Pre-Processing in Opinion Mining on a Social Media Language Dataset” Fernando Leandro dos Santos, CIC-UnB University of Brasilia, Brasilia, Brazil, Marcelo Ladeira, CIC-UnB, University of Brasilia, Brasilia, Brazil