RCOMM 2011 - Sentiment Classification with RapidMiner
Sentiment Classification with RapidMiner Bruno Ohana and Brendan Tierney DIT School of Computing June 2011
Our Talk Introduction to Sentiment Analysis Supervised Learning Approaches Case Study with RapidMiner
Motivation “81% of US internet users (60 of population) have 60% used the internet to perform research on a product they intended to purchase, as of 2007.” “Over 30% of US internet users have at one time % posted a comment or online review about a product or service they’ve purchased.” (Horrigan, 2008)
MotivationA lot of online content is subjective in nature. User Generated Content: Product reviews, blog posts, twitter, etc. epinions.com, Amazon, RottenTomatoes.com. Sheer volume of opinion data calls for automated analytical methods.
Why Are Automated Methods Relevant? Search and Recommendation Engines. Show me only positive/negative/neutral. Market Research. What is being said about brand X on Twitter? Contextual Ad Placement. Mediation of online communities.
A Growing Industry Opinion Mining offerings Voice of Customer analytics Social Media Monitoring SaaS or embedded in data mining packages
Opinion Mining – Sentiment Classification For a given Text Document, Determine Sentiment Orientation Positive or Negative, Favorable or Unfavorable, etc. Binary or along a scale (e.g. 1 stars) 1-5 Data is unstructured text format. From sentence to document level.Ex: Positive or Negative?“This is by far the worst hotel experience ive ever had. the owner overbooked while i was staying there (even though i booked the room two months in advance) and made me move to another room, but that room wasnt even a hotel room!”
Supervised Learning for Text Train a classifier algorithm based on a training data set. Raw data will be text. Approach: Use term presence information as features. A plain text document becomes a word vector.
Supervised Learning for Text A word vector can be used to train a classifier. Building a Word Vector Unit of tokenization: uni/bi/n uni/bi/n-gram Term presence metric Binary, tf-idf, frequency idf, Stemming Stop Words Removal Word Train Classifier Tokenize Stemming VectorIMDB Data Set (Plain Text)
Opinion Mining – Sentiment ClassificationChallenges of Data Driven Approaches Domain dependence. “chuck norris” might be a good sentiment ” predictor, but on movies only We lose discourse information. Ex: negation detection “This comedy is not really funny.” NLP techniques might help.
RapidMiner Case Study Sentiment Classification based on Word Vectors. Convert Text data to Word Vectors Using RapidMiner’s Text Processing Extension. Use it to Train/Test a Learner Model. Using Cross-Validation. Using Correlation and Parameter Testing to pick better features. Our data set is a collection of Film reviews from IMDB presented in (Pang et al, 2004).
RapidMiner Case Study Selects document collectio From a directory. From text to list of tokens Convert word variations t Their stem.
RapidMiner Case Study Parameter Testing - Filter “top K” most correlated attributes. - K is a macro iterated using Parameter Testing. Testing
RapidMiner Case StudyCross Validation - Training Step. Calculate Attribute Weights and Normalize. Pass models on “through port” to Testing. Select “top k” attributes by weight and train SVM.
RapidMiner Case StudyCross Validation – Testing Step
Case Study – Adding More Features Pre-Computed features based on text statistics. Computed Document, Word and Sentence Sizes, Part Part-of-speech Presence, Stop words ratio, Syllable Count. Features based on scoring using a sentiment lexicon. (Ohana & Tierney ‘09). Used SentiWordNet as the Lexicon (Esuli et al, 09). In RapidMiner we can merge those data sets using a known unique ID (File name in our case).
Opinion Lexicons Opinion Lexicons. A database of terms and opinion information they carry. Some terms and expressions carry “a priori” opinion bias, relatively independent from context. Ex: good, excellent, bad, poor. To build the data set: Score document based on terms found. Total positive/negative scores. Per part-of-speech. Per document section.
Lexicon Based Approach Document Scores POS Negation Scoring SWN Features Tagger DetectionMDB Data Set (Plain Text) SentiWordNet
Part of Speech Tagging The computer-animated comedy " shrek " is designed to be enjoyed on animated different levels by different groups . for children , it offers imaginative visuals , appealing new characters mixed with a host of familiar faces , loads of action and a barrage of big laughs The/DT computer-animated/JJ comedy/NN / shrek/NN / is/VBZ designed/VBN to/TO be/VB enjoyed/VBN on/IN different/JJ levels/NNS by/IN different/JJ groups/NNS ./. for/IN children/NNS ,/, it/PRP offers/VBZ imaginative/JJ visuals/NNS ,/, appealing/VBG new/JJ characters/NNS mixed/VBN with/IN a/DT host/NN of/IN familiar/JJ faces/NNS ,/, loads/NNS of/IN action/NN and/CC a/DT barrage/NN of/IN big/JJ laughs/NNS
Negation Detection NegEx (Chapman et al ’01). Look for negating expressions Pseudo-negations. “no wonder”, “no change”, “not only” Forward and Backward Scope. “don’t”, “not”, “without”, “unlikely to”, etc…
Case Study – Adding More Features Data Set Merging
Results - AccuracyAverage Accuracy using 10-fold Cross fold Cross-validationMethod Accuracy % Feature CountBaseline word vector 85.39 6739Baseline less uncorrelated attributes 85.49 1800Document Stats (S) 68.73 22SentiWordNet features (SWN) 67.40 39Merging (S) + (N) 72.79 61Merging Baseline + (S) + (SWN) and 86.39 1800removing uncorrelated attributes
Opinion Mining – Sentiment Classification Some results from the field (IMDB data set).Method Accuracy SourceSupport Vector Machines and 77.10% (Pang et al, 2002)Bigrams word vectorWord Vector Naïve Bayes + Parts of 77.50% (Salvetti et al, 2004)SpeechSupport Vector Machines and 82.90% (Pang et al, 2002)Unigrams word vectorUnigrams + Subjectivity Detection 87.15% (Pang et al, 2004)SVM + stylistic features 87.95% (Abbasi et al, 2008)SVM + GA feature selection 95.55% (Abbasi et al, 2008)