• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Combining Knowledge and Data Mining to Understand Sentiment
 

Combining Knowledge and Data Mining to Understand Sentiment

on

  • 878 views

a practical assessment of approaches

a practical assessment of approaches

Statistics

Views

Total Views
878
Views on SlideShare
878
Embed Views
0

Actions

Likes
0
Downloads
15
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Combining Knowledge and Data Mining to Understand Sentiment Combining Knowledge and Data Mining to Understand Sentiment Document Transcript

    • WHITE PAPERCombining Knowledge and Data Miningto Understand Sentiment – A PracticalAssessment of Approaches
    • COMBINING KNOWLEDGE AND DATA MINING TO UNDERSTAND SENTIMENTTable of ContentsAbstract............................................................................................................1Introduction......................................................................................................1The Elements of Sentiment Analysis...............................................................1 What Is Sentiment Analysis?........................................................................1 When Is It Relevant?.....................................................................................2 Elements of Sentiment Analysis...................................................................2Sentiment Analysis Methods...........................................................................3 The Data.......................................................................................................3 Data Mining Approach..................................................................................4 Benefits of the data mining approach...............................................................5 Drawback of the data mining approach............................................................5 Natural Language Processing Approach.......................................................5 Step one: taxonomy identification....................................................................6 Step two: defining objects and attributes.........................................................7 Step three: defining polarity..............................................................................8 Benefits of the NLP approach........................................................................10 Drawback of the NLP approach.....................................................................11The Best of Both Worlds.................................................................................11 Data Mining of the Text for the Rule Builder...............................................11 Hybrid Approaches......................................................................................14 Polarity scores as additional features..............................................................14 Stacked models.............................................................................................15Results ...........................................................................................................16 Attribute-Level Results...............................................................................16 Overall Results............................................................................................16Other Applications..........................................................................................17 Importing Models .......................................................................................17 Creating Training Data................................................................................18 Other Capabilities of SAS® Enterprise Miner™............................................19Conclusions....................................................................................................19References......................................................................................................20 i
    • COMBINING KNOWLEDGE AND DATA MINING TO UNDERSTAND SENTIMENT Russell Albright is a Research Statistician Developer at SAS and has been working on SAS® Text Miner algorithms since its initial release more than 10 years ago. He holds a master’s and a doctorate in applied math from Clemson University. Albright has expertise in numerical matrix methods and Bayesian networks, and he has experience applying text mining to many Web-based sources, including Twitter, Yahoo and PubMed. Praveen Lakkaraju is a Software Developer at SAS and is a member of the SAS Text Analytics research and development team. His areas of experience include sentiment analysis, information retrieval and content categorization. He was instrumental in the launch of the SAS Social Media Analytics solution, and is still actively involved in its development. Lakkaraju holds a master’s in computer science from the University of Kansas, where he specialized in the field of natural language processing.ii
    • COMBINING KNOWLEDGE AND DATA MINING TO UNDERSTAND SENTIMENTAbstractAn important application of text analytics is to automatically characterize thesentiment of documents in a variety of domains, whether it is positive, negativeor neither. In this paper we explore the benefits of combining domain-specificlinguistic rules with data mining methods to improve both the effectiveness ofyour models and the efficiency of the model builder.IntroductionOur world has changed drastically in the last 10 years. An individual’s opinionsare no longer shared only with his or her immediate family and friends, butinstead are capable of influencing the decisions of thousands or even millions ofpeople the individual has never even met. The Internet has given the individual aplatform to broadcast grievances and recommendations that can reach acrossthe world. And the existence of social networks gives these opinions the potentialto snowball into a viral frenzy that can make your company’s products or servicesa worldwide boon or a global catastrophe in just a matter of days.The savvy marketer monitors and evaluates relevant Web content continually tounderstand consumer sentiment toward products or services from his company– and toward his competitors. This attention to Web content allows the companyto respond quickly to customer opinion.The sheer volume of references related to your company’s products or servicesmakes automating this task essential. Sources such as blogs, product reviews,forums and news articles can all be monitored, scored for relevance against yourtopics of interest, and then classified according to sentiment. ■ Sentiment analysis is an automatic method that provides feedback to you regarding the opinions and attitudes of your customers.The Elements of Sentiment AnalysisWhat Is Sentiment Analysis?Sentiment analysis is an automatic method that provides feedback to youregarding the opinions and attitudes of your customers. The analysis is basedon customers’ electronic written commentaries regarding your products andservices and those of your competitors. The feedback can be provided at avery high level with drill-down so that you can explore how opinions differ withingroups, subgroups and even at the individual level. 1
    • COMBINING KNOWLEDGE AND DATA MINING TO UNDERSTAND SENTIMENTMore precisely, sentiment analysis is the process of classifying or rating the opinionsor sentiment expressed in a document. The rating may assign the sentiment intoone of three categories: positive, negative or neutral; or it may, instead, assign anumeric score. The rating that is assigned is termed polarity. The sentiment may beassessed for the entire document or for particular objects or attributes mentioned inthe document.When Is It Relevant?Sentiment analysis is relevant in almost every context that your customers orpotential customers express themselves in written form – and possibly spoken form –via different communication channels. These comments may not have been intendedfor direct consumption by your company. They may have been posted in websiteforums, tweets, blogs or other Web pages and directed toward your potentialcustomers. On the other hand, some content may have been intentionally directed atyour company through e-mail, a company support website, a survey questionnaire, acall center desk, etc.Automated sentiment analysis is important to implement when you are inundatedwith relevant, useful feedback through these channels. For many companies, itis impossible for individuals to monitor and understand all that is communicatedin these sources due to their sheer volume. The information comes too quicklyand from too many channels. Sentiment analysis provides you with an immediateinterpretation, not just of every individual comment but also of the global opinionsexpressed.Elements of Sentiment AnalysisYou cannot implement a comprehensive sentiment analysis solution with a processthat merely analyzes the sentiment of a document. Instead, you must coordinateseveral tasks to maximize the benefits.1. Data acquisition phase. This phase involves setting up an automated process to obtain a clean set of documents to analyze. You can use SAS software to obtain the documents from the Internet and from local file systems or databases. SAS software can also be used to filter the documents by eliminating any “noise” that is common to Web documents (e.g., filtering spam).2. Sentiment assignment phase. This phase involves creating a model that can calculate the polarity of the author’s sentiment or opinion toward your topics of interest and apply that model to naïve documents. SAS technologies can help you derive accurate assessments of sentiment.3. Summarization and reporting phase. Identifying sentiment within a particular document is interesting in itself, but frequently it will be of more interest to characterize representative populations within your collection. SAS provides techniques for such exploration, which entails answering questions such as:2
    • COMBINING KNOWLEDGE AND DATA MINING TO UNDERSTAND SENTIMENT •  oes the age of our customer tend to make a difference in his or her opinion D about our service? •  ow do the cumulative opinions about our competitor’s product compare with H the cumulative opinions about our product? •  id our customers perceive the changes we made to our outlet stores as D beneficial, or not?4. Repetition phase. The final step in your sentiment analysis project will be to set up a process to automate the entire analysis on a repeated basis. This allows you to monitor sentiment changes, identify important influencers and respond quickly to what you learn.For this paper we will focus primarily on the sentiment assignment phase. Notethat since text is written in natural language and not with a precise quantitativerepresentation, there are many challenges to effectively analyze for sentiment.For one, natural language text is full of ambiguities, implicit meaning and subtlenuances. Normally a human reader has the necessary experience to bothunderstand natural language expressions and to comprehend the meaning of thesubject area along with the sentiment the author intended to communicate. Butautomating this process in a computer can be challenging. Such things as slang,pronoun resolution, sarcasm and idioms all make a direct interpretation of the textdifficult.Further, an automatic process will not function at the semantic level of the text at allunless there is a direct mapping of a linguistic rule to semantics. In many instancesthis can be captured with the rules we will discuss later; but the diversity of ways toexpress the same meaning can make it difficult to accurately capture all situationswith a set of rules.There are two primary approaches to building models for sentiment analysis. Thefirst, natural language processing, uses a domain expert to build a set of linguisticrules to determine the sentiment polarity of the document’s content. The second,machine learning, uses training data (documents that have the sentiment polarityalready assigned to them) to build a predictive model. Predictive models such asdecision trees, logistic regressions or neural networks will make this prediction ondocuments that are outside the training set.Sentiment Analysis MethodsThe DataWe will use two collections of movie review data to demonstrate the techniquespresented in this paper. The first collection created by Pang and Lee contains 2,000 3
    • COMBINING KNOWLEDGE AND DATA MINING TO UNDERSTAND SENTIMENTmovie reviews. The collection is split evenly with 1,000 positive and 1,000 negativereviews.1 The second collection was obtained by retrieving 6,631 movie reviewsfrom Yahoo.2 This collection has both overall ratings for the movie being discussedand also ratings for several attributes of each movie, including the story line, cast,direction and visuals.Although your data is almost certainly not movie review data, the concepts andtechniques demonstrated using this movie data are applicable to most othersentiment-related text data sets.Data Mining ApproachA data mining approach to sentiment analysis translates an unstructured textproblem to one that makes predictions on structured, quantitative data. Theapproach borrows several techniques from computational linguistics and informationretrieval communities to represent the text numerically, and then applies traditionaldata mining techniques to this numeric representation. In the end, a target variable isidentified and a pattern is discovered from the training data for predicting sentimentpolarity. This pattern can then be used to predict new observations.The first step in creating the numeric representation is to convert the entire trainingcollection into a document-by-term frequency matrix. Each document is parsed intoindividual terms, or term/part-of-speech pairs. Then the set of all terms becomesthe variables on the data set so that documents are now represented as vectors oflength equal to the number of distinct terms in the collection. These vectors are verysparse, containing mostly zeroes – because any one document contains a very smallpercentage of the terms in the collection. Once the documents are represented asvectors, the frequencies in each cell can be weighted with a function that takes intoaccount the distribution of the term across the collection and relative to the levels ofthe target variable.After these document vectors are formed, a dimension reduction technique – suchas the singular value decomposition (see Taming Text with the SVD, Albright, 2004)– is typically used to represent each document in a reduced-dimensional spaceof maybe 50 to 100 variables, where each variable is a linear combination of theweighted terms that originally represented each document.Finally, these reduced-dimensional vectors, together with the sentiment variable, canbe supplied to a predictive model. The model will attempt to learn from the trainingdata by utilizing patterns in the reduced-dimensional vector. This predictive model willthen create a function that will predict the sentiment for any document.1 The Pang and Lee movie review data is available at: http://www.cs.cornell.edu/People/pabo/movie- review-data2 Yahoo movie reviews were obtained from: http://movies.yahoo.com4
    • COMBINING KNOWLEDGE AND DATA MINING TO UNDERSTAND SENTIMENTBenefits of the data mining approachThe data mining approach is appealing because it is based on learning patterns thatare useful for making automated, efficient predictions. The algorithms are capableof discovering unimagined and complicated patterns that would be beyond what ahuman could anticipate. Frequently, a data mining approach can beat a rule-basedapproach in topic classification. Of course, this is dependent on having enoughtraining data to build the model.Drawback of the data mining approachThe vector-based representation of a document, which is required for data mining ■ The algorithms are capable oftechniques, does not maintain information that is potentially important to sentiment discovering unimagined andclassification. For example, the vector representation does not capture when terms complicated patterns that wouldare close to one another in the document, if one term precedes another or any othercontextual cues. The order of terms in a phrase can significantly affect meaning. be beyond what a human couldConsider the phrases: anticipate. “… night for a great movie” and “… great night for a movie”These two phrases convey two different meanings; yet in a vector representation, thephrases have an identical representation.In addition, most predictive models provide little feedback to the user as to preciselywhy a particular document was classified as having positive or negative polarity. Sowhen you attempt to understand what positive things people said in a particulardocument, you frequently have to read the entire document to discover the answer.As a final drawback, forming the training and validation is an essential componentof learning a predictive model, but it can be very time-consuming and challenging.A rating needs to be provided for every document, and if there are attributes ofdocuments that you wish to use to measure sentiment, you will need to provide arating for each of these as well. Another complication is that two different reviewersfrequently assign two different sentiment ratings to the same document. This canintroduce unexpected errors in building and measuring the performance of yourmodel.Natural Language Processing ApproachNatural language processing (NLP) is a field of artificial intelligence that deals withautomatically extracting meaning from natural language text. As discussed in theintroduction of this paper, it’s very challenging to get machines to understand text atthe same levels as humans. Doing this with the specific goal of extracting sentimentis even more challenging. For example, consider the text snippet below: 5
    • COMBINING KNOWLEDGE AND DATA MINING TO UNDERSTAND SENTIMENT“… with that out of the way, let me say this – this film is bad. This film is really, reallybad. Yet somehow, it is strangely enjoyable. …”If interpreted by a human, the above text would imply a positive sentiment from theauthor toward the movie. However, it can be very challenging to get the same outputfrom a computer because of the dense presence of the strongly negative words.The rule-based NLP methods use certain entities and syntactic patterns in the textto understand its meaning. SAS Sentiment Analysis provides all the tools neededfor this kind of disambiguation. You can use a combination of language dictionaries,linguistic constructs like parts of speech, and noun phrases along with a range ofoperators.The operators fall into a few different categories as shown below:• Boolean operators. Used to include or exclude different entities (e.g., AND, OR, NOT).• Frequency operators. Used to measure the specified number of occurrences of certain entities, (e.g., MIN, MINOC, MAXOC).• Context operators. Used to measure the context within which certain entities occur in the text (e.g., DIST, START, END, SENT, PARA).• Sequence operators. Used to look for the entities in a specific sequence (e.g., ORD, ORDDIST).The process of developing rule-based models for sentiment analysis involves a fewdifferent steps. These are explained below.Step one: taxonomy identificationThe initial step in the NLP approach is taxonomy identification. Taxonomy hererefers to a simple, two-level hierarchy where you specify the different objects andattributes for which you want to extract sentiment. You can either use a predefinedtaxonomy or you can use text mining to learn the most prominent objects and theirattributes in the corpus and then make them part of your taxonomy. Figure 1 showsthe predefined taxonomy that we used for extracting sentiment from the movie reviewdata. The discovery-based text mining methods are discussed later in this paper.6
    • COMBINING KNOWLEDGE AND DATA MINING TO UNDERSTAND SENTIMENTFigure 1: Taxonomy for movie reviews.Step two: defining objects and attributesThe next step is to define the objects and their attributes. A basic approach todefining these is to identify their synonyms or the different ways they may be referredto in the text. Figure 2 shows an example.Figure 2: Example of defining the visuals attribute.While this approach captures many cases, in other situations the attribute might bereferred to using its co-referent. Consider the example below:“The movie starred Jennifer Aniston. The plot of the movie was very interesting.Aniston’s performance was commendable. She looks adorable.” 7
    • COMBINING KNOWLEDGE AND DATA MINING TO UNDERSTAND SENTIMENTHere the name of the actress was mentioned only in the first sentence. In thesubsequent sentences, the actress was referred to using her last name anda pronoun. These three entities are said to be co-referent and the process ofidentifying them is called co-reference resolution. The rule-based methods allow youto write rules to handle such cases.Step three: defining polarityPolarity is determined by associating predefined positive or negative terms orexpressions with the attributes that have been identified. Dictionaries of subjectiveexpressions are available and can be customized to specific domains (see Figure 3).Figure 3: Example of a generic dictionary of positive keywords.You could also define multiple classes of subjective expressions to denote differentlevels of subjectivity.“incredible,” “stunning” ➔ strong positive“hate,” “disgust” ➔ strong negativeAssigning the appropriate polarity requires that negations are handled properly. To dothis, you can use a combination of part-of-speech tags and dictionaries as shown inFigures 4 and 5.8
    • COMBINING KNOWLEDGE AND DATA MINING TO UNDERSTAND SENTIMENTFigure 4: Example of a class of negated adjectives.In Figure 4, “NegClass” is a dictionary of expressions that denote a negation. Forexample, “not,” “will not,” “have not,” etc. and “:Adv,” “:A” and “:V” represent anyadverb, adjective and verb respectively.Figure 5: Example of a negation rule.Finally, to extract the sentiment at attribute level, you can write context-based rulesas shown in Figure 6, where we used a combination of operators. 9
    • COMBINING KNOWLEDGE AND DATA MINING TO UNDERSTAND SENTIMENT ■ The major advantage of rule-based methods is the amount of control they give rule developers over how the analysis will be performed.Figure 6: Example of an attribute-level sentiment rule.Benefits of the NLP approachThe major advantage of rule-based methods is the amount of control they giverule developers over how the analysis will be performed. Developers can use theirknowledge of the domain and the language within it to develop rules that have highprecision.Unlike statistical analysis, the results of rule-based analysis are easily interpretable.This is very important for real-life applications where the analysts need to knowexactly why a document or an attribute within a document was tagged as positive ornegative. In other words, analysts need to know exactly what sentences, keywordsor context within the document triggered the positive or negative sentiment. Figure 7shows an example of this. I think they did a fantastic job this movie. I read the book, I loved the book, and I loved the movie! My only qualm was Javier bardem playing a Brazilian when he is SPANISH! Julia Roberts was perfect and beautfiul. Wonderful casting job (with the exception of Bardem)! Good acting. Some parters were a tad confusing for those who haven’t read the book. But I took my mom, who didn’t read the book, and she really liked it. <br/> <br/> It’s not just some sappy chick flick. It’s a powerful journey about finding yourself hen you let yourself GO!<br/> <br/> Empowering.<br/> Perfection. = EAT PRAY LOVE!<br/> LovelyFigure 7: Example showing different entities that were used for rule-based analysis.Rule-based methods are completely unsupervised; that is, they do not require anytraining data. This is a big advantage in real-life applications where training data isscarce. The non-availability of training data is more pronounced when it comes togranular sentiment analysis (sentiment derived at the objects and attributes level).10
    • COMBINING KNOWLEDGE AND DATA MINING TO UNDERSTAND SENTIMENTAnother advantage of rule-based methods is their ability to refine the rules over timebased on the feedback from analysts or subject-matter experts. The more time therule developer spends on refining the rules, the better the results. Language evolvesover time and people start using newer terms to express their sentiments. This isespecially true for social media, where the language used changes all the time. Insuch cases, rule-based methods give you the flexibility needed to adjust your modelsaccordingly.Drawback of the NLP approachThe disadvantage of rule-based methods is that they require a lot of humaninvolvement in developing the rules. These methods completely rely on the domainknowledge of rule developers. It might take a few weeks to come up with a strongrule-based model for a new domain. However, once you have a strong rule-basedmodel for a domain, you can reuse that model with some minor modifications fordifferent applications within the domain.The importance of validation data is often underestimated while developing thesemodels. The rules being written must be generic enough so that they are capableof handling all possible cases. Inexperienced rule developers tend to over-fit theirrules to the sample data they are working with. Such rules might not work well whentested on different data sets. So, rule developers must make sure they validate therules on different data sets before considering a model ready to deploy.The Best of Both WorldsAs we discussed earlier, data mining learns relevant patterns from a numericalrepresentation of the entire collection, and the patterns discovered are derived byanalyzing the collection as a whole. The rule builder, on the other hand, relies only ■ Because they approach the problemon personal experience and knowledge to formulate rules that will be useful forsentiment analysis. so differently, data mining and rule- based systems can complement oneBecause they approach the problem so differently, data mining and rule-based another.systems can complement one another. They can do this in two ways. First,unsupervised data mining can be used as a tool for the rule builder; and second, thesupervised data mining model can be combined with the rule-based model in sucha way that the strengths of each model are combined, and any possible mistakesmade by one model can be corrected by the other.Data Mining of the Text for the Rule BuilderThe challenge of the rule builder is to devise and formulate rules that capture thesentiment contained in the collection. To do this, the rule builder must have someunderstanding of the content of the documents that are being categorized. For 11
    • COMBINING KNOWLEDGE AND DATA MINING TO UNDERSTAND SENTIMENTinstance, in our movie review collection, are all the reviews about a specific movie orare they about a specific genre of movies? If we know, we can save time by writingrules that are only directed to a particular movie or genre. On the other hand, if thereviews are about movies from many different genres, we must consider how thatknowledge affects the rules we write. Otherwise, we might not capture the sentimentaccurately.For instance, when discussing a horror movie, the statement “The scariest thing I have ever seen”is typically an indicator that the reviewer enjoyed the movie. But it could be a negativeindicator if the reviewer was discussing a children’s movie.Unsupervised text mining allows you to quickly get a handle on the collection youare examining without spending time reading many individual documents. SASText Miner provides a node both for generating topics within a document and forclustering the documents. These approaches are useful for understanding thecollection and for revealing significant aspects of the data. Table 1 shows that ourcollection is quite varied. ID Descriptive Terms Freq. Pct. 1 + horror, + killer, + scary, + scream, horror, + reason, last, 155 8% minutes 2 + animation, adults, animated, disney, voice, children, 73 4% kids, + feature 3 coen, fargo, money, wife, different, pretty, sequences, 37 2% guy 4 + war, world, life, love, + sense, + fight, right, + father 267 13% 5 + comedy, jokes, + funny, funny, fun, script, back, cast 213 11% 6 earth, effects, special effects, special, star, + action, + 276 14% people, interesting 7 + action, + fight, sequences, bad, fun, guy, special ef- 177 9% fects, acting 8 + comedy, mother, + father, woman, funny, love, + family, 400 20% high 9 performances, mother, performance, love, down, + point, 117 6% last, different 10 + thriller, case, + action, + killer, wife, + job, performance, 285 14% scriptTable 1: Ten clusters from the Pang and Lee data.The clusters reveal several prominent categories of movies, reminding rule buildersthat they need to consider how people express sentiment in the following types ofmovies:• Horror movies.• Animation and children’s movies.12
    • COMBINING KNOWLEDGE AND DATA MINING TO UNDERSTAND SENTIMENT• Comedies.• Science fiction movies.• Action movies.• Thrillers.If you, as the rule builder, had not been thinking of how people express their opinionsabout movies from these different categories, it could be easy to incorrectly capturethe sentiment contained in them.Further discovery can be done to capture the sentiment of individual attributeswithin the document. For instance, since the SAS Text Miner filter node allows youto subset documents that contain the visual attribute synonyms displayed in Figure2, you can subset the collection accordingly. In Figure 8, the search expression hasbeen set to include only those documents that contain at least one of the visualattribute synonyms used in the rule building. The special character “*” implies awildcard search is to occur, and the quoted input means that only the exact phrase,“special effects,” should match. The filter node can be followed with a clusteringor topic node, and then any analysis of this subsetted collection provides you withsome potential new ideas for rules.Figure 8: A search expression to retrieve documents concerned with the visual sentimentattribute.This particular subsetted collection revealed discussions around costumes andcostume designs, as well as the reviewer’s reaction to the theater setting. Neither ofthese were aspects of visual sentiment that we had considered prior to discoveringthese topics.At an even finer level, the reports of important terms and phrases (particularly inrelation to one another in the concept-linking diagram) provide sentence-levelideas for your rule generation. The diagram in Figure 9 was made in the process ofexploring reviewers’ comments on their theater experience. The diagram suggeststhat the sentiment regarding the music or sound in the movie might be anotherattribute that could be added to the taxonomy and examined. 13
    • COMBINING KNOWLEDGE AND DATA MINING TO UNDERSTAND SENTIMENTFigure 9: A concept link diagram of “music” and “loud.”Hybrid Approaches ■ Hybrid approaches involve usingHybrid approaches involve using a rule-based approach and a data mining approach a rule-based approach and a datain combination. In the next sections we will describe two alternative methods. The mining approach in combination.first method can be used to supplement the features from the traditional data miningmodel by adding features derived from the linguistic rules that are triggered. Thesecond method shows how to use an ensemble of the results of the two distinctapproaches to improve the prediction.Polarity scores as additional featuresOne advantage of SAS Text Miner is that it allows additional features associated withthe document to be combined with the term features or with the SVD dimensionsbefore training the predictive model. Polarity scores are simply a summary scorebased on a function of the number of times the positive and the negative rules triggerin a document, or in an attribute of a document. These values can be obtained fromSAS Sentiment Analysis.14
    • COMBINING KNOWLEDGE AND DATA MINING TO UNDERSTAND SENTIMENTOnce obtained, the logistic function can be applied to the ratio of the weightedpositive and negative counts so that a document’s polarity score will be between 0and 1, inclusively. A document with more positive sentiment weight will be assigneda score closer to 1, and a document that tends to have more negative sentimentscores closer to 0. This score is then used in combination with the SVD dimensions.When the document has several attributes that receive a polarity score, each ofthese scores can be added as features to the text mining model. The hybrid modelwithin SAS Sentiment Analysis software also makes use of this approach.Stacked modelsAnother hybrid approach is to stack the models. This means that the rule-based andthe data mining models are run separately in the first stage; but a second, predictivemodel is “stacked” after these two models so that the output of the two (a predictiveprobability for each document from each model) becomes the input into a second-stage model.Stacking is an ensemble method that can improve accuracy if the two first-stagemodels differ in their predictions. Stacking allows for the two models to potentiallycorrect one another where they differ.In Figure 10, SAS Text Miner is used to build one sentiment model, while the modelimport node brings in a model from SAS Sentiment Analysis. The output of thetwo models is massaged with SAS code, and then goes into the second stageregression for a final prediction.Figure 10: Stacking models. 15
    • COMBINING KNOWLEDGE AND DATA MINING TO UNDERSTAND SENTIMENTResultsWe experimented with the sentiment analysis approaches presented in this paperusing the movie review data sets. The Yahoo movie data set was used to analyzesentiment at the attribute level, and the Pang and Lee data set was used for theoverall sentiment predictions.Attribute-Level ResultsTable 2 shows the results for the attribute-level sentiment analysis on the Yahoomovie data. The Yahoo data had explicit user ratings for the different attributes,and we compared those ratings with the predictions made by the rule-basedmodel developed with SAS Sentiment Analysis. We spent three days on the rule-development process. The Yahoo data included some reviews where a user ratingwas available for a particular attribute, but the attribute itself was not discussedin the text of the review. We did not include such reviews in the evaluation of theattribute. We also did not include the general attribute because no user ratings wereavailable for it. A user rating of C+ or higher was considered positive, and C- orlower was considered negative. Num Reviews Misclass Rate Story 972 .23 Cast 1272 .14 Direction 243 .17 Visuals 459 .12 Aggregate 2946 .18Table 2: Attribute-level results.With just three days of effort on rule development, we were able to achieve anoverall precision of 82 percent at the attribute level. The misclassification rate for thestory attribute was relatively higher than the other attributes. That is an indication tothe rule developer to further refine the rules for that attribute. Rule refinement is anongoing process, and precision can improve over a period of time.Overall ResultsTable 3 shows the results of our comparisons of the Pang and Lee data. For thedata mining approach, 1,800 random movie reviews were used for training a model,16
    • COMBINING KNOWLEDGE AND DATA MINING TO UNDERSTAND SENTIMENTand 200 reviews were held out to be scored. This process was repeated four times,and the misclassification scores were averaged. For each run, the same set of 200reviews was analyzed in SAS Sentiment Analysis so that the comparisons weremade on the same set of data. Approach Misclass Rate 1 SAS Text Miner .144 2 SAS Sentiment Analysis .252 Attribute-Level Rules 3 Add Polarity Scores as .132 Features in SAS Text Miner 4 Blended .139Table 3: Overall sentiment misclassification results.The results obtained with the text mining model were achieved by using a category-specific weighting and by having enough training data. The SAS Sentiment Analysisoverall sentiment model was derived from the rules for the individual attributes.Under these conditions, the rule-based model did not perform as well as the SASText Miner model. However, combining the models – by using the polarity scores asfeatures in the SAS Text Miner model, or by blending the two models – did improveresults.Other ApplicationsImporting ModelsSAS Sentiment Analysis can build a hybrid model using rules combined with a NaïveBayes algorithm. However, to leverage all the predictive analysis advantages ofSAS® Enterprise Miner™ software, the models from SAS Sentiment Analysis mustbe imported into SAS Enterprise Miner. This can be done easily by using the SASEnterprise Miner model import node. Once the output of SAS Sentiment Analysisis imported, models can be combined in various ways and then compared withthe model assessment node. Figure 11 shows the receiver operator curve (ROC)plot from the model assessment node after a SAS Sentiment Analysis model wasimported. 17
    • COMBINING KNOWLEDGE AND DATA MINING TO UNDERSTAND SENTIMENTFigure 11: ROC chart of SAS Enterprise Miner models with an imported SAS Sentiment ■ One approach to creating trainingAnalysis model (denoted by model import). In this graph, “TM” denotes SAS Text Minerand “RuleIn” refers to using SAS Sentiment Analysis rules in conjunction with data is to use very precise rules thatSAS Text Miner. will make a sentiment classification only on the documents you are mostCreating Training Data sure about.As discussed earlier, training data that has the “answers” is an essential part of atext mining approach. It is necessary to build a predictive model that can makeaccurate sentiment predictions. It is also important for a rule-based system becauseit validates how your rules are doing. The feedback lets you know if you need toadd or remove specific rules, or if you must refine certain rules. Unfortunately,training data is not always available, and creating this data can be an expensive timecommitment.One approach to creating training data is to use very precise rules that will make asentiment classification only on the documents you are most sure about. At the riskof not assigning a sentiment category to many of the documents, you do assignsentiment to a small subset of documents.18
    • COMBINING KNOWLEDGE AND DATA MINING TO UNDERSTAND SENTIMENTWe applied this approach to the movie review data by choosing rules that capturedcomplete phrases that seemed, in our opinion, to indicate the overall sentiment. Forinstance, we included a set of rules that would trigger a positive score for a reviewthat contained phrases like: “I thoroughly enjoyed this movie.” or “I totally loved the film.”When these types of phrases occurred in the document, the polarity was ratedpositive. Similarly, corresponding precise rules were added for negative polarity.When we applied this approach to our movie review collection, 103 of the 2,000documents triggered our rules. (While 103 documents is too small for an effective setof training data, with a larger pool of 20,000 reviews we would have likely obtained1,000 documents in the training set.) We still confirmed the polarity by reviewingeach of the 103 documents. Since SAS Sentiment Analysis highlights the rules incontext, it was quick work to check the 103 documents to ensure that it was anappropriate trigger. Based on our manual review, it appeared that eight of the 103documents were incorrect, so we corrected the polarity for those so that our trainingdata would be free of errors.Other Capabilities of SAS® Enterprise Miner™This paper has primarily focused on combing the rule-based capabilities of SASSentiment Analysis with the text mining capabilities of SAS Text Miner, in conjunctionwith the predictive models available in SAS Enterprise Miner. There is much morefunctionality in SAS Enterprise Miner that can be used to help you understandthe sentiment contained in a collection and to build on the rule models you havedeveloped. Such functionality as sequences and associations, decision trees, SOM-Kohonen self-organizing maps, variable clustering, transformations and sampling,and statistical exploration have all been used in various contexts to supplementtextual understanding.ConclusionsIndependently, both the domain knowledge and the data mining approaches tosentiment analysis have their strengths and weaknesses; but hopefully you will notbe forced to choose between using one or the other for your analysis. In this paper,we have shown that the two approaches complement one another. So, while theNLP approach leverages the rule builder’s domain knowledge, text mining can alsobe used by that person to improve, clarify or correct how that knowledge relates tothe particular collection being analyzed. Text mining reveals important patterns in thespecific collection that assist the rule builder. 19
    • COMBINING KNOWLEDGE AND DATA MINING TO UNDERSTAND SENTIMENTOn the other hand, the text mining approach allows you to quickly build a sentimentclassifier with term frequencies alone. But without any semantic or syntacticindicators, mistakes that would seem elementary to a human can easily occur. Wehave shown that these linguistic indicators can be captured by a rule-base systemand then leveraged in the statistical classifier as additional features, or as a blendedmodel. The end result is a model that is better than either one individually.References1 Albright, Russ. Taming Text with the SVD. January 2004. SAS: Cary, NC. Web:http://ftp.sas.com/techsup/download/EMiner/TamingTextwiththeSVD.pdf.2 Pang et al. “Thumbs Up? Sentiment Classification Using Machine LearningTechniques.” Proceedings of the Conference on Empirical Methods in NaturalLanguage Processing (EMNLP). Conference on Empirical Methods in NaturalLanguage Processing. 2002. 79-86. The authors thank James Cox and Janardhana Punuru from the SAS Text Analytics Research and Development team for their helpful comments and suggestions. They also thank Fiona McNeill from SAS Marketing for encouraging them to work on this paper and providing valuable feedback.20
    • SAS Institute Inc. World Headquarters   +1 919 677 8000To contact your local SAS office, please visit: www.sas.com/officesSAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USAand other countries. ® indicates USA registration. Other brand and product names are trademarks of their respective companies.Copyright © 2011, SAS Institute Inc. All rights reserved. 105008_S59083.0211