Twitter text mining using sas


Published on
Here we present a simple yet effective way of text mining Twitter using Excel VBA and SAS. Find more such articles here

Published in: Business, Technology
  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Twitter text mining using sas

  1. 1. Text Mining Using SAS
  2. 2. Introduction Extraction Processing Analyzing Reporting Text mining of Twitter data could provide unprecedented utility for businesses, political groups and curious Internet users alike Introduction  Twitter is a “micro-blogging" social networking website that has a large and rapidly growing user base.  The website provides a rich bank of data in the form of “tweets," which are short status updates and musings from Twitters users that must be written in 140 characters or less.  As an increasingly-popular platform for conveying opinions and thoughts, it seems natural to mine Twitter for potentially interesting trends regarding prominent topics in the news or popular culture. Problem Statement  How can one extract the rich text information available in twitter and how can it be used to draw meaningful insights? Approach  To achieve this we would first need to build an accurate sentiment analyzer for tweets, which is what this solution aims to achieve.  For a user-generated status update (which can not exceed 140 characters), our classification model would determine whether the given tweet reflects positive opinion or negative opinion on the users behalf.  For instance, the tweet “Im in forida with Jesse! i love vacations!" would be positive, whereas the tweet “Setting up an apartment is lame." would be negative.  Based on this, we would be able to translate text(tweet) into numbers for the business.. Impact  Businesses will be able to better understand the image of their brand.  Manufacturers can get an idea of the features of a product that are, according to the users, not up to the mark and start working on the improvements.  A political lobbyist can gauge the popular opinion of a politician by calculating the sentiment of all tweets containing the politician‟s name.  This can also help businesses gauge the performance of their competitors.
  3. 3. Introduction Extraction Processing Analyzing Reporting A combination of „twitter API‟, MS Excel & SAS can be used to extract information from twitter and create an input dataset for the analysis Advantages: Fetch Data Export Data  Returns latest 10K tweets for a particular Keyword  Metrics like Time, Date, Location etc Twitter can also be retrieved. Search key word on websites like Export data Disadvantages:, etc to Excel  Manual process, difficult to automate Run Macros Access API Export Data Import Data into SAS VBA excel macros will Data will be fetched from This data will be Appended Sas datasets access data from APIs APIs and exported to CSV‟s imported into SAS will form a master datasetData Extraction from Twitter can be accomplished by following the process mentioned below. Step 1: The entire process can be fully automated by scheduling the run of VBA macros (process can also be initiated through SAS macros). We can schedule the process to run periodically and data can be retrieved on a regular basis. Step 2:Running an excel macro to access twitter data. This macro creates a URL based on the user‟s input. This excel file access „twitter API‟ through XML and fetches data into one of the sheets. This data is then exported into a CSV file. Step 3: The exported data is then imported into SAS using SAS macros. This data is then appended into a master data set. Step 4: We now have a data set which can be used for further processing and analyzing.
  4. 4. Introduction Extraction Processing Analyzing Reporting Retrieved tweets will go through a series of scrubbing steps; these will simplify extraction of information from the tweets Example tweet: I bought an ipad…. It has a good touch screen..Luv it :) Tweets written in languages other than English will be filtered out. This filter will be applied in Language filter the excel macro itself. The API filters language based on the URL. Removal of special Removal of characters like !:@#$%^&*)(.,;” etc.. List of all special characters will be given in the Excel itself and using „Find and Replace‟ functionality we can replace all these with blanks characters Ex: I bought an ipad It has a good touch screen luv it This will replace the pronouns in the tweet with the respective nouns. Pronoun resolution Ex: I bought an ipad. ipad has a good touch screen luv ipad Custom dictionary can be created in Excel. We can add service of an online dictionary provider Spell check by changing the research options. This will correct wrongly spelled words in Excel (Twittionary1) Ex: I bought an ipad ipad has a good touch screen love ipad This will replace all the words with same meaning with one word of same meaning. In built Excel Synonym thesaurus thesaurus can be used to accomplish this. Ex: : I bought an ipad ipad has a good touch screen love ipad (no change) Removal of noise Words like a,an,is,the etc. are to be removed. words Ex: I bought ipad ipad good touch screen love ipad Part of speech Markov Model for POS tagging2 tagging 1 2
  5. 5. Introduction Extraction Processing Analyzing Reporting Logistic regression model can be developed on a sample of data; this can be used to classify sentiments of the tweet Manual classification will be done on a sample of tweets, classification could be lets say – positive or negative (opinion lexicons can also be used for classification1) Tweets assigned manually will be divided into 2 parts – 80% of data should be taken in Model sample and 20% of data should be taken as validating sample (almost same amount of positive and negative tweets should be taken in both the samples) A logistic regression will be used develop a model taking classification as dependent variable and binary variables for the words as independent variables. Dependent variable will have 0 for negative feedback and 1 for positive feedback2. Accuracy of model can be tested against validating sample. Model equation obtained using logistic regression will be used to calculate classification of tweets on validating sample. Results obtained will be compared with the manually assigned classification. If accuracy is too low, then logistic regression should be developed again for a different set of tweets Validated model can be easily used to classify any number of tweets into 2 groups – positive and negative. 1 2
  6. 6. Introduction Extraction Processing Analyzing Reporting Text can converted to numbers in the form of different metrics/reports to better understand the sentiments of users 350 Sentiments variation Heat Map showing 300 250 over time Sentiments- 200 Positive, negative & Positive, negative & 150 100 neutral feedbacks neutral feedbacks 50 0 can be analyzed over can be represented by time different colors Positive Negative Neutral Metrics: % of positive and negative tweets on subject by feature Metrics: Time variation of number of tweets can seen over time. and by geography Insights: This will highlight the cultural & regional acceptance of Insights: Trends of graph will give an idea of popularity of the the product subject with time Touch Screen Speed Graphics Effectiveness of a Marketing Activity A business can gauge effectiveness of a recent marketing campaign by aggregating user opinion on twitter regarding their product Mixed Reactions report Positive Negative Neutral Order of positive and negative feedbacks from members Sentiments by Features who have given mixed feedback. This order will indicate how the reaction changes over usage time. Metrics: Positive, negative & neutral feedbacks for individual features can be shown Popularity Report A political lobbyist can gauge the popular opinion of a Insights: This will indicate which features need improvement . politician by calculating the sentiment of all tweets containing the politician‟s name.