Performing sentiment analysis on Twitter data
(2011 Norway attacks)
Team –
AparnaDhanashriJayaprakash – 50094768
HimanshuY...
Analysis of Twitter Data Set
Introduction
Big Data is increasingly pertinent in today’s digitalized world and is being use...
Analysis of Twitter Data Set
Amy Winehouse was a hugely popular British singer and songwriter. Her work was
critically as ...
Analysis of Twitter Data Set
Hashtag Count
Oslo 466
Norway 396
tcot 308
oslo 244
p2 234
SAVEAMERICANOW 214
news 124
blamet...
Analysis of Twitter Data Set
politics 32
NFL 32
utoya 27
PrayForNorway 27
Utøya 27
CNN 26
Islam 24
oslobomb 24
Data Cleani...
Analysis of Twitter Data Set
format. The technique that we used to do this is Google’s Geocoding API. This API assists by
...
Analysis of Twitter Data Set
analysis for each of the two events. Following are the different aspects which will help proc...
Analysis of Twitter Data Set
URL Share Count
http://t.co/0IGT940 http://t.co/kLYO5t5
http://huff.to/oDwgHC http://t.co/BtI...
Analysis of Twitter Data Set
Event 2: Norway attacks
0 50 100 150 200 250 300 350 400 450
SkyNewsBreak
YouTube
BreakingNew...
Analysis of Twitter Data Set
7%
7%
7%
5%
5%
4%
4%
4%
4%4%3%3%3%3%
3%
3%
3%
3%
3%
3%
3%
2%
2%
2%2%2%2%2%2%2%
URL Share Coun...
Analysis of Twitter Data Set
Comparison Analysis
The Amy Winehouse event occurred on 23rd
of July,2011 whereas the Norway ...
Analysis of Twitter Data Set
Event 1: Amy Winehouse
The Event 1 garnered maximum neutral tweets and minimum positive tweet...
Analysis of Twitter Data Set
Conclusion
Managing huge amounts of data is becoming convenient with the advent of distribute...
Analysis of Twitter Data Set
References
http://en.wikipedia.org/wiki/Sentiment_Analysis
http://en.wikipedia.org/wiki/Apach...
Upcoming SlideShare
Loading in …5
×

Twitter analysis

183 views

Published on

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Twitter analysis

  1. 1. Performing sentiment analysis on Twitter data (2011 Norway attacks) Team – AparnaDhanashriJayaprakash – 50094768 HimanshuYadav – 50093151 Inder Puneet Singh – 50094241 Sabah Abdul Mannan Khan – 50094894 VidyaMulukutla - 50095830
  2. 2. Analysis of Twitter Data Set Introduction Big Data is increasingly pertinent in today’s digitalized world and is being used in a lot of different domains. With social media being so pervasive, it makes logical sense to use it to generate the data sets for analysis in various areas from politics to entertainment.We have chosen ‘Twitter’ as our source for data since it has a wide user base that includes regular people as well as popular individuals from the fields of media, movies, sports and politics. There are a lot of analytical results that can be derived from a popular and widely used Social media platform like Twitter and we used the data generated from it through an implementation using Apache Hadoop and Hive. In order to gauge the reactions from the different users who responded to the significant events in the month of July 2011, we performed a Sentiment Analysis. Sentiment Analysis is the process of trying to gather subjective information through natural language processing, computational linguistics and text analysis. It is also known as opinion mining.There were two important and completely contrasting events that took place in July 2011for which we came up with a comparison analysis and the description of the events is as follows: The Norway attacks of 2011 were the most deadly attacks on the country. Two sequential explosions took place within a span of two hours on 22nd July 2011. The first one was a car bomb that took place in the executive governmental headquarters that killed eight people and injured around 209 people. The second one was a deadly assault that took place on an island. It was a summer camp organized by the youth division of the ruling party. An unidentified man gained access to the camp and open fired at the participating members. This attack claimed 69 lives and seriously injured 110 persons. The accused in the case, Anders Behring Breivik, was sentenced to 21 years in imprisonment.
  3. 3. Analysis of Twitter Data Set Amy Winehouse was a hugely popular British singer and songwriter. Her work was critically as well as commercially appreciated and she won multiple Grammy Awards for her songs. Her sudden demise due to alcohol poisoning on 23rd July 2011 shocked millions of her fans worldwide and sent the online networking sites into frenzy. Hypothesis As per our hypothesis, we decided to evaluate how users from different geographical locations reacted to both the stories on twitter.We took the assumption that the Norway attackswould affect the public more as compared to the Amy Winehouse death and would garner more tweets, hashtags and retweets as it is a more important event in the sense that it was an attack in which many lives were lost and even more critically injured. We compared these two events using sentiment analysis. Technology For our implementation, we have used Apache Hadoop which was deployed on an Amazon EC2 instance for processing of data.For the installation of Hadoop master, we used m1.1large instance type whereas for the Hadoop slaves, we used m1.4small instance types. We elected the M1 general-purpose instance types primarily for their extremely low cost options for running applications. They are appropriate for a moderately good CPU performance. Apache Hive was used to analyze, summarize and query the data using a SQL type language known as HiveQL. Data Preparation Data Selection The data that was extracted was segregated into different tables for the sake of convenience of analysis. One of the tables from the Norway attacks event is as shown below -
  4. 4. Analysis of Twitter Data Set Hashtag Count Oslo 466 Norway 396 tcot 308 oslo 244 p2 234 SAVEAMERICANOW 214 news 124 blamethemuslims 111 norway 110 breakingnews 93 isles 93 fb 90 islanders 88 cnn 82 Utoya 74 teaparty 61 osloexpl 55 News 55 prayfornorway 55 tlot 36 Breivik 34 socialmedia 34
  5. 5. Analysis of Twitter Data Set politics 32 NFL 32 utoya 27 PrayForNorway 27 Utøya 27 CNN 26 Islam 24 oslobomb 24 Data Cleaning: Contrary to our perception that the data set would be limited to one specific time period of say one year, the information extracted from the dataset spanned over many years due to which there was no concentration of high density of information in one particular time period. Firstly, this meant finding events that occurred in a specific time period. Also, considering the fact that data in the data set is acquired from varied number of sources, there is often a lot of redundant data, which makes the deletion of duplicate information mandatory before any analysis can be conducted. Owing to the fact that we were dealing with huge data sets, we partitioned the data to make the analysis easier and also to improve query performance. Another important aspect of Data cleaning is Geo tagging locations. The reason that this needs to be considered is that the same address can be interpreted in various forms. For example, Bangalore, Bangalore Karnataka and Bangalore Karnataka India are all different ways to write the same location. In order to perform an accurate analysis, the location needs to be normalized and converted into the same
  6. 6. Analysis of Twitter Data Set format. The technique that we used to do this is Google’s Geocoding API. This API assists by giving a straightforward method to convert a particular address into coordinates like latitudes and longitudes that can be applied for map positioning. Challenges faced during Implementation: Some of the hindrances that we encountered with the extracted data are:  Duplicate files: The extracted data returned a huge number of repetitive files with the same content. This is a huge annoyance, as single files with unique content must be filtered through additional processing. This is also very time consuming.  Parsing data: Parsing is a difficult aspect and it does not work owing to varied reasons such as if the data on Twitter consists of many languages. Another reason could be the that the JSON structure was closed incorrectly which limits the data read beyond this point.  Complete data not recovered: This issue deals with the non-recovery of complete data when extracting through Apache Hive. As we are dealing with huge data sets, a lot of extra programming and debugging is required to repair the situation. Parsing exceptions were created which were thatched by locating the erroneous files. Analysis After data selection and data cleaning process, different tables were selected that were representative of various aspects of the analysis with regards to the two events – Amy Winehouse and Norway attacks ; a comparison analysis for the two events along with asentiment
  7. 7. Analysis of Twitter Data Set analysis for each of the two events. Following are the different aspects which will help proceed with an analysis of the events in hand – Data Distribution, Hashtags count table, URLS count table, Tweet sentiment, and Famous tweeters. Event 1: Amy Winehouse No of Tweets 0 5000 10000 15000 20000 25000 No of Tweets
  8. 8. Analysis of Twitter Data Set URL Share Count http://t.co/0IGT940 http://t.co/kLYO5t5 http://huff.to/oDwgHC http://t.co/BtIzsiW http://t.co/CahfKYh http://on.msnbc.com/4dpW6f http://nyp.st/qYGM9L http://bit.ly/oapSdd http://t.co/TkKR8Qm http://n.pr/nnu5XS 0 100 200 300 400 500 600 Hashtag Count
  9. 9. Analysis of Twitter Data Set Event 2: Norway attacks 0 50 100 150 200 250 300 350 400 450 SkyNewsBreak YouTube BreakingNews HuffingtonPost Reuters NewYorkPost iamshortymack RollingStone HotNewHipHop mashable User Mention Count No of Tweets 0 2000 4000 6000 8000 No of Tweets
  10. 10. Analysis of Twitter Data Set 7% 7% 7% 5% 5% 4% 4% 4% 4%4%3%3%3%3% 3% 3% 3% 3% 3% 3% 3% 2% 2% 2%2%2%2%2%2%2% URL Share Count http://on.mash.to/nViorD http://bisi.pl/31b http://bit.ly http://budurl.com/2tl2 http://t.co/dPHb33j http://bit.ly/qd41UN http://apne.ws/qvdeXV http://bit http://t.co/AyS26mV http://twitpic.com/5tzsmx http://t.co/dXABr5T http://apne.ws/qi7CM5 0 50 100 150 200 250 300 350 400 450 500 Hashtag Count
  11. 11. Analysis of Twitter Data Set Comparison Analysis The Amy Winehouse event occurred on 23rd of July,2011 whereas the Norway attacks event occurred on 22nd July, 2011. As can be seen from the charts, the number of tweets for event 1 peaked on the day of the event and had a steep drop over the week till they finally died down. On the other hand, the Norway attacks event, had maximum tweets on the day of the event and subsequently over the next couple of days while the drop in number of tweets was pretty gradual. However, it is interesting to note that event 1 garnered the maximum number of tweets of over 20000 on the day when it occurred. Despite being of more serious nature, event 2 saw much less number of tweets on the day of its occurrence. Sentiment Analysis The sentiments in terms of positive, negative and neutral tweets to the two events over a span of a week from 07/22/2011 to 07/31/2011 are visualized. Below are graphs that depict the same – 0 50 100 150 200 250 300 350 400 450 BreakingNews Reuters CBSNews YouTube HuffingtonPost YahooNews StateDept mpoppel ggreenwald SenatorSanders User Mention Count
  12. 12. Analysis of Twitter Data Set Event 1: Amy Winehouse The Event 1 garnered maximum neutral tweets and minimum positive tweets on the whole. Event 2: Norway Attacks Event 2 also garnered maximum neutral tweets and minimum positive tweets on the whole. Interestingly, the number of negative tweets exceeded the neutral and positive tweets during the subsequent days of the event. 0 2000 4000 6000 8000 10000 12000 20-Jul-11 22-Jul-11 24-Jul-11 26-Jul-11 28-Jul-11 30-Jul-11 1-Aug-11 Tweet Count Positive tweet Negative Tweet Neutral Tweet 0 1000 2000 3000 4000 5000 6000 7000 8000 20-Jul-11 22-Jul-11 24-Jul-11 26-Jul-11 28-Jul-11 30-Jul-11 1-Aug-11 Tweet Count Positive Negative Neutral
  13. 13. Analysis of Twitter Data Set Conclusion Managing huge amounts of data is becoming convenient with the advent of distributed file systems. They have the capability of managing and analyzing huge volumes of data that can help assess a particular event’s significance over a period of time. The analysis negates the hypothesis that we had initially assumed and brought us to the conclusion that Amy Winehouse event was as popular as an event as grave as the Norway attacks if not more. The retweets that the events generated assist in determining the most discussed issues among the twitter users. It is extremely surprising that a celebrity death can take precedence over assault of a nation. A reasoning for this could be that people are very conscious and careful upon commenting on issues that are sensitive in nature and choose to refrain from expressing views. The sentiment analysis reasserts this; with the graphs showing maximum neutral tweets to both the events, it can be interpreted that most people are reserved in their opinions and hence take a neutral stand while participating on a public platform where most activities are scrutinized especially an issue as delicate as the Norway attacks.
  14. 14. Analysis of Twitter Data Set References http://en.wikipedia.org/wiki/Sentiment_Analysis http://en.wikipedia.org/wiki/Apache_Hive http://aws.amazon.com/ec2/instance-types/#selecting-instance-types https://developers.google.com/maps/documentation/geocoding/?hl=el

×