11/18/2015 Analyze Twitter Data
Intermediate Project Report
UNIVERSITY AT BUFFALO
Sentiment Analysis of Mr. Narendra Modi’s Brand Image using Twitter Data
Summary: - I am doing sentiment analysis of Mr. Narendra Modi’s Brand Image across
different nations using data from twitter. For fetching the twitter data, I am using Apache
Flume that is open source and by default comes installed in Hortonworks sandbox platform
After fetching the data from twitter, it would be loaded directly to HDFS (Hadoop Distributed
File System). This way I am reducing the extra overhead of transferring the data from local
system to HDFS.
Data loaded in HDFS is still in unstructured format and not good for Ad-hoc analysis. So I will
be converting the JSON data to tabular format and store it in HIVE. Also I would be providing
a graphical user interface to end users to run their own ad-hoc analysis.
Next step deals with using the dictionary file to score the sentiment of each tweet by the
number of positive words compared to number of negative words, and then assigned a
positive, negative or neutral sentiment value to eachtweet. I have downloaded the dictionary
file from below link.
Click here for Dictionary
Last part of project is to show results of sentiments analysis in form of visualizations. Here I
will be using Tableau for it. I will be connecting Tableau to Hive using Hortonworks ODBC
Driver that I downloaded from Hortonworks website (link mentioned in references section).
I will show the results of analysis in the form graphs and maps using Tableau’s inbuilt VIZQL
Data sets and Software:
Sentiment Data: - Sentiment Data is unstructured data that represents opinions, emotions,
attitudes contained in sources such as social media posts, online blogs, and product reviews
Whyuse sentiment Data:- Organizations use sentiment data to know what people feel about
their product and what they can do to effectively market their product.
How did I fetched Twitter Data: - Created twitter app, configured flume.conf with app
credentials and ran flume. All the steps for fetching data from twitter using Apache Flume I
have mentioned in a YouTube video and a ppt, the link of which is below. I have alsouploaded
video at ublearns discussion forum of DC.
YouTube: - https://youtu.be/E1w5SkE7Cco
Slide share: - http://www.slideshare.net/bharat3khanna/extracting-twitter-data-using-
Source code for Flume-Snapshot.jar:- Idownloadedsource code of Flume-snapshot.jarfromgithub
and builtthe jarusingmavenpackage inHadoop cluster.
Click here for Flume Source Code
Size of Data: - Though there is no limitation of amount of data I can get from twitter but for this
project, I am going to do my analysis on approximately 100 mb of data.
AlgorithmsUsed:- IamnotusingMap-Reduce Algorithmhere,sinceIwanttodoanalysis oncomplete
data and I don’twant to use aggregatedmeasures.If I wouldhave usedMap Reduce,thenmy lot of
data wouldhave beenaggregatedbyreducer.My source data isin JSON format and I am usingHive-
serde.jar (serde stands serializer and deserializer) that helps in parsing the JSON data effectively to
Source code forHive-serde.jar:-Idownloaded source code of Hive-serde.jarfromgithubandbuiltthe
jar using maven package in Hadoop cluster.
Clickhere forHive-serde.jarsource code
Analysis to be done on Twitter data: - I am going to do following analysis using Hive and Tableau:-
a) Maximum tweets count per user.
b) Count of retweets.
c) Geographically mapping people’s sentiments towards Mr. Modi.