Successfully reported this slideshow.
Your SlideShare is downloading. ×



Check these out next

1 of 39 Ad

More Related Content

Similar to Twitter_Sentiment_analysis.pptx (20)

Recently uploaded (20)



  1. 1. Presented By : RANJAN KUMAR BAITHA
  3. 3. About Twitter  Social networking and micro blogging service  Enables users to send and read messages  Messages of length up to 140 characters, known as "tweets".  Tweets contain rich information about people’s preferences.  People share their thoughts about matches and players stats using Twitter.
  4. 4. People’s opinions towards a match have huge impact on its success. Our project includes prediction using Twitter data, and analysis of the prediction results. High volume of positive tweets may indicate perform- ance and result of a match and players . But how to quantify ?
  5. 5.  The problem in twitter analytics is classifying polarity of a given text at the document, sentence or a features/aspect level. Whether the given document, sentence or a entity of a features/aspect is positive, negative or neutral.
  6. 6. Using social media to predict the future becomes very popular in recent years.  Predicting the Future with Social Media Bernardo tries to show that twitter-based prediction of Matches and Players that can effect in result and performance.  Predicting matches and players performance using social media (Andrei Oghina, Mathias Breuss, Manos Tsagkias & Maarten de Rijke 2012) uses twitter and facebook data to predict the scores and result as well as which player can perform in that match. My project includes prediction using Twitter data and investigation on two new topics based on the prediction results.
  7. 7. Data Collection: existing twitter data set and recent tweets via Twitter API Data Pre-processing: get the "clean" data and transform it to the format we need  Analysis: train a classifier to classify the tweets as: positive, negative, neutral and irrelevant  Prediction: use the statistics of the tweets' labels to predict the match result (win/loss)
  8. 8. MapReduce – Data Reduction The processing pillar in the Hadoop ecosystem is the MapReduce framework. The framework allows the specification of an operation to be applied to a huge data set, divide the problem and data, and run it in parallel. From an analyst’s point of view, this can occur on multiple dimensions. For example, a very large dataset can be reduced into a smaller subset where analytics can be applied
  9. 9. MapReduce - R Executing R code in the context of a MapReduce job elevates the kinds and size of analytics that can be applied to huge datasets. Problems that fit nicely into this model include “pleasingly parallel” scenarios. Here’s a simple use case: Scoring a dataset against a model built in R.
  10. 10. HDFS Architecture
  11. 11. Namenode • manages the File System's namespace/meta-data/file blocks • Runs on 1 machine to several machines Data node • Stores and retrieves data blocks • Reports to Namenode • Runs on many machines  Secondary Namenode • Performs house keeping work so Namenode doesn’t have • Requires similar hardware as Namenode machine • Not used for high-availability ,not a backup for name node
  12. 12.  Imposes key-value input/output  Defines map and reduce functions map: (K1,V1) → list (K2,V2) reduce: (K2,list(V2)) → list (K3,V3)  Map function is applied to every input key-value pair  Map function generates intermediate key-value pairs  Intermediate key-values are sorted and grouped by key  Reduce is applied to sorted and grouped intermediate key-values  Reduce emits result key-values
  13. 13. Takes care of distributed processing and coordination  Scheduling – Jobs are broken down into smaller chunks called tasks. These tasks are scheduled  Task Localization with Data – Framework strives to place tasks on the nodes that host the segment of data to be processed by that specific task – Code is moved to where the data is
  14. 14.  Error Handling – Failures are an expected behavior so tasks are automatically re-tried on other machines Data Synchronization – Shuffle and Sort barrier re-arranges and moves data between machines – Input and output are coordinated by the framework
  15. 15. This involves pushing the model to the Task nodes in the Hadoop cluster, running a MapReduce job that loads the model into R on a task node, scoring data either row-by row ( or in aggregates), and writing the results back to HDFS.  In the most simplistic case this can be done with just a Map task.
  16. 16. Session is the first step in working within theHDFS Overview To meet these challenges we have to start with some basics. First, we need to understand data storage in Hadoop, how it can be leveraged from R, and why it is important. The basic storage mechanism in Hadoop is HDFS (Hadoop Distributed File System). For an R programmer, being able to read/write files in HDFS from a standalone R .
  17. 17.  Avoid sampling / aggregation;  Reduce data movement and replication;  Bring the analytics as close as possible to the data and;  Optimize computation speed.
  18. 18. Creating a Twitter Application First step to perform Twitter Analysis is to create a twitter application. This application will allow you to perform analysis by connecting your R console to the twitter using the twitter API. The steps for creating your twitter applications are: Go to and login by using your twitter account. Then go to My Applications  Create a new application
  19. 19. Give your application a name, describe about your application in few words, provide your website’s URL or your blog address (in case you don’t have any website). Leave the Callback URL blank for now. Complete other formalities and create your twitter application. Once, all the steps are done, the created application will show as below. Please note the Consumer key and Consumer Secret numbers as they will be used in RStudio later.
  20. 20. This step is done. Next, I will work on my Rstudio.
  21. 21. These are twitteR, ROAuth, plyr, stringr,RJSONIO,Rcurl,bitops and ggplot2. In this section, I will first use some packages in R. You can install these packages by the following commands: Working on Rstudio - Building the corpus
  22. 22. Now run the following R script code snippet After running this script section, the console will look like this
  23. 23.  Now once this file is downloaded, we are now moving on to accessing the twitter API. This step include the script code to perform handshake using the Consumer Key and Consumer Secret number of your own application.  You have to change these entries by the keys from your application. Following is the code you have to run to perform handshake
  24. 24. Saving Tweets Once the handshake is done and authorized by twitter, we can fetch most recent tweets related to any keyword. I have used #Kejriwal as Mr. Arvind Kejriwal is the most talked about person in Delhi now a day. The code for getting tweets related to #Kejriwal is: This command will get 1000 tweets related to Kejriwal. The function “searchTwitter” is used to download tweets from the timeline. Now we need to convert this list of 1000 tweets into the data frame, so that we can work on it. Then finally we convert the data frame into .csv file
  25. 25. nt/analysis ding-twitter-data-using-r.html package/