Mike davies sentiment_analysis_presentation_backup
Sentiment Analysis 1. Discover a niche network of Twitter users 2. Model their emotions on topics 3. Use feelings to more accurately predict a time series e.g. The stock market e.g. Box office success 4. Are some [users/networks] more influential than others?
This Talk The Design Decision The Core Goals The 3 parts of the project: 1. Classifying the SENTIMENT of tweets 2. Building a NETWORK of twitter users 3. Finding a TIME SERIES of sentiment for each user
Sentiment Analysis Used Already Derwent Capital Markets - ”The twitter hedgefund” £25m fund 10% of tweets predicts Dow Jones movement direction with 87.6% accuracy Returned 1.85% in its first month of trading Johan Bollen, Indiana University, used bag-of- words approach
Sentiment Analysis Used Already Product reviews / ratings
Sentiment Analysis Used Already Social Media Analytics
Design Decision Many paragraphs of text (Product Reviews) + : Better accuracy of prediction - : Less data overall Huge amount of small quantities of text (Twitter) + : Opinions of greater number of people & at high enough frequency to model as a signal - : Classification of opinion is v. poor => TWITTER
2 Current Aims (will change later) 1. Project aims to be context independent (i.e. Movies & products) 2. When context is given, use it to better classify tweets
1: Sentiment Analysis of Tweets Three-tier classification process: tweet spam not spam objective subjective positive negative
1: Sentiment Analysis of Tweets Double-Back Propagation Algorithm ACL Journal, March 2011, MIT Press Opinion Word Extraction & Target Extraction 4 rules ”The phone has a good screen” => add ”good” to list of adjectives => add ”screen” to list of nouns Etc. Great for rating features of a product Not great for tweets
1: Sentiment Analysis of Tweets Twitter Part Of Speech (POS) tagger: www.ark.cs.cmu.edu/TweetNLP/ Written in java " ^ Drive ^ Max Ent " ^ , , go V and & watch V it O ! , Fantastic A movie N . ,
2: Building a Network Community detection: Paper 1: Near linear time algorithm for detecting community structures on large scale networks Paper 2: An LDA-based Community Structure Discovery Approach for Large-Scale Social Networks Haizheng Zhang
2: Building a Network Like MapReduce Instead of ”map” and ”reduce” Map = Update: modify overlapping sets of data Reduce = Sync: perform reductions in the background while sync is running Label Propagation & LDA
3: Time series prediction Will get time series from python to R using the rpy2 module R has a great package ”quantmod” for importing financial market data. Can also import other time series very easily & many great libraries.
Built With Python - For majority of code Packages: numpy, scipy, matplotlib networkx, graphviz, rpy2 django, twython, nltk R - For time series analysis Postgreql - SQL database Java - Twitter POS tagger C/C++ - GraphLab
Thank You Mike Davies Documented at www.m1ked.com
Notes: Vowpal Wabbit LDA Vowpal Wabbit is an open source library for fast online learning (mostly SGD) mainly developed by a guy at Yahoo. Optimised for speed LDA uses clever tricks like vectorisation, floating point representation to avoid using pow() and exp() functions.
Notes: Label Propagation Label Propagation has been proven to be an effective semi-supervised learning approach in many applications. The key idea behind label propagation is to first construct a graph in which each node represents a data point and each edge is assigned a weight often computed as the similarity between data points, then propagate the class labels of labeled data to neighbors in the constructed graph in order to make predictions.