Analyzing Power of Tweets in Predicting Commodity Futures

Analyzing the power of Tweets
in predicting Commodity Futures
Mar 17, 2014
@gopivotal @being_bayesian
Srivatsan Ramanujam
Senior Data Scientist
Pivotal
© Copyright 2013 Pivotal. All rights reserved. 1

Problem Definition
Ÿ Can we predict Corn, Soybean and Wheat futures based on Social Chatter on Twitter ?
Ÿ The Customer: A major Agricultural Cooperative

Data

Obtaining Data
Ÿ Used to fetch 5-years of historical tweets matching any of a list of keywords of interest
Tweets Table Poster Information

GNIP
Ÿ As plugged-in partners, we’ve worked with
GNIP before, experience was great!
Ÿ We needed historical data and GNIP’s
Historical PowerTrack came in handy
Ÿ Clean API, quick quotes, convenient to
download results of historical jobs

Grain Futures Vs. Volume of Tweets

The Platform

Data Science Toolkit
Ÿ Appliance
– Full Rack DCA with Greenplum Database
Ÿ ETL
– Python
Ÿ Modeling
– SQL
– MADlib
– PL/Python, PL/Java
– Ark-Tweet-NLP1 with PL/Java Wrappers
Ÿ Visualization
– Tableau
1CMU ARK Twitter Parts-of-Speech tagger : http://www.ark.cs.cmu.edu/TweetNLP (GPL 2)

Pivotal Greenplum MPP DB
Think of it as multiple
PostGreSQL servers
Master
Segments/Workers
Rows are distributed across segments by
a particular field (or randomly)

PL/X : X in {pgsql, R, Python, Java, Perl, C etc.}
• Allows users to write Greenplum/
PostgreSQL functions in the R/Python/
Java, Perl, pgsql or C languages Standby
Ÿ The interpreter/VM of the language ‘X’ is
installed on each node of the Greenplum
Database Cluster
• Data Parallelism:
- PL/X piggybacks on
Greenplum’s MPP architecture
Master
Segment Host
Segment
Segment
…
Master
Host
SQL
Interconnect
Segment Host
Segment
Segment
Segment Host
Segment
Segment
Segment Host
Segment
Segment

Scalable, in-database ML
• Open Source!https://github.com/madlib/madlib
• Works on Greenplum DB and PostgreSQL
• Active development by Pivotal
• Downloads and Docs: http://madlib.net/
- Latest Release : 1.4 (Dec 2014)

MADlib In-Database
Functions
Predictive Modeling Library
Generalized Linear Models
• Linear Regression
• Logistic Regression
• Multinomial Logistic Regression
• Cox Proportional Hazards
• Regression
• Elastic Net Regularization
• Sandwich Estimators (Huber white,
clustered, marginal effects)
Matrix Factorization
• Single Value Decomposition (SVD)
• Low-Rank
Machine Learning Algorithms
• Principal Component Analysis (PCA)
• Association Rules (Affinity Analysis, Market
Basket)
• Topic Modeling (Parallel LDA)
• Decision Trees
• Ensemble Learners (Random Forests)
• Support Vector Machines
• Conditional Random Field (CRF)
• Clustering (K-means)
• Cross Validation
Linear Systems
• Sparse and Dense Solvers
Descriptive Statistics
Sketch-based Estimators
• CountMin (Cormode-
Muthukrishnan)
• FM (Flajolet-Martin)
• MFV (Most Frequent
Values)
Correlation
Summary
Support Modules
Array Operations
Sparse Vectors
Random Sampling
Probability Functions

The Models

The Approach
• In addition to identifying textual cues in tweets that were correlated with
commodity futures, we also wanted to analyze whether tweet sentiment was
correlated with commodity futures

Sentiment Analysis – Challenges
Ÿ Language on Twitter doesn’t
adhere to rules of grammar, syntax
or spelling
Ÿ We don’t have labeled data for our
problem. The tweets aren’t tagged
with sentiment
Ÿ Semi-Supervised Sentiment
Prediction can be achieved by
dictionary look-ups of tokens in a
Tweet, but without Context,
Sentiment Prediction is futile!
“Cool”

Sentiment Analysis – Approach
Ÿ Parallelized ArkTweetNLP to achieve fast parts-of-speech tagging on Tweets
Ÿ Custom (patent pending) algorithm to extract contextual cues & score sentiment of tweets
Semi-Supervised Sentiment Classification
Phrase Extraction
Break-up Tweets into
tokens and tag their
parts-of-speech
Part-of-speech
tagger1
1: Parts-of-speech Tagger : Gp-Ark-Tweet-NLP (http://vatsan.github.io/gp-ark-tweet-nlp/)
Phrasal Polarity
Scoring
Use learned phrasal
polarities to score
sentiment of new tweets
Sentiment Scored
Tweets

Text Analytics Pipeline with GNIP stream
Tweet
Stream
Stored on
HDFS
(gpfdist)
Loaded as
external tables
into GPDB
Parallel Parsing of
JSON and extraction
of fields using PL/
Python
Topic Analysis through
MADlib pLDA
Sentiment Analysis
through custom
PL/Python functions
D3.js

Key Take-Aways
There is significant signal in Tweets in predicting commodity futures
Sentiment Analysis of tweets can provide an additional signal in predicting
commodity futures. Twitter sentiment was negatively correlated with commodity
futures, in the sample we analyzed
A blended model of Text Regression, Sentiment Analysis and Tweet Actor
information gave us encouraging results and we believe that when combined
with market fundamentals like weather or yield will give better models

What’s in it for me?

Pivotal Open Source Contributions
http://gopivotal.com/pivotal-products/open-source-software
• MADlib – In-database parallel ML
- https://github.com/madlib/madlib
• PyMADlib – Python Wrapper for MADlib
- https://github.com/gopivotal/pymadlib
• PivotalR – R wrapper for MADlib
- https://github.com/madlib-internal/PivotalR
• Part-of-speech tagger for Twitter via SQL
- http://vatsan.github.io/gp-ark-tweet-nlp/
Questions?
@being_bayesian

Analyzing Power of Tweets in Predicting Commodity Futures

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Analyzing Power of Tweets in Predicting Commodity Futures

Similar to Analyzing Power of Tweets in Predicting Commodity Futures (20)

Recently uploaded

Recently uploaded (20)

Analyzing Power of Tweets in Predicting Commodity Futures