Big Data Pipeline for Topic and
Sentiment Analysis, with
Applications
Srivatsan Ramanujam (@being_bayesian)
Senior Data Sc...
Agenda
Introduction
The Problem
The Platform
The Pipeline
Live Demo: Topic and Sentiment Analysis Engine
Applications in r...
Pivotal: A New Platform for a New Era
Data-Driven Application Development

Pivotal Data
Science Labs

App Fabric

Data Fab...
The Problem

© Copyright 2013 Pivotal. All rights reserved.

4
The Problem
Make sense of large volumes of unstructured text and integrate this with the
structured sources of data to mak...
The Platform

© Copyright 2013 Pivotal. All rights reserved.

6
Pivotal Greenplum MPP DB
Think of it as multiple
PostGreSQL servers
Master

Segments/Workers
Rows are distributed across s...
Pivotal Hadoop

• The pipeline in this
talk can be run on
Pivotal Hadoop +
HAWQ

© Copyright 2013 Pivotal. All rights rese...
Data Parallelism Vs. Task Parallelism
Data Parallelism: Little or no effort is required to break up the problem
into a num...
User-Defined Functions (UDFs)
PostgreSQL/Greenplum provide lots of flexibility in defining your own functions.
Simple UDFs...
PL/X : X in {pgsql, R, Python, Java, Perl, C etc.}
•

Allows users to write
Greenplum/PostgreSQL functions in the
R/Python...
Going Beyond Data Parallelism
Data Parallel computation via PL/Python libraries only allow
us to run ‘n’ models in paralle...
Scalable, in-database ML

•
•
•

Open Source!https://github.com/madlib/madlib
Works on Greenplum DB and PostgreSQL
Active ...
MADlib In-Database
Functions
Descriptive Statistics

Predictive Modeling Library
Generalized Linear Models
• Linear Regres...
Architecture
User Interface
“Driver” Functions
(outer loops of iterative algorithms, optimizer invocations)
High-level Abs...
MADlib on Hadoop

• A subset of algorithms from MADlib on Pivotal Greenplum DB, work out of
the box on HAWQ.
• Other funct...
The Pipeline

© Copyright 2013 Pivotal. All rights reserved.

17
The Pipeline

Tweet
Stream

D3.js
Stored on
HDFS
Topic Analysis
through MADlib pLDA

(gpfdist)
Loaded as
external tables
i...
Topic Analysis – MADlib pLDA
Natural Language Processing - GPText
Filter
relevant
content

Align
Data

Social
Media
Tokeni...
Sentiment Analysis
We don’t have labeled data for our problem (Tweets
aren’t tagged with Sentiment)

“Unpredictable”

Semi...
Sentiment Analysis – PL/X Functions
Break-up Tweets into
tokens and tag their
parts-of-speech

Part-of-speech
tagger1

1:
...
Live Demo

© Copyright 2013 Pivotal. All rights reserved.

22
Real World Applications

© Copyright 2013 Pivotal. All rights reserved.

23
Churn Models for Telecom Industry
Goal
– Identify and prevent customers who are likely to churn.

Challenges
–
–
–
–

Cost...
Structured Features for Churn Models
The problem is extensively studied with a rich set of approaches in the literature

D...
Blending the Unstructured with the Structured
What other sources of previously untapped data could we use ?

Are our custo...
Sentiment Analysis and Topic Models
MORE ACCURATE LIKELIHOOD
TO CHURN

Unstructured Data
External

Internal
Sentiment Anal...
Predicting Commodity Futures through Twitter
Customer
A major a agri-business cooperative
Business Problem
Predict price o...
The Approach

•

Tweets alone had significant predictive power for the commodity of interest to
us. When blended with stru...
What’s in it for me?

© Copyright 2013 Pivotal. All rights reserved.

30
Pivotal Open Source Contributions
http://gopivotal.com/pivotal-products/open-source-software

• PyMADlib – Python Wrapper ...
BUILT FOR THE SPEED OF BUSINESS
Upcoming SlideShare
Loading in …5
×

A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal Greenplum Database

6,727 views

Published on

Unstructured data is everywhere - in the form of posts, status updates, bloglets or news feeds in social media or in the form of customer interactions Call Center CRM. While many organizations study and monitor social media for tracking brand value and targeting specific customer segments, in our experience blending the unstructured data with the structured data in supplementing data science models has been far more effective than working with it independently.

In this talk we will show case an end-to-end topic and sentiment analysis pipeline we've built on the Pivotal Greenplum Database platform for Twitter feeds from GNIP, using open source tools like MADlib and PL/Python. We've used this pipeline to build regression models to predict commodity futures from tweets and in enhancing churn models for telecom through topic and sentiment analysis of call center transcripts. All of this was possible because of the flexibility and extensibility of the platform we worked with.

Published in: Technology, Education
0 Comments
9 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
6,727
On SlideShare
0
From Embeds
0
Number of Embeds
746
Actions
Shares
0
Downloads
202
Comments
0
Likes
9
Embeds 0
No embeds

No notes for slide
  • Why do we care about Apps as well as Data? By ‘apps’ we mean “enterprise and cloud applications” and how they are built. Pivotal has said a lot about data in public but we care about apps just as much. Leveraging the strengths of vFabric and Spring, Pivotal will continue to enable customers to build the applications they need. Applications are how our customers offer many new products and services today. Apps can accelerate customer interactions and realize value from data by presenting it to users in a meaningful way. With tools like Spring and Cloud Foundry, we can make ‘big data’ comprehensible and ‘easy’ to developers and hence to enterprises. And of course: users generate data, sensors generate data, phones generate data.. but much of this data comes from some sort of application. 
  • A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal Greenplum Database

    1. 1. Big Data Pipeline for Topic and Sentiment Analysis, with Applications Srivatsan Ramanujam (@being_bayesian) Senior Data Scientist, Pivotal 11 Jan 2014 © Copyright 2013 Pivotal. All rights reserved. 1
    2. 2. Agenda Introduction The Problem The Platform The Pipeline Live Demo: Topic and Sentiment Analysis Engine Applications in real world customer engagements © Copyright 2013 Pivotal. All rights reserved. 2
    3. 3. Pivotal: A New Platform for a New Era Data-Driven Application Development Pivotal Data Science Labs App Fabric Data Fabric “The new Middleware” “The new Database” Cloud Fabric “The new OS” ...ETC “The new Hardware” © Copyright 2013 Pivotal. All rights reserved. 3
    4. 4. The Problem © Copyright 2013 Pivotal. All rights reserved. 4
    5. 5. The Problem Make sense of large volumes of unstructured text and integrate this with the structured sources of data to make better predictions Approaches – Topic Analysis – Sentiment Analysis © Copyright 2013 Pivotal. All rights reserved. 5
    6. 6. The Platform © Copyright 2013 Pivotal. All rights reserved. 6
    7. 7. Pivotal Greenplum MPP DB Think of it as multiple PostGreSQL servers Master Segments/Workers Rows are distributed across segments by a particular field (or randomly) © Copyright 2013 Pivotal. All rights reserved. 7
    8. 8. Pivotal Hadoop • The pipeline in this talk can be run on Pivotal Hadoop + HAWQ © Copyright 2013 Pivotal. All rights reserved. 8
    9. 9. Data Parallelism Vs. Task Parallelism Data Parallelism: Little or no effort is required to break up the problem into a number of parallel tasks, and there exists no dependency (or communication) between those parallel tasks. – Ex: Build one Churn model for each state in the US simultaneously, when customer data is distributed by state code. Task Parallelism: Split the problem into independent sub-tasks which can executed in parallel. – Ex: Build one Churn model in parallel for the entire US, though customer data is distributed by state code. © Copyright 2013 Pivotal. All rights reserved. 9
    10. 10. User-Defined Functions (UDFs) PostgreSQL/Greenplum provide lots of flexibility in defining your own functions. Simple UDFs are SQL queries with calling arguments and return types. Definition: Execution: CREATE FUNCTION times2(INT) RETURNS INT AS $$ SELECT 2 * $1 $$ LANGUAGE sql; SELECT times2(1); times2 -------2 (1 row) © Copyright 2013 Pivotal. All rights reserved. 10
    11. 11. PL/X : X in {pgsql, R, Python, Java, Perl, C etc.} • Allows users to write Greenplum/PostgreSQL functions in the R/Python/Java, Perl, pgsql or C languages SQL Master Host The interpreter/VM of the language ‘X’ is installed on each node of the Greenplum Database Cluster • Data Parallelism: - PL/X piggybacks on Greenplum’s MPP architecture © Copyright 2013 Pivotal. All rights reserved. Standby Master Interconnect Segment Host Segment Segment Segment Host Segment Segment Segment Host Segment Segment Segment Host Segment Segment … 11
    12. 12. Going Beyond Data Parallelism Data Parallel computation via PL/Python libraries only allow us to run ‘n’ models in parallel. This works great when we are building one model for each value of the group by column, but we need parallelized algorithms to be able to build a single model on all the available data For this, we use MADlib – an open source library of parallel in-database machine learning algorithms. © Copyright 2013 Pivotal. All rights reserved. 12
    13. 13. Scalable, in-database ML • • • Open Source!https://github.com/madlib/madlib Works on Greenplum DB and PostgreSQL Active development by Pivotal - • © Copyright 2013 Pivotal. All rights reserved. Latest Release : 1.4 (Dec 2014) Downloads and Docs: http://madlib.net/ 13
    14. 14. MADlib In-Database Functions Descriptive Statistics Predictive Modeling Library Generalized Linear Models • Linear Regression • Logistic Regression • Multinomial Logistic Regression • Cox Proportional Hazards • Regression • Elastic Net Regularization • Sandwich Estimators (Huber white, clustered, marginal effects) Matrix Factorization • Single Value Decomposition (SVD) • Low-Rank © Copyright 2013 Pivotal. All rights reserved. Machine Learning Algorithms • Principal Component Analysis (PCA) • Association Rules (Affinity Analysis, Market Basket) • Topic Modeling (Parallel LDA) • Decision Trees • Ensemble Learners (Random Forests) • Support Vector Machines • Conditional Random Field (CRF) • Clustering (K-means) • Cross Validation Linear Systems • Sparse and Dense Solvers Sketch-based Estimators • CountMin (CormodeMuthukrishnan) • FM (Flajolet-Martin) • MFV (Most Frequent Values) Correlation Summary Support Modules Array Operations Sparse Vectors Random Sampling Probability Functions 14
    15. 15. Architecture User Interface “Driver” Functions (outer loops of iterative algorithms, optimizer invocations) High-level Abstraction Layer (iteration controller, ...) RDBMS Built-in Functions SQL, generated from specification Python with templated SQL Python Functions for Inner Loops (for streaming algorithms) Low-level Abstraction Layer (matrix operations, C++ to RDBMS type bridge, …) C++ RDBMS Query Processing (Greenplum, PostgreSQL, …) © Copyright 2013 Pivotal. All rights reserved. 15
    16. 16. MADlib on Hadoop • A subset of algorithms from MADlib on Pivotal Greenplum DB, work out of the box on HAWQ. • Other functions are being ported. • With the general availability and support for User Defined Functions in HAWQ, MADlib will attain full parity with GPDB © Copyright 2013 Pivotal. All rights reserved. 16
    17. 17. The Pipeline © Copyright 2013 Pivotal. All rights reserved. 17
    18. 18. The Pipeline Tweet Stream D3.js Stored on HDFS Topic Analysis through MADlib pLDA (gpfdist) Loaded as external tables into GPDB © Copyright 2013 Pivotal. All rights reserved. Parallel Parsing of JSON and extraction of fields using PL/Python Sentiment Analysis through custom PL/Python functions 18
    19. 19. Topic Analysis – MADlib pLDA Natural Language Processing - GPText Filter relevant content Align Data Social Media Tokenizer Stemming, frequency filtering Prepare dataset for Topic Modeling Topic Graph Topic composition MADlib Topic Model Topic Clouds © Copyright 2013 Pivotal. All rights reserved. 19
    20. 20. Sentiment Analysis We don’t have labeled data for our problem (Tweets aren’t tagged with Sentiment) “Unpredictable” Semi-Supervised Sentiment Prediction can be achieved by dictionary look-ups of tokens in a Tweet, but without Context, Sentiment Prediction is futile! “Breakthrough” © Copyright 2013 Pivotal. All rights reserved. 20
    21. 21. Sentiment Analysis – PL/X Functions Break-up Tweets into tokens and tag their parts-of-speech Part-of-speech tagger1 1: Semi-Supervised Sentiment Classification Phrase Extraction Phrasal Polarity Scoring Use learned phrasal polarities to score sentiment of new tweets Sentiment Scored Tweets Parts-of-speech Tagger : Gp-Ark-Tweet-NLP (http://vatsan.github.io/gp-ark-tweet-nlp/) © Copyright 2013 Pivotal. All rights reserved. 21
    22. 22. Live Demo © Copyright 2013 Pivotal. All rights reserved. 22
    23. 23. Real World Applications © Copyright 2013 Pivotal. All rights reserved. 23
    24. 24. Churn Models for Telecom Industry Goal – Identify and prevent customers who are likely to churn. Challenges – – – – Cost of acquiring new customers is high Recouping cost of customer acquisition high if customer is not retained long enough Lower barrier to switching subscribers With mobile number portability, barrier to switching even lower Good News – Cost of retaining existing customers is lower! © Copyright 2013 Pivotal. All rights reserved. 24
    25. 25. Structured Features for Churn Models The problem is extensively studied with a rich set of approaches in the literature Device Texting Stats Call Stats Rate Plans Customer Demographics These features are great, but the models soon hit a plateau with structured features! © Copyright 2013 Pivotal. All rights reserved. 25
    26. 26. Blending the Unstructured with the Structured What other sources of previously untapped data could we use ? Are our customers happy ? Where ? What segments ? What are the common topics in their conversations online ? © Copyright 2013 Pivotal. All rights reserved. 26
    27. 27. Sentiment Analysis and Topic Models MORE ACCURATE LIKELIHOOD TO CHURN Unstructured Data External Internal Sentiment Analysis Engine (Classifier) Topic Engine (LDA) Structured Data: EDW © Copyright 2013 Pivotal. All rights reserved. Topic Dashboard 27
    28. 28. Predicting Commodity Futures through Twitter Customer A major a agri-business cooperative Business Problem Predict price of commodity futures through Twitter Solution Built Sentiment Analysis and Text Regression algorithms to predict commodity futures from Tweets Established the foundation for blending the structured data (market fundamentals) with unstructured data (tweets) Challenges Language on Twitter does not adhere to rules of grammar and has poor structure No domain specific label corpus of tweet sentiment – problem is semi-supervised © Copyright 2013 Pivotal. All rights reserved. 28
    29. 29. The Approach • Tweets alone had significant predictive power for the commodity of interest to us. When blended with structured features like weather data we expect to see much better results. © Copyright 2013 Pivotal. All rights reserved. 29
    30. 30. What’s in it for me? © Copyright 2013 Pivotal. All rights reserved. 30
    31. 31. Pivotal Open Source Contributions http://gopivotal.com/pivotal-products/open-source-software • PyMADlib – Python Wrapper for MADlib - https://github.com/gopivotal/pymadlib • PivotalR – R wrapper for MADlib - https://github.com/madlib-internal/PivotalR • Part-of-speech tagger for Twitter via SQL - http://vatsan.github.io/gp-ark-tweet-nlp/ Questions? @being_bayesian © Copyright 2013 Pivotal. All rights reserved. 31
    32. 32. BUILT FOR THE SPEED OF BUSINESS

    ×