SlideShare a Scribd company logo
Big Data Pipeline for Topic and
Sentiment Analysis, with
Applications
Srivatsan Ramanujam (@being_bayesian)
Senior Data Scientist, Pivotal

11 Jan 2014

© Copyright 2013 Pivotal. All rights reserved.

1
Agenda
Introduction
The Problem
The Platform
The Pipeline
Live Demo: Topic and Sentiment Analysis Engine
Applications in real world customer engagements

© Copyright 2013 Pivotal. All rights reserved.

2
Pivotal: A New Platform for a New Era
Data-Driven Application Development

Pivotal Data
Science Labs

App Fabric

Data Fabric

“The new Middleware”

“The new Database”

Cloud Fabric
“The new OS”
...ETC

“The new Hardware”

© Copyright 2013 Pivotal. All rights reserved.

3
The Problem

© Copyright 2013 Pivotal. All rights reserved.

4
The Problem
Make sense of large volumes of unstructured text and integrate this with the
structured sources of data to make better predictions
Approaches
– Topic Analysis
– Sentiment Analysis

© Copyright 2013 Pivotal. All rights reserved.

5
The Platform

© Copyright 2013 Pivotal. All rights reserved.

6
Pivotal Greenplum MPP DB
Think of it as multiple
PostGreSQL servers
Master

Segments/Workers
Rows are distributed across segments by
a particular field (or randomly)

© Copyright 2013 Pivotal. All rights reserved.

7
Pivotal Hadoop

• The pipeline in this
talk can be run on
Pivotal Hadoop +
HAWQ

© Copyright 2013 Pivotal. All rights reserved.

8
Data Parallelism Vs. Task Parallelism
Data Parallelism: Little or no effort is required to break up the problem
into a number of parallel tasks, and there exists no dependency (or
communication) between those parallel tasks.
– Ex: Build one Churn model for each state in the US simultaneously, when
customer data is distributed by state code.

Task Parallelism: Split the problem into independent sub-tasks which
can executed in parallel.
– Ex: Build one Churn model in parallel for the entire US, though customer
data is distributed by state code.

© Copyright 2013 Pivotal. All rights reserved.

9
User-Defined Functions (UDFs)
PostgreSQL/Greenplum provide lots of flexibility in defining your own functions.
Simple UDFs are SQL queries with calling arguments and return types.

Definition:

Execution:

CREATE FUNCTION times2(INT)
RETURNS INT
AS $$
SELECT 2 * $1
$$ LANGUAGE sql;

SELECT times2(1);
times2
-------2
(1 row)

© Copyright 2013 Pivotal. All rights reserved.

10
PL/X : X in {pgsql, R, Python, Java, Perl, C etc.}
•

Allows users to write
Greenplum/PostgreSQL functions in the
R/Python/Java, Perl, pgsql or C languages

SQL
Master
Host

The interpreter/VM of the language ‘X’ is
installed on each node of the Greenplum
Database Cluster
•

Data Parallelism:
- PL/X piggybacks on
Greenplum’s MPP architecture

© Copyright 2013 Pivotal. All rights reserved.

Standby
Master

Interconnect

Segment Host
Segment
Segment

Segment Host
Segment
Segment

Segment Host
Segment
Segment

Segment Host
Segment
Segment

…
11
Going Beyond Data Parallelism
Data Parallel computation via PL/Python libraries only allow
us to run ‘n’ models in parallel.
This works great when we are building one model for each
value of the group by column, but we need parallelized
algorithms to be able to build a single model on all the
available data

For this, we use MADlib – an open source library of parallel
in-database machine learning algorithms.

© Copyright 2013 Pivotal. All rights reserved.

12
Scalable, in-database ML

•
•
•

Open Source!https://github.com/madlib/madlib
Works on Greenplum DB and PostgreSQL
Active development by Pivotal
-

•

© Copyright 2013 Pivotal. All rights reserved.

Latest Release : 1.4 (Dec 2014)
Downloads and Docs: http://madlib.net/

13
MADlib In-Database
Functions
Descriptive Statistics

Predictive Modeling Library
Generalized Linear Models
• Linear Regression
• Logistic Regression
• Multinomial Logistic Regression
• Cox Proportional Hazards
• Regression
• Elastic Net Regularization
• Sandwich Estimators (Huber white,
clustered, marginal effects)

Matrix Factorization
• Single Value Decomposition (SVD)
• Low-Rank

© Copyright 2013 Pivotal. All rights reserved.

Machine Learning Algorithms
• Principal Component Analysis (PCA)
• Association Rules (Affinity Analysis, Market
Basket)
• Topic Modeling (Parallel LDA)
• Decision Trees
• Ensemble Learners (Random Forests)
• Support Vector Machines
• Conditional Random Field (CRF)
• Clustering (K-means)
• Cross Validation
Linear Systems
• Sparse and Dense Solvers

Sketch-based Estimators
• CountMin (CormodeMuthukrishnan)
• FM (Flajolet-Martin)
• MFV (Most Frequent
Values)
Correlation
Summary
Support Modules
Array Operations
Sparse Vectors
Random Sampling
Probability Functions

14
Architecture
User Interface
“Driver” Functions
(outer loops of iterative algorithms, optimizer invocations)
High-level Abstraction Layer
(iteration controller, ...)

RDBMS
Built-in
Functions

SQL, generated from
specification

Python with
templated SQL
Python

Functions for Inner Loops
(for streaming algorithms)
Low-level Abstraction Layer
(matrix operations, C++ to RDBMS
type bridge, …)

C++

RDBMS Query Processing
(Greenplum, PostgreSQL, …)
© Copyright 2013 Pivotal. All rights reserved.

15
MADlib on Hadoop

• A subset of algorithms from MADlib on Pivotal Greenplum DB, work out of
the box on HAWQ.
• Other functions are being ported.
• With the general availability and support for User Defined Functions in
HAWQ, MADlib will attain full parity with GPDB

© Copyright 2013 Pivotal. All rights reserved.

16
The Pipeline

© Copyright 2013 Pivotal. All rights reserved.

17
The Pipeline

Tweet
Stream

D3.js
Stored on
HDFS
Topic Analysis
through MADlib pLDA

(gpfdist)
Loaded as
external tables
into GPDB

© Copyright 2013 Pivotal. All rights reserved.

Parallel Parsing of
JSON and extraction
of fields using
PL/Python

Sentiment Analysis
through custom
PL/Python functions

18
Topic Analysis – MADlib pLDA
Natural Language Processing - GPText
Filter
relevant
content

Align
Data

Social
Media
Tokenizer

Stemming,
frequency
filtering

Prepare
dataset for
Topic
Modeling

Topic Graph
Topic composition

MADlib Topic
Model
Topic
Clouds

© Copyright 2013 Pivotal. All rights reserved.

19
Sentiment Analysis
We don’t have labeled data for our problem (Tweets
aren’t tagged with Sentiment)

“Unpredictable”

Semi-Supervised Sentiment Prediction can be
achieved by dictionary look-ups of tokens in a Tweet,
but without Context, Sentiment Prediction is futile!

“Breakthrough”

© Copyright 2013 Pivotal. All rights reserved.

20
Sentiment Analysis – PL/X Functions
Break-up Tweets into
tokens and tag their
parts-of-speech

Part-of-speech
tagger1

1:

Semi-Supervised Sentiment Classification

Phrase Extraction

Phrasal Polarity
Scoring

Use learned phrasal
polarities to score
sentiment of new tweets

Sentiment Scored
Tweets

Parts-of-speech Tagger : Gp-Ark-Tweet-NLP (http://vatsan.github.io/gp-ark-tweet-nlp/)

© Copyright 2013 Pivotal. All rights reserved.

21
Live Demo

© Copyright 2013 Pivotal. All rights reserved.

22
Real World Applications

© Copyright 2013 Pivotal. All rights reserved.

23
Churn Models for Telecom Industry
Goal
– Identify and prevent customers who are likely to churn.

Challenges
–
–
–
–

Cost of acquiring new customers is high
Recouping cost of customer acquisition high if customer is not retained long enough
Lower barrier to switching subscribers
With mobile number portability, barrier to switching even lower

Good News
– Cost of retaining existing customers is lower!

© Copyright 2013 Pivotal. All rights reserved.

24
Structured Features for Churn Models
The problem is extensively studied with a rich set of approaches in the literature

Device

Texting Stats

Call Stats

Rate Plans

Customer
Demographics

These features are great, but the models soon hit a plateau with structured
features!

© Copyright 2013 Pivotal. All rights reserved.

25
Blending the Unstructured with the Structured
What other sources of previously untapped data could we use ?

Are our customers happy ? Where ? What segments ?
What are the common topics in their conversations online ?

© Copyright 2013 Pivotal. All rights reserved.

26
Sentiment Analysis and Topic Models
MORE ACCURATE LIKELIHOOD
TO CHURN

Unstructured Data
External

Internal
Sentiment Analysis
Engine
(Classifier)

Topic Engine
(LDA)

Structured Data: EDW

© Copyright 2013 Pivotal. All rights reserved.

Topic Dashboard

27
Predicting Commodity Futures through Twitter
Customer
A major a agri-business cooperative
Business Problem
Predict price of commodity futures through
Twitter

Solution

Built Sentiment Analysis and Text
Regression algorithms to predict commodity
futures from Tweets
Established the foundation for blending the
structured data (market fundamentals) with
unstructured data (tweets)

Challenges
Language on Twitter does not adhere to
rules of grammar and has poor structure
No domain specific label corpus of tweet
sentiment – problem is semi-supervised

© Copyright 2013 Pivotal. All rights reserved.

28
The Approach

•

Tweets alone had significant predictive power for the commodity of interest to
us. When blended with structured features like weather data we expect to see
much better results.

© Copyright 2013 Pivotal. All rights reserved.

29
What’s in it for me?

© Copyright 2013 Pivotal. All rights reserved.

30
Pivotal Open Source Contributions
http://gopivotal.com/pivotal-products/open-source-software

• PyMADlib – Python Wrapper for MADlib
-

https://github.com/gopivotal/pymadlib

• PivotalR – R wrapper for MADlib
-

https://github.com/madlib-internal/PivotalR

• Part-of-speech tagger for Twitter via SQL
-

http://vatsan.github.io/gp-ark-tweet-nlp/

Questions?
@being_bayesian

© Copyright 2013 Pivotal. All rights reserved.

31
BUILT FOR THE SPEED OF BUSINESS

More Related Content

What's hot

ChatGPT 101.pptx
ChatGPT 101.pptxChatGPT 101.pptx
ChatGPT 101.pptx
MohamadAimanArifMoha
 
Les espaces verts et les jardins
Les espaces verts et les jardinsLes espaces verts et les jardins
Les espaces verts et les jardinsHiba Architecte
 
Blosne - Séminaire services gestionnaires VdR
Blosne - Séminaire services gestionnaires VdRBlosne - Séminaire services gestionnaires VdR
Blosne - Séminaire services gestionnaires VdR
fredbourcier
 
State of AI Report 2023 - Air Street Capital
State of AI Report 2023 - Air Street CapitalState of AI Report 2023 - Air Street Capital
State of AI Report 2023 - Air Street Capital
AI Geek (wishesh)
 
Skills ontology
Skills ontologySkills ontology
Skills ontology
Ramu Govindan
 
LangChain Intro by KeyMate.AI
LangChain Intro by KeyMate.AILangChain Intro by KeyMate.AI
LangChain Intro by KeyMate.AI
OzgurOscarOzkan
 
Introduction to ChatGPT
Introduction to ChatGPTIntroduction to ChatGPT
Introduction to ChatGPT
annusharma26
 
AI生成工具的新衝擊 - MS Bing & Google Bard 能否挑戰ChatGPT-4領導地位
AI生成工具的新衝擊 - MS Bing & Google Bard 能否挑戰ChatGPT-4領導地位AI生成工具的新衝擊 - MS Bing & Google Bard 能否挑戰ChatGPT-4領導地位
AI生成工具的新衝擊 - MS Bing & Google Bard 能否挑戰ChatGPT-4領導地位
eLearning Consortium 電子學習聯盟
 
Espaces verts & jardins
Espaces verts & jardinsEspaces verts & jardins
Espaces verts & jardinsSami Sahli
 
Esanthramanujam-ChatGPT vs Bard-PPT.pptx
Esanthramanujam-ChatGPT vs Bard-PPT.pptxEsanthramanujam-ChatGPT vs Bard-PPT.pptx
Esanthramanujam-ChatGPT vs Bard-PPT.pptx
esANTHHHH
 
VTU_Tools of Design Thinking.pdf
VTU_Tools of Design Thinking.pdfVTU_Tools of Design Thinking.pdf
VTU_Tools of Design Thinking.pdf
vijimech408
 
Hathigaon -the_elephant_village
Hathigaon  -the_elephant_villageHathigaon  -the_elephant_village
Hathigaon -the_elephant_village
NishaMathewGhosh
 
Design thinking for Education, AUW Session 1
Design thinking  for Education, AUW Session 1Design thinking  for Education, AUW Session 1
Design thinking for Education, AUW Session 1
Stefanie Panke
 
The Hadoop Ecosystem
The Hadoop EcosystemThe Hadoop Ecosystem
The Hadoop EcosystemJ Singh
 
OpenAI GPT in Depth - Questions and Misconceptions
OpenAI GPT in Depth - Questions and MisconceptionsOpenAI GPT in Depth - Questions and Misconceptions
OpenAI GPT in Depth - Questions and Misconceptions
Ivo Andreev
 
AI, Creativity and Generative Art
AI, Creativity and Generative ArtAI, Creativity and Generative Art
AI, Creativity and Generative Art
Eelco den Heijer
 
Ozone and HDFS's Evolution
Ozone and HDFS's EvolutionOzone and HDFS's Evolution
Ozone and HDFS's Evolution
DataWorks Summit
 
Chat GPT Intoduction.pdf
Chat GPT Intoduction.pdfChat GPT Intoduction.pdf
Chat GPT Intoduction.pdf
Thiyagu K
 
IDS403 society and technology final project part two_daria smith giraud
IDS403 society and technology final project part two_daria smith giraudIDS403 society and technology final project part two_daria smith giraud
IDS403 society and technology final project part two_daria smith giraud
Daria Smith Giraud
 
Introduction to design thinking
Introduction to design thinkingIntroduction to design thinking
Introduction to design thinking
Fieke Sluijs
 

What's hot (20)

ChatGPT 101.pptx
ChatGPT 101.pptxChatGPT 101.pptx
ChatGPT 101.pptx
 
Les espaces verts et les jardins
Les espaces verts et les jardinsLes espaces verts et les jardins
Les espaces verts et les jardins
 
Blosne - Séminaire services gestionnaires VdR
Blosne - Séminaire services gestionnaires VdRBlosne - Séminaire services gestionnaires VdR
Blosne - Séminaire services gestionnaires VdR
 
State of AI Report 2023 - Air Street Capital
State of AI Report 2023 - Air Street CapitalState of AI Report 2023 - Air Street Capital
State of AI Report 2023 - Air Street Capital
 
Skills ontology
Skills ontologySkills ontology
Skills ontology
 
LangChain Intro by KeyMate.AI
LangChain Intro by KeyMate.AILangChain Intro by KeyMate.AI
LangChain Intro by KeyMate.AI
 
Introduction to ChatGPT
Introduction to ChatGPTIntroduction to ChatGPT
Introduction to ChatGPT
 
AI生成工具的新衝擊 - MS Bing & Google Bard 能否挑戰ChatGPT-4領導地位
AI生成工具的新衝擊 - MS Bing & Google Bard 能否挑戰ChatGPT-4領導地位AI生成工具的新衝擊 - MS Bing & Google Bard 能否挑戰ChatGPT-4領導地位
AI生成工具的新衝擊 - MS Bing & Google Bard 能否挑戰ChatGPT-4領導地位
 
Espaces verts & jardins
Espaces verts & jardinsEspaces verts & jardins
Espaces verts & jardins
 
Esanthramanujam-ChatGPT vs Bard-PPT.pptx
Esanthramanujam-ChatGPT vs Bard-PPT.pptxEsanthramanujam-ChatGPT vs Bard-PPT.pptx
Esanthramanujam-ChatGPT vs Bard-PPT.pptx
 
VTU_Tools of Design Thinking.pdf
VTU_Tools of Design Thinking.pdfVTU_Tools of Design Thinking.pdf
VTU_Tools of Design Thinking.pdf
 
Hathigaon -the_elephant_village
Hathigaon  -the_elephant_villageHathigaon  -the_elephant_village
Hathigaon -the_elephant_village
 
Design thinking for Education, AUW Session 1
Design thinking  for Education, AUW Session 1Design thinking  for Education, AUW Session 1
Design thinking for Education, AUW Session 1
 
The Hadoop Ecosystem
The Hadoop EcosystemThe Hadoop Ecosystem
The Hadoop Ecosystem
 
OpenAI GPT in Depth - Questions and Misconceptions
OpenAI GPT in Depth - Questions and MisconceptionsOpenAI GPT in Depth - Questions and Misconceptions
OpenAI GPT in Depth - Questions and Misconceptions
 
AI, Creativity and Generative Art
AI, Creativity and Generative ArtAI, Creativity and Generative Art
AI, Creativity and Generative Art
 
Ozone and HDFS's Evolution
Ozone and HDFS's EvolutionOzone and HDFS's Evolution
Ozone and HDFS's Evolution
 
Chat GPT Intoduction.pdf
Chat GPT Intoduction.pdfChat GPT Intoduction.pdf
Chat GPT Intoduction.pdf
 
IDS403 society and technology final project part two_daria smith giraud
IDS403 society and technology final project part two_daria smith giraudIDS403 society and technology final project part two_daria smith giraud
IDS403 society and technology final project part two_daria smith giraud
 
Introduction to design thinking
Introduction to design thinkingIntroduction to design thinking
Introduction to design thinking
 

Viewers also liked

Python Powered Data Science at Pivotal (PyData 2013)
Python Powered Data Science at Pivotal (PyData 2013)Python Powered Data Science at Pivotal (PyData 2013)
Python Powered Data Science at Pivotal (PyData 2013)
Srivatsan Ramanujam
 
Sentiment analysis-by-nltk
Sentiment analysis-by-nltkSentiment analysis-by-nltk
Sentiment analysis-by-nltkWei-Ting Kuo
 
Sentiment Analysis in Twitter
Sentiment Analysis in TwitterSentiment Analysis in Twitter
Sentiment Analysis in Twitter
Ayushi Dalmia
 
Sentiment Analysis of Twitter Data
Sentiment Analysis of Twitter DataSentiment Analysis of Twitter Data
Sentiment Analysis of Twitter Data
Sumit Raj
 
Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Scienc...
Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Scienc...Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Scienc...
Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Scienc...
Sarah Aerni
 
Germin8 - Social Media Analytics
Germin8 - Social Media AnalyticsGermin8 - Social Media Analytics
Germin8 - Social Media Analytics
Germin8
 
Alleviating Data Sparsity for Twitter Sentiment Analysis
Alleviating Data Sparsity for Twitter Sentiment AnalysisAlleviating Data Sparsity for Twitter Sentiment Analysis
Alleviating Data Sparsity for Twitter Sentiment Analysis
Knowledge Media Institute - The Open University
 
Social media & sentiment analysis splunk conf2012
Social media & sentiment analysis   splunk conf2012Social media & sentiment analysis   splunk conf2012
Social media & sentiment analysis splunk conf2012
Michael Wilde
 
Sentiments Analysis using Python and nltk
Sentiments Analysis using Python and nltk Sentiments Analysis using Python and nltk
Sentiments Analysis using Python and nltk
Ashwin Perti
 
Political sentiment analysis using twitter data
Political sentiment analysis using twitter dataPolitical sentiment analysis using twitter data
Political sentiment analysis using twitter data
Amal Mahmoud
 
Sentiment analysis using naive bayes classifier
Sentiment analysis using naive bayes classifier Sentiment analysis using naive bayes classifier
Sentiment analysis using naive bayes classifier
Dev Sahu
 
How Sentiment Analysis works
How Sentiment Analysis worksHow Sentiment Analysis works
How Sentiment Analysis works
CJ Jenkins
 
Introduction to Sentiment Analysis
Introduction to Sentiment AnalysisIntroduction to Sentiment Analysis
Introduction to Sentiment Analysis
Jaganadh Gopinadhan
 
Tutorial of Sentiment Analysis
Tutorial of Sentiment AnalysisTutorial of Sentiment Analysis
Tutorial of Sentiment AnalysisFabio Benedetti
 
Big Data & Sentiment Analysis
Big Data & Sentiment AnalysisBig Data & Sentiment Analysis
Big Data & Sentiment Analysis
Michel Bruley
 
Massively Parallel Processing with Procedural Python by Ronert Obst PyData Be...
Massively Parallel Processing with Procedural Python by Ronert Obst PyData Be...Massively Parallel Processing with Procedural Python by Ronert Obst PyData Be...
Massively Parallel Processing with Procedural Python by Ronert Obst PyData Be...
PyData
 
Massively Parallel Processing with Procedural Python (PyData London 2014)
Massively Parallel Processing with Procedural Python (PyData London 2014)Massively Parallel Processing with Procedural Python (PyData London 2014)
Massively Parallel Processing with Procedural Python (PyData London 2014)
Ian Huston
 
Pivotal OSS meetup - MADlib and PivotalR
Pivotal OSS meetup - MADlib and PivotalRPivotal OSS meetup - MADlib and PivotalR
Pivotal OSS meetup - MADlib and PivotalR
go-pivotal
 
Pipeline parallelism
Pipeline parallelismPipeline parallelism
Pipeline parallelism
Dr. C.V. Suresh Babu
 
Lecture 3: Structuring Unstructured Texts Through Sentiment Analysis
Lecture 3: Structuring Unstructured Texts Through Sentiment AnalysisLecture 3: Structuring Unstructured Texts Through Sentiment Analysis
Lecture 3: Structuring Unstructured Texts Through Sentiment Analysis
Marina Santini
 

Viewers also liked (20)

Python Powered Data Science at Pivotal (PyData 2013)
Python Powered Data Science at Pivotal (PyData 2013)Python Powered Data Science at Pivotal (PyData 2013)
Python Powered Data Science at Pivotal (PyData 2013)
 
Sentiment analysis-by-nltk
Sentiment analysis-by-nltkSentiment analysis-by-nltk
Sentiment analysis-by-nltk
 
Sentiment Analysis in Twitter
Sentiment Analysis in TwitterSentiment Analysis in Twitter
Sentiment Analysis in Twitter
 
Sentiment Analysis of Twitter Data
Sentiment Analysis of Twitter DataSentiment Analysis of Twitter Data
Sentiment Analysis of Twitter Data
 
Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Scienc...
Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Scienc...Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Scienc...
Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Scienc...
 
Germin8 - Social Media Analytics
Germin8 - Social Media AnalyticsGermin8 - Social Media Analytics
Germin8 - Social Media Analytics
 
Alleviating Data Sparsity for Twitter Sentiment Analysis
Alleviating Data Sparsity for Twitter Sentiment AnalysisAlleviating Data Sparsity for Twitter Sentiment Analysis
Alleviating Data Sparsity for Twitter Sentiment Analysis
 
Social media & sentiment analysis splunk conf2012
Social media & sentiment analysis   splunk conf2012Social media & sentiment analysis   splunk conf2012
Social media & sentiment analysis splunk conf2012
 
Sentiments Analysis using Python and nltk
Sentiments Analysis using Python and nltk Sentiments Analysis using Python and nltk
Sentiments Analysis using Python and nltk
 
Political sentiment analysis using twitter data
Political sentiment analysis using twitter dataPolitical sentiment analysis using twitter data
Political sentiment analysis using twitter data
 
Sentiment analysis using naive bayes classifier
Sentiment analysis using naive bayes classifier Sentiment analysis using naive bayes classifier
Sentiment analysis using naive bayes classifier
 
How Sentiment Analysis works
How Sentiment Analysis worksHow Sentiment Analysis works
How Sentiment Analysis works
 
Introduction to Sentiment Analysis
Introduction to Sentiment AnalysisIntroduction to Sentiment Analysis
Introduction to Sentiment Analysis
 
Tutorial of Sentiment Analysis
Tutorial of Sentiment AnalysisTutorial of Sentiment Analysis
Tutorial of Sentiment Analysis
 
Big Data & Sentiment Analysis
Big Data & Sentiment AnalysisBig Data & Sentiment Analysis
Big Data & Sentiment Analysis
 
Massively Parallel Processing with Procedural Python by Ronert Obst PyData Be...
Massively Parallel Processing with Procedural Python by Ronert Obst PyData Be...Massively Parallel Processing with Procedural Python by Ronert Obst PyData Be...
Massively Parallel Processing with Procedural Python by Ronert Obst PyData Be...
 
Massively Parallel Processing with Procedural Python (PyData London 2014)
Massively Parallel Processing with Procedural Python (PyData London 2014)Massively Parallel Processing with Procedural Python (PyData London 2014)
Massively Parallel Processing with Procedural Python (PyData London 2014)
 
Pivotal OSS meetup - MADlib and PivotalR
Pivotal OSS meetup - MADlib and PivotalRPivotal OSS meetup - MADlib and PivotalR
Pivotal OSS meetup - MADlib and PivotalR
 
Pipeline parallelism
Pipeline parallelismPipeline parallelism
Pipeline parallelism
 
Lecture 3: Structuring Unstructured Texts Through Sentiment Analysis
Lecture 3: Structuring Unstructured Texts Through Sentiment AnalysisLecture 3: Structuring Unstructured Texts Through Sentiment Analysis
Lecture 3: Structuring Unstructured Texts Through Sentiment Analysis
 

Similar to A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal Greenplum Database

Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
Srivatsan Ramanujam
 
Data Science at Scale on MPP databases - Use Cases & Open Source Tools
Data Science at Scale on MPP databases - Use Cases & Open Source ToolsData Science at Scale on MPP databases - Use Cases & Open Source Tools
Data Science at Scale on MPP databases - Use Cases & Open Source Tools
Esther Vasiete
 
Greenplum Architecture
Greenplum ArchitectureGreenplum Architecture
Greenplum Architecture
Alexey Grishchenko
 
Solution Use Case Demo: The Power of Relationships in Your Big Data
Solution Use Case Demo: The Power of Relationships in Your Big DataSolution Use Case Demo: The Power of Relationships in Your Big Data
Solution Use Case Demo: The Power of Relationships in Your Big Data
InfiniteGraph
 
HBaseCon 2013: Being Smarter Than the Smart Meter
HBaseCon 2013: Being Smarter Than the Smart MeterHBaseCon 2013: Being Smarter Than the Smart Meter
HBaseCon 2013: Being Smarter Than the Smart Meter
Cloudera, Inc.
 
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
VMware Tanzu
 
Data Science Amsterdam - Massively Parallel Processing with Procedural Languages
Data Science Amsterdam - Massively Parallel Processing with Procedural LanguagesData Science Amsterdam - Massively Parallel Processing with Procedural Languages
Data Science Amsterdam - Massively Parallel Processing with Procedural Languages
Ian Huston
 
Oracle NoSQL Database release 3.0 overview
Oracle NoSQL Database release 3.0 overviewOracle NoSQL Database release 3.0 overview
Oracle NoSQL Database release 3.0 overviewPaulo Fagundes
 
Spark For Plain Old Java Geeks (June2014 Meetup)
Spark For Plain Old Java Geeks (June2014 Meetup)Spark For Plain Old Java Geeks (June2014 Meetup)
Spark For Plain Old Java Geeks (June2014 Meetup)
sdeeg
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
markgrover
 
Z Data Tools and APIs Overview
Z Data Tools and APIs OverviewZ Data Tools and APIs Overview
Z Data Tools and APIs Overview
HCLSoftware
 
[db tech showcase Tokyo 2018] #dbts2018 #C25 『マルチモデル・データベースへの道: PostgreSQLを最も...
[db tech showcase Tokyo 2018] #dbts2018 #C25 『マルチモデル・データベースへの道: PostgreSQLを最も...[db tech showcase Tokyo 2018] #dbts2018 #C25 『マルチモデル・データベースへの道: PostgreSQLを最も...
[db tech showcase Tokyo 2018] #dbts2018 #C25 『マルチモデル・データベースへの道: PostgreSQLを最も...
Insight Technology, Inc.
 
Streaming is a Detail
Streaming is a DetailStreaming is a Detail
Streaming is a Detail
HostedbyConfluent
 
Analyzing Big data in R and Scala using Apache Spark 17-7-19
Analyzing Big data in R and Scala using Apache Spark  17-7-19Analyzing Big data in R and Scala using Apache Spark  17-7-19
Analyzing Big data in R and Scala using Apache Spark 17-7-19
Ahmed Elsayed
 
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
AboutYouGmbH
 
Kamanja: Driving Business Value through Real-Time Decisioning Solutions
Kamanja: Driving Business Value through Real-Time Decisioning SolutionsKamanja: Driving Business Value through Real-Time Decisioning Solutions
Kamanja: Driving Business Value through Real-Time Decisioning Solutions
Greg Makowski
 
Unstructured Datasets Analysis: Thesaurus Model
Unstructured Datasets Analysis: Thesaurus ModelUnstructured Datasets Analysis: Thesaurus Model
Unstructured Datasets Analysis: Thesaurus Model
Editor IJCATR
 
Java in the database–is it really useful? Solving impossible Big Data challenges
Java in the database–is it really useful? Solving impossible Big Data challengesJava in the database–is it really useful? Solving impossible Big Data challenges
Java in the database–is it really useful? Solving impossible Big Data challenges
Rogue Wave Software
 
TensorFlow 16: Building a Data Science Platform
TensorFlow 16: Building a Data Science Platform TensorFlow 16: Building a Data Science Platform
TensorFlow 16: Building a Data Science Platform
Seldon
 
Integration Patterns for Big Data Applications
Integration Patterns for Big Data ApplicationsIntegration Patterns for Big Data Applications
Integration Patterns for Big Data Applications
Michael Häusler
 

Similar to A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal Greenplum Database (20)

Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
 
Data Science at Scale on MPP databases - Use Cases & Open Source Tools
Data Science at Scale on MPP databases - Use Cases & Open Source ToolsData Science at Scale on MPP databases - Use Cases & Open Source Tools
Data Science at Scale on MPP databases - Use Cases & Open Source Tools
 
Greenplum Architecture
Greenplum ArchitectureGreenplum Architecture
Greenplum Architecture
 
Solution Use Case Demo: The Power of Relationships in Your Big Data
Solution Use Case Demo: The Power of Relationships in Your Big DataSolution Use Case Demo: The Power of Relationships in Your Big Data
Solution Use Case Demo: The Power of Relationships in Your Big Data
 
HBaseCon 2013: Being Smarter Than the Smart Meter
HBaseCon 2013: Being Smarter Than the Smart MeterHBaseCon 2013: Being Smarter Than the Smart Meter
HBaseCon 2013: Being Smarter Than the Smart Meter
 
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
 
Data Science Amsterdam - Massively Parallel Processing with Procedural Languages
Data Science Amsterdam - Massively Parallel Processing with Procedural LanguagesData Science Amsterdam - Massively Parallel Processing with Procedural Languages
Data Science Amsterdam - Massively Parallel Processing with Procedural Languages
 
Oracle NoSQL Database release 3.0 overview
Oracle NoSQL Database release 3.0 overviewOracle NoSQL Database release 3.0 overview
Oracle NoSQL Database release 3.0 overview
 
Spark For Plain Old Java Geeks (June2014 Meetup)
Spark For Plain Old Java Geeks (June2014 Meetup)Spark For Plain Old Java Geeks (June2014 Meetup)
Spark For Plain Old Java Geeks (June2014 Meetup)
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
 
Z Data Tools and APIs Overview
Z Data Tools and APIs OverviewZ Data Tools and APIs Overview
Z Data Tools and APIs Overview
 
[db tech showcase Tokyo 2018] #dbts2018 #C25 『マルチモデル・データベースへの道: PostgreSQLを最も...
[db tech showcase Tokyo 2018] #dbts2018 #C25 『マルチモデル・データベースへの道: PostgreSQLを最も...[db tech showcase Tokyo 2018] #dbts2018 #C25 『マルチモデル・データベースへの道: PostgreSQLを最も...
[db tech showcase Tokyo 2018] #dbts2018 #C25 『マルチモデル・データベースへの道: PostgreSQLを最も...
 
Streaming is a Detail
Streaming is a DetailStreaming is a Detail
Streaming is a Detail
 
Analyzing Big data in R and Scala using Apache Spark 17-7-19
Analyzing Big data in R and Scala using Apache Spark  17-7-19Analyzing Big data in R and Scala using Apache Spark  17-7-19
Analyzing Big data in R and Scala using Apache Spark 17-7-19
 
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
 
Kamanja: Driving Business Value through Real-Time Decisioning Solutions
Kamanja: Driving Business Value through Real-Time Decisioning SolutionsKamanja: Driving Business Value through Real-Time Decisioning Solutions
Kamanja: Driving Business Value through Real-Time Decisioning Solutions
 
Unstructured Datasets Analysis: Thesaurus Model
Unstructured Datasets Analysis: Thesaurus ModelUnstructured Datasets Analysis: Thesaurus Model
Unstructured Datasets Analysis: Thesaurus Model
 
Java in the database–is it really useful? Solving impossible Big Data challenges
Java in the database–is it really useful? Solving impossible Big Data challengesJava in the database–is it really useful? Solving impossible Big Data challenges
Java in the database–is it really useful? Solving impossible Big Data challenges
 
TensorFlow 16: Building a Data Science Platform
TensorFlow 16: Building a Data Science Platform TensorFlow 16: Building a Data Science Platform
TensorFlow 16: Building a Data Science Platform
 
Integration Patterns for Big Data Applications
Integration Patterns for Big Data ApplicationsIntegration Patterns for Big Data Applications
Integration Patterns for Big Data Applications
 

More from Srivatsan Ramanujam

Machine Learning Driven Sales and Marketing for Everyone
Machine Learning Driven Sales and Marketing for EveryoneMachine Learning Driven Sales and Marketing for Everyone
Machine Learning Driven Sales and Marketing for Everyone
Srivatsan Ramanujam
 
Data Science for predictive maintenance in connected vehicles
Data Science for predictive maintenance in connected vehiclesData Science for predictive maintenance in connected vehicles
Data Science for predictive maintenance in connected vehicles
Srivatsan Ramanujam
 
All thingspython@pivotal
All thingspython@pivotalAll thingspython@pivotal
All thingspython@pivotal
Srivatsan Ramanujam
 
Data Driven Action : A Primer on Data Science
Data Driven Action : A Primer on Data ScienceData Driven Action : A Primer on Data Science
Data Driven Action : A Primer on Data Science
Srivatsan Ramanujam
 
Climate Data Lake: Empowering Citizen Scientists in Acadia National Park
Climate Data Lake: Empowering Citizen Scientists in Acadia National ParkClimate Data Lake: Empowering Citizen Scientists in Acadia National Park
Climate Data Lake: Empowering Citizen Scientists in Acadia National Park
Srivatsan Ramanujam
 
Analyzing Power of Tweets in Predicting Commodity Futures
Analyzing Power of Tweets in Predicting Commodity FuturesAnalyzing Power of Tweets in Predicting Commodity Futures
Analyzing Power of Tweets in Predicting Commodity Futures
Srivatsan Ramanujam
 
PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learn...
PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learn...PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learn...
PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learn...
Srivatsan Ramanujam
 

More from Srivatsan Ramanujam (7)

Machine Learning Driven Sales and Marketing for Everyone
Machine Learning Driven Sales and Marketing for EveryoneMachine Learning Driven Sales and Marketing for Everyone
Machine Learning Driven Sales and Marketing for Everyone
 
Data Science for predictive maintenance in connected vehicles
Data Science for predictive maintenance in connected vehiclesData Science for predictive maintenance in connected vehicles
Data Science for predictive maintenance in connected vehicles
 
All thingspython@pivotal
All thingspython@pivotalAll thingspython@pivotal
All thingspython@pivotal
 
Data Driven Action : A Primer on Data Science
Data Driven Action : A Primer on Data ScienceData Driven Action : A Primer on Data Science
Data Driven Action : A Primer on Data Science
 
Climate Data Lake: Empowering Citizen Scientists in Acadia National Park
Climate Data Lake: Empowering Citizen Scientists in Acadia National ParkClimate Data Lake: Empowering Citizen Scientists in Acadia National Park
Climate Data Lake: Empowering Citizen Scientists in Acadia National Park
 
Analyzing Power of Tweets in Predicting Commodity Futures
Analyzing Power of Tweets in Predicting Commodity FuturesAnalyzing Power of Tweets in Predicting Commodity Futures
Analyzing Power of Tweets in Predicting Commodity Futures
 
PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learn...
PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learn...PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learn...
PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learn...
 

Recently uploaded

GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
Product School
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Product School
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
Generating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using SmithyGenerating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using Smithy
g2nightmarescribd
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
Elena Simperl
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Jeffrey Haguewood
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Tobias Schneck
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 

Recently uploaded (20)

GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
Generating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using SmithyGenerating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using Smithy
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 

A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal Greenplum Database

  • 1. Big Data Pipeline for Topic and Sentiment Analysis, with Applications Srivatsan Ramanujam (@being_bayesian) Senior Data Scientist, Pivotal 11 Jan 2014 © Copyright 2013 Pivotal. All rights reserved. 1
  • 2. Agenda Introduction The Problem The Platform The Pipeline Live Demo: Topic and Sentiment Analysis Engine Applications in real world customer engagements © Copyright 2013 Pivotal. All rights reserved. 2
  • 3. Pivotal: A New Platform for a New Era Data-Driven Application Development Pivotal Data Science Labs App Fabric Data Fabric “The new Middleware” “The new Database” Cloud Fabric “The new OS” ...ETC “The new Hardware” © Copyright 2013 Pivotal. All rights reserved. 3
  • 4. The Problem © Copyright 2013 Pivotal. All rights reserved. 4
  • 5. The Problem Make sense of large volumes of unstructured text and integrate this with the structured sources of data to make better predictions Approaches – Topic Analysis – Sentiment Analysis © Copyright 2013 Pivotal. All rights reserved. 5
  • 6. The Platform © Copyright 2013 Pivotal. All rights reserved. 6
  • 7. Pivotal Greenplum MPP DB Think of it as multiple PostGreSQL servers Master Segments/Workers Rows are distributed across segments by a particular field (or randomly) © Copyright 2013 Pivotal. All rights reserved. 7
  • 8. Pivotal Hadoop • The pipeline in this talk can be run on Pivotal Hadoop + HAWQ © Copyright 2013 Pivotal. All rights reserved. 8
  • 9. Data Parallelism Vs. Task Parallelism Data Parallelism: Little or no effort is required to break up the problem into a number of parallel tasks, and there exists no dependency (or communication) between those parallel tasks. – Ex: Build one Churn model for each state in the US simultaneously, when customer data is distributed by state code. Task Parallelism: Split the problem into independent sub-tasks which can executed in parallel. – Ex: Build one Churn model in parallel for the entire US, though customer data is distributed by state code. © Copyright 2013 Pivotal. All rights reserved. 9
  • 10. User-Defined Functions (UDFs) PostgreSQL/Greenplum provide lots of flexibility in defining your own functions. Simple UDFs are SQL queries with calling arguments and return types. Definition: Execution: CREATE FUNCTION times2(INT) RETURNS INT AS $$ SELECT 2 * $1 $$ LANGUAGE sql; SELECT times2(1); times2 -------2 (1 row) © Copyright 2013 Pivotal. All rights reserved. 10
  • 11. PL/X : X in {pgsql, R, Python, Java, Perl, C etc.} • Allows users to write Greenplum/PostgreSQL functions in the R/Python/Java, Perl, pgsql or C languages SQL Master Host The interpreter/VM of the language ‘X’ is installed on each node of the Greenplum Database Cluster • Data Parallelism: - PL/X piggybacks on Greenplum’s MPP architecture © Copyright 2013 Pivotal. All rights reserved. Standby Master Interconnect Segment Host Segment Segment Segment Host Segment Segment Segment Host Segment Segment Segment Host Segment Segment … 11
  • 12. Going Beyond Data Parallelism Data Parallel computation via PL/Python libraries only allow us to run ‘n’ models in parallel. This works great when we are building one model for each value of the group by column, but we need parallelized algorithms to be able to build a single model on all the available data For this, we use MADlib – an open source library of parallel in-database machine learning algorithms. © Copyright 2013 Pivotal. All rights reserved. 12
  • 13. Scalable, in-database ML • • • Open Source!https://github.com/madlib/madlib Works on Greenplum DB and PostgreSQL Active development by Pivotal - • © Copyright 2013 Pivotal. All rights reserved. Latest Release : 1.4 (Dec 2014) Downloads and Docs: http://madlib.net/ 13
  • 14. MADlib In-Database Functions Descriptive Statistics Predictive Modeling Library Generalized Linear Models • Linear Regression • Logistic Regression • Multinomial Logistic Regression • Cox Proportional Hazards • Regression • Elastic Net Regularization • Sandwich Estimators (Huber white, clustered, marginal effects) Matrix Factorization • Single Value Decomposition (SVD) • Low-Rank © Copyright 2013 Pivotal. All rights reserved. Machine Learning Algorithms • Principal Component Analysis (PCA) • Association Rules (Affinity Analysis, Market Basket) • Topic Modeling (Parallel LDA) • Decision Trees • Ensemble Learners (Random Forests) • Support Vector Machines • Conditional Random Field (CRF) • Clustering (K-means) • Cross Validation Linear Systems • Sparse and Dense Solvers Sketch-based Estimators • CountMin (CormodeMuthukrishnan) • FM (Flajolet-Martin) • MFV (Most Frequent Values) Correlation Summary Support Modules Array Operations Sparse Vectors Random Sampling Probability Functions 14
  • 15. Architecture User Interface “Driver” Functions (outer loops of iterative algorithms, optimizer invocations) High-level Abstraction Layer (iteration controller, ...) RDBMS Built-in Functions SQL, generated from specification Python with templated SQL Python Functions for Inner Loops (for streaming algorithms) Low-level Abstraction Layer (matrix operations, C++ to RDBMS type bridge, …) C++ RDBMS Query Processing (Greenplum, PostgreSQL, …) © Copyright 2013 Pivotal. All rights reserved. 15
  • 16. MADlib on Hadoop • A subset of algorithms from MADlib on Pivotal Greenplum DB, work out of the box on HAWQ. • Other functions are being ported. • With the general availability and support for User Defined Functions in HAWQ, MADlib will attain full parity with GPDB © Copyright 2013 Pivotal. All rights reserved. 16
  • 17. The Pipeline © Copyright 2013 Pivotal. All rights reserved. 17
  • 18. The Pipeline Tweet Stream D3.js Stored on HDFS Topic Analysis through MADlib pLDA (gpfdist) Loaded as external tables into GPDB © Copyright 2013 Pivotal. All rights reserved. Parallel Parsing of JSON and extraction of fields using PL/Python Sentiment Analysis through custom PL/Python functions 18
  • 19. Topic Analysis – MADlib pLDA Natural Language Processing - GPText Filter relevant content Align Data Social Media Tokenizer Stemming, frequency filtering Prepare dataset for Topic Modeling Topic Graph Topic composition MADlib Topic Model Topic Clouds © Copyright 2013 Pivotal. All rights reserved. 19
  • 20. Sentiment Analysis We don’t have labeled data for our problem (Tweets aren’t tagged with Sentiment) “Unpredictable” Semi-Supervised Sentiment Prediction can be achieved by dictionary look-ups of tokens in a Tweet, but without Context, Sentiment Prediction is futile! “Breakthrough” © Copyright 2013 Pivotal. All rights reserved. 20
  • 21. Sentiment Analysis – PL/X Functions Break-up Tweets into tokens and tag their parts-of-speech Part-of-speech tagger1 1: Semi-Supervised Sentiment Classification Phrase Extraction Phrasal Polarity Scoring Use learned phrasal polarities to score sentiment of new tweets Sentiment Scored Tweets Parts-of-speech Tagger : Gp-Ark-Tweet-NLP (http://vatsan.github.io/gp-ark-tweet-nlp/) © Copyright 2013 Pivotal. All rights reserved. 21
  • 22. Live Demo © Copyright 2013 Pivotal. All rights reserved. 22
  • 23. Real World Applications © Copyright 2013 Pivotal. All rights reserved. 23
  • 24. Churn Models for Telecom Industry Goal – Identify and prevent customers who are likely to churn. Challenges – – – – Cost of acquiring new customers is high Recouping cost of customer acquisition high if customer is not retained long enough Lower barrier to switching subscribers With mobile number portability, barrier to switching even lower Good News – Cost of retaining existing customers is lower! © Copyright 2013 Pivotal. All rights reserved. 24
  • 25. Structured Features for Churn Models The problem is extensively studied with a rich set of approaches in the literature Device Texting Stats Call Stats Rate Plans Customer Demographics These features are great, but the models soon hit a plateau with structured features! © Copyright 2013 Pivotal. All rights reserved. 25
  • 26. Blending the Unstructured with the Structured What other sources of previously untapped data could we use ? Are our customers happy ? Where ? What segments ? What are the common topics in their conversations online ? © Copyright 2013 Pivotal. All rights reserved. 26
  • 27. Sentiment Analysis and Topic Models MORE ACCURATE LIKELIHOOD TO CHURN Unstructured Data External Internal Sentiment Analysis Engine (Classifier) Topic Engine (LDA) Structured Data: EDW © Copyright 2013 Pivotal. All rights reserved. Topic Dashboard 27
  • 28. Predicting Commodity Futures through Twitter Customer A major a agri-business cooperative Business Problem Predict price of commodity futures through Twitter Solution Built Sentiment Analysis and Text Regression algorithms to predict commodity futures from Tweets Established the foundation for blending the structured data (market fundamentals) with unstructured data (tweets) Challenges Language on Twitter does not adhere to rules of grammar and has poor structure No domain specific label corpus of tweet sentiment – problem is semi-supervised © Copyright 2013 Pivotal. All rights reserved. 28
  • 29. The Approach • Tweets alone had significant predictive power for the commodity of interest to us. When blended with structured features like weather data we expect to see much better results. © Copyright 2013 Pivotal. All rights reserved. 29
  • 30. What’s in it for me? © Copyright 2013 Pivotal. All rights reserved. 30
  • 31. Pivotal Open Source Contributions http://gopivotal.com/pivotal-products/open-source-software • PyMADlib – Python Wrapper for MADlib - https://github.com/gopivotal/pymadlib • PivotalR – R wrapper for MADlib - https://github.com/madlib-internal/PivotalR • Part-of-speech tagger for Twitter via SQL - http://vatsan.github.io/gp-ark-tweet-nlp/ Questions? @being_bayesian © Copyright 2013 Pivotal. All rights reserved. 31
  • 32. BUILT FOR THE SPEED OF BUSINESS

Editor's Notes

  1. Why do we care about Apps as well as Data? By ‘apps’ we mean “enterprise and cloud applications” and how they are built. Pivotal has said a lot about data in public but we care about apps just as much. Leveraging the strengths of vFabric and Spring, Pivotal will continue to enable customers to build the applications they need. Applications are how our customers offer many new products and services today. Apps can accelerate customer interactions and realize value from data by presenting it to users in a meaningful way. With tools like Spring and Cloud Foundry, we can make ‘big data’ comprehensible and ‘easy’ to developers and hence to enterprises. And of course: users generate data, sensors generate data, phones generate data.. but much of this data comes from some sort of application.