Srivatsan Ramanujam
Senior Data Scientist
Greenplum

© Copyright 2011 EMC Corporation. All rights reserved.

1
Agenda
• Greenplum UAP overview
– Products: GPDB, GPHD, Chorus, Analytics Labs, Data Computing Appliance
– GPDB Architecture

• MADlib
–
–
–
–

Overview
Algorithms
Working Mechanism
Performance Comparison with Mahout

• PyMADlib
– Overview
– Demo in IPython Notebook

• Future Directions
– GPHD and HAWQ

© Copyright 2011 EMC Corporation. All rights reserved.

2
Greenplum Overview

© Copyright 2011 EMC Corporation. All rights reserved.

3
Products

© Copyright 2011 EMC Corporation. All rights reserved.

4
Greenplum Database - Architecture
MPP (Massively Parallel Processing)
Shared-Nothing Architecture
Master
Servers

...

SQL
MapReduce

...

Query planning &
dispatch

Network
Interconnect

Segment
Servers

...

...

Query processing
& data storage

External
Sources
Loading,
streaming, etc.

© Copyright 2011 EMC Corporation. All rights reserved.

5
MADlib

© Copyright 2011 EMC Corporation. All rights reserved.

6
MADlib: The Origin

UrbanDictionary.com:
mad (adj.): an adjective used to enhance a noun.
1- dude, you got skills.
2- dude, you got mad skills.

• First mention of MAD analytics was at VLDB’09
– MAD Skills: New Analysis Practices for Big Data
– Jeffrey Cohen, Brian Dolan, Mark Dunlap, Joseph Hellerstein, Caleb
Welton http://db.cs.berkeley.edu/papers/vldb09-madskills.pdf

• MADlib project initiated in late 2010
– Maintained by Greenplum/EMC with significant contributions
from UW Madison, UFlorida and UC Berkeley.

© Copyright 2011 EMC Corporation. All rights reserved.

7
Current Modules
Data Modeling
Supervised Learning
•
•
•
•
•
•
•
•
•

Naive Bayes Classification
Linear Regression
Logistic Regression
Multinomial Logistic Regression
Decision Tree
Random Forest
Support Vector Machines
Cox-Proportional Hazards Regression
Conditional Random Field

Unsupervised Learning
• Association Rules
• k-Means Clustering
• Low-rank Matrix Factorization
• SVD Matrix Factorization
• Parallel Latent Dirichlet Allocation

Descriptive Statistics
Sketch-based Estimators
• CountMin (CormodeMuthukrishnan)
• FM (Flajolet-Martin)
• MFV (Most Frequent Values)

Profile

Quantile

Support
Array
Operations
Conjugate
Gradient
Sparse
Vectors
Probability
Functions
Random
Sampling

Inferential Statistics
Hypothesis tests

© Copyright 2011 EMC Corporation. All rights reserved.

8
MADlib – User Doc
• Check out the user guide with examples at: http://doc.madlib.net

© Copyright 2011 EMC Corporation. All rights reserved.

9
How does it work ? : A Linear Regression Example
• Finding linear dependencies between variables
– y ≈ c0 + c1 · x1 + c2 · x2 ?
# select y, x1, x2

Vector of
dependent
variables y

© Copyright 2011 EMC Corporation. All rights reserved.

from unm limit 6;

y
| x1 | x2
-------+------+----10.14 |
0 | 0.3
11.93 | 0.69 | 0.6
13.57 | 1.1 | 0.9
14.17 | 1.39 | 1.2
15.25 | 1.61 | 1.5
16.15 | 1.79 | 1.8

Design
matrix X

10
Reminder: Linear-Regression Model
•
• If residuals i.i.d. Gaussians with standard deviation σ:
– max likelihood ⇔ min sum of squared residuals

• First-order conditions for the following quadratic objective (in c)

yield the minimizer

© Copyright 2011 EMC Corporation. All rights reserved.

11
Linear Regression: Streaming Algorithm
• How to compute with a single table scan?

-1
XT

XT

y

X

X TX

© Copyright 2011 EMC Corporation. All rights reserved.

XTy

12
Linear Regression: Parallel Computation
XT
y

Segment 1

T
X1 y1

© Copyright 2011 EMC Corporation. All rights reserved.

Segment 2

T
X2 y2

Master

X Ty

13
Performance Comparison : Test Setup on AWB
• AWB
– 1000-node cluster located in Las Vegas
– Over 24,000 processors, 48 TB of Memory, and 24 PB of raw disk
storage
– 8000+ Map Task Capacity, 5000+ Reduce Task Capacity
– GPHD 1.1, GPDB 4.2.3

• Mahout v0.7
• MADlib v0.5
– With small LMF change to allow 4-byte integer values

• Test matrix
–
–
–
–

Data size (# rows/records, # columns/features)
Algorithms
Algorithm parameters (e.g. convergence threshold, # iterations)
GPDB segment / MR (Map-Reduce) task configurations

© Copyright 2011 EMC Corporation. All rights reserved.

14
Performance & Scalability Results (summary)

• Whitepaper coming out shortly!

© Copyright 2011 EMC Corporation. All rights reserved.

15
Logistic Regression
• Mahout only has sequential (i.e. single node) IGD implementation

MADlib & Mahout Logistic Regression Scalability Across
Number of Attributes
700

Census data, 48 attributes [Mahout]
600

Time in Minutes

Census data, 48 attributes [MADlib]
500
400
300
200
100
0
1000000

10000000

10000000

1E+09

log(Number of Rows)

© Copyright 2011 EMC Corporation. All rights reserved.

16
Logistic Regression
MADlib Scalability Across Number of GPDB Segments
18
16

Time in Minutes

14
12
10
8
6
4
2
0
0

50

100

150

200

250

300

Number of GPDB Segments

© Copyright 2011 EMC Corporation. All rights reserved.

17
K-Means Clustering
MADlib & Mahout K-means Scalability Across
Number of Rows
350

Census data, 48 attributes [Mahout]
300

Census data, 48 attributes [MADlib]
Time in Min

250
200
150
100
50
0
1000000

10000000

10000000

1E+09

log(Number of Rows)

© Copyright 2011 EMC Corporation. All rights reserved.

18
K-Means Clustering
MADlib K-means Scalability Across
Number of GPDB Segments
10
9
8

Time in Min

7
6
5
4
3
2
1

0
0

50

100

150

200

250

300

Number of GPDB Segments

© Copyright 2011 EMC Corporation. All rights reserved.

19
PyMADlib : Python + MADlib = Awesome!

© Copyright 2011 EMC Corporation. All rights reserved.

20
Motivation
• SQL is great for many things, but it’s not nearly enough

• Undeniably the most straightforward way to query data

• But not necessarily designed for data science

© Copyright 2011 EMC Corporation. All rights reserved.

21
MADlib is a godsend!
• Empowers data scientists to run canned machine learning
routines – focus less on coding, more on science
• In-database, explicitly parallel.

• So why do we need anything else?
– UI is still all in SQL
– Need to tap into rich visualization libraries

© Copyright 2011 EMC Corporation. All rights reserved.

22
Then which interface is favored by and familiar
to data scientists?

• Depends on who you ask
• Left survey is for “higher level languages,” and right survey is for “lower level languages”

© Copyright 2011 EMC Corporation. All rights reserved.

23
Wait, don’t we already have this (PL/R,
PL/Python, SAS HPA)?
• PL/X’s are wonderful, but:
– It still requires non-trivial knowledge of SQL to use effectively
– Mostly limited to explicitly parallel jobs
– Primarily a SQL interface to the end user

• Need an interface that is:
– Less SQL, more R/Python/SAS
– Implicitly parallelized
– More scalable

• SAS HPA = $$$$$

© Copyright 2011 EMC Corporation. All rights reserved.

24
The challenge
• MADlib
–
–
–
–

Open source
Extremely powerful/scalable
Growing algorithm breadth
SQL

• Python/R
–
–
–
–

Open source
Memory limited
High algorithm breadth
Language/interface purpose-designed for data science

• SAS
–
–
–
–

High user loyalty
Non-HPA is memory limited, HPA requires investment
High algorithm breadth
Language/interface purpose-designed for data science

• Want to leverage both the performance benefits of MADlib and the
usability of languages like Python, SAS, and R

© Copyright 2011 EMC Corporation. All rights reserved.

25
Simple solution: Translate Python code into
SQL
ODBC/
JDBC

Python  SQL

SQL to execute MADlib
Model output

• All data stays in DB and all model estimation and heavy lifting done in DB by
MADlib

• Only strings of SQL and model output transferred across ODBC/JDBC
• Best of both worlds: number crunching power of MADlib along with rich set of
visualizations of Matplotlib, NetworkX and all your other favorite Python
libraries. Let MADlib do all the heavy-lifting on your Greenplum/PostGreSQL
database, while you program in your favorite language – Python.

© Copyright 2011 EMC Corporation. All rights reserved.

26
Demo

PyMADlib Tutorial –
IPython Notebook Viewer Link

http://nbviewer.ipython.org/5275846

© Copyright 2011 EMC Corporation. All rights reserved.

27
Where do I get it ?

$pip install pymadlib

© Copyright 2011 EMC Corporation. All rights reserved.

28
I don’t have GPDB or MADlib – What do I do ?
• Greenplum Database Community Edition is freely
available for single node installations on multiple
platforms
– Written permission may be requested from EMC/Greenplum
for research use for multi-node installations

• MADlib is free and open-source
– Downloadable for multiple platforms from
https://github.com/madlib/madlib

• PyMADlib is also free and open-source 
– Downloadable from https://github.com/vatsan/pymadlib

© Copyright 2011 EMC Corporation. All rights reserved.

29
Future Directions

© Copyright 2011 EMC Corporation. All rights reserved.

30
Greenplum HD
• HAWQ – Parallel SQL query engine that combines the key
technological advantages of industry-leading Greenplum
Database with scalability and convenience of Hadoop

• SQL Standards Compliant
– Supports Correlated Sub-queries, Window Functions, Roll-ups, Cubes
+ range of scalar and aggregate functions

• ACID Compliant

© Copyright 2011 EMC Corporation. All rights reserved.

31
HAWQ – Architecture

© Copyright 2011 EMC Corporation. All rights reserved.

32
Performance : HAWQ1 Vs. Hive Vs. Impala2

All experiments were run on a 60 node deployment with Analytics Workbench3

1
2
3

http://www.greenplum.com/sites/default/files/2013_0301_hawq_sql_engine_hadoop_1.pdf
https://github.com/cloudera/impala/
http://www.analyticsworkbench.com/

© Copyright 2011 EMC Corporation. All rights reserved.

33
HAWQ: Deep Scalable Analytics
What’s inside the box?

• Linear Regression
• Logistic Regression
• Multinomial Logistic Regression
• K-Means

• Association Rules
• Latent Dirichlet Allocation
• Users can connect to HAWQ via popular programming languages and it also
supports JDBC and ODBC.
• Most tools will work out of the box with HAWQ, including PyMADlib

© Copyright 2011 EMC Corporation. All rights reserved.

34
Questions?
@being_bayesian
vatsan.cs@utexas.edu
https://github.com/vatsan/pymadlib

© Copyright 2011 EMC Corporation. All rights reserved.

35
Appendix

© Copyright 2011 EMC Corporation. All rights reserved.

36
Datasets
The following datasets were used in comparing the performance of
MADlib with Mahout
– KDD Cup 2009 Orange marketing churn data (16.5 MB)
• About 500,000 records and 15,000 numerical and categorical attributes
– Census 2000 data (1.7 GB)
• About 14 million records and 48 numerical and categorical attributes
– Enron data (1.9 GB)
• About 700,000 documents with a vocabulary size of 200,000
– KDD Cup 2011 Yahoo! Music Webscope data (4.16 GB)
• About 1 million users, 600,000 songs, and 250 million ratings
– Netflix Prize 2009 data (52.7 MB)
• About 400,000 users, 900 movies, and 4.5 million ratings

© Copyright 2011 EMC Corporation. All rights reserved.

37

PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learning library.

  • 1.
    Srivatsan Ramanujam Senior DataScientist Greenplum © Copyright 2011 EMC Corporation. All rights reserved. 1
  • 2.
    Agenda • Greenplum UAPoverview – Products: GPDB, GPHD, Chorus, Analytics Labs, Data Computing Appliance – GPDB Architecture • MADlib – – – – Overview Algorithms Working Mechanism Performance Comparison with Mahout • PyMADlib – Overview – Demo in IPython Notebook • Future Directions – GPHD and HAWQ © Copyright 2011 EMC Corporation. All rights reserved. 2
  • 3.
    Greenplum Overview © Copyright2011 EMC Corporation. All rights reserved. 3
  • 4.
    Products © Copyright 2011EMC Corporation. All rights reserved. 4
  • 5.
    Greenplum Database -Architecture MPP (Massively Parallel Processing) Shared-Nothing Architecture Master Servers ... SQL MapReduce ... Query planning & dispatch Network Interconnect Segment Servers ... ... Query processing & data storage External Sources Loading, streaming, etc. © Copyright 2011 EMC Corporation. All rights reserved. 5
  • 6.
    MADlib © Copyright 2011EMC Corporation. All rights reserved. 6
  • 7.
    MADlib: The Origin UrbanDictionary.com: mad(adj.): an adjective used to enhance a noun. 1- dude, you got skills. 2- dude, you got mad skills. • First mention of MAD analytics was at VLDB’09 – MAD Skills: New Analysis Practices for Big Data – Jeffrey Cohen, Brian Dolan, Mark Dunlap, Joseph Hellerstein, Caleb Welton http://db.cs.berkeley.edu/papers/vldb09-madskills.pdf • MADlib project initiated in late 2010 – Maintained by Greenplum/EMC with significant contributions from UW Madison, UFlorida and UC Berkeley. © Copyright 2011 EMC Corporation. All rights reserved. 7
  • 8.
    Current Modules Data Modeling SupervisedLearning • • • • • • • • • Naive Bayes Classification Linear Regression Logistic Regression Multinomial Logistic Regression Decision Tree Random Forest Support Vector Machines Cox-Proportional Hazards Regression Conditional Random Field Unsupervised Learning • Association Rules • k-Means Clustering • Low-rank Matrix Factorization • SVD Matrix Factorization • Parallel Latent Dirichlet Allocation Descriptive Statistics Sketch-based Estimators • CountMin (CormodeMuthukrishnan) • FM (Flajolet-Martin) • MFV (Most Frequent Values) Profile Quantile Support Array Operations Conjugate Gradient Sparse Vectors Probability Functions Random Sampling Inferential Statistics Hypothesis tests © Copyright 2011 EMC Corporation. All rights reserved. 8
  • 9.
    MADlib – UserDoc • Check out the user guide with examples at: http://doc.madlib.net © Copyright 2011 EMC Corporation. All rights reserved. 9
  • 10.
    How does itwork ? : A Linear Regression Example • Finding linear dependencies between variables – y ≈ c0 + c1 · x1 + c2 · x2 ? # select y, x1, x2 Vector of dependent variables y © Copyright 2011 EMC Corporation. All rights reserved. from unm limit 6; y | x1 | x2 -------+------+----10.14 | 0 | 0.3 11.93 | 0.69 | 0.6 13.57 | 1.1 | 0.9 14.17 | 1.39 | 1.2 15.25 | 1.61 | 1.5 16.15 | 1.79 | 1.8 Design matrix X 10
  • 11.
    Reminder: Linear-Regression Model • •If residuals i.i.d. Gaussians with standard deviation σ: – max likelihood ⇔ min sum of squared residuals • First-order conditions for the following quadratic objective (in c) yield the minimizer © Copyright 2011 EMC Corporation. All rights reserved. 11
  • 12.
    Linear Regression: StreamingAlgorithm • How to compute with a single table scan? -1 XT XT y X X TX © Copyright 2011 EMC Corporation. All rights reserved. XTy 12
  • 13.
    Linear Regression: ParallelComputation XT y Segment 1 T X1 y1 © Copyright 2011 EMC Corporation. All rights reserved. Segment 2 T X2 y2 Master X Ty 13
  • 14.
    Performance Comparison :Test Setup on AWB • AWB – 1000-node cluster located in Las Vegas – Over 24,000 processors, 48 TB of Memory, and 24 PB of raw disk storage – 8000+ Map Task Capacity, 5000+ Reduce Task Capacity – GPHD 1.1, GPDB 4.2.3 • Mahout v0.7 • MADlib v0.5 – With small LMF change to allow 4-byte integer values • Test matrix – – – – Data size (# rows/records, # columns/features) Algorithms Algorithm parameters (e.g. convergence threshold, # iterations) GPDB segment / MR (Map-Reduce) task configurations © Copyright 2011 EMC Corporation. All rights reserved. 14
  • 15.
    Performance & ScalabilityResults (summary) • Whitepaper coming out shortly! © Copyright 2011 EMC Corporation. All rights reserved. 15
  • 16.
    Logistic Regression • Mahoutonly has sequential (i.e. single node) IGD implementation MADlib & Mahout Logistic Regression Scalability Across Number of Attributes 700 Census data, 48 attributes [Mahout] 600 Time in Minutes Census data, 48 attributes [MADlib] 500 400 300 200 100 0 1000000 10000000 10000000 1E+09 log(Number of Rows) © Copyright 2011 EMC Corporation. All rights reserved. 16
  • 17.
    Logistic Regression MADlib ScalabilityAcross Number of GPDB Segments 18 16 Time in Minutes 14 12 10 8 6 4 2 0 0 50 100 150 200 250 300 Number of GPDB Segments © Copyright 2011 EMC Corporation. All rights reserved. 17
  • 18.
    K-Means Clustering MADlib &Mahout K-means Scalability Across Number of Rows 350 Census data, 48 attributes [Mahout] 300 Census data, 48 attributes [MADlib] Time in Min 250 200 150 100 50 0 1000000 10000000 10000000 1E+09 log(Number of Rows) © Copyright 2011 EMC Corporation. All rights reserved. 18
  • 19.
    K-Means Clustering MADlib K-meansScalability Across Number of GPDB Segments 10 9 8 Time in Min 7 6 5 4 3 2 1 0 0 50 100 150 200 250 300 Number of GPDB Segments © Copyright 2011 EMC Corporation. All rights reserved. 19
  • 20.
    PyMADlib : Python+ MADlib = Awesome! © Copyright 2011 EMC Corporation. All rights reserved. 20
  • 21.
    Motivation • SQL isgreat for many things, but it’s not nearly enough • Undeniably the most straightforward way to query data • But not necessarily designed for data science © Copyright 2011 EMC Corporation. All rights reserved. 21
  • 22.
    MADlib is agodsend! • Empowers data scientists to run canned machine learning routines – focus less on coding, more on science • In-database, explicitly parallel. • So why do we need anything else? – UI is still all in SQL – Need to tap into rich visualization libraries © Copyright 2011 EMC Corporation. All rights reserved. 22
  • 23.
    Then which interfaceis favored by and familiar to data scientists? • Depends on who you ask • Left survey is for “higher level languages,” and right survey is for “lower level languages” © Copyright 2011 EMC Corporation. All rights reserved. 23
  • 24.
    Wait, don’t wealready have this (PL/R, PL/Python, SAS HPA)? • PL/X’s are wonderful, but: – It still requires non-trivial knowledge of SQL to use effectively – Mostly limited to explicitly parallel jobs – Primarily a SQL interface to the end user • Need an interface that is: – Less SQL, more R/Python/SAS – Implicitly parallelized – More scalable • SAS HPA = $$$$$ © Copyright 2011 EMC Corporation. All rights reserved. 24
  • 25.
    The challenge • MADlib – – – – Opensource Extremely powerful/scalable Growing algorithm breadth SQL • Python/R – – – – Open source Memory limited High algorithm breadth Language/interface purpose-designed for data science • SAS – – – – High user loyalty Non-HPA is memory limited, HPA requires investment High algorithm breadth Language/interface purpose-designed for data science • Want to leverage both the performance benefits of MADlib and the usability of languages like Python, SAS, and R © Copyright 2011 EMC Corporation. All rights reserved. 25
  • 26.
    Simple solution: TranslatePython code into SQL ODBC/ JDBC Python  SQL SQL to execute MADlib Model output • All data stays in DB and all model estimation and heavy lifting done in DB by MADlib • Only strings of SQL and model output transferred across ODBC/JDBC • Best of both worlds: number crunching power of MADlib along with rich set of visualizations of Matplotlib, NetworkX and all your other favorite Python libraries. Let MADlib do all the heavy-lifting on your Greenplum/PostGreSQL database, while you program in your favorite language – Python. © Copyright 2011 EMC Corporation. All rights reserved. 26
  • 27.
    Demo PyMADlib Tutorial – IPythonNotebook Viewer Link http://nbviewer.ipython.org/5275846 © Copyright 2011 EMC Corporation. All rights reserved. 27
  • 28.
    Where do Iget it ? $pip install pymadlib © Copyright 2011 EMC Corporation. All rights reserved. 28
  • 29.
    I don’t haveGPDB or MADlib – What do I do ? • Greenplum Database Community Edition is freely available for single node installations on multiple platforms – Written permission may be requested from EMC/Greenplum for research use for multi-node installations • MADlib is free and open-source – Downloadable for multiple platforms from https://github.com/madlib/madlib • PyMADlib is also free and open-source  – Downloadable from https://github.com/vatsan/pymadlib © Copyright 2011 EMC Corporation. All rights reserved. 29
  • 30.
    Future Directions © Copyright2011 EMC Corporation. All rights reserved. 30
  • 31.
    Greenplum HD • HAWQ– Parallel SQL query engine that combines the key technological advantages of industry-leading Greenplum Database with scalability and convenience of Hadoop • SQL Standards Compliant – Supports Correlated Sub-queries, Window Functions, Roll-ups, Cubes + range of scalar and aggregate functions • ACID Compliant © Copyright 2011 EMC Corporation. All rights reserved. 31
  • 32.
    HAWQ – Architecture ©Copyright 2011 EMC Corporation. All rights reserved. 32
  • 33.
    Performance : HAWQ1Vs. Hive Vs. Impala2 All experiments were run on a 60 node deployment with Analytics Workbench3 1 2 3 http://www.greenplum.com/sites/default/files/2013_0301_hawq_sql_engine_hadoop_1.pdf https://github.com/cloudera/impala/ http://www.analyticsworkbench.com/ © Copyright 2011 EMC Corporation. All rights reserved. 33
  • 34.
    HAWQ: Deep ScalableAnalytics What’s inside the box? • Linear Regression • Logistic Regression • Multinomial Logistic Regression • K-Means • Association Rules • Latent Dirichlet Allocation • Users can connect to HAWQ via popular programming languages and it also supports JDBC and ODBC. • Most tools will work out of the box with HAWQ, including PyMADlib © Copyright 2011 EMC Corporation. All rights reserved. 34
  • 35.
  • 36.
    Appendix © Copyright 2011EMC Corporation. All rights reserved. 36
  • 37.
    Datasets The following datasetswere used in comparing the performance of MADlib with Mahout – KDD Cup 2009 Orange marketing churn data (16.5 MB) • About 500,000 records and 15,000 numerical and categorical attributes – Census 2000 data (1.7 GB) • About 14 million records and 48 numerical and categorical attributes – Enron data (1.9 GB) • About 700,000 documents with a vocabulary size of 200,000 – KDD Cup 2011 Yahoo! Music Webscope data (4.16 GB) • About 1 million users, 600,000 songs, and 250 million ratings – Netflix Prize 2009 data (52.7 MB) • About 400,000 users, 900 movies, and 4.5 million ratings © Copyright 2011 EMC Corporation. All rights reserved. 37

Editor's Notes

  • #9 Special thanks to Grace Gee (Engineer, SOAR Program, Greenplum)