PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learning library.

Srivatsan Ramanujam
Senior Data Scientist
Greenplum

© Copyright 2011 EMC Corporation. All rights reserved.

1

Agenda
• Greenplum UAP overview
– Products: GPDB, GPHD, Chorus, Analytics Labs, Data Computing Appliance
– GPDB Architecture

• MADlib
–
–
–
–

Overview
Algorithms
Working Mechanism
Performance Comparison with Mahout

• PyMADlib
– Overview
– Demo in IPython Notebook

• Future Directions
– GPHD and HAWQ


2

Greenplum Overview


3

Products


4

Greenplum Database - Architecture
MPP (Massively Parallel Processing)
Shared-Nothing Architecture
Master
Servers

...

SQL
MapReduce

...

Query planning &
dispatch

Network
Interconnect

Segment
Servers

...

...

Query processing
& data storage

External
Sources
Loading,
streaming, etc.


5

MADlib


6

MADlib: The Origin

UrbanDictionary.com:
mad (adj.): an adjective used to enhance a noun.
1- dude, you got skills.
2- dude, you got mad skills.

• First mention of MAD analytics was at VLDB’09
– MAD Skills: New Analysis Practices for Big Data
– Jeffrey Cohen, Brian Dolan, Mark Dunlap, Joseph Hellerstein, Caleb
Welton http://db.cs.berkeley.edu/papers/vldb09-madskills.pdf

• MADlib project initiated in late 2010
– Maintained by Greenplum/EMC with significant contributions
from UW Madison, UFlorida and UC Berkeley.


7

Current Modules
Data Modeling
Supervised Learning
•
•
•
•
•
•
•
•
•

Naive Bayes Classification
Linear Regression
Logistic Regression
Multinomial Logistic Regression
Decision Tree
Random Forest
Support Vector Machines
Cox-Proportional Hazards Regression
Conditional Random Field

Unsupervised Learning
• Association Rules
• k-Means Clustering
• Low-rank Matrix Factorization
• SVD Matrix Factorization
• Parallel Latent Dirichlet Allocation

Descriptive Statistics
Sketch-based Estimators
• CountMin (CormodeMuthukrishnan)
• FM (Flajolet-Martin)
• MFV (Most Frequent Values)

Profile

Quantile

Support
Array
Operations
Conjugate
Gradient
Sparse
Vectors
Probability
Functions
Random
Sampling

Inferential Statistics
Hypothesis tests


8

MADlib – User Doc
• Check out the user guide with examples at: http://doc.madlib.net


9

How does it work ? : A Linear Regression Example
• Finding linear dependencies between variables
– y ≈ c0 + c1 · x1 + c2 · x2 ?
# select y, x1, x2

Vector of
dependent
variables y


from unm limit 6;

y
| x1 | x2
-------+------+----10.14 |
0 | 0.3
11.93 | 0.69 | 0.6
13.57 | 1.1 | 0.9
14.17 | 1.39 | 1.2
15.25 | 1.61 | 1.5
16.15 | 1.79 | 1.8

Design
matrix X

10

Reminder: Linear-Regression Model
•
• If residuals i.i.d. Gaussians with standard deviation σ:
– max likelihood ⇔ min sum of squared residuals

• First-order conditions for the following quadratic objective (in c)

yield the minimizer


11

Linear Regression: Streaming Algorithm
• How to compute with a single table scan?

-1
XT

XT

y

X

X TX


XTy

12

Linear Regression: Parallel Computation
XT
y

Segment 1

T
X1 y1


Segment 2

T
X2 y2

Master

X Ty

13

Performance Comparison : Test Setup on AWB
• AWB
– 1000-node cluster located in Las Vegas
– Over 24,000 processors, 48 TB of Memory, and 24 PB of raw disk
storage
– 8000+ Map Task Capacity, 5000+ Reduce Task Capacity
– GPHD 1.1, GPDB 4.2.3

• Mahout v0.7
• MADlib v0.5
– With small LMF change to allow 4-byte integer values

• Test matrix
–
–
–
–

Data size (# rows/records, # columns/features)
Algorithms
Algorithm parameters (e.g. convergence threshold, # iterations)
GPDB segment / MR (Map-Reduce) task configurations


14

Performance & Scalability Results (summary)

• Whitepaper coming out shortly!


15

Logistic Regression
• Mahout only has sequential (i.e. single node) IGD implementation

MADlib & Mahout Logistic Regression Scalability Across
Number of Attributes
700

Census data, 48 attributes [Mahout]
600

Time in Minutes

Census data, 48 attributes [MADlib]
500
400
300
200
100
0
1000000

10000000

10000000

1E+09

log(Number of Rows)


16

Logistic Regression
MADlib Scalability Across Number of GPDB Segments
18
16

Time in Minutes

14
12
10
8
6
4
2
0
0

50

100

150

200

250

300

Number of GPDB Segments


17

K-Means Clustering
MADlib & Mahout K-means Scalability Across
Number of Rows
350

Census data, 48 attributes [Mahout]
300

Census data, 48 attributes [MADlib]
Time in Min

250
200
150
100
50
0
1000000

10000000

10000000

1E+09

log(Number of Rows)


18

K-Means Clustering
MADlib K-means Scalability Across
10
9
8

Time in Min

7
6
5
4
3
2
1

0
0

50

100

150

200

250

300



19

PyMADlib : Python + MADlib = Awesome!


20

Motivation
• SQL is great for many things, but it’s not nearly enough

• Undeniably the most straightforward way to query data

• But not necessarily designed for data science


21

MADlib is a godsend!
• Empowers data scientists to run canned machine learning
routines – focus less on coding, more on science
• In-database, explicitly parallel.

• So why do we need anything else?
– UI is still all in SQL
– Need to tap into rich visualization libraries


22

Then which interface is favored by and familiar
to data scientists?

• Depends on who you ask
• Left survey is for “higher level languages,” and right survey is for “lower level languages”


23

Wait, don’t we already have this (PL/R,
PL/Python, SAS HPA)?
• PL/X’s are wonderful, but:
– It still requires non-trivial knowledge of SQL to use effectively
– Mostly limited to explicitly parallel jobs
– Primarily a SQL interface to the end user

• Need an interface that is:
– Less SQL, more R/Python/SAS
– Implicitly parallelized
– More scalable

• SAS HPA = $$$$$


24

The challenge
• MADlib
–
–
–
–

Open source
Extremely powerful/scalable
Growing algorithm breadth
SQL

• Python/R
–
–
–
–

Open source
Memory limited
High algorithm breadth
Language/interface purpose-designed for data science

• SAS
–
–
–
–

High user loyalty
Non-HPA is memory limited, HPA requires investment
High algorithm breadth
Language/interface purpose-designed for data science

• Want to leverage both the performance benefits of MADlib and the
usability of languages like Python, SAS, and R


25

Simple solution: Translate Python code into
SQL
ODBC/
JDBC

Python  SQL

SQL to execute MADlib
Model output

• All data stays in DB and all model estimation and heavy lifting done in DB by
MADlib

• Only strings of SQL and model output transferred across ODBC/JDBC
• Best of both worlds: number crunching power of MADlib along with rich set of
visualizations of Matplotlib, NetworkX and all your other favorite Python
libraries. Let MADlib do all the heavy-lifting on your Greenplum/PostGreSQL
database, while you program in your favorite language – Python.


26

Demo

PyMADlib Tutorial –
IPython Notebook Viewer Link

http://nbviewer.ipython.org/5275846


27

Where do I get it ?

$pip install pymadlib


28

I don’t have GPDB or MADlib – What do I do ?
• Greenplum Database Community Edition is freely
available for single node installations on multiple
platforms
– Written permission may be requested from EMC/Greenplum
for research use for multi-node installations

• MADlib is free and open-source
– Downloadable for multiple platforms from
https://github.com/madlib/madlib

• PyMADlib is also free and open-source 
– Downloadable from https://github.com/vatsan/pymadlib


29

Future Directions


30

Greenplum HD
• HAWQ – Parallel SQL query engine that combines the key
technological advantages of industry-leading Greenplum
Database with scalability and convenience of Hadoop

• SQL Standards Compliant
– Supports Correlated Sub-queries, Window Functions, Roll-ups, Cubes
+ range of scalar and aggregate functions

• ACID Compliant


31

HAWQ – Architecture


32

Performance : HAWQ1 Vs. Hive Vs. Impala2

All experiments were run on a 60 node deployment with Analytics Workbench3

1
2
3

http://www.greenplum.com/sites/default/files/2013_0301_hawq_sql_engine_hadoop_1.pdf
https://github.com/cloudera/impala/
http://www.analyticsworkbench.com/


33

HAWQ: Deep Scalable Analytics
What’s inside the box?

• Linear Regression
• Logistic Regression
• Multinomial Logistic Regression
• K-Means

• Association Rules
• Latent Dirichlet Allocation
• Users can connect to HAWQ via popular programming languages and it also
supports JDBC and ODBC.
• Most tools will work out of the box with HAWQ, including PyMADlib


34

Questions?
@being_bayesian
vatsan.cs@utexas.edu
https://github.com/vatsan/pymadlib


35

Appendix


36

Datasets
The following datasets were used in comparing the performance of
MADlib with Mahout
– KDD Cup 2009 Orange marketing churn data (16.5 MB)
• About 500,000 records and 15,000 numerical and categorical attributes
– Census 2000 data (1.7 GB)
• About 14 million records and 48 numerical and categorical attributes
– Enron data (1.9 GB)
• About 700,000 documents with a vocabulary size of 200,000
– KDD Cup 2011 Yahoo! Music Webscope data (4.16 GB)
• About 1 million users, 600,000 songs, and 250 million ratings
– Netflix Prize 2009 data (52.7 MB)
• About 400,000 users, 900 movies, and 4.5 million ratings


37

PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learning library.

More Related Content

What's hot

Similar to PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learning library.

Recently uploaded

PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learning library.

Editor's Notes