Pivotal OSS meetup - MADlib and PivotalR

BUILT FOR THE SPEED OF BUSINESS

Pivotal Confidential–Internal Use Only

1

Pivotal OSS Meetups

Big Data Analytics
MADlib and PivotalR:
Scalable Machine Learning for
Massively Parallel Databases
Rahul Iyer,
Senior Software Developer,
Predictive Analytics
March, 4th 2014


2

Agenda for the talk

•  Introduce MADlib, a
distributed machine learning
library for SQL users
•  How scalability is achieved
by distributing the
computation?
•  Performance metrics +
comparisons with Mahout


•  A new R interface to access
all of MADlib’s features
•  How does it get big-data
results with small-data
efforts?
•  Demo to showcase PivotalR

3

What is Big data?
•  Volumes of data …
•  In various formats …
•  From multiple sources …

and Analytics?
•  Generate insights …
•  for informed decision-making


4

Data ---! Information ---! Insights
Traditional analytics pipeline
Time;to;Insights&

Data&Prep&

sample.csv&

spec.docx&

DB&Extract&

scores.csv&

DB&Import&

3&


6

The MAD approach
Data ---! Information ---! Insights
Time-to-Insights

Data&Prep&

Model&

Score&

Billions&of&rows&
Reduced&Data&
in&minutes&
Movement&

Enterprise)Data)
RDBMS&

RDBMS&

RDBMS&

RDBMS&

4&

7

What is MADlib?
MADlib project was initiated in 2011 by Greenplum architects and Joe
Hellerstein from Univ. of California, Berkeley.

•  MAD stands for:
•  lib stands for SQL library of:
•  advanced (mathematical, statistical, machine learning)
•  parallel & scalable in-database functions


8

What is MADlib?
MADlib project was initiated in 2011 by Greenplum architects and Joe
Hellerstein from Univ. of California, Berkeley.

•  MAD stands for:
•  lib stands for SQL library of:
•  advanced (mathematical, statistical, machine learning)
•  parallel & scalable in-database functions
UrbanDictionary.com:
mad (adj.): an adjective used to enhance a noun.
1- dude, you got skills.
2- dude, you got mad skills.


9

Which platforms does it run on?

(Partly ported)

Impala

HAWQ
HDFS


GPDB

PostgreSQL

10

Shared-Nothing Database Architecture
MPP (Massively Parallel Processing)

Master
Servers

...

SQL
MapReduce

...

Query planning &
dispatch

Network
Interconnect

Segment
Servers

...

...

Query processing
& data storage

External
Sources
Loading,
streaming, etc.


11

Supervised Learning

Summary function
Sketch estimators
Percentiles
Correlation matrix
Data Exploration

Text analytics

•  Generalized Linear models
•  Linear Regression
•  Logistic Regression
•  Multinomial logit …
•  Decision Trees and Random Forest
•  Naive Bayes Classification
•  Support Vector Machines
•  Cox-Prop Hazards
and more …

•  CRF
•  LDA

Support modules
•  Array operations
•  Sparse Vectors
•  Probability functions

Scoring
Sampling
methods
•  Cross Validation

•  Linear Regression
•  Logistic Regression
•  Naïve Bayes
…

Scoring

Predictive Modeling

Analytics Pipeline

Data Prep
Aggregation
Normalizing
Pivoting
Filtering


Data mining

Model fitness

Unsupervised Learning

Statistical metrics

•  Association Rules
•  k-Means Clustering
•  Low-rank Matrix Factorization
•  PCA
•  SVD Matrix Factorization

•  Descriptive statistics
•  Goodness of fit
•  Inferential statistics
•  ROC

12

Example usage
Train a model

Predict for new data


13

How do we implement scalability?
Example: Linear Regression
•  Finding linear dependencies between variables

Regressor (y)

y ≈ c0 + c1 · x1 + c2 · x2 ?

Vector of
dependent
variables y

y
| x1 | x2
-------+------+----10.14 |
0 | 0.3
11.93 | 0.69 | 0.6
13.57 | 1.1 | 0.9
14.17 | 1.39 | 1.2
15.25 | 1.61 | 1.5
16.15 | 1.79 | 1.8

Design
matrix X

Predictor (x1)


14

Challenges in computing OLS solution


15

Challenges in computing OLS solution

XT

X

a c
b d

a b
c d

Segment 2

Segment 2

Segment 1

Segment 1

16

Challenges to compute OLS solution

XT

X

a c
b d

a b
c d

a2 + c2

Data across nodes are multiplied!

=


17


XT

X

a c
b d

a b
c d

a2 + c2

ab + cd

Data across nodes are multiplied!

=


18


XT

X

a c
b d

a b
c d

a2 + c2

ab + cd

ba + dc

b2 + d2

Looks like the result can be decomposed

=


19


XT

a c
b d
=

X

a b
c d

a
b

a b

+

a2 + c2

ab + cd

ba + dc

c
d

c d

b2 + d2

Let’s change perspective

=


20

Linear Regression: Streaming Algorithm
How to compute with a single table scan?

-1
XT

XT
y

X

XTX


XTy

22

Linear Regression: Parallel Computation
XT
y

Segment 1


T
X1 y1

Segment 2

T
X2 y2

Master

XTy

23

Basic&Building&Block:&
Basic Building Block: User-defined aggregate
User;Deﬁned&Aggregates&
x#
(1,0,3,…,5)&
(;2,4,5,…,2)&
…&

y)
3&
2&
…&

(A,b)&
…&

AggregaOon&phase&1&on&each&node:&
1.  IniOalize:&(A,b) = (0,0)
2.  TransiOon&for&all&rows:&
&
(A,b) = (A,b) + (x  
⋅T ,x ⋅ 
y)
 x
3.  Send&(A,b)&
map&
&
reduce&
AggregaOon&phase&2&on&master&node:&
1.  Merge:&&
(A,b) = (A,b) + (A,b)
ˆ
2.  Finalize:& β = solve(A,b) = A−1 ⋅ b
13&


24

Problem solved? … Not Yet
"  Many ML solutions are iterative without analytical
formulations
Initialize problem

Perform optimization step

false

Has converged?
true
Return results


25

1.90
1.66

60.58
227.7

1.197
1.276
1.698
3.363
8.840
6.18

2.383
2.869
4.475
13.35
45.48
171.7

17.14
111.4
0.3904
0.4769
1.151
3.263
13.10
84.59

Use a convex optimization framework
1.&Lack&of&portable&mulO;pass&
-  Each step
n execution times
iteraOons& has an analytical formulation
that can be performed in parallel

•  WITH RECURSIVE&not&reliable&basis&for&

portability&
•  User;defined&driver&
funcOons&in&Python&

CREATE TEMP TABLE temp!
INSERT INTO temp SELECT
step(...) FROM ...!

–  Outer&loops&not&
performance;criOcal&

false&

Figure 6: The Archetypical Convex Function f (x) = x .
•  Compromise:&
2

Application

Different&user&interface&

Least Squares
Lasso [38]
Logisitic Regression
Classification (SVM)
Recommendation
Labeling (CRF) [40]

Objective
P
(xT u y)2
P(u,y)2⌦ T
2
P(u,y)2⌦ (x u y) + µkxk1
log(1 + exp( yxt u))
P(u,y)2⌦
T
(u,y)2⌦ (1 yx u)+
P
T
Mi j )2 + µkL, Rk2
(i,h
j)2⌦ (Li R j
iF
P P
k
j x j F j (yk , zk ) log Z(zk )

SELECT converged(...)
FROM temp, ...!
true&
SELECT result(...)!
FROM temp!
16&

26

Architecture
SQL, generated per
specification

User Interface

The&MADlib&Vision&
High-level Abstraction Layer
Python
(iteration controller, ...)
•  Academic&and&industry&contribuOons&
•  Think&of&“CRAN&for&databases”&
3.&Lack&of&language&support&for&
Functions for Inner Loops
RDBMS
Built-in
Functions

–  Repository&of&open;source&ML&algorithms&
linear&algebra&
(implements convex optimization)
–  This&Ome&with&data&parallelism&in&mind&
•  C++&AbstracOon&Layer&uses&Eigen&
C++
•  Open;Source&Framework&
Low-level Abstraction Layer
•  (Dense)&Vectors&and&matrices:&
(matrix operations, PRECISION[]!
DOUBLE
C++ to DB typeExample:&…)
•  bridge,
AnyType!
solve::run(AnyType& args) {!
MappedMatrix A = args[0].getAs<MappedMatrix>();!
MappedColumnVector b = args[1].getAs<MappedColumnVector>();!
BSD&License&
Eigen&
!
MutableMappedColumnVector x = allocateArray<double>(A.cols());!
x = A.colPivHouseholderQr().solve(b);!
return x;!
Performance:&
}!

RDBMS Query Processing
(Greenplum, PostgreSQL, Hadoop with SQL)

10&

•  No&unnecessary&copying&
•  No&internal&type&conversion&
18&


27

Performance&Trends&
Performance trends

sk&I/O&is&not&always&
•  Overhead
e&boLleneck& for a

single row is very
Performance&tuning&is&of a
low (fraction
essenOal&second)

verhead&for&single&
•  Able to achieve
uery&very&low&(fracOon&
close to linear
&a&second)&
speedup
eenplum&achieves&
early&perfect&speedup&

OLS&on&10&million&rows&(in&seconds)&

#&variables:&

20&

40&

160&

40&
35&
30&
25&
20&
15&
10&
5&
0&
6&

12&

18&

#&segments&

80&

24&
22&
28

Performance Comparison with Apache Mahout
"  Analytics WorkBench (http://www.gopivotal.com/big-data/analytics-workbench)

–  1000-node cluster located in Las Vegas
–  Over 24,000 processors, 48 TB of Memory, and 24 PB of
raw disk storage
–  8000+ Map Task Capacity, 5000+ Reduce Task Capacity
–  Infrastructure: Pivotal HD 1.1

"  Mahout v0.7
"  Test matrix*
–  Data size

▪  KDD Cup 2009 Orange marketing churn data (16.5 GB)
▪  Enron data (1.9 GB)
▪  Census data 2000 (1.7 GB)

–  Algorithms: Logistic Regression and K-means
–  Algorithm parameters (e.g. convergence threshold, # iterations)
* Reporting a subset of results from whitepaper.
Courtesy Grace Gee (Engineer, SOAR Program, Pivotal)

29

Logistic Regression
MADlib & Mahout Logistic Regression Scalability Across
Number of Attributes
700

Census data, 48 attributes [Mahout]
600

Time in Minutes

Census data, 48 attributes [MADlib]
500
400
300
200
100
0
1000000

10000000

100000000

1E+09

log(Number of Rows)


30

Logistic Regression
9
8

Time in Minutes

7
6
5
4
3
2
1
0
1000000

10000000

100000000
log(Number of Rows)

1E+09


31

K-Means
MADlib & Mahout K-means Scalability Across
Number of Rows
350
300


Time in Min

250
200
150
100
50
0
1000000

10000000

100000000

1E+09

log(Number of Rows)


32

Random Forest
1600
1400


Time in Min

1200
1000
800
600
400
200
0
1000000

10000000

100000000

1E+09

log(Number of Rows)


33

Part 1 Summary

MADlib is a easy-to-use library that
provides a SQL interface to fast,
scalable machine learning
algorithms …


35

But not all Data Scientists
speak SQL …
Accessing Scalability through R


36

Why R?

From the report: “The preponderance of R and Python usage is more surprising …
two most commonly used individual tools, even above Excel. R and Python are likely
popular because they are easily accessible and effective open source tools.”
O’Reilly: 2013 Data Science Salary Survey

37

PivotalR Design Overview
PivotalR Design Overview
Execution in Database
•  Call MADlib’s in-DB machine learning functions
• 
• 
• 

directly from R
Call MADlib’s in-DB to native R function
Syntax is analogous machine learning functions
directly from R
Syntax is analogous to native R function
PivotalR
PivotalR

R " SQL

R " SQL

No data here

RPostgreSQL
RPostgreSQL

Data lives here
Data lives here

SQL to execute

SQL to execute MADlib
SQL to execute
Computation results

Database w/ MADlib

Model output
Computation results

Database w/ MADlib

•  Data doesn’t need to leave the database
•  All heavy lifting, including model estimation
•  & computation, are to leave the database
Data doesn’t need done in the database
merely point lifting, including model estimation
•  All heavy to DB objects
& computation, are done in the database

•  All data stays in DB: R objects
Woo Jung
http://gopivotal.github.io/PivotalR/
•  All model estimation and heavy lifting done in DB by MADlib
Woo Jung
•  R → SQL translation done in the R client
•  Only strings of SQL and model output transferred across DBI
http://gopivotal.github.io/PivotalR/
No data here

© Copyright 2014 Pivotal. All rights reserved.

36

© Copyright 2014 Pivotal. All rights reserved.

36

Courtesy Woo Jung and Hai Qian

38

Some of current features
And more ... (SQL wrapper)
+ - *
%% %/%

/
^

A wrapper of MADlib

• 
• 
• 
• 
• 

Linear regression
Logistic regression
Elastic Net

• 

Categorial variable
as.factor()

ARIMA

• 
• 
• 
• 
• 
• 
• 

Table summary

• 
• 
• 


dim
names

$

[

==
&
by

[[
!=

|

$<>

<

!

• 

• 

sort

[<>=

[[<<=

• 

merge

db.data.frame

• 
• 

as.db.data.frame

is.na

preview
content

• 

predict

c mean sum sd var min max
length colMeans colSums
db.connect db.disconnect
db.list db.objects
db.existsObject delete

40

Demonstration
library(PivotalR)

Load the Library

db.connect(port = 14526, dbname = "madlib")

Connect to the database “madlib” on port 14526

db.objects()

List all the tables in the active connection

x <- db.data.frame("madlibtestdata.dt_abalone")

Create an R object that references a table in the database

dim(x)

Report #/rows and #/columns in the table

names(x)

Column names within the table

x$rings

Database query object representing “select rings from madlibtestdata.dt_abalone”

lookat(x, 10) # look at a sample of table

Pull 10 rows of data from the table back into the R environment

mean(x$rings)

query object representing “select avg(rings) from madlibtestdata.dt_abalone”

lookat(mean(x$rings))

execute the query and report back the result

fit <- madlib.lm(rings ~ . - id | sex, data = y)

Run a linear regression within the database and return a model object

predict(fit, x)

Create a query object representing scoring the model in the database

mean((x$rings - predict(fit, x))^2)

Query object calculating the mean square error of the model

x$sex <- as.factor(v$sex)

Add a calculated factor column to the database query object

m0 <- madlib.glm(resp ~ age,

Calculate a logistic regression model

family="binomial", data=dbbank)
mstep <- step(m0, scope=list(lower=~age,
upper=~age + factor(marital) + factor(education) +
factor(housing) + factor(loan) + factor(job)))


Perform stepwise feature selection

43

We’re looking for contributors
•  Browse our help pages
–  Start page: madlib.net
–  Github pages
•  github.com/madlib/madlib
•  github.com/gopivotal/pivotalr
•  github.com/gopivotal/pymadlib

(SQL)
(R)
(Python)

–  Use our product and report issues:
•  jira.madlib.net (Issue tracker)
•  user@madlib.net (User forum)

•  Can use PostgreSQL or Greenplum Database
Community Edition for installations on multiple
platforms

44

Credits

The&MADlib&Vision&

The&MADlib&Vision&



BSD&License&

Eigen&

10&

Leaders and contributors:
Gavin Sherry
BSD&License&
Caleb Welton
Joseph Hellerstein
Christopher Ré
Zhe Wang
Florian Schoppmann


Hai Qian
Eigen&
Shengwen Yang
Aaron Feng

10&

and many others …

45

Thank you for your attention

Important links:
Product email: madlib@gopivotal.com
Product site: madlib.net
Speaker email: riyer@gopivotal.com


46

Pivotal OSS meetup - MADlib and PivotalR

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (14)

Similar to Pivotal OSS meetup - MADlib and PivotalR

Similar to Pivotal OSS meetup - MADlib and PivotalR (20)

Recently uploaded

Recently uploaded (20)

Pivotal OSS meetup - MADlib and PivotalR