With the explosion of big data, the need for fast and inexpensive analytics solutions has become a key basis of competition in many industries. Extracting the value of big data with analytics can be complex, and requires advanced skills.
At Pivotal, we are building open-source solutions (MADlib, PivotalR, PyMadlib) to simplify this process for the user, while maintaining the efficiency necessary for big data analysis.
This talk will provide information about MADlib, an open source library of SQL-based algorithms for machine learning, data mining and statistics that run at large scale within a database engine, with no need for data import/export to other tools.
It provides an overview of the library’s architecture and compares various statistical methods with those available in Apache Mahout.
We also introduce, PivotalR, a R-based wrapper for MADlib that allows data scientists and programmers to access power of MADlib along with the ease of use of R.
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
Pivotal OSS meetup - MADlib and PivotalR
1. BUILT FOR THE SPEED OF BUSINESS
Pivotal Confidential–Internal Use Only
1
2. Pivotal OSS Meetups
Big Data Analytics
MADlib and PivotalR:
Scalable Machine Learning for
Massively Parallel Databases
Rahul Iyer,
Senior Software Developer,
Predictive Analytics
March, 4th 2014
Pivotal Confidential–Internal Use Only
2
3. Agenda for the talk
• Introduce MADlib, a
distributed machine learning
library for SQL users
• How scalability is achieved
by distributing the
computation?
• Performance metrics +
comparisons with Mahout
Pivotal Confidential–Internal Use Only
• A new R interface to access
all of MADlib’s features
• How does it get big-data
results with small-data
efforts?
• Demo to showcase PivotalR
3
4. What is Big data?
• Volumes of data …
• In various formats …
• From multiple sources …
and Analytics?
• Generate insights …
• for informed decision-making
Pivotal Confidential–Internal Use Only
4
5. Data ---! Information ---! Insights
Traditional analytics pipeline
Time;to;Insights&
Data&Prep&
sample.csv&
spec.docx&
DB&Extract&
scores.csv&
DB&Import&
3&
Pivotal Confidential–Internal Use Only
6
6. The MAD approach
Data ---! Information ---! Insights
Time-to-Insights
Data&Prep&
Model&
Score&
Billions&of&rows&
Reduced&Data&
in&minutes&
Movement&
Enterprise)Data)
RDBMS&
RDBMS&
RDBMS&
RDBMS&
4&
Pivotal Confidential–Internal Use Only
7
7. What is MADlib?
MADlib project was initiated in 2011 by Greenplum architects and Joe
Hellerstein from Univ. of California, Berkeley.
• MAD stands for:
• lib stands for SQL library of:
• advanced (mathematical, statistical, machine learning)
• parallel & scalable in-database functions
Pivotal Confidential–Internal Use Only
8
8. What is MADlib?
MADlib project was initiated in 2011 by Greenplum architects and Joe
Hellerstein from Univ. of California, Berkeley.
• MAD stands for:
• lib stands for SQL library of:
• advanced (mathematical, statistical, machine learning)
• parallel & scalable in-database functions
UrbanDictionary.com:
mad (adj.): an adjective used to enhance a noun.
1- dude, you got skills.
2- dude, you got mad skills.
Pivotal Confidential–Internal Use Only
9
9. Which platforms does it run on?
(Partly ported)
Impala
HAWQ
HDFS
Pivotal Confidential–Internal Use Only
GPDB
PostgreSQL
10
10. Shared-Nothing Database Architecture
MPP (Massively Parallel Processing)
Master
Servers
...
SQL
MapReduce
...
Query planning &
dispatch
Network
Interconnect
Segment
Servers
...
...
Query processing
& data storage
External
Sources
Loading,
streaming, etc.
Pivotal Confidential–Internal Use Only
11
11. Supervised Learning
Summary function
Sketch estimators
Percentiles
Correlation matrix
Data Exploration
Text analytics
• Generalized Linear models
• Linear Regression
• Logistic Regression
• Multinomial logit …
• Decision Trees and Random Forest
• Naive Bayes Classification
• Support Vector Machines
• Cox-Prop Hazards
and more …
• CRF
• LDA
Support modules
• Array operations
• Sparse Vectors
• Probability functions
Scoring
Sampling
methods
• Cross Validation
• Linear Regression
• Logistic Regression
• Naïve Bayes
…
Scoring
Predictive Modeling
Analytics Pipeline
Data Prep
Aggregation
Normalizing
Pivoting
Filtering
Pivotal Confidential–Internal Use Only
Data mining
Model fitness
Unsupervised Learning
Statistical metrics
• Association Rules
• k-Means Clustering
• Low-rank Matrix Factorization
• PCA
• SVD Matrix Factorization
• Descriptive statistics
• Goodness of fit
• Inferential statistics
• ROC
12
12. Example usage
Train a model
Predict for new data
Pivotal Confidential–Internal Use Only
13
13. How do we implement scalability?
Example: Linear Regression
• Finding linear dependencies between variables
Regressor (y)
y ≈ c0 + c1 · x1 + c2 · x2 ?
Vector of
dependent
variables y
y
| x1 | x2
-------+------+----10.14 |
0 | 0.3
11.93 | 0.69 | 0.6
13.57 | 1.1 | 0.9
14.17 | 1.39 | 1.2
15.25 | 1.61 | 1.5
16.15 | 1.79 | 1.8
Design
matrix X
Predictor (x1)
Pivotal Confidential–Internal Use Only
14
15. Challenges in computing OLS solution
XT
X
a c
b d
a b
c d
Segment 2
Segment 2
Segment 1
Pivotal Confidential–Internal Use Only
Segment 1
16
16. Challenges to compute OLS solution
XT
X
a c
b d
a b
c d
a2 + c2
Data across nodes are multiplied!
=
Pivotal Confidential–Internal Use Only
17
17. Challenges to compute OLS solution
XT
X
a c
b d
a b
c d
a2 + c2
ab + cd
Data across nodes are multiplied!
=
Pivotal Confidential–Internal Use Only
18
18. Challenges to compute OLS solution
XT
X
a c
b d
a b
c d
a2 + c2
ab + cd
ba + dc
b2 + d2
Looks like the result can be decomposed
=
Pivotal Confidential–Internal Use Only
19
19. Challenges to compute OLS solution
XT
a c
b d
=
X
a b
c d
a
b
a b
+
a2 + c2
ab + cd
ba + dc
c
d
c d
b2 + d2
Let’s change perspective
=
Pivotal Confidential–Internal Use Only
20
20. Linear Regression: Streaming Algorithm
How to compute with a single table scan?
-1
XT
XT
y
X
XTX
Pivotal Confidential–Internal Use Only
XTy
22
21. Linear Regression: Parallel Computation
XT
y
Segment 1
Pivotal Confidential–Internal Use Only
T
X1 y1
Segment 2
T
X2 y2
Master
XTy
23
23. Problem solved? … Not Yet
" Many ML solutions are iterative without analytical
formulations
Initialize problem
Perform optimization step
false
Has converged?
true
Return results
Pivotal Confidential–Internal Use Only
25
24. 1.90
1.66
60.58
227.7
1.197
1.276
1.698
3.363
8.840
6.18
2.383
2.869
4.475
13.35
45.48
171.7
17.14
111.4
0.3904
0.4769
1.151
3.263
13.10
84.59
Use a convex optimization framework
1.&Lack&of&portable&mulO;pass&
- Each step
n execution times
iteraOons& has an analytical formulation
that can be performed in parallel
• WITH RECURSIVE¬&reliable&basis&for&
portability&
• User;defined&driver&
funcOons&in&Python&
CREATE TEMP TABLE temp!
INSERT INTO temp SELECT
step(...) FROM ...!
– Outer&loops¬&
performance;criOcal&
false&
Figure 6: The Archetypical Convex Function f (x) = x .
• Compromise:&
2
Application
Different&user&interface&
Least Squares
Lasso [38]
Logisitic Regression
Classification (SVM)
Pivotal Confidential–Internal Use Only
Recommendation
Labeling (CRF) [40]
Objective
P
(xT u y)2
P(u,y)2⌦ T
2
P(u,y)2⌦ (x u y) + µkxk1
log(1 + exp( yxt u))
P(u,y)2⌦
T
(u,y)2⌦ (1 yx u)+
P
T
Mi j )2 + µkL, Rk2
(i,h
j)2⌦ (Li R j
iF
P P
k
j x j F j (yk , zk ) log Z(zk )
SELECT converged(...)
FROM temp, ...!
true&
SELECT result(...)!
FROM temp!
16&
26
26. Performance&Trends&
Performance trends
sk&I/O&is¬&always&
• Overhead
e&boLleneck& for a
single row is very
Performance&tuning&is&of a
low (fraction
essenOal&second)
verhead&for&single&
• Able to achieve
uery&very&low&(fracOon&
close to linear
&a&second)&
speedup
eenplum&achieves&
early&perfect&speedup&
OLS&on&10&million&rows&(in&seconds)&
#&variables:&
20&
40&
160&
40&
35&
30&
25&
20&
15&
10&
5&
0&
6&
12&
18&
#&segments&
Pivotal Confidential–Internal Use Only
80&
24&
22&
28
27. Performance Comparison with Apache Mahout
" Analytics WorkBench (http://www.gopivotal.com/big-data/analytics-workbench)
– 1000-node cluster located in Las Vegas
– Over 24,000 processors, 48 TB of Memory, and 24 PB of
raw disk storage
– 8000+ Map Task Capacity, 5000+ Reduce Task Capacity
– Infrastructure: Pivotal HD 1.1
" Mahout v0.7
" Test matrix*
– Data size
▪ KDD Cup 2009 Orange marketing churn data (16.5 GB)
▪ Enron data (1.9 GB)
▪ Census data 2000 (1.7 GB)
– Algorithms: Logistic Regression and K-means
– Algorithm parameters (e.g. convergence threshold, # iterations)
* Reporting a subset of results from whitepaper.
Courtesy Grace Gee (Engineer, SOAR Program, Pivotal)
Pivotal Confidential–Internal Use Only
29
28. Logistic Regression
MADlib & Mahout Logistic Regression Scalability Across
Number of Attributes
700
Census data, 48 attributes [Mahout]
600
Time in Minutes
Census data, 48 attributes [MADlib]
500
400
300
200
100
0
1000000
10000000
100000000
1E+09
log(Number of Rows)
Courtesy Grace Gee (Engineer, SOAR Program, Pivotal)
Pivotal Confidential–Internal Use Only
30
29. Logistic Regression
9
8
Time in Minutes
7
6
5
4
3
2
1
0
1000000
10000000
100000000
log(Number of Rows)
1E+09
Courtesy Grace Gee (Engineer, SOAR Program, Pivotal)
Pivotal Confidential–Internal Use Only
31
30. K-Means
MADlib & Mahout K-means Scalability Across
Number of Rows
350
300
Census data, 48 attributes [Mahout]
Census data, 48 attributes [MADlib]
Time in Min
250
200
150
100
50
0
1000000
10000000
100000000
1E+09
log(Number of Rows)
Courtesy Grace Gee (Engineer, SOAR Program, Pivotal)
Pivotal Confidential–Internal Use Only
32
31. Random Forest
1600
Census data, 46 attributes [Mahout]
1400
Census data, 46 attributes [MADlib]
Time in Min
1200
1000
800
600
400
200
0
1000000
10000000
100000000
1E+09
log(Number of Rows)
Courtesy Grace Gee (Engineer, SOAR Program, Pivotal)
Pivotal Confidential–Internal Use Only
33
32. Part 1 Summary
MADlib is a easy-to-use library that
provides a SQL interface to fast,
scalable machine learning
algorithms …
Pivotal Confidential–Internal Use Only
35
33. But not all Data Scientists
speak SQL …
Accessing Scalability through R
Pivotal Confidential–Internal Use Only
36
34. Why R?
From the report: “The preponderance of R and Python usage is more surprising …
two most commonly used individual tools, even above Excel. R and Python are likely
popular because they are easily accessible and effective open source tools.”
O’Reilly: 2013 Data Science Salary Survey
Pivotal Confidential–Internal Use Only
37
36. Some of current features
And more ... (SQL wrapper)
+ - *
%% %/%
/
^
A wrapper of MADlib
•
•
•
•
•
Linear regression
Logistic regression
Elastic Net
•
Categorial variable
as.factor()
ARIMA
•
•
•
•
•
•
•
Table summary
•
•
•
Pivotal Confidential–Internal Use Only
dim
names
$
[
==
&
by
[[
!=
|
$<>
<
!
•
•
sort
[<>=
[[<<=
•
merge
db.data.frame
•
•
as.db.data.frame
is.na
preview
content
•
predict
c mean sum sd var min max
length colMeans colSums
db.connect db.disconnect
db.list db.objects
db.existsObject delete
40
37. Demonstration
library(PivotalR)
Load the Library
db.connect(port = 14526, dbname = "madlib")
Connect to the database “madlib” on port 14526
db.objects()
List all the tables in the active connection
x <- db.data.frame("madlibtestdata.dt_abalone")
Create an R object that references a table in the database
dim(x)
Report #/rows and #/columns in the table
names(x)
Column names within the table
x$rings
Database query object representing “select rings from madlibtestdata.dt_abalone”
lookat(x, 10) # look at a sample of table
Pull 10 rows of data from the table back into the R environment
mean(x$rings)
query object representing “select avg(rings) from madlibtestdata.dt_abalone”
lookat(mean(x$rings))
execute the query and report back the result
fit <- madlib.lm(rings ~ . - id | sex, data = y)
Run a linear regression within the database and return a model object
predict(fit, x)
Create a query object representing scoring the model in the database
mean((x$rings - predict(fit, x))^2)
Query object calculating the mean square error of the model
x$sex <- as.factor(v$sex)
Add a calculated factor column to the database query object
m0 <- madlib.glm(resp ~ age,
Calculate a logistic regression model
family="binomial", data=dbbank)
mstep <- step(m0, scope=list(lower=~age,
upper=~age + factor(marital) + factor(education) +
factor(housing) + factor(loan) + factor(job)))
Pivotal Confidential–Internal Use Only
Perform stepwise feature selection
43
38. We’re looking for contributors
• Browse our help pages
– Start page: madlib.net
– Github pages
• github.com/madlib/madlib
• github.com/gopivotal/pivotalr
• github.com/gopivotal/pymadlib
(SQL)
(R)
(Python)
– Use our product and report issues:
• jira.madlib.net (Issue tracker)
• user@madlib.net (User forum)
• Can use PostgreSQL or Greenplum Database
Community Edition for installations on multiple
platforms
Pivotal Confidential–Internal Use Only
44
39. Credits
The&MADlib&Vision&
• Academic&and&industry&contribuOons&
• Think&of&“CRAN&for&databases”&
The&MADlib&Vision&
– Repository&of&open;source&ML&algorithms&
– This&Ome&with&data¶llelism&in&mind&
• Open;Source&Framework&
BSD&License&
Eigen&
• Academic&and&industry&contribuOons&
• Think&of&“CRAN&for&databases”&
– Repository&of&open;source&ML&algorithms&
– This&Ome&with&data¶llelism&in&mind&
10&
• Open;Source&Framework&
Leaders and contributors:
Gavin Sherry
BSD&License&
Caleb Welton
Joseph Hellerstein
Christopher Ré
Zhe Wang
Florian Schoppmann
Pivotal Confidential–Internal Use Only
Hai Qian
Eigen&
Shengwen Yang
Aaron Feng
10&
and many others …
45
40. Thank you for your attention
Important links:
Product email: madlib@gopivotal.com
Product site: madlib.net
Speaker email: riyer@gopivotal.com
Pivotal Confidential–Internal Use Only
46