SlideShare a Scribd company logo
1 of 40
Download to read offline
BUILT FOR THE SPEED OF BUSINESS

Pivotal Confidential–Internal Use Only

1
Pivotal OSS Meetups

Big Data Analytics
MADlib and PivotalR:
Scalable Machine Learning for
Massively Parallel Databases
Rahul Iyer,
Senior Software Developer,
Predictive Analytics
March, 4th 2014

Pivotal Confidential–Internal Use Only

2
Agenda for the talk

•  Introduce MADlib, a
distributed machine learning
library for SQL users
•  How scalability is achieved
by distributing the
computation?
•  Performance metrics +
comparisons with Mahout

Pivotal Confidential–Internal Use Only

•  A new R interface to access
all of MADlib’s features
•  How does it get big-data
results with small-data
efforts?
•  Demo to showcase PivotalR

3
What is Big data?
•  Volumes of data …
•  In various formats …
•  From multiple sources …

and Analytics?
•  Generate insights …
•  for informed decision-making

Pivotal Confidential–Internal Use Only

4
Data ---! Information ---! Insights
Traditional analytics pipeline
Time;to;Insights&

Data&Prep&

sample.csv&

spec.docx&

DB&Extract&

scores.csv&

DB&Import&

3&

Pivotal Confidential–Internal Use Only

6
The MAD approach
Data ---! Information ---! Insights
Time-to-Insights

Data&Prep&

Model&

Score&

Billions&of&rows&
Reduced&Data&
in&minutes&
Movement&

Enterprise)Data)
RDBMS&

RDBMS&

RDBMS&

RDBMS&

4&
Pivotal Confidential–Internal Use Only

7
What is MADlib?
MADlib project was initiated in 2011 by Greenplum architects and Joe
Hellerstein from Univ. of California, Berkeley.

•  MAD stands for:
•  lib stands for SQL library of:
•  advanced (mathematical, statistical, machine learning)
•  parallel & scalable in-database functions

Pivotal Confidential–Internal Use Only

8
What is MADlib?
MADlib project was initiated in 2011 by Greenplum architects and Joe
Hellerstein from Univ. of California, Berkeley.

•  MAD stands for:
•  lib stands for SQL library of:
•  advanced (mathematical, statistical, machine learning)
•  parallel & scalable in-database functions
UrbanDictionary.com:
mad (adj.): an adjective used to enhance a noun.
1- dude, you got skills.
2- dude, you got mad skills.

Pivotal Confidential–Internal Use Only

9
Which platforms does it run on?

(Partly ported)

Impala

HAWQ
HDFS

Pivotal Confidential–Internal Use Only

GPDB

PostgreSQL

10
Shared-Nothing Database Architecture
MPP (Massively Parallel Processing)

Master
Servers

...

SQL
MapReduce

...

Query planning &
dispatch

Network
Interconnect

Segment
Servers

...

...

Query processing
& data storage

External
Sources
Loading,
streaming, etc.

Pivotal Confidential–Internal Use Only

11
Supervised Learning

Summary function
Sketch estimators
Percentiles
Correlation matrix
Data Exploration

Text analytics

•  Generalized Linear models
•  Linear Regression
•  Logistic Regression
•  Multinomial logit …
•  Decision Trees and Random Forest
•  Naive Bayes Classification
•  Support Vector Machines
•  Cox-Prop Hazards
and more …

•  CRF
•  LDA

Support modules
•  Array operations
•  Sparse Vectors
•  Probability functions

Scoring
Sampling
methods
•  Cross Validation

•  Linear Regression
•  Logistic Regression
•  Naïve Bayes
…

Scoring

Predictive Modeling

Analytics Pipeline

Data Prep
Aggregation
Normalizing
Pivoting
Filtering

Pivotal Confidential–Internal Use Only

Data mining

Model fitness

Unsupervised Learning

Statistical metrics

•  Association Rules
•  k-Means Clustering
•  Low-rank Matrix Factorization
•  PCA
•  SVD Matrix Factorization

•  Descriptive statistics
•  Goodness of fit
•  Inferential statistics
•  ROC

12
Example usage
Train a model

Predict for new data

Pivotal Confidential–Internal Use Only

13
How do we implement scalability?
Example: Linear Regression
•  Finding linear dependencies between variables

Regressor (y)

y ≈ c0 + c1 · x1 + c2 · x2 ?

Vector of
dependent
variables y

y
| x1 | x2
-------+------+----10.14 |
0 | 0.3
11.93 | 0.69 | 0.6
13.57 | 1.1 | 0.9
14.17 | 1.39 | 1.2
15.25 | 1.61 | 1.5
16.15 | 1.79 | 1.8

Design
matrix X

Predictor (x1)

Pivotal Confidential–Internal Use Only

14
Challenges in computing OLS solution

Pivotal Confidential–Internal Use Only

15
Challenges in computing OLS solution

XT

X

a c
b d

a b
c d

Segment 2

Segment 2

Segment 1
Pivotal Confidential–Internal Use Only

Segment 1

16
Challenges to compute OLS solution

XT

X

a c
b d

a b
c d

a2 + c2

Data across nodes are multiplied!

=

Pivotal Confidential–Internal Use Only

17
Challenges to compute OLS solution

XT

X

a c
b d

a b
c d

a2 + c2

ab + cd

Data across nodes are multiplied!

=

Pivotal Confidential–Internal Use Only

18
Challenges to compute OLS solution

XT

X

a c
b d

a b
c d

a2 + c2

ab + cd

ba + dc

b2 + d2

Looks like the result can be decomposed

=

Pivotal Confidential–Internal Use Only

19
Challenges to compute OLS solution

XT

a c
b d
=

X

a b
c d

a
b

a b

+

a2 + c2

ab + cd

ba + dc

c
d

c d

b2 + d2

Let’s change perspective

=

Pivotal Confidential–Internal Use Only

20
Linear Regression: Streaming Algorithm
How to compute with a single table scan?

-1
XT

XT
y

X

XTX

Pivotal Confidential–Internal Use Only

XTy

22
Linear Regression: Parallel Computation
XT
y

Segment 1

Pivotal Confidential–Internal Use Only

T
X1 y1

Segment 2

T
X2 y2

Master

XTy

23
Basic&Building&Block:&
Basic Building Block: User-defined aggregate
User;Defined&Aggregates&
x#
(1,0,3,…,5)&
(;2,4,5,…,2)&
…&

y)
3&
2&
…&

(A,b)&
…&

AggregaOon&phase&1&on&each&node:&
1.  IniOalize:&(A,b) = (0,0)
2.  TransiOon&for&all&rows:&
&
(A,b) = (A,b) + (x  
⋅T ,x ⋅ 
y)
 x
3.  Send&(A,b)&
map&
&
reduce&
AggregaOon&phase&2&on&master&node:&
1.  Merge:&&
(A,b) = (A,b) + (A,b)
ˆ
2.  Finalize:& β = solve(A,b) = A−1 ⋅ b
13&

Pivotal Confidential–Internal Use Only

24
Problem solved? … Not Yet
"  Many ML solutions are iterative without analytical
formulations
Initialize problem

Perform optimization step

false

Has converged?
true
Return results

Pivotal Confidential–Internal Use Only

25
1.90
1.66

60.58
227.7

1.197
1.276
1.698
3.363
8.840
6.18

2.383
2.869
4.475
13.35
45.48
171.7

17.14
111.4
0.3904
0.4769
1.151
3.263
13.10
84.59

Use a convex optimization framework
1.&Lack&of&portable&mulO;pass&
-  Each step
n execution times
iteraOons& has an analytical formulation
that can be performed in parallel

•  WITH RECURSIVE&not&reliable&basis&for&

portability&
•  User;defined&driver&
funcOons&in&Python&

CREATE TEMP TABLE temp!
INSERT INTO temp SELECT
step(...) FROM ...!

–  Outer&loops&not&
performance;criOcal&

false&

Figure 6: The Archetypical Convex Function f (x) = x .
•  Compromise:&
2

Application

Different&user&interface&

Least Squares
Lasso [38]
Logisitic Regression
Classification (SVM)
Pivotal Confidential–Internal Use Only
Recommendation
Labeling (CRF) [40]

Objective
P
(xT u y)2
P(u,y)2⌦ T
2
P(u,y)2⌦ (x u y) + µkxk1
log(1 + exp( yxt u))
P(u,y)2⌦
T
(u,y)2⌦ (1 yx u)+
P
T
Mi j )2 + µkL, Rk2
(i,h
j)2⌦ (Li R j
iF
P P
k
j x j F j (yk , zk ) log Z(zk )

SELECT converged(...)
FROM temp, ...!
true&
SELECT result(...)!
FROM temp!
16&

26
Architecture
SQL, generated per
specification

User Interface

The&MADlib&Vision&
High-level Abstraction Layer
Python
(iteration controller, ...)
•  Academic&and&industry&contribuOons&
•  Think&of&“CRAN&for&databases”&
3.&Lack&of&language&support&for&
Functions for Inner Loops
RDBMS
Built-in
Functions

–  Repository&of&open;source&ML&algorithms&
linear&algebra&
(implements convex optimization)
–  This&Ome&with&data&parallelism&in&mind&
•  C++&AbstracOon&Layer&uses&Eigen&
C++
•  Open;Source&Framework&
Low-level Abstraction Layer
•  (Dense)&Vectors&and&matrices:&
(matrix operations, PRECISION[]!
DOUBLE
C++ to DB typeExample:&…)
•  bridge,
AnyType!
solve::run(AnyType& args) {!
MappedMatrix A = args[0].getAs<MappedMatrix>();!
MappedColumnVector b = args[1].getAs<MappedColumnVector>();!
BSD&License&
Eigen&
!
MutableMappedColumnVector x = allocateArray<double>(A.cols());!
x = A.colPivHouseholderQr().solve(b);!
return x;!
Performance:&
}!

RDBMS Query Processing
(Greenplum, PostgreSQL, Hadoop with SQL)

10&

•  No&unnecessary&copying&
•  No&internal&type&conversion&
18&

Pivotal Confidential–Internal Use Only

27
Performance&Trends&
Performance trends

sk&I/O&is&not&always&
•  Overhead
e&boLleneck& for a

single row is very
Performance&tuning&is&of a
low (fraction
essenOal&second)

verhead&for&single&
•  Able to achieve
uery&very&low&(fracOon&
close to linear
&a&second)&
speedup
eenplum&achieves&
early&perfect&speedup&

OLS&on&10&million&rows&(in&seconds)&

#&variables:&

20&

40&

160&

40&
35&
30&
25&
20&
15&
10&
5&
0&
6&

12&

18&

#&segments&
Pivotal Confidential–Internal Use Only

80&

24&
22&
28
Performance Comparison with Apache Mahout
"  Analytics WorkBench (http://www.gopivotal.com/big-data/analytics-workbench)

–  1000-node cluster located in Las Vegas
–  Over 24,000 processors, 48 TB of Memory, and 24 PB of
raw disk storage
–  8000+ Map Task Capacity, 5000+ Reduce Task Capacity
–  Infrastructure: Pivotal HD 1.1

"  Mahout v0.7
"  Test matrix*
–  Data size

▪  KDD Cup 2009 Orange marketing churn data (16.5 GB)
▪  Enron data (1.9 GB)
▪  Census data 2000 (1.7 GB)

–  Algorithms: Logistic Regression and K-means
–  Algorithm parameters (e.g. convergence threshold, # iterations)
* Reporting a subset of results from whitepaper.
Courtesy Grace Gee (Engineer, SOAR Program, Pivotal)
Pivotal Confidential–Internal Use Only

29
Logistic Regression
MADlib & Mahout Logistic Regression Scalability Across
Number of Attributes
700

Census data, 48 attributes [Mahout]
600

Time in Minutes

Census data, 48 attributes [MADlib]
500
400
300
200
100
0
1000000

10000000

100000000

1E+09

log(Number of Rows)

Courtesy Grace Gee (Engineer, SOAR Program, Pivotal)
Pivotal Confidential–Internal Use Only

30
Logistic Regression
9
8

Time in Minutes

7
6
5
4
3
2
1
0
1000000

10000000

100000000
log(Number of Rows)

1E+09

Courtesy Grace Gee (Engineer, SOAR Program, Pivotal)
Pivotal Confidential–Internal Use Only

31
K-Means
MADlib & Mahout K-means Scalability Across
Number of Rows
350
300

Census data, 48 attributes [Mahout]
Census data, 48 attributes [MADlib]

Time in Min

250
200
150
100
50
0
1000000

10000000

100000000

1E+09

log(Number of Rows)

Courtesy Grace Gee (Engineer, SOAR Program, Pivotal)
Pivotal Confidential–Internal Use Only

32
Random Forest
1600
Census data, 46 attributes [Mahout]
1400

Census data, 46 attributes [MADlib]

Time in Min

1200
1000
800
600
400
200
0
1000000

10000000

100000000

1E+09

log(Number of Rows)

Courtesy Grace Gee (Engineer, SOAR Program, Pivotal)
Pivotal Confidential–Internal Use Only

33
Part 1 Summary

MADlib is a easy-to-use library that
provides a SQL interface to fast,
scalable machine learning
algorithms …

Pivotal Confidential–Internal Use Only

35
But not all Data Scientists
speak SQL …
Accessing Scalability through R

Pivotal Confidential–Internal Use Only

36
Why R?

From the report: “The preponderance of R and Python usage is more surprising …
two most commonly used individual tools, even above Excel. R and Python are likely
popular because they are easily accessible and effective open source tools.”
O’Reilly: 2013 Data Science Salary Survey
Pivotal Confidential–Internal Use Only

37
PivotalR Design Overview
PivotalR Design Overview
Execution in Database
•  Call MADlib’s in-DB machine learning functions
• 
• 
• 

directly from R
Call MADlib’s in-DB to native R function
Syntax is analogous machine learning functions
directly from R
Syntax is analogous to native R function
PivotalR
PivotalR

R " SQL

R " SQL

No data here

RPostgreSQL
RPostgreSQL

Data lives here
Data lives here

SQL to execute

SQL to execute MADlib
SQL to execute
Computation results

Database w/ MADlib

Model output
Computation results

Database w/ MADlib

•  Data doesn’t need to leave the database
•  All heavy lifting, including model estimation
•  & computation, are to leave the database
Data doesn’t need done in the database
merely point lifting, including model estimation
•  All heavy to DB objects
& computation, are done in the database

•  All data stays in DB: R objects
Woo Jung
http://gopivotal.github.io/PivotalR/
•  All model estimation and heavy lifting done in DB by MADlib
Woo Jung
•  R → SQL translation done in the R client
•  Only strings of SQL and model output transferred across DBI
http://gopivotal.github.io/PivotalR/
No data here

© Copyright 2014 Pivotal. All rights reserved.

36

© Copyright 2014 Pivotal. All rights reserved.

36

Courtesy Woo Jung and Hai Qian
Pivotal Confidential–Internal Use Only

38
Some of current features
And more ... (SQL wrapper)
+ - *
%% %/%

/
^

A wrapper of MADlib

• 
• 
• 
• 
• 

Linear regression
Logistic regression
Elastic Net

• 

Categorial variable
as.factor()

ARIMA

• 
• 
• 
• 
• 
• 
• 

Table summary

• 
• 
• 

Pivotal Confidential–Internal Use Only

dim
names

$

[

==
&
by

[[
!=

|

$<>

<

!

• 

• 

sort

[<>=

[[<<=

• 

merge

db.data.frame

• 
• 

as.db.data.frame

is.na

preview
content

• 

predict

c mean sum sd var min max
length colMeans colSums
db.connect db.disconnect
db.list db.objects
db.existsObject delete

40
Demonstration
library(PivotalR)

Load the Library

db.connect(port = 14526, dbname = "madlib")

Connect to the database “madlib” on port 14526

db.objects()

List all the tables in the active connection

x <- db.data.frame("madlibtestdata.dt_abalone")

Create an R object that references a table in the database

dim(x)

Report #/rows and #/columns in the table

names(x)

Column names within the table

x$rings

Database query object representing “select rings from madlibtestdata.dt_abalone”

lookat(x, 10) # look at a sample of table

Pull 10 rows of data from the table back into the R environment

mean(x$rings)

query object representing “select avg(rings) from madlibtestdata.dt_abalone”

lookat(mean(x$rings))

execute the query and report back the result

fit <- madlib.lm(rings ~ . - id | sex, data = y)

Run a linear regression within the database and return a model object

predict(fit, x)

Create a query object representing scoring the model in the database

mean((x$rings - predict(fit, x))^2)

Query object calculating the mean square error of the model

x$sex <- as.factor(v$sex)

Add a calculated factor column to the database query object

m0 <- madlib.glm(resp ~ age,

Calculate a logistic regression model

family="binomial", data=dbbank)
mstep <- step(m0, scope=list(lower=~age,
upper=~age + factor(marital) + factor(education) +
factor(housing) + factor(loan) + factor(job)))

Pivotal Confidential–Internal Use Only

Perform stepwise feature selection

43
We’re looking for contributors
•  Browse our help pages
–  Start page: madlib.net
–  Github pages
•  github.com/madlib/madlib
•  github.com/gopivotal/pivotalr
•  github.com/gopivotal/pymadlib

(SQL)
(R)
(Python)

–  Use our product and report issues:
•  jira.madlib.net (Issue tracker)
•  user@madlib.net (User forum)

•  Can use PostgreSQL or Greenplum Database
Community Edition for installations on multiple
platforms
Pivotal Confidential–Internal Use Only

44
Credits

The&MADlib&Vision&
•  Academic&and&industry&contribuOons&
•  Think&of&“CRAN&for&databases”&

The&MADlib&Vision&

–  Repository&of&open;source&ML&algorithms&
–  This&Ome&with&data&parallelism&in&mind&

•  Open;Source&Framework&

BSD&License&

Eigen&

•  Academic&and&industry&contribuOons&
•  Think&of&“CRAN&for&databases”&
–  Repository&of&open;source&ML&algorithms&
–  This&Ome&with&data&parallelism&in&mind&
10&

•  Open;Source&Framework&
Leaders and contributors:
Gavin Sherry
BSD&License&
Caleb Welton
Joseph Hellerstein
Christopher Ré
Zhe Wang
Florian Schoppmann

Pivotal Confidential–Internal Use Only

Hai Qian
Eigen&
Shengwen Yang
Aaron Feng

10&

and many others …

45
Thank you for your attention

Important links:
Product email: madlib@gopivotal.com
Product site: madlib.net
Speaker email: riyer@gopivotal.com

Pivotal Confidential–Internal Use Only

46

More Related Content

What's hot

The Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemThe Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemCloudera, Inc.
 
Simple, Modular and Extensible Big Data Platform Concept
Simple, Modular and Extensible Big Data Platform ConceptSimple, Modular and Extensible Big Data Platform Concept
Simple, Modular and Extensible Big Data Platform ConceptSatish Mohan
 
SQL and Machine Learning on Hadoop
SQL and Machine Learning on HadoopSQL and Machine Learning on Hadoop
SQL and Machine Learning on HadoopMukund Babbar
 
Summary machine learning and model deployment
Summary machine learning and model deploymentSummary machine learning and model deployment
Summary machine learning and model deploymentNovita Sari
 
Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...
Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...
Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...Spark Summit
 
Machine Learning and Hadoop
Machine Learning and HadoopMachine Learning and Hadoop
Machine Learning and HadoopJosh Patterson
 
Introduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemIntroduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemShivaji Dutta
 
Large Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache SparkLarge Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache SparkCloudera, Inc.
 
Machine Learning by Example - Apache Spark
Machine Learning by Example - Apache SparkMachine Learning by Example - Apache Spark
Machine Learning by Example - Apache SparkMeeraj Kunnumpurath
 
Practical Distributed Machine Learning Pipelines on Hadoop
Practical Distributed Machine Learning Pipelines on HadoopPractical Distributed Machine Learning Pipelines on Hadoop
Practical Distributed Machine Learning Pipelines on HadoopDataWorks Summit
 
The Zoo Expands: Labrador *Loves* Elephant, Thanks to Hamster
The Zoo Expands: Labrador *Loves* Elephant, Thanks to HamsterThe Zoo Expands: Labrador *Loves* Elephant, Thanks to Hamster
The Zoo Expands: Labrador *Loves* Elephant, Thanks to HamsterMilind Bhandarkar
 
MLlib and Machine Learning on Spark
MLlib and Machine Learning on SparkMLlib and Machine Learning on Spark
MLlib and Machine Learning on SparkPetr Zapletal
 
Integrating Existing C++ Libraries into PySpark with Esther Kundin
Integrating Existing C++ Libraries into PySpark with Esther KundinIntegrating Existing C++ Libraries into PySpark with Esther Kundin
Integrating Existing C++ Libraries into PySpark with Esther KundinDatabricks
 
Seattle Scalability Meetup - Ted Dunning - MapR
Seattle Scalability Meetup - Ted Dunning - MapRSeattle Scalability Meetup - Ted Dunning - MapR
Seattle Scalability Meetup - Ted Dunning - MapRclive boulton
 
Extending Hadoop for Fun & Profit
Extending Hadoop for Fun & ProfitExtending Hadoop for Fun & Profit
Extending Hadoop for Fun & ProfitMilind Bhandarkar
 
Unlocking Your Hadoop Data with Apache Spark and CDH5
Unlocking Your Hadoop Data with Apache Spark and CDH5Unlocking Your Hadoop Data with Apache Spark and CDH5
Unlocking Your Hadoop Data with Apache Spark and CDH5SAP Concur
 
Large Scale Machine learning with Spark
Large Scale Machine learning with SparkLarge Scale Machine learning with Spark
Large Scale Machine learning with SparkMd. Mahedi Kaysar
 
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)Uwe Printz
 

What's hot (20)

The Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemThe Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop Ecosystem
 
Simple, Modular and Extensible Big Data Platform Concept
Simple, Modular and Extensible Big Data Platform ConceptSimple, Modular and Extensible Big Data Platform Concept
Simple, Modular and Extensible Big Data Platform Concept
 
Spark meetup TCHUG
Spark meetup TCHUGSpark meetup TCHUG
Spark meetup TCHUG
 
SQL and Machine Learning on Hadoop
SQL and Machine Learning on HadoopSQL and Machine Learning on Hadoop
SQL and Machine Learning on Hadoop
 
Summary machine learning and model deployment
Summary machine learning and model deploymentSummary machine learning and model deployment
Summary machine learning and model deployment
 
Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...
Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...
Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...
 
Machine Learning and Hadoop
Machine Learning and HadoopMachine Learning and Hadoop
Machine Learning and Hadoop
 
Introduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemIntroduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystem
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystem
 
Large Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache SparkLarge Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache Spark
 
Machine Learning by Example - Apache Spark
Machine Learning by Example - Apache SparkMachine Learning by Example - Apache Spark
Machine Learning by Example - Apache Spark
 
Practical Distributed Machine Learning Pipelines on Hadoop
Practical Distributed Machine Learning Pipelines on HadoopPractical Distributed Machine Learning Pipelines on Hadoop
Practical Distributed Machine Learning Pipelines on Hadoop
 
The Zoo Expands: Labrador *Loves* Elephant, Thanks to Hamster
The Zoo Expands: Labrador *Loves* Elephant, Thanks to HamsterThe Zoo Expands: Labrador *Loves* Elephant, Thanks to Hamster
The Zoo Expands: Labrador *Loves* Elephant, Thanks to Hamster
 
MLlib and Machine Learning on Spark
MLlib and Machine Learning on SparkMLlib and Machine Learning on Spark
MLlib and Machine Learning on Spark
 
Integrating Existing C++ Libraries into PySpark with Esther Kundin
Integrating Existing C++ Libraries into PySpark with Esther KundinIntegrating Existing C++ Libraries into PySpark with Esther Kundin
Integrating Existing C++ Libraries into PySpark with Esther Kundin
 
Seattle Scalability Meetup - Ted Dunning - MapR
Seattle Scalability Meetup - Ted Dunning - MapRSeattle Scalability Meetup - Ted Dunning - MapR
Seattle Scalability Meetup - Ted Dunning - MapR
 
Extending Hadoop for Fun & Profit
Extending Hadoop for Fun & ProfitExtending Hadoop for Fun & Profit
Extending Hadoop for Fun & Profit
 
Unlocking Your Hadoop Data with Apache Spark and CDH5
Unlocking Your Hadoop Data with Apache Spark and CDH5Unlocking Your Hadoop Data with Apache Spark and CDH5
Unlocking Your Hadoop Data with Apache Spark and CDH5
 
Large Scale Machine learning with Spark
Large Scale Machine learning with SparkLarge Scale Machine learning with Spark
Large Scale Machine learning with Spark
 
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
 

Viewers also liked

Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Scienc...
Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Scienc...Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Scienc...
Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Scienc...Sarah Aerni
 
Analytics Environment
Analytics EnvironmentAnalytics Environment
Analytics EnvironmentYuu Kimy
 
[D22] Pivotal HD 2.0 -業界最高レベルSQL on Hadoop技術「HAWQ」解説- by Masayuki Matsushita
[D22] Pivotal HD 2.0 -業界最高レベルSQL on Hadoop技術「HAWQ」解説- by Masayuki Matsushita[D22] Pivotal HD 2.0 -業界最高レベルSQL on Hadoop技術「HAWQ」解説- by Masayuki Matsushita
[D22] Pivotal HD 2.0 -業界最高レベルSQL on Hadoop技術「HAWQ」解説- by Masayuki MatsushitaInsight Technology, Inc.
 
About alteryx
About alteryxAbout alteryx
About alteryxYuu Kimy
 
[db tech showcase Sapporo 2015] C15:商用RDBをOSSへ Oracle to Postgres 徹底解説 by 株式会...
[db tech showcase Sapporo 2015] C15:商用RDBをOSSへ Oracle to Postgres 徹底解説 by 株式会...[db tech showcase Sapporo 2015] C15:商用RDBをOSSへ Oracle to Postgres 徹底解説 by 株式会...
[db tech showcase Sapporo 2015] C15:商用RDBをOSSへ Oracle to Postgres 徹底解説 by 株式会...Insight Technology, Inc.
 
データからインサイト そして、アイデアの発想へ(CJM/POV/HMW)
データからインサイト そして、アイデアの発想へ(CJM/POV/HMW)データからインサイト そして、アイデアの発想へ(CJM/POV/HMW)
データからインサイト そして、アイデアの発想へ(CJM/POV/HMW)Masanori Kado
 
はじパタ2章
はじパタ2章はじパタ2章
はじパタ2章tetsuro ito
 
ベイジアンモデリングによるマーケティングサイエンス〜状態空間モデルを用いたモデリング
ベイジアンモデリングによるマーケティングサイエンス〜状態空間モデルを用いたモデリングベイジアンモデリングによるマーケティングサイエンス〜状態空間モデルを用いたモデリング
ベイジアンモデリングによるマーケティングサイエンス〜状態空間モデルを用いたモデリング宏喜 佐野
 
10分でわかる主成分分析(PCA)
10分でわかる主成分分析(PCA)10分でわかる主成分分析(PCA)
10分でわかる主成分分析(PCA)Takanori Ogata
 
はじめよう多変量解析~主成分分析編~
はじめよう多変量解析~主成分分析編~はじめよう多変量解析~主成分分析編~
はじめよう多変量解析~主成分分析編~宏喜 佐野
 
Cloud Foundry Technical Overview
Cloud Foundry Technical OverviewCloud Foundry Technical Overview
Cloud Foundry Technical Overviewcornelia davis
 

Viewers also liked (14)

Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Scienc...
Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Scienc...Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Scienc...
Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Scienc...
 
Analytics Environment
Analytics EnvironmentAnalytics Environment
Analytics Environment
 
[D22] Pivotal HD 2.0 -業界最高レベルSQL on Hadoop技術「HAWQ」解説- by Masayuki Matsushita
[D22] Pivotal HD 2.0 -業界最高レベルSQL on Hadoop技術「HAWQ」解説- by Masayuki Matsushita[D22] Pivotal HD 2.0 -業界最高レベルSQL on Hadoop技術「HAWQ」解説- by Masayuki Matsushita
[D22] Pivotal HD 2.0 -業界最高レベルSQL on Hadoop技術「HAWQ」解説- by Masayuki Matsushita
 
About alteryx
About alteryxAbout alteryx
About alteryx
 
[db tech showcase Sapporo 2015] C15:商用RDBをOSSへ Oracle to Postgres 徹底解説 by 株式会...
[db tech showcase Sapporo 2015] C15:商用RDBをOSSへ Oracle to Postgres 徹底解説 by 株式会...[db tech showcase Sapporo 2015] C15:商用RDBをOSSへ Oracle to Postgres 徹底解説 by 株式会...
[db tech showcase Sapporo 2015] C15:商用RDBをOSSへ Oracle to Postgres 徹底解説 by 株式会...
 
データからインサイト そして、アイデアの発想へ(CJM/POV/HMW)
データからインサイト そして、アイデアの発想へ(CJM/POV/HMW)データからインサイト そして、アイデアの発想へ(CJM/POV/HMW)
データからインサイト そして、アイデアの発想へ(CJM/POV/HMW)
 
はじパタ2章
はじパタ2章はじパタ2章
はじパタ2章
 
ベイジアンモデリングによるマーケティングサイエンス〜状態空間モデルを用いたモデリング
ベイジアンモデリングによるマーケティングサイエンス〜状態空間モデルを用いたモデリングベイジアンモデリングによるマーケティングサイエンス〜状態空間モデルを用いたモデリング
ベイジアンモデリングによるマーケティングサイエンス〜状態空間モデルを用いたモデリング
 
10分でわかる主成分分析(PCA)
10分でわかる主成分分析(PCA)10分でわかる主成分分析(PCA)
10分でわかる主成分分析(PCA)
 
はじめよう多変量解析~主成分分析編~
はじめよう多変量解析~主成分分析編~はじめよう多変量解析~主成分分析編~
はじめよう多変量解析~主成分分析編~
 
Cloud Foundry Technical Overview
Cloud Foundry Technical OverviewCloud Foundry Technical Overview
Cloud Foundry Technical Overview
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
What is Big Data?
What is Big Data?What is Big Data?
What is Big Data?
 
Big data ppt
Big  data pptBig  data ppt
Big data ppt
 

Similar to Pivotal OSS meetup - MADlib and PivotalR

Challenges of Implementing an Advanced SQL Engine on Hadoop
Challenges of Implementing an Advanced SQL Engine on HadoopChallenges of Implementing an Advanced SQL Engine on Hadoop
Challenges of Implementing an Advanced SQL Engine on HadoopDataWorks Summit
 
Challenges of Building a First Class SQL-on-Hadoop Engine
Challenges of Building a First Class SQL-on-Hadoop EngineChallenges of Building a First Class SQL-on-Hadoop Engine
Challenges of Building a First Class SQL-on-Hadoop EngineNicolas Morales
 
MongoDB Administration 101
MongoDB Administration 101MongoDB Administration 101
MongoDB Administration 101MongoDB
 
DataMass Summit - Machine Learning for Big Data in SQL Server
DataMass Summit - Machine Learning for Big Data  in SQL ServerDataMass Summit - Machine Learning for Big Data  in SQL Server
DataMass Summit - Machine Learning for Big Data in SQL ServerŁukasz Grala
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
 
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in PythonThe Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in PythonMiklos Christine
 
Jss 2015 in memory and operational analytics
Jss 2015   in memory and operational analyticsJss 2015   in memory and operational analytics
Jss 2015 in memory and operational analyticsDavid Barbarin
 
[JSS2015] In memory and operational analytics
[JSS2015] In memory and operational analytics[JSS2015] In memory and operational analytics
[JSS2015] In memory and operational analyticsGUSS
 
A Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.pptA Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.pptSanket Shikhar
 
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...Mitul Tiwari
 
Hadoop & no sql new generation database systems
Hadoop & no sql   new generation database systemsHadoop & no sql   new generation database systems
Hadoop & no sql new generation database systemsramazan fırın
 
Sql Performance Tuning For Developers
Sql Performance Tuning For DevelopersSql Performance Tuning For Developers
Sql Performance Tuning For Developerssqlserver.co.il
 
Advanced Analytics in Hadoop
Advanced Analytics in HadoopAdvanced Analytics in Hadoop
Advanced Analytics in HadoopAnalyticsWeek
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...Jose Quesada (hiring)
 
Modern Big Data Analytics Tools: An Overview
Modern Big Data Analytics Tools: An OverviewModern Big Data Analytics Tools: An Overview
Modern Big Data Analytics Tools: An OverviewGreat Wide Open
 
Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep DiveApache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep DiveXu Jiang
 
MySQL performance monitoring using Statsd and Graphite (PLUK2013)
MySQL performance monitoring using Statsd and Graphite (PLUK2013)MySQL performance monitoring using Statsd and Graphite (PLUK2013)
MySQL performance monitoring using Statsd and Graphite (PLUK2013)spil-engineering
 
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael ArmbrustStructuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael ArmbrustSpark Summit
 
EM12c: Capacity Planning with OEM Metrics
EM12c: Capacity Planning with OEM MetricsEM12c: Capacity Planning with OEM Metrics
EM12c: Capacity Planning with OEM MetricsMaaz Anjum
 

Similar to Pivotal OSS meetup - MADlib and PivotalR (20)

Challenges of Implementing an Advanced SQL Engine on Hadoop
Challenges of Implementing an Advanced SQL Engine on HadoopChallenges of Implementing an Advanced SQL Engine on Hadoop
Challenges of Implementing an Advanced SQL Engine on Hadoop
 
Challenges of Building a First Class SQL-on-Hadoop Engine
Challenges of Building a First Class SQL-on-Hadoop EngineChallenges of Building a First Class SQL-on-Hadoop Engine
Challenges of Building a First Class SQL-on-Hadoop Engine
 
MongoDB Administration 101
MongoDB Administration 101MongoDB Administration 101
MongoDB Administration 101
 
DataMass Summit - Machine Learning for Big Data in SQL Server
DataMass Summit - Machine Learning for Big Data  in SQL ServerDataMass Summit - Machine Learning for Big Data  in SQL Server
DataMass Summit - Machine Learning for Big Data in SQL Server
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
DA_01_Intro.pptx
DA_01_Intro.pptxDA_01_Intro.pptx
DA_01_Intro.pptx
 
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in PythonThe Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
 
Jss 2015 in memory and operational analytics
Jss 2015   in memory and operational analyticsJss 2015   in memory and operational analytics
Jss 2015 in memory and operational analytics
 
[JSS2015] In memory and operational analytics
[JSS2015] In memory and operational analytics[JSS2015] In memory and operational analytics
[JSS2015] In memory and operational analytics
 
A Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.pptA Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.ppt
 
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...
 
Hadoop & no sql new generation database systems
Hadoop & no sql   new generation database systemsHadoop & no sql   new generation database systems
Hadoop & no sql new generation database systems
 
Sql Performance Tuning For Developers
Sql Performance Tuning For DevelopersSql Performance Tuning For Developers
Sql Performance Tuning For Developers
 
Advanced Analytics in Hadoop
Advanced Analytics in HadoopAdvanced Analytics in Hadoop
Advanced Analytics in Hadoop
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
 
Modern Big Data Analytics Tools: An Overview
Modern Big Data Analytics Tools: An OverviewModern Big Data Analytics Tools: An Overview
Modern Big Data Analytics Tools: An Overview
 
Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep DiveApache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
 
MySQL performance monitoring using Statsd and Graphite (PLUK2013)
MySQL performance monitoring using Statsd and Graphite (PLUK2013)MySQL performance monitoring using Statsd and Graphite (PLUK2013)
MySQL performance monitoring using Statsd and Graphite (PLUK2013)
 
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael ArmbrustStructuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
 
EM12c: Capacity Planning with OEM Metrics
EM12c: Capacity Planning with OEM MetricsEM12c: Capacity Planning with OEM Metrics
EM12c: Capacity Planning with OEM Metrics
 

Recently uploaded

Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 

Recently uploaded (20)

Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 

Pivotal OSS meetup - MADlib and PivotalR

  • 1. BUILT FOR THE SPEED OF BUSINESS Pivotal Confidential–Internal Use Only 1
  • 2. Pivotal OSS Meetups Big Data Analytics MADlib and PivotalR: Scalable Machine Learning for Massively Parallel Databases Rahul Iyer, Senior Software Developer, Predictive Analytics March, 4th 2014 Pivotal Confidential–Internal Use Only 2
  • 3. Agenda for the talk •  Introduce MADlib, a distributed machine learning library for SQL users •  How scalability is achieved by distributing the computation? •  Performance metrics + comparisons with Mahout Pivotal Confidential–Internal Use Only •  A new R interface to access all of MADlib’s features •  How does it get big-data results with small-data efforts? •  Demo to showcase PivotalR 3
  • 4. What is Big data? •  Volumes of data … •  In various formats … •  From multiple sources … and Analytics? •  Generate insights … •  for informed decision-making Pivotal Confidential–Internal Use Only 4
  • 5. Data ---! Information ---! Insights Traditional analytics pipeline Time;to;Insights& Data&Prep& sample.csv& spec.docx& DB&Extract& scores.csv& DB&Import& 3& Pivotal Confidential–Internal Use Only 6
  • 6. The MAD approach Data ---! Information ---! Insights Time-to-Insights Data&Prep& Model& Score& Billions&of&rows& Reduced&Data& in&minutes& Movement& Enterprise)Data) RDBMS& RDBMS& RDBMS& RDBMS& 4& Pivotal Confidential–Internal Use Only 7
  • 7. What is MADlib? MADlib project was initiated in 2011 by Greenplum architects and Joe Hellerstein from Univ. of California, Berkeley. •  MAD stands for: •  lib stands for SQL library of: •  advanced (mathematical, statistical, machine learning) •  parallel & scalable in-database functions Pivotal Confidential–Internal Use Only 8
  • 8. What is MADlib? MADlib project was initiated in 2011 by Greenplum architects and Joe Hellerstein from Univ. of California, Berkeley. •  MAD stands for: •  lib stands for SQL library of: •  advanced (mathematical, statistical, machine learning) •  parallel & scalable in-database functions UrbanDictionary.com: mad (adj.): an adjective used to enhance a noun. 1- dude, you got skills. 2- dude, you got mad skills. Pivotal Confidential–Internal Use Only 9
  • 9. Which platforms does it run on? (Partly ported) Impala HAWQ HDFS Pivotal Confidential–Internal Use Only GPDB PostgreSQL 10
  • 10. Shared-Nothing Database Architecture MPP (Massively Parallel Processing) Master Servers ... SQL MapReduce ... Query planning & dispatch Network Interconnect Segment Servers ... ... Query processing & data storage External Sources Loading, streaming, etc. Pivotal Confidential–Internal Use Only 11
  • 11. Supervised Learning Summary function Sketch estimators Percentiles Correlation matrix Data Exploration Text analytics •  Generalized Linear models •  Linear Regression •  Logistic Regression •  Multinomial logit … •  Decision Trees and Random Forest •  Naive Bayes Classification •  Support Vector Machines •  Cox-Prop Hazards and more … •  CRF •  LDA Support modules •  Array operations •  Sparse Vectors •  Probability functions Scoring Sampling methods •  Cross Validation •  Linear Regression •  Logistic Regression •  Naïve Bayes … Scoring Predictive Modeling Analytics Pipeline Data Prep Aggregation Normalizing Pivoting Filtering Pivotal Confidential–Internal Use Only Data mining Model fitness Unsupervised Learning Statistical metrics •  Association Rules •  k-Means Clustering •  Low-rank Matrix Factorization •  PCA •  SVD Matrix Factorization •  Descriptive statistics •  Goodness of fit •  Inferential statistics •  ROC 12
  • 12. Example usage Train a model Predict for new data Pivotal Confidential–Internal Use Only 13
  • 13. How do we implement scalability? Example: Linear Regression •  Finding linear dependencies between variables Regressor (y) y ≈ c0 + c1 · x1 + c2 · x2 ? Vector of dependent variables y y | x1 | x2 -------+------+----10.14 | 0 | 0.3 11.93 | 0.69 | 0.6 13.57 | 1.1 | 0.9 14.17 | 1.39 | 1.2 15.25 | 1.61 | 1.5 16.15 | 1.79 | 1.8 Design matrix X Predictor (x1) Pivotal Confidential–Internal Use Only 14
  • 14. Challenges in computing OLS solution Pivotal Confidential–Internal Use Only 15
  • 15. Challenges in computing OLS solution XT X a c b d a b c d Segment 2 Segment 2 Segment 1 Pivotal Confidential–Internal Use Only Segment 1 16
  • 16. Challenges to compute OLS solution XT X a c b d a b c d a2 + c2 Data across nodes are multiplied! = Pivotal Confidential–Internal Use Only 17
  • 17. Challenges to compute OLS solution XT X a c b d a b c d a2 + c2 ab + cd Data across nodes are multiplied! = Pivotal Confidential–Internal Use Only 18
  • 18. Challenges to compute OLS solution XT X a c b d a b c d a2 + c2 ab + cd ba + dc b2 + d2 Looks like the result can be decomposed = Pivotal Confidential–Internal Use Only 19
  • 19. Challenges to compute OLS solution XT a c b d = X a b c d a b a b + a2 + c2 ab + cd ba + dc c d c d b2 + d2 Let’s change perspective = Pivotal Confidential–Internal Use Only 20
  • 20. Linear Regression: Streaming Algorithm How to compute with a single table scan? -1 XT XT y X XTX Pivotal Confidential–Internal Use Only XTy 22
  • 21. Linear Regression: Parallel Computation XT y Segment 1 Pivotal Confidential–Internal Use Only T X1 y1 Segment 2 T X2 y2 Master XTy 23
  • 22. Basic&Building&Block:& Basic Building Block: User-defined aggregate User;Defined&Aggregates& x# (1,0,3,…,5)& (;2,4,5,…,2)& …& y) 3& 2& …& (A,b)& …& AggregaOon&phase&1&on&each&node:& 1.  IniOalize:&(A,b) = (0,0) 2.  TransiOon&for&all&rows:& & (A,b) = (A,b) + (x   ⋅T ,x ⋅  y)  x 3.  Send&(A,b)& map& & reduce& AggregaOon&phase&2&on&master&node:& 1.  Merge:&& (A,b) = (A,b) + (A,b) ˆ 2.  Finalize:& β = solve(A,b) = A−1 ⋅ b 13& Pivotal Confidential–Internal Use Only 24
  • 23. Problem solved? … Not Yet "  Many ML solutions are iterative without analytical formulations Initialize problem Perform optimization step false Has converged? true Return results Pivotal Confidential–Internal Use Only 25
  • 24. 1.90 1.66 60.58 227.7 1.197 1.276 1.698 3.363 8.840 6.18 2.383 2.869 4.475 13.35 45.48 171.7 17.14 111.4 0.3904 0.4769 1.151 3.263 13.10 84.59 Use a convex optimization framework 1.&Lack&of&portable&mulO;pass& -  Each step n execution times iteraOons& has an analytical formulation that can be performed in parallel •  WITH RECURSIVE&not&reliable&basis&for& portability& •  User;defined&driver& funcOons&in&Python& CREATE TEMP TABLE temp! INSERT INTO temp SELECT step(...) FROM ...! –  Outer&loops&not& performance;criOcal& false& Figure 6: The Archetypical Convex Function f (x) = x . •  Compromise:& 2 Application Different&user&interface& Least Squares Lasso [38] Logisitic Regression Classification (SVM) Pivotal Confidential–Internal Use Only Recommendation Labeling (CRF) [40] Objective P (xT u y)2 P(u,y)2⌦ T 2 P(u,y)2⌦ (x u y) + µkxk1 log(1 + exp( yxt u)) P(u,y)2⌦ T (u,y)2⌦ (1 yx u)+ P T Mi j )2 + µkL, Rk2 (i,h j)2⌦ (Li R j iF P P k j x j F j (yk , zk ) log Z(zk ) SELECT converged(...) FROM temp, ...! true& SELECT result(...)! FROM temp! 16& 26
  • 25. Architecture SQL, generated per specification User Interface The&MADlib&Vision& High-level Abstraction Layer Python (iteration controller, ...) •  Academic&and&industry&contribuOons& •  Think&of&“CRAN&for&databases”& 3.&Lack&of&language&support&for& Functions for Inner Loops RDBMS Built-in Functions –  Repository&of&open;source&ML&algorithms& linear&algebra& (implements convex optimization) –  This&Ome&with&data&parallelism&in&mind& •  C++&AbstracOon&Layer&uses&Eigen& C++ •  Open;Source&Framework& Low-level Abstraction Layer •  (Dense)&Vectors&and&matrices:& (matrix operations, PRECISION[]! DOUBLE C++ to DB typeExample:&…) •  bridge, AnyType! solve::run(AnyType& args) {! MappedMatrix A = args[0].getAs<MappedMatrix>();! MappedColumnVector b = args[1].getAs<MappedColumnVector>();! BSD&License& Eigen& ! MutableMappedColumnVector x = allocateArray<double>(A.cols());! x = A.colPivHouseholderQr().solve(b);! return x;! Performance:& }! RDBMS Query Processing (Greenplum, PostgreSQL, Hadoop with SQL) 10& •  No&unnecessary&copying& •  No&internal&type&conversion& 18& Pivotal Confidential–Internal Use Only 27
  • 26. Performance&Trends& Performance trends sk&I/O&is&not&always& •  Overhead e&boLleneck& for a single row is very Performance&tuning&is&of a low (fraction essenOal&second) verhead&for&single& •  Able to achieve uery&very&low&(fracOon& close to linear &a&second)& speedup eenplum&achieves& early&perfect&speedup& OLS&on&10&million&rows&(in&seconds)& #&variables:& 20& 40& 160& 40& 35& 30& 25& 20& 15& 10& 5& 0& 6& 12& 18& #&segments& Pivotal Confidential–Internal Use Only 80& 24& 22& 28
  • 27. Performance Comparison with Apache Mahout "  Analytics WorkBench (http://www.gopivotal.com/big-data/analytics-workbench) –  1000-node cluster located in Las Vegas –  Over 24,000 processors, 48 TB of Memory, and 24 PB of raw disk storage –  8000+ Map Task Capacity, 5000+ Reduce Task Capacity –  Infrastructure: Pivotal HD 1.1 "  Mahout v0.7 "  Test matrix* –  Data size ▪  KDD Cup 2009 Orange marketing churn data (16.5 GB) ▪  Enron data (1.9 GB) ▪  Census data 2000 (1.7 GB) –  Algorithms: Logistic Regression and K-means –  Algorithm parameters (e.g. convergence threshold, # iterations) * Reporting a subset of results from whitepaper. Courtesy Grace Gee (Engineer, SOAR Program, Pivotal) Pivotal Confidential–Internal Use Only 29
  • 28. Logistic Regression MADlib & Mahout Logistic Regression Scalability Across Number of Attributes 700 Census data, 48 attributes [Mahout] 600 Time in Minutes Census data, 48 attributes [MADlib] 500 400 300 200 100 0 1000000 10000000 100000000 1E+09 log(Number of Rows) Courtesy Grace Gee (Engineer, SOAR Program, Pivotal) Pivotal Confidential–Internal Use Only 30
  • 29. Logistic Regression 9 8 Time in Minutes 7 6 5 4 3 2 1 0 1000000 10000000 100000000 log(Number of Rows) 1E+09 Courtesy Grace Gee (Engineer, SOAR Program, Pivotal) Pivotal Confidential–Internal Use Only 31
  • 30. K-Means MADlib & Mahout K-means Scalability Across Number of Rows 350 300 Census data, 48 attributes [Mahout] Census data, 48 attributes [MADlib] Time in Min 250 200 150 100 50 0 1000000 10000000 100000000 1E+09 log(Number of Rows) Courtesy Grace Gee (Engineer, SOAR Program, Pivotal) Pivotal Confidential–Internal Use Only 32
  • 31. Random Forest 1600 Census data, 46 attributes [Mahout] 1400 Census data, 46 attributes [MADlib] Time in Min 1200 1000 800 600 400 200 0 1000000 10000000 100000000 1E+09 log(Number of Rows) Courtesy Grace Gee (Engineer, SOAR Program, Pivotal) Pivotal Confidential–Internal Use Only 33
  • 32. Part 1 Summary MADlib is a easy-to-use library that provides a SQL interface to fast, scalable machine learning algorithms … Pivotal Confidential–Internal Use Only 35
  • 33. But not all Data Scientists speak SQL … Accessing Scalability through R Pivotal Confidential–Internal Use Only 36
  • 34. Why R? From the report: “The preponderance of R and Python usage is more surprising … two most commonly used individual tools, even above Excel. R and Python are likely popular because they are easily accessible and effective open source tools.” O’Reilly: 2013 Data Science Salary Survey Pivotal Confidential–Internal Use Only 37
  • 35. PivotalR Design Overview PivotalR Design Overview Execution in Database •  Call MADlib’s in-DB machine learning functions •  •  •  directly from R Call MADlib’s in-DB to native R function Syntax is analogous machine learning functions directly from R Syntax is analogous to native R function PivotalR PivotalR R " SQL R " SQL No data here RPostgreSQL RPostgreSQL Data lives here Data lives here SQL to execute SQL to execute MADlib SQL to execute Computation results Database w/ MADlib Model output Computation results Database w/ MADlib •  Data doesn’t need to leave the database •  All heavy lifting, including model estimation •  & computation, are to leave the database Data doesn’t need done in the database merely point lifting, including model estimation •  All heavy to DB objects & computation, are done in the database •  All data stays in DB: R objects Woo Jung http://gopivotal.github.io/PivotalR/ •  All model estimation and heavy lifting done in DB by MADlib Woo Jung •  R → SQL translation done in the R client •  Only strings of SQL and model output transferred across DBI http://gopivotal.github.io/PivotalR/ No data here © Copyright 2014 Pivotal. All rights reserved. 36 © Copyright 2014 Pivotal. All rights reserved. 36 Courtesy Woo Jung and Hai Qian Pivotal Confidential–Internal Use Only 38
  • 36. Some of current features And more ... (SQL wrapper) + - * %% %/% / ^ A wrapper of MADlib •  •  •  •  •  Linear regression Logistic regression Elastic Net •  Categorial variable as.factor() ARIMA •  •  •  •  •  •  •  Table summary •  •  •  Pivotal Confidential–Internal Use Only dim names $ [ == & by [[ != | $<> < ! •  •  sort [<>= [[<<= •  merge db.data.frame •  •  as.db.data.frame is.na preview content •  predict c mean sum sd var min max length colMeans colSums db.connect db.disconnect db.list db.objects db.existsObject delete 40
  • 37. Demonstration library(PivotalR) Load the Library db.connect(port = 14526, dbname = "madlib") Connect to the database “madlib” on port 14526 db.objects() List all the tables in the active connection x <- db.data.frame("madlibtestdata.dt_abalone") Create an R object that references a table in the database dim(x) Report #/rows and #/columns in the table names(x) Column names within the table x$rings Database query object representing “select rings from madlibtestdata.dt_abalone” lookat(x, 10) # look at a sample of table Pull 10 rows of data from the table back into the R environment mean(x$rings) query object representing “select avg(rings) from madlibtestdata.dt_abalone” lookat(mean(x$rings)) execute the query and report back the result fit <- madlib.lm(rings ~ . - id | sex, data = y) Run a linear regression within the database and return a model object predict(fit, x) Create a query object representing scoring the model in the database mean((x$rings - predict(fit, x))^2) Query object calculating the mean square error of the model x$sex <- as.factor(v$sex) Add a calculated factor column to the database query object m0 <- madlib.glm(resp ~ age, Calculate a logistic regression model family="binomial", data=dbbank) mstep <- step(m0, scope=list(lower=~age, upper=~age + factor(marital) + factor(education) + factor(housing) + factor(loan) + factor(job))) Pivotal Confidential–Internal Use Only Perform stepwise feature selection 43
  • 38. We’re looking for contributors •  Browse our help pages –  Start page: madlib.net –  Github pages •  github.com/madlib/madlib •  github.com/gopivotal/pivotalr •  github.com/gopivotal/pymadlib (SQL) (R) (Python) –  Use our product and report issues: •  jira.madlib.net (Issue tracker) •  user@madlib.net (User forum) •  Can use PostgreSQL or Greenplum Database Community Edition for installations on multiple platforms Pivotal Confidential–Internal Use Only 44
  • 39. Credits The&MADlib&Vision& •  Academic&and&industry&contribuOons& •  Think&of&“CRAN&for&databases”& The&MADlib&Vision& –  Repository&of&open;source&ML&algorithms& –  This&Ome&with&data&parallelism&in&mind& •  Open;Source&Framework& BSD&License& Eigen& •  Academic&and&industry&contribuOons& •  Think&of&“CRAN&for&databases”& –  Repository&of&open;source&ML&algorithms& –  This&Ome&with&data&parallelism&in&mind& 10& •  Open;Source&Framework& Leaders and contributors: Gavin Sherry BSD&License& Caleb Welton Joseph Hellerstein Christopher Ré Zhe Wang Florian Schoppmann Pivotal Confidential–Internal Use Only Hai Qian Eigen& Shengwen Yang Aaron Feng 10& and many others … 45
  • 40. Thank you for your attention Important links: Product email: madlib@gopivotal.com Product site: madlib.net Speaker email: riyer@gopivotal.com Pivotal Confidential–Internal Use Only 46