Pivotal OSS meetup - MADlib and PivotalR

2,347 views

Published on

With the explosion of big data, the need for fast and inexpensive analytics solutions has become a key basis of competition in many industries. Extracting the value of big data with analytics can be complex, and requires advanced skills.

At Pivotal, we are building open-source solutions (MADlib, PivotalR, PyMadlib) to simplify this process for the user, while maintaining the efficiency necessary for big data analysis.

This talk will provide information about MADlib, an open source library of SQL-based algorithms for machine learning, data mining and statistics that run at large scale within a database engine, with no need for data import/export to other tools.

It provides an overview of the library’s architecture and compares various statistical methods with those available in Apache Mahout.

We also introduce, PivotalR, a R-based wrapper for MADlib that allows data scientists and programmers to access power of MADlib along with the ease of use of R.

Published in: Technology
  • Be the first to comment

Pivotal OSS meetup - MADlib and PivotalR

  1. 1. BUILT FOR THE SPEED OF BUSINESS Pivotal Confidential–Internal Use Only 1
  2. 2. Pivotal OSS Meetups Big Data Analytics MADlib and PivotalR: Scalable Machine Learning for Massively Parallel Databases Rahul Iyer, Senior Software Developer, Predictive Analytics March, 4th 2014 Pivotal Confidential–Internal Use Only 2
  3. 3. Agenda for the talk •  Introduce MADlib, a distributed machine learning library for SQL users •  How scalability is achieved by distributing the computation? •  Performance metrics + comparisons with Mahout Pivotal Confidential–Internal Use Only •  A new R interface to access all of MADlib’s features •  How does it get big-data results with small-data efforts? •  Demo to showcase PivotalR 3
  4. 4. What is Big data? •  Volumes of data … •  In various formats … •  From multiple sources … and Analytics? •  Generate insights … •  for informed decision-making Pivotal Confidential–Internal Use Only 4
  5. 5. Data ---! Information ---! Insights Traditional analytics pipeline Time;to;Insights& Data&Prep& sample.csv& spec.docx& DB&Extract& scores.csv& DB&Import& 3& Pivotal Confidential–Internal Use Only 6
  6. 6. The MAD approach Data ---! Information ---! Insights Time-to-Insights Data&Prep& Model& Score& Billions&of&rows& Reduced&Data& in&minutes& Movement& Enterprise)Data) RDBMS& RDBMS& RDBMS& RDBMS& 4& Pivotal Confidential–Internal Use Only 7
  7. 7. What is MADlib? MADlib project was initiated in 2011 by Greenplum architects and Joe Hellerstein from Univ. of California, Berkeley. •  MAD stands for: •  lib stands for SQL library of: •  advanced (mathematical, statistical, machine learning) •  parallel & scalable in-database functions Pivotal Confidential–Internal Use Only 8
  8. 8. What is MADlib? MADlib project was initiated in 2011 by Greenplum architects and Joe Hellerstein from Univ. of California, Berkeley. •  MAD stands for: •  lib stands for SQL library of: •  advanced (mathematical, statistical, machine learning) •  parallel & scalable in-database functions UrbanDictionary.com: mad (adj.): an adjective used to enhance a noun. 1- dude, you got skills. 2- dude, you got mad skills. Pivotal Confidential–Internal Use Only 9
  9. 9. Which platforms does it run on? (Partly ported) Impala HAWQ HDFS Pivotal Confidential–Internal Use Only GPDB PostgreSQL 10
  10. 10. Shared-Nothing Database Architecture MPP (Massively Parallel Processing) Master Servers ... SQL MapReduce ... Query planning & dispatch Network Interconnect Segment Servers ... ... Query processing & data storage External Sources Loading, streaming, etc. Pivotal Confidential–Internal Use Only 11
  11. 11. Supervised Learning Summary function Sketch estimators Percentiles Correlation matrix Data Exploration Text analytics •  Generalized Linear models •  Linear Regression •  Logistic Regression •  Multinomial logit … •  Decision Trees and Random Forest •  Naive Bayes Classification •  Support Vector Machines •  Cox-Prop Hazards and more … •  CRF •  LDA Support modules •  Array operations •  Sparse Vectors •  Probability functions Scoring Sampling methods •  Cross Validation •  Linear Regression •  Logistic Regression •  Naïve Bayes … Scoring Predictive Modeling Analytics Pipeline Data Prep Aggregation Normalizing Pivoting Filtering Pivotal Confidential–Internal Use Only Data mining Model fitness Unsupervised Learning Statistical metrics •  Association Rules •  k-Means Clustering •  Low-rank Matrix Factorization •  PCA •  SVD Matrix Factorization •  Descriptive statistics •  Goodness of fit •  Inferential statistics •  ROC 12
  12. 12. Example usage Train a model Predict for new data Pivotal Confidential–Internal Use Only 13
  13. 13. How do we implement scalability? Example: Linear Regression •  Finding linear dependencies between variables Regressor (y) y ≈ c0 + c1 · x1 + c2 · x2 ? Vector of dependent variables y y | x1 | x2 -------+------+----10.14 | 0 | 0.3 11.93 | 0.69 | 0.6 13.57 | 1.1 | 0.9 14.17 | 1.39 | 1.2 15.25 | 1.61 | 1.5 16.15 | 1.79 | 1.8 Design matrix X Predictor (x1) Pivotal Confidential–Internal Use Only 14
  14. 14. Challenges in computing OLS solution Pivotal Confidential–Internal Use Only 15
  15. 15. Challenges in computing OLS solution XT X a c b d a b c d Segment 2 Segment 2 Segment 1 Pivotal Confidential–Internal Use Only Segment 1 16
  16. 16. Challenges to compute OLS solution XT X a c b d a b c d a2 + c2 Data across nodes are multiplied! = Pivotal Confidential–Internal Use Only 17
  17. 17. Challenges to compute OLS solution XT X a c b d a b c d a2 + c2 ab + cd Data across nodes are multiplied! = Pivotal Confidential–Internal Use Only 18
  18. 18. Challenges to compute OLS solution XT X a c b d a b c d a2 + c2 ab + cd ba + dc b2 + d2 Looks like the result can be decomposed = Pivotal Confidential–Internal Use Only 19
  19. 19. Challenges to compute OLS solution XT a c b d = X a b c d a b a b + a2 + c2 ab + cd ba + dc c d c d b2 + d2 Let’s change perspective = Pivotal Confidential–Internal Use Only 20
  20. 20. Linear Regression: Streaming Algorithm How to compute with a single table scan? -1 XT XT y X XTX Pivotal Confidential–Internal Use Only XTy 22
  21. 21. Linear Regression: Parallel Computation XT y Segment 1 Pivotal Confidential–Internal Use Only T X1 y1 Segment 2 T X2 y2 Master XTy 23
  22. 22. Basic&Building&Block:& Basic Building Block: User-defined aggregate User;Defined&Aggregates& x# (1,0,3,…,5)& (;2,4,5,…,2)& …& y) 3& 2& …& (A,b)& …& AggregaOon&phase&1&on&each&node:& 1.  IniOalize:&(A,b) = (0,0) 2.  TransiOon&for&all&rows:& & (A,b) = (A,b) + (x   ⋅T ,x ⋅  y)  x 3.  Send&(A,b)& map& & reduce& AggregaOon&phase&2&on&master&node:& 1.  Merge:&& (A,b) = (A,b) + (A,b) ˆ 2.  Finalize:& β = solve(A,b) = A−1 ⋅ b 13& Pivotal Confidential–Internal Use Only 24
  23. 23. Problem solved? … Not Yet "  Many ML solutions are iterative without analytical formulations Initialize problem Perform optimization step false Has converged? true Return results Pivotal Confidential–Internal Use Only 25
  24. 24. 1.90 1.66 60.58 227.7 1.197 1.276 1.698 3.363 8.840 6.18 2.383 2.869 4.475 13.35 45.48 171.7 17.14 111.4 0.3904 0.4769 1.151 3.263 13.10 84.59 Use a convex optimization framework 1.&Lack&of&portable&mulO;pass& -  Each step n execution times iteraOons& has an analytical formulation that can be performed in parallel •  WITH RECURSIVE&not&reliable&basis&for& portability& •  User;defined&driver& funcOons&in&Python& CREATE TEMP TABLE temp! INSERT INTO temp SELECT step(...) FROM ...! –  Outer&loops&not& performance;criOcal& false& Figure 6: The Archetypical Convex Function f (x) = x . •  Compromise:& 2 Application Different&user&interface& Least Squares Lasso [38] Logisitic Regression Classification (SVM) Pivotal Confidential–Internal Use Only Recommendation Labeling (CRF) [40] Objective P (xT u y)2 P(u,y)2⌦ T 2 P(u,y)2⌦ (x u y) + µkxk1 log(1 + exp( yxt u)) P(u,y)2⌦ T (u,y)2⌦ (1 yx u)+ P T Mi j )2 + µkL, Rk2 (i,h j)2⌦ (Li R j iF P P k j x j F j (yk , zk ) log Z(zk ) SELECT converged(...) FROM temp, ...! true& SELECT result(...)! FROM temp! 16& 26
  25. 25. Architecture SQL, generated per specification User Interface The&MADlib&Vision& High-level Abstraction Layer Python (iteration controller, ...) •  Academic&and&industry&contribuOons& •  Think&of&“CRAN&for&databases”& 3.&Lack&of&language&support&for& Functions for Inner Loops RDBMS Built-in Functions –  Repository&of&open;source&ML&algorithms& linear&algebra& (implements convex optimization) –  This&Ome&with&data&parallelism&in&mind& •  C++&AbstracOon&Layer&uses&Eigen& C++ •  Open;Source&Framework& Low-level Abstraction Layer •  (Dense)&Vectors&and&matrices:& (matrix operations, PRECISION[]! DOUBLE C++ to DB typeExample:&…) •  bridge, AnyType! solve::run(AnyType& args) {! MappedMatrix A = args[0].getAs<MappedMatrix>();! MappedColumnVector b = args[1].getAs<MappedColumnVector>();! BSD&License& Eigen& ! MutableMappedColumnVector x = allocateArray<double>(A.cols());! x = A.colPivHouseholderQr().solve(b);! return x;! Performance:& }! RDBMS Query Processing (Greenplum, PostgreSQL, Hadoop with SQL) 10& •  No&unnecessary&copying& •  No&internal&type&conversion& 18& Pivotal Confidential–Internal Use Only 27
  26. 26. Performance&Trends& Performance trends sk&I/O&is&not&always& •  Overhead e&boLleneck& for a single row is very Performance&tuning&is&of a low (fraction essenOal&second) verhead&for&single& •  Able to achieve uery&very&low&(fracOon& close to linear &a&second)& speedup eenplum&achieves& early&perfect&speedup& OLS&on&10&million&rows&(in&seconds)& #&variables:& 20& 40& 160& 40& 35& 30& 25& 20& 15& 10& 5& 0& 6& 12& 18& #&segments& Pivotal Confidential–Internal Use Only 80& 24& 22& 28
  27. 27. Performance Comparison with Apache Mahout "  Analytics WorkBench (http://www.gopivotal.com/big-data/analytics-workbench) –  1000-node cluster located in Las Vegas –  Over 24,000 processors, 48 TB of Memory, and 24 PB of raw disk storage –  8000+ Map Task Capacity, 5000+ Reduce Task Capacity –  Infrastructure: Pivotal HD 1.1 "  Mahout v0.7 "  Test matrix* –  Data size ▪  KDD Cup 2009 Orange marketing churn data (16.5 GB) ▪  Enron data (1.9 GB) ▪  Census data 2000 (1.7 GB) –  Algorithms: Logistic Regression and K-means –  Algorithm parameters (e.g. convergence threshold, # iterations) * Reporting a subset of results from whitepaper. Courtesy Grace Gee (Engineer, SOAR Program, Pivotal) Pivotal Confidential–Internal Use Only 29
  28. 28. Logistic Regression MADlib & Mahout Logistic Regression Scalability Across Number of Attributes 700 Census data, 48 attributes [Mahout] 600 Time in Minutes Census data, 48 attributes [MADlib] 500 400 300 200 100 0 1000000 10000000 100000000 1E+09 log(Number of Rows) Courtesy Grace Gee (Engineer, SOAR Program, Pivotal) Pivotal Confidential–Internal Use Only 30
  29. 29. Logistic Regression 9 8 Time in Minutes 7 6 5 4 3 2 1 0 1000000 10000000 100000000 log(Number of Rows) 1E+09 Courtesy Grace Gee (Engineer, SOAR Program, Pivotal) Pivotal Confidential–Internal Use Only 31
  30. 30. K-Means MADlib & Mahout K-means Scalability Across Number of Rows 350 300 Census data, 48 attributes [Mahout] Census data, 48 attributes [MADlib] Time in Min 250 200 150 100 50 0 1000000 10000000 100000000 1E+09 log(Number of Rows) Courtesy Grace Gee (Engineer, SOAR Program, Pivotal) Pivotal Confidential–Internal Use Only 32
  31. 31. Random Forest 1600 Census data, 46 attributes [Mahout] 1400 Census data, 46 attributes [MADlib] Time in Min 1200 1000 800 600 400 200 0 1000000 10000000 100000000 1E+09 log(Number of Rows) Courtesy Grace Gee (Engineer, SOAR Program, Pivotal) Pivotal Confidential–Internal Use Only 33
  32. 32. Part 1 Summary MADlib is a easy-to-use library that provides a SQL interface to fast, scalable machine learning algorithms … Pivotal Confidential–Internal Use Only 35
  33. 33. But not all Data Scientists speak SQL … Accessing Scalability through R Pivotal Confidential–Internal Use Only 36
  34. 34. Why R? From the report: “The preponderance of R and Python usage is more surprising … two most commonly used individual tools, even above Excel. R and Python are likely popular because they are easily accessible and effective open source tools.” O’Reilly: 2013 Data Science Salary Survey Pivotal Confidential–Internal Use Only 37
  35. 35. PivotalR Design Overview PivotalR Design Overview Execution in Database •  Call MADlib’s in-DB machine learning functions •  •  •  directly from R Call MADlib’s in-DB to native R function Syntax is analogous machine learning functions directly from R Syntax is analogous to native R function PivotalR PivotalR R " SQL R " SQL No data here RPostgreSQL RPostgreSQL Data lives here Data lives here SQL to execute SQL to execute MADlib SQL to execute Computation results Database w/ MADlib Model output Computation results Database w/ MADlib •  Data doesn’t need to leave the database •  All heavy lifting, including model estimation •  & computation, are to leave the database Data doesn’t need done in the database merely point lifting, including model estimation •  All heavy to DB objects & computation, are done in the database •  All data stays in DB: R objects Woo Jung http://gopivotal.github.io/PivotalR/ •  All model estimation and heavy lifting done in DB by MADlib Woo Jung •  R → SQL translation done in the R client •  Only strings of SQL and model output transferred across DBI http://gopivotal.github.io/PivotalR/ No data here © Copyright 2014 Pivotal. All rights reserved. 36 © Copyright 2014 Pivotal. All rights reserved. 36 Courtesy Woo Jung and Hai Qian Pivotal Confidential–Internal Use Only 38
  36. 36. Some of current features And more ... (SQL wrapper) + - * %% %/% / ^ A wrapper of MADlib •  •  •  •  •  Linear regression Logistic regression Elastic Net •  Categorial variable as.factor() ARIMA •  •  •  •  •  •  •  Table summary •  •  •  Pivotal Confidential–Internal Use Only dim names $ [ == & by [[ != | $<> < ! •  •  sort [<>= [[<<= •  merge db.data.frame •  •  as.db.data.frame is.na preview content •  predict c mean sum sd var min max length colMeans colSums db.connect db.disconnect db.list db.objects db.existsObject delete 40
  37. 37. Demonstration library(PivotalR) Load the Library db.connect(port = 14526, dbname = "madlib") Connect to the database “madlib” on port 14526 db.objects() List all the tables in the active connection x <- db.data.frame("madlibtestdata.dt_abalone") Create an R object that references a table in the database dim(x) Report #/rows and #/columns in the table names(x) Column names within the table x$rings Database query object representing “select rings from madlibtestdata.dt_abalone” lookat(x, 10) # look at a sample of table Pull 10 rows of data from the table back into the R environment mean(x$rings) query object representing “select avg(rings) from madlibtestdata.dt_abalone” lookat(mean(x$rings)) execute the query and report back the result fit <- madlib.lm(rings ~ . - id | sex, data = y) Run a linear regression within the database and return a model object predict(fit, x) Create a query object representing scoring the model in the database mean((x$rings - predict(fit, x))^2) Query object calculating the mean square error of the model x$sex <- as.factor(v$sex) Add a calculated factor column to the database query object m0 <- madlib.glm(resp ~ age, Calculate a logistic regression model family="binomial", data=dbbank) mstep <- step(m0, scope=list(lower=~age, upper=~age + factor(marital) + factor(education) + factor(housing) + factor(loan) + factor(job))) Pivotal Confidential–Internal Use Only Perform stepwise feature selection 43
  38. 38. We’re looking for contributors •  Browse our help pages –  Start page: madlib.net –  Github pages •  github.com/madlib/madlib •  github.com/gopivotal/pivotalr •  github.com/gopivotal/pymadlib (SQL) (R) (Python) –  Use our product and report issues: •  jira.madlib.net (Issue tracker) •  user@madlib.net (User forum) •  Can use PostgreSQL or Greenplum Database Community Edition for installations on multiple platforms Pivotal Confidential–Internal Use Only 44
  39. 39. Credits The&MADlib&Vision& •  Academic&and&industry&contribuOons& •  Think&of&“CRAN&for&databases”& The&MADlib&Vision& –  Repository&of&open;source&ML&algorithms& –  This&Ome&with&data&parallelism&in&mind& •  Open;Source&Framework& BSD&License& Eigen& •  Academic&and&industry&contribuOons& •  Think&of&“CRAN&for&databases”& –  Repository&of&open;source&ML&algorithms& –  This&Ome&with&data&parallelism&in&mind& 10& •  Open;Source&Framework& Leaders and contributors: Gavin Sherry BSD&License& Caleb Welton Joseph Hellerstein Christopher Ré Zhe Wang Florian Schoppmann Pivotal Confidential–Internal Use Only Hai Qian Eigen& Shengwen Yang Aaron Feng 10& and many others … 45
  40. 40. Thank you for your attention Important links: Product email: madlib@gopivotal.com Product site: madlib.net Speaker email: riyer@gopivotal.com Pivotal Confidential–Internal Use Only 46

×