PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learning library.
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learning library.

on

  • 1,082 views

These are slides from my talk @ DataDay Texas, in Austin on 30 Mar 2013 ...

These are slides from my talk @ DataDay Texas, in Austin on 30 Mar 2013
(http://2013.datadaytexas.com/schedule)
Favorite and Fork PyMADlib on GitHub: https://github.com/gopivotal/pymadlib
MADlib: http://madlib.net

Statistics

Views

Total Views
1,082
Views on SlideShare
1,082
Embed Views
0

Actions

Likes
1
Downloads
25
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Special thanks to Grace Gee (Engineer, SOAR Program, Greenplum)

PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learning library. Presentation Transcript

  • 1. Srivatsan Ramanujam Senior Data Scientist Greenplum © Copyright 2011 EMC Corporation. All rights reserved. 1
  • 2. Agenda • Greenplum UAP overview – Products: GPDB, GPHD, Chorus, Analytics Labs, Data Computing Appliance – GPDB Architecture • MADlib – – – – Overview Algorithms Working Mechanism Performance Comparison with Mahout • PyMADlib – Overview – Demo in IPython Notebook • Future Directions – GPHD and HAWQ © Copyright 2011 EMC Corporation. All rights reserved. 2
  • 3. Greenplum Overview © Copyright 2011 EMC Corporation. All rights reserved. 3
  • 4. Products © Copyright 2011 EMC Corporation. All rights reserved. 4
  • 5. Greenplum Database - Architecture MPP (Massively Parallel Processing) Shared-Nothing Architecture Master Servers ... SQL MapReduce ... Query planning & dispatch Network Interconnect Segment Servers ... ... Query processing & data storage External Sources Loading, streaming, etc. © Copyright 2011 EMC Corporation. All rights reserved. 5
  • 6. MADlib © Copyright 2011 EMC Corporation. All rights reserved. 6
  • 7. MADlib: The Origin UrbanDictionary.com: mad (adj.): an adjective used to enhance a noun. 1- dude, you got skills. 2- dude, you got mad skills. • First mention of MAD analytics was at VLDB’09 – MAD Skills: New Analysis Practices for Big Data – Jeffrey Cohen, Brian Dolan, Mark Dunlap, Joseph Hellerstein, Caleb Welton http://db.cs.berkeley.edu/papers/vldb09-madskills.pdf • MADlib project initiated in late 2010 – Maintained by Greenplum/EMC with significant contributions from UW Madison, UFlorida and UC Berkeley. © Copyright 2011 EMC Corporation. All rights reserved. 7
  • 8. Current Modules Data Modeling Supervised Learning • • • • • • • • • Naive Bayes Classification Linear Regression Logistic Regression Multinomial Logistic Regression Decision Tree Random Forest Support Vector Machines Cox-Proportional Hazards Regression Conditional Random Field Unsupervised Learning • Association Rules • k-Means Clustering • Low-rank Matrix Factorization • SVD Matrix Factorization • Parallel Latent Dirichlet Allocation Descriptive Statistics Sketch-based Estimators • CountMin (CormodeMuthukrishnan) • FM (Flajolet-Martin) • MFV (Most Frequent Values) Profile Quantile Support Array Operations Conjugate Gradient Sparse Vectors Probability Functions Random Sampling Inferential Statistics Hypothesis tests © Copyright 2011 EMC Corporation. All rights reserved. 8
  • 9. MADlib – User Doc • Check out the user guide with examples at: http://doc.madlib.net © Copyright 2011 EMC Corporation. All rights reserved. 9
  • 10. How does it work ? : A Linear Regression Example • Finding linear dependencies between variables – y ≈ c0 + c1 · x1 + c2 · x2 ? # select y, x1, x2 Vector of dependent variables y © Copyright 2011 EMC Corporation. All rights reserved. from unm limit 6; y | x1 | x2 -------+------+----10.14 | 0 | 0.3 11.93 | 0.69 | 0.6 13.57 | 1.1 | 0.9 14.17 | 1.39 | 1.2 15.25 | 1.61 | 1.5 16.15 | 1.79 | 1.8 Design matrix X 10
  • 11. Reminder: Linear-Regression Model • • If residuals i.i.d. Gaussians with standard deviation σ: – max likelihood ⇔ min sum of squared residuals • First-order conditions for the following quadratic objective (in c) yield the minimizer © Copyright 2011 EMC Corporation. All rights reserved. 11
  • 12. Linear Regression: Streaming Algorithm • How to compute with a single table scan? -1 XT XT y X X TX © Copyright 2011 EMC Corporation. All rights reserved. XTy 12
  • 13. Linear Regression: Parallel Computation XT y Segment 1 T X1 y1 © Copyright 2011 EMC Corporation. All rights reserved. Segment 2 T X2 y2 Master X Ty 13
  • 14. Performance Comparison : Test Setup on AWB • AWB – 1000-node cluster located in Las Vegas – Over 24,000 processors, 48 TB of Memory, and 24 PB of raw disk storage – 8000+ Map Task Capacity, 5000+ Reduce Task Capacity – GPHD 1.1, GPDB 4.2.3 • Mahout v0.7 • MADlib v0.5 – With small LMF change to allow 4-byte integer values • Test matrix – – – – Data size (# rows/records, # columns/features) Algorithms Algorithm parameters (e.g. convergence threshold, # iterations) GPDB segment / MR (Map-Reduce) task configurations © Copyright 2011 EMC Corporation. All rights reserved. 14
  • 15. Performance & Scalability Results (summary) • Whitepaper coming out shortly! © Copyright 2011 EMC Corporation. All rights reserved. 15
  • 16. Logistic Regression • Mahout only has sequential (i.e. single node) IGD implementation MADlib & Mahout Logistic Regression Scalability Across Number of Attributes 700 Census data, 48 attributes [Mahout] 600 Time in Minutes Census data, 48 attributes [MADlib] 500 400 300 200 100 0 1000000 10000000 10000000 1E+09 log(Number of Rows) © Copyright 2011 EMC Corporation. All rights reserved. 16
  • 17. Logistic Regression MADlib Scalability Across Number of GPDB Segments 18 16 Time in Minutes 14 12 10 8 6 4 2 0 0 50 100 150 200 250 300 Number of GPDB Segments © Copyright 2011 EMC Corporation. All rights reserved. 17
  • 18. K-Means Clustering MADlib & Mahout K-means Scalability Across Number of Rows 350 Census data, 48 attributes [Mahout] 300 Census data, 48 attributes [MADlib] Time in Min 250 200 150 100 50 0 1000000 10000000 10000000 1E+09 log(Number of Rows) © Copyright 2011 EMC Corporation. All rights reserved. 18
  • 19. K-Means Clustering MADlib K-means Scalability Across Number of GPDB Segments 10 9 8 Time in Min 7 6 5 4 3 2 1 0 0 50 100 150 200 250 300 Number of GPDB Segments © Copyright 2011 EMC Corporation. All rights reserved. 19
  • 20. PyMADlib : Python + MADlib = Awesome! © Copyright 2011 EMC Corporation. All rights reserved. 20
  • 21. Motivation • SQL is great for many things, but it’s not nearly enough • Undeniably the most straightforward way to query data • But not necessarily designed for data science © Copyright 2011 EMC Corporation. All rights reserved. 21
  • 22. MADlib is a godsend! • Empowers data scientists to run canned machine learning routines – focus less on coding, more on science • In-database, explicitly parallel. • So why do we need anything else? – UI is still all in SQL – Need to tap into rich visualization libraries © Copyright 2011 EMC Corporation. All rights reserved. 22
  • 23. Then which interface is favored by and familiar to data scientists? • Depends on who you ask • Left survey is for “higher level languages,” and right survey is for “lower level languages” © Copyright 2011 EMC Corporation. All rights reserved. 23
  • 24. Wait, don’t we already have this (PL/R, PL/Python, SAS HPA)? • PL/X’s are wonderful, but: – It still requires non-trivial knowledge of SQL to use effectively – Mostly limited to explicitly parallel jobs – Primarily a SQL interface to the end user • Need an interface that is: – Less SQL, more R/Python/SAS – Implicitly parallelized – More scalable • SAS HPA = $$$$$ © Copyright 2011 EMC Corporation. All rights reserved. 24
  • 25. The challenge • MADlib – – – – Open source Extremely powerful/scalable Growing algorithm breadth SQL • Python/R – – – – Open source Memory limited High algorithm breadth Language/interface purpose-designed for data science • SAS – – – – High user loyalty Non-HPA is memory limited, HPA requires investment High algorithm breadth Language/interface purpose-designed for data science • Want to leverage both the performance benefits of MADlib and the usability of languages like Python, SAS, and R © Copyright 2011 EMC Corporation. All rights reserved. 25
  • 26. Simple solution: Translate Python code into SQL ODBC/ JDBC Python  SQL SQL to execute MADlib Model output • All data stays in DB and all model estimation and heavy lifting done in DB by MADlib • Only strings of SQL and model output transferred across ODBC/JDBC • Best of both worlds: number crunching power of MADlib along with rich set of visualizations of Matplotlib, NetworkX and all your other favorite Python libraries. Let MADlib do all the heavy-lifting on your Greenplum/PostGreSQL database, while you program in your favorite language – Python. © Copyright 2011 EMC Corporation. All rights reserved. 26
  • 27. Demo PyMADlib Tutorial – IPython Notebook Viewer Link http://nbviewer.ipython.org/5275846 © Copyright 2011 EMC Corporation. All rights reserved. 27
  • 28. Where do I get it ? $pip install pymadlib © Copyright 2011 EMC Corporation. All rights reserved. 28
  • 29. I don’t have GPDB or MADlib – What do I do ? • Greenplum Database Community Edition is freely available for single node installations on multiple platforms – Written permission may be requested from EMC/Greenplum for research use for multi-node installations • MADlib is free and open-source – Downloadable for multiple platforms from https://github.com/madlib/madlib • PyMADlib is also free and open-source  – Downloadable from https://github.com/vatsan/pymadlib © Copyright 2011 EMC Corporation. All rights reserved. 29
  • 30. Future Directions © Copyright 2011 EMC Corporation. All rights reserved. 30
  • 31. Greenplum HD • HAWQ – Parallel SQL query engine that combines the key technological advantages of industry-leading Greenplum Database with scalability and convenience of Hadoop • SQL Standards Compliant – Supports Correlated Sub-queries, Window Functions, Roll-ups, Cubes + range of scalar and aggregate functions • ACID Compliant © Copyright 2011 EMC Corporation. All rights reserved. 31
  • 32. HAWQ – Architecture © Copyright 2011 EMC Corporation. All rights reserved. 32
  • 33. Performance : HAWQ1 Vs. Hive Vs. Impala2 All experiments were run on a 60 node deployment with Analytics Workbench3 1 2 3 http://www.greenplum.com/sites/default/files/2013_0301_hawq_sql_engine_hadoop_1.pdf https://github.com/cloudera/impala/ http://www.analyticsworkbench.com/ © Copyright 2011 EMC Corporation. All rights reserved. 33
  • 34. HAWQ: Deep Scalable Analytics What’s inside the box? • Linear Regression • Logistic Regression • Multinomial Logistic Regression • K-Means • Association Rules • Latent Dirichlet Allocation • Users can connect to HAWQ via popular programming languages and it also supports JDBC and ODBC. • Most tools will work out of the box with HAWQ, including PyMADlib © Copyright 2011 EMC Corporation. All rights reserved. 34
  • 35. Questions? @being_bayesian vatsan.cs@utexas.edu https://github.com/vatsan/pymadlib © Copyright 2011 EMC Corporation. All rights reserved. 35
  • 36. Appendix © Copyright 2011 EMC Corporation. All rights reserved. 36
  • 37. Datasets The following datasets were used in comparing the performance of MADlib with Mahout – KDD Cup 2009 Orange marketing churn data (16.5 MB) • About 500,000 records and 15,000 numerical and categorical attributes – Census 2000 data (1.7 GB) • About 14 million records and 48 numerical and categorical attributes – Enron data (1.9 GB) • About 700,000 documents with a vocabulary size of 200,000 – KDD Cup 2011 Yahoo! Music Webscope data (4.16 GB) • About 1 million users, 600,000 songs, and 250 million ratings – Netflix Prize 2009 data (52.7 MB) • About 400,000 users, 900 movies, and 4.5 million ratings © Copyright 2011 EMC Corporation. All rights reserved. 37