BUILT FOR THE SPEED OF BUSINESS
@ianhuston
2© Copyright 2014 Pivotal. All rights reserved. 2© Copyright 2013 Pivotal. All rights reserved. 2© Copyright 20...
@ianhuston
3© Copyright 2014 Pivotal. All rights reserved.
Some Links for this talk
Ÿ  Simple code examples:
https://gith...
@ianhuston
4© Copyright 2014 Pivotal. All rights reserved.
About Pivotal
Cloud Storage
Virtualization
Data &
Analytics
Pla...
@ianhuston
5© Copyright 2014 Pivotal. All rights reserved.
What do our customers look like?
Ÿ  Large enterprises with lot...
@ianhuston
6© Copyright 2014 Pivotal. All rights reserved.
Open Source is Pivotal
@ianhuston
7© Copyright 2014 Pivotal. All rights reserved.
Pivotal’s Open Source Contributions
Lots more interesting small...
@ianhuston
8© Copyright 2014 Pivotal. All rights reserved.
Typical Engagement Tech Setup
Ÿ  Platform:
–  Greenplum Analyt...
@ianhuston
9© Copyright 2014 Pivotal. All rights reserved.
1 Find Data
Platforms
•  Greenplum DB
•  Pivotal HD
•  Hadoop (...
@ianhuston
10© Copyright 2014 Pivotal. All rights reserved. 10© Copyright 2013 Pivotal. All rights reserved. 10© Copyright...
@ianhuston
11© Copyright 2014 Pivotal. All rights reserved.
MPP Architectural Overview
Think of it as multiple
PostGreSQL ...
@ianhuston
12© Copyright 2014 Pivotal. All rights reserved.
Data Parallelism
Ÿ  Little or no effort is required to break ...
@ianhuston
13© Copyright 2014 Pivotal. All rights reserved.
User-Defined Functions (UDFs)
Ÿ  PostgreSQL/Greenplum provide...
@ianhuston
14© Copyright 2014 Pivotal. All rights reserved.
Ÿ  The interpreter/VM of the language ‘X’ is
installed on eac...
@ianhuston
15© Copyright 2014 Pivotal. All rights reserved.
Intro to PL/Python
Ÿ  Procedural languages need to be install...
@ianhuston
16© Copyright 2014 Pivotal. All rights reserved. 16© Copyright 2013 Pivotal. All rights reserved. 16© Copyright...
@ianhuston
17© Copyright 2014 Pivotal. All rights reserved.
Returning Results
Ÿ  Postgres primitive types (int, bigint, t...
@ianhuston
18© Copyright 2014 Pivotal. All rights reserved.
Returning more results
You can return multiple results by wrap...
@ianhuston
19© Copyright 2014 Pivotal. All rights reserved.
Accessing Packages
Ÿ  On Greenplum DB: To be available packag...
@ianhuston
20© Copyright 2014 Pivotal. All rights reserved.
Benefits of PL/Python
Ÿ  Easy to bring your code to the data....
@ianhuston
21© Copyright 2014 Pivotal. All rights reserved. 21© Copyright 2013 Pivotal. All rights reserved. 21© Copyright...
@ianhuston
22© Copyright 2014 Pivotal. All rights reserved.
Going Beyond Data Parallelism
Ÿ  Data Parallel computation vi...
@ianhuston
23© Copyright 2014 Pivotal. All rights reserved.
MADlib: The Origin
•  First mention of MAD analytics was at VL...
@ianhuston
24© Copyright 2014 Pivotal. All rights reserved.
MADlib Executes Algorithms In-Place
MADlib Advantages
Ø  No D...
@ianhuston
25© Copyright 2014 Pivotal. All rights reserved.
MADlib In-Database
Functions
Predictive Modeling Library
Linea...
@ianhuston
26© Copyright 2014 Pivotal. All rights reserved.
Architecture
RDBMS Query Processing
(Greenplum, PostgreSQL, …)...
@ianhuston
27© Copyright 2014 Pivotal. All rights reserved.
How does it work ? : A Linear Regression Example
Ÿ  Finding l...
@ianhuston
28© Copyright 2014 Pivotal. All rights reserved.
Reminder: Linear-Regression Model
• 
•  If residuals i.i.d. Ga...
@ianhuston
29© Copyright 2014 Pivotal. All rights reserved.
Linear Regression: Streaming Algorithm
•  How to compute with ...
@ianhuston
30© Copyright 2014 Pivotal. All rights reserved.
Linear Regression: Parallel Computation
XT
y
XT
1 y1 XT
2 y2
S...
@ianhuston
31© Copyright 2014 Pivotal. All rights reserved.
Demos
Ÿ  We built demos to showcase our technology pipeline, ...
@ianhuston
32© Copyright 2014 Pivotal. All rights reserved.
Topic and Sentiment Analysis Pipeline
Stored on
HDFS
Tweet
Str...
@ianhuston
33© Copyright 2014 Pivotal. All rights reserved.
Transport for London
Traffic Disruption feed
Pivotal Greenplum...
@ianhuston
34© Copyright 2014 Pivotal. All rights reserved.
Get in touch
Feel free to contact me about PL/Python, or more ...
BUILT FOR THE SPEED OF BUSINESS
Upcoming SlideShare
Loading in …5
×

Massively Parallel Process with Prodedural Python by Ian Huston

1,030 views

Published on

Massively Parallel Process with Prodedural Python by Ian Huston
https://github.com/ihuston/plpython_examples

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,030
On SlideShare
0
From Embeds
0
Number of Embeds
19
Actions
Shares
0
Downloads
17
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Massively Parallel Process with Prodedural Python by Ian Huston

  1. 1. BUILT FOR THE SPEED OF BUSINESS
  2. 2. @ianhuston 2© Copyright 2014 Pivotal. All rights reserved. 2© Copyright 2013 Pivotal. All rights reserved. 2© Copyright 2013 Pivotal. All rights reserved. 2© Copyright 2013 Pivotal. All rights reserved. 2© Copyright 2014 Pivotal. All rights reserved. Massively Parallel Processing with Procedural Python How do we use the PyData stack in data science engagements at Pivotal? Ian Huston, @ianhuston Data Scientist, Pivotal
  3. 3. @ianhuston 3© Copyright 2014 Pivotal. All rights reserved. Some Links for this talk Ÿ  Simple code examples: https://github.com/ihuston/plpython_examples Ÿ  IPython notebook rendered with nbviewer: http://tinyurl.com/ih-plpython Ÿ  More info (written for PL/R but applies to PL/Python): http://gopivotal.github.io/gp-r/ Ÿ  Traffic Disruption demo (if we have time) http://ds-demo-transport.cfapps.io
  4. 4. @ianhuston 4© Copyright 2014 Pivotal. All rights reserved. About Pivotal Cloud Storage Virtualization Data & Analytics Platform Cloud Application Platform Data-Driven Application Development Pivotal Data Science Labs
  5. 5. @ianhuston 5© Copyright 2014 Pivotal. All rights reserved. What do our customers look like? Ÿ  Large enterprises with lots of data collected –  Work with 10s of TBs to PBs of data, structured & unstructured Ÿ  Not able to get what they want out of their data –  Old Legacy systems with high cost and no flexibility –  Response times are too slow for interactive data analysis –  Can only deal with small samples of data locally Ÿ  They want to transform into data driven enterprises
  6. 6. @ianhuston 6© Copyright 2014 Pivotal. All rights reserved. Open Source is Pivotal
  7. 7. @ianhuston 7© Copyright 2014 Pivotal. All rights reserved. Pivotal’s Open Source Contributions Lots more interesting small projects: •  PyMADlib – Python Wrapper for MADlib https://github.com/gopivotal/pymadlib •  PivotalR – R wrapper for MADlib http://github.com/madlib-internal/PivotalR •  Part-of-speech tagger for Twitter via SQL http://vatsan.github.io/gp-ark-tweet-nlp/ •  Pandas via psql (interactive PostgreSQL terminal) https://github.com/vatsan/pandas_via_psql
  8. 8. @ianhuston 8© Copyright 2014 Pivotal. All rights reserved. Typical Engagement Tech Setup Ÿ  Platform: –  Greenplum Analytics Database (GPDB) –  Pivotal HD Hadoop Distribution + HAWQ (SQL DB on Hadoop) Ÿ  Open Source Options (http://gopivotal.com): –  Greenplum Community Edition –  Pivotal HD Community Edition (HAWQ not included) –  MADlib in-database machine learning library (http://madlib.net) Ÿ  Where Python fits in: –  PL/Python running in-database, with nltk, scikit-learn etc –  IPython for exploratory analysis –  Pandas, Matplotlib etc.
  9. 9. @ianhuston 9© Copyright 2014 Pivotal. All rights reserved. 1 Find Data Platforms •  Greenplum DB •  Pivotal HD •  Hadoop (other) •  SAS HPA •  AWS 2 Write Code Editing Tools •  Vi/Vim •  Emacs •  Smultron •  TextWrangler •  Eclipse •  Notepad++ •  IPython •  Sublime Languages •  SQL •  Bash scripting •  C •  C++ •  C# •  Java •  Python •  R 3 Run Code Interfaces •  pgAdminIII •  psql •  psycopg2 •  Terminal •  Cygwin •  Putty •  Winscp 4 Write Code for Big Data In-Database •  SQL •  PL/Python •  PL/Java •  PL/R •  PL/pgSQL Hadoop •  HAWQ •  Pig •  Hive •  Java 5 Implement Algorithms Libraries •  MADlib Java •  Mahout R •  (Too many to list!) Text •  OpenNLP •  NLTK •  GPText C++ •  opencv Python •  NumPy •  SciPy •  scikit-learn •  Pandas Programs •  Alpine Miner •  Rstudio •  MATLAB •  SAS •  Stata 6 Show Results Visualization •  python-matplotlib •  python-networkx •  D3.js •  Tableau •  GraphViz •  Gephi •  R (ggplot2, lattice, shiny) •  Excel 7 Collaborate Sharing Tools •  Chorus •  Confluence •  Socialcast •  Github •  Google Drive & Hangouts PIVOTAL DATA SCIENCE TOOLKIT A large and varied tool box!
  10. 10. @ianhuston 10© Copyright 2014 Pivotal. All rights reserved. 10© Copyright 2013 Pivotal. All rights reserved. 10© Copyright 2013 Pivotal. All rights reserved. 10© Copyright 2013 Pivotal. All rights reserved. 10© Copyright 2014 Pivotal. All rights reserved. PL/Python
  11. 11. @ianhuston 11© Copyright 2014 Pivotal. All rights reserved. MPP Architectural Overview Think of it as multiple PostGreSQL servers Workers Master
  12. 12. @ianhuston 12© Copyright 2014 Pivotal. All rights reserved. Data Parallelism Ÿ  Little or no effort is required to break up the problem into a number of parallel tasks, and there exists no dependency (or communication) between those parallel tasks. Ÿ  Examples: –  Measure the height of each student in a classroom (explicitly parallelizable by student) –  MapReduce –  map() function in Python
  13. 13. @ianhuston 13© Copyright 2014 Pivotal. All rights reserved. User-Defined Functions (UDFs) Ÿ  PostgreSQL/Greenplum provide lots of flexibility in defining your own functions. Ÿ  Simple UDFs are SQL queries with calling arguments and return types. Definition: CREATE  FUNCTION  times2(INT)   RETURNS  INT   AS  $$          SELECT  2  *  $1   $$  LANGUAGE  sql;   Execution: SELECT  times2(1);    times2     -­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐              2   (1  row)  
  14. 14. @ianhuston 14© Copyright 2014 Pivotal. All rights reserved. Ÿ  The interpreter/VM of the language ‘X’ is installed on each node of the Greenplum Database Cluster •  Data Parallelism: -  PL/X piggybacks on Greenplum’s MPP architecture •  Allows users to write Greenplum/ PostgreSQL functions in the R/Python/ Java, Perl, pgsql or C languages Standby Master … Master Host SQL Interconnect Segment Host Segment Segment Segment Host Segment Segment Segment Host Segment Segment Segment Host Segment Segment PL/X : X in {pgsql, R, Python, Java, Perl, C etc.}
  15. 15. @ianhuston 15© Copyright 2014 Pivotal. All rights reserved. Intro to PL/Python Ÿ  Procedural languages need to be installed on each database used. Ÿ  Name in SQL is plpythonu, ‘u’ means untrusted so need to be superuser to install. Ÿ  Syntax is like normal Python function with function definition line replaced by SQL wrapper. Alternatively like a SQL User Defined Function with Python inside. CREATE  FUNCTION  pymax  (a  integer,  b  integer)      RETURNS  integer   AS  $$      if  a  >  b:          return  a      return  b   $$  LANGUAGE  plpythonu;   SQL wrapper SQL wrapper Normal Python
  16. 16. @ianhuston 16© Copyright 2014 Pivotal. All rights reserved. 16© Copyright 2013 Pivotal. All rights reserved. 16© Copyright 2013 Pivotal. All rights reserved. 16© Copyright 2013 Pivotal. All rights reserved. 16© Copyright 2014 Pivotal. All rights reserved. Examples
  17. 17. @ianhuston 17© Copyright 2014 Pivotal. All rights reserved. Returning Results Ÿ  Postgres primitive types (int, bigint, text, float8, double precision, date, NULL etc.) Ÿ  Composite types can be returned by creating a composite type in the database:   CREATE  TYPE  named_value  AS  (      name    text,      value    integer   );   Ÿ  Then you can return a list, tuple or dict (not sets) which reference the same structure as the table: CREATE  FUNCTION  make_pair  (name  text,  value  integer)      RETURNS  named_value   AS  $$      return  [  name,  value  ]      #  or  alternatively,  as  tuple:  return  (  name,  value  )      #  or  as  dict:  return  {  "name":  name,  "value":  value  }      #  or  as  an  object  with  attributes  .name  and  .value   $$  LANGUAGE  plpythonu;   Ÿ  For functions which return multiple rows, prefix “setof” before the return type
  18. 18. @ianhuston 18© Copyright 2014 Pivotal. All rights reserved. Returning more results You can return multiple results by wrapping them in a sequence (tuple, list or set), an iterator or a generator: CREATE  FUNCTION  make_pair  (name  text)      RETURNS  SETOF  named_value   AS  $$      return  ([  name,  1  ],  [  name,  2  ],  [  name,  3])     $$  LANGUAGE  plpythonu;   Sequence Generator CREATE  FUNCTION  make_pair  (name  text)      RETURNS  SETOF  named_value    AS  $$      for  i  in  range(3):              yield  (name,  i)     $$  LANGUAGE  plpythonu;  
  19. 19. @ianhuston 19© Copyright 2014 Pivotal. All rights reserved. Accessing Packages Ÿ  On Greenplum DB: To be available packages must be installed on the individual segment nodes. –  Can use “parallel ssh” tool gpssh to conda/pip install –  Currently Greenplum DB ships with Python 2.6 (!) Ÿ  Then just import as usual inside function:     CREATE  FUNCTION  make_pair  (name  text)      RETURNS  named_value   AS  $$      import  numpy  as  np      return  ((name,i)  for  i  in  np.arange(3))   $$  LANGUAGE  plpythonu;  
  20. 20. @ianhuston 20© Copyright 2014 Pivotal. All rights reserved. Benefits of PL/Python Ÿ  Easy to bring your code to the data. Ÿ  When SQL falls short leverage your Python (or R/Java/C) experience quickly. Ÿ  Apply Python across terabytes of data with minimal overhead or additional requirements. Ÿ  Results are already in the database system, ready for further analysis or storage.
  21. 21. @ianhuston 21© Copyright 2014 Pivotal. All rights reserved. 21© Copyright 2013 Pivotal. All rights reserved. 21© Copyright 2013 Pivotal. All rights reserved. 21© Copyright 2013 Pivotal. All rights reserved. 21© Copyright 2014 Pivotal. All rights reserved. MADlib
  22. 22. @ianhuston 22© Copyright 2014 Pivotal. All rights reserved. Going Beyond Data Parallelism Ÿ  Data Parallel computation via PL/Python libraries only allow us to run ‘n’ models in parallel. Ÿ  This works great when we are building one model for each value of the group by column, but we need parallelized algorithms to be able to build a single model on all the available data Ÿ  For this, we use MADlib – an open source library of parallel in-database machine learning algorithms.
  23. 23. @ianhuston 23© Copyright 2014 Pivotal. All rights reserved. MADlib: The Origin •  First mention of MAD analytics was at VLDB 2009 MAD Skills: New Analysis Practices for Big Data J. Hellerstein, J. Cohen, B. Dolan, M. Dunlap, C. Welton (with help from: Noelle Sio, David Hubbard, James Marca) http://db.cs.berkeley.edu/papers/vldb09-madskills.pdf •  MADlib project initiated in late 2010: Greenplum Analytics team and Prof. Joe Hellerstein UrbanDictionary mad (adj.): an adjective used to enhance a noun. 1- dude, you got skills. 2- dude, you got mad skills •  Open Source! https://github.com/madlib/madlib •  Works on Greenplum DB, PostgreSQL and also HAWQ & Impala •  Active development by Pivotal -  Latest Release: v1.4 (Nov 2013) •  Downloads and Docs: http://madlib.net/
  24. 24. @ianhuston 24© Copyright 2014 Pivotal. All rights reserved. MADlib Executes Algorithms In-Place MADlib Advantages Ø  No Data Movement Ø  Use MPP architecture’s full compute power Ø  Use MPP architecture’s entire memory to process data sets SQL M SQL M SQL M Segment Processors Master Processor M MADlib User
  25. 25. @ianhuston 25© Copyright 2014 Pivotal. All rights reserved. MADlib In-Database Functions Predictive Modeling Library Linear Systems •  Sparse and Dense Solvers Matrix Factorization •  Single Value Decomposition (SVD) •  Low-Rank Generalized Linear Models •  Linear Regression •  Logistic Regression •  Multinomial Logistic Regression •  Cox Proportional Hazards •  Regression •  Elastic Net Regularization •  Sandwich Estimators (Huber white, clustered, marginal effects) Machine Learning Algorithms •  Principal Component Analysis (PCA) •  Association Rules (Affinity Analysis, Market Basket) •  Topic Modeling (Parallel LDA) •  Decision Trees •  Ensemble Learners (Random Forests) •  Support Vector Machines •  Conditional Random Field (CRF) •  Clustering (K-means) •  Cross Validation Descriptive Statistics Sketch-based Estimators •  CountMin (Cormode- Muthukrishnan) •  FM (Flajolet-Martin) •  MFV (Most Frequent Values) Correlation Summary Support Modules Array Operations Sparse Vectors Random Sampling Probability Functions
  26. 26. @ianhuston 26© Copyright 2014 Pivotal. All rights reserved. Architecture RDBMS Query Processing (Greenplum, PostgreSQL, …) Low-level Abstraction Layer (matrix operations, C++ to RDBMS type bridge, …) RDBMS Built-in Functions User Interface High-level Abstraction Layer (iteration controller, ...) Functions for Inner Loops (for streaming algorithms) “Driver” Functions (outer loops of iterative algorithms, optimizer invocations) Python Python with templated SQL SQL, generated from specification C++
  27. 27. @ianhuston 27© Copyright 2014 Pivotal. All rights reserved. How does it work ? : A Linear Regression Example Ÿ  Finding linear dependencies between variables –  y ≈ c0 + c1 · x1 + c2 · x2 ? # select y, x1, x2 from unm limit 6; y | x1 | x2 -------+------+----- 10.14 | 0 | 0.3 11.93 | 0.69 | 0.6 13.57 | 1.1 | 0.9 14.17 | 1.39 | 1.2 15.25 | 1.61 | 1.5 16.15 | 1.79 | 1.8 Design Matrix X Vector of dependent variables y
  28. 28. @ianhuston 28© Copyright 2014 Pivotal. All rights reserved. Reminder: Linear-Regression Model •  •  If residuals i.i.d. Gaussians with standard deviation σ: –  max likelihood ⇔ min sum of squared residuals •  First-order conditions for the following quadratic objective (in c) yield the minimizer f(y | x) ∝ exp − 1 2σ2 · (y − xT c)2
  29. 29. @ianhuston 29© Copyright 2014 Pivotal. All rights reserved. Linear Regression: Streaming Algorithm •  How to compute with a single table scan? XT X XT y -1 XTX XTy
  30. 30. @ianhuston 30© Copyright 2014 Pivotal. All rights reserved. Linear Regression: Parallel Computation XT y XT 1 y1 XT 2 y2 Segment 1 Segment 2 XT y Master XT y = XT 1 XT 2 y1 y2 = XT i yi
  31. 31. @ianhuston 31© Copyright 2014 Pivotal. All rights reserved. Demos Ÿ  We built demos to showcase our technology pipeline, using Python technology. Ÿ  Two use cases: –  Topic and Sentiment Analysis of Tweets –  London Road Traffic Disruption prediction
  32. 32. @ianhuston 32© Copyright 2014 Pivotal. All rights reserved. Topic and Sentiment Analysis Pipeline Stored on HDFS Tweet Stream (gpfdist) Loaded as external tables into GPDB Parallel Parsing of JSON and extraction of fields using PL/ Python Topic Analysis through MADlib pLDA Sentiment Analysis through custom PL/Python functions D3.js
  33. 33. @ianhuston 33© Copyright 2014 Pivotal. All rights reserved. Transport for London Traffic Disruption feed Pivotal Greenplum Database Deduplication Feature Creation Modelling & Machine Learning d3.js  &  NVD3   Interactive SVG figures Transport Disruption Prediction Pipeline
  34. 34. @ianhuston 34© Copyright 2014 Pivotal. All rights reserved. Get in touch Feel free to contact me about PL/Python, or more generally about Data Science and opportunities available. @ianhuston ihuston @ gopivotal.com http://www.ianhuston.net
  35. 35. BUILT FOR THE SPEED OF BUSINESS

×