Your SlideShare is downloading. ×
0
BUILT FOR THE SPEED OF BUSINESS
@ronert_obst
2© Copyright 2014 Pivotal. All rights reserved. 2© Copyright 2013 Pivotal. All rights reserved. 2© Copyright ...
@ronert_obst
3© Copyright 2014 Pivotal. All rights reserved.
Some Links for this talk
Ÿ  Simple code examples:
https://gi...
@ronert_obst
4© Copyright 2014 Pivotal. All rights reserved.
About Pivotal
Agile PaaS Big Data
Pivotal™
HD
with GemFire XD...
@ronert_obst
5© Copyright 2014 Pivotal. All rights reserved.
What do our customers look like?
Ÿ  Large enterprises with l...
@ronert_obst
6© Copyright 2014 Pivotal. All rights reserved.
Open Source is Pivotal
@ronert_obst
7© Copyright 2014 Pivotal. All rights reserved.
Pivotal’s Open Source Contributions
Lots more interesting sma...
@ronert_obst
8© Copyright 2014 Pivotal. All rights reserved.
Typical Engagement Tech Setup
Ÿ  Platform:
–  Greenplum Anal...
@ronert_obst
9© Copyright 2014 Pivotal. All rights reserved. 9© Copyright 2013 Pivotal. All rights reserved. 9© Copyright ...
@ronert_obst
10© Copyright 2014 Pivotal. All rights reserved.
Intro to PL/Python
Ÿ  Procedural languages need to be insta...
@ronert_obst
11© Copyright 2014 Pivotal. All rights reserved.
Returning Results
Ÿ  Postgres primitive types (int, bigint,...
@ronert_obst
12© Copyright 2014 Pivotal. All rights reserved.
Returning more results
You can return multiple results by wr...
@ronert_obst
13© Copyright 2014 Pivotal. All rights reserved.
Accessing Packages
Ÿ  On Greenplum DB: To be available pack...
@ronert_obst
14© Copyright 2014 Pivotal. All rights reserved.
MPP Architectural Overview
Think of it as multiple
PostgreSQ...
@ronert_obst
15© Copyright 2014 Pivotal. All rights reserved.
Benefits of PL/Python
Ÿ  Easy to bring your code to the dat...
@ronert_obst
16© Copyright 2014 Pivotal. All rights reserved. 16© Copyright 2013 Pivotal. All rights reserved. 16© Copyrig...
@ronert_obst
17© Copyright 2014 Pivotal. All rights reserved.
Going Beyond Data Parallelism
Ÿ  PL/Python only allows us t...
@ronert_obst
18© Copyright 2014 Pivotal. All rights reserved.
MADlib In-Database
Functions
Predictive Modeling Library
Lin...
@ronert_obst
19© Copyright 2014 Pivotal. All rights reserved.
Architecture
RDBMS Query Processing
(Greenplum, PostgreSQL, ...
@ronert_obst
20© Copyright 2014 Pivotal. All rights reserved. 20© Copyright 2013 Pivotal. All rights reserved. 20© Copyrig...
@ronert_obst
21© Copyright 2014 Pivotal. All rights reserved.
In Database Image Processing
Ÿ  Many use cases
–  Defect de...
@ronert_obst
22© Copyright 2014 Pivotal. All rights reserved.
Image Smoothing
create	
  or	
  replace	
  function	
  smoot...
@ronert_obst
23© Copyright 2014 Pivotal. All rights reserved.
Pivotal GNIP Decahose Pipeline
Parallel Parsing
of JSON
(PL/...
@ronert_obst
24© Copyright 2014 Pivotal. All rights reserved.
Sentiment Analysis – PL/Python Functions
Sentiment Scored
Tw...
@ronert_obst
25© Copyright 2014 Pivotal. All rights reserved.
Transport for London
Traffic Disruption feed
Pivotal Greenpl...
@ronert_obst
26© Copyright 2014 Pivotal. All rights reserved.
Get in touch
Feel free to contact me about PL/Python, or mor...
BUILT FOR THE SPEED OF BUSINESS
Upcoming SlideShare
Loading in...5
×

Massively Parallel Processing with Procedural Python by Ronert Obst PyData Berlin 2014

654

Published on

The Python data ecosystem has grown beyond the confines of single machines to embrace scalability. Here we describe one of our approaches to scaling, which is already being used in production systems. The goal of in-database analytics is to bring the calculations to the data, reducing transport costs and I/O bottlenecks. Using PL/Python we can run parallel queries across terabytes of data using not only pure SQL but also familiar PyData packages such as scikit-learn and nltk. This approach can also be used with PL/R to make use of a wide variety of R packages. We look at examples on Postgres compatible systems such as the Greenplum Database and on Hadoop through Pivotal HAWQ. We will also introduce MADlib, Pivotal’s open source library for scalable in-database machine learning, which uses Python to glue SQL queries to low level C++ functions and is also usable through the PyMADlib package.

Published in: Data & Analytics
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
654
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
16
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Transcript of "Massively Parallel Processing with Procedural Python by Ronert Obst PyData Berlin 2014"

  1. 1. BUILT FOR THE SPEED OF BUSINESS
  2. 2. @ronert_obst 2© Copyright 2014 Pivotal. All rights reserved. 2© Copyright 2013 Pivotal. All rights reserved. 2© Copyright 2013 Pivotal. All rights reserved. 2© Copyright 2013 Pivotal. All rights reserved. 2© Copyright 2014 Pivotal. All rights reserved. Massively Parallel Processing with Procedural Python How do we use the PyData stack in data science engagements at Pivotal? Ian Huston, Ronert Obst Data Scientists, Pivotal
  3. 3. @ronert_obst 3© Copyright 2014 Pivotal. All rights reserved. Some Links for this talk Ÿ  Simple code examples: https://github.com/ronert/plpython_examples Ÿ  IPython notebook rendered with nbviewer: http://bit.ly/1u5t1Mj
  4. 4. @ronert_obst 4© Copyright 2014 Pivotal. All rights reserved. About Pivotal Agile PaaS Big Data Pivotal™ HD with GemFire XD® Pivotal™ Greenplum Database Pivotal™ GemFire®
  5. 5. @ronert_obst 5© Copyright 2014 Pivotal. All rights reserved. What do our customers look like? Ÿ  Large enterprises with lots of data collected –  Work with PBs of data, structured & unstructured Ÿ  Not able to get what they want out of their data –  Legacy systems –  Slow response times –  Small Samples Ÿ  Transform into data driven enterprises
  6. 6. @ronert_obst 6© Copyright 2014 Pivotal. All rights reserved. Open Source is Pivotal
  7. 7. @ronert_obst 7© Copyright 2014 Pivotal. All rights reserved. Pivotal’s Open Source Contributions Lots more interesting small projects: •  PyMADlib – Python Wrapper for MADlib https://github.com/gopivotal/pymadlib •  PivotalR – R wrapper for MADlib http://github.com/madlib-internal/PivotalR •  Part-of-speech tagger for Twitter via SQL http://vatsan.github.io/gp-ark-tweet-nlp/ •  Pandas via psql (interactive PostgreSQL terminal) https://github.com/vatsan/pandas_via_psql
  8. 8. @ronert_obst 8© Copyright 2014 Pivotal. All rights reserved. Typical Engagement Tech Setup Ÿ  Platform: –  Greenplum Analytics Database (GPDB) –  Pivotal HD Hadoop Distribution + HAWQ (SQL on Hadoop) Ÿ  Open Source Options (http://gopivotal.com): –  Greenplum Community Edition –  Pivotal HD Community Edition (HAWQ not included) –  MADlib in-database machine learning library (http://madlib.net) Ÿ  Where Python fits in: –  IPython Notebook for exploratory analysis –  PL/Python running in-database, with nltk, scikit-learn etc –  Pandas, Matplotlib etc. Pivotal™ HD with GemFire XD® Pivotal™ Greenplum Database
  9. 9. @ronert_obst 9© Copyright 2014 Pivotal. All rights reserved. 9© Copyright 2013 Pivotal. All rights reserved. 9© Copyright 2013 Pivotal. All rights reserved. 9© Copyright 2013 Pivotal. All rights reserved. 9© Copyright 2014 Pivotal. All rights reserved. PL/Python
  10. 10. @ronert_obst 10© Copyright 2014 Pivotal. All rights reserved. Intro to PL/Python Ÿ  Procedural languages need to be installed on each database used Ÿ  Name in SQL is plpythonu, ‘u’ means untrusted so need to be super user to install Ÿ  Syntax is like normal Python function with function definition line replaced by SQL wrapper CREATE  FUNCTION  pymax  (a  integer,  b  integer)      RETURNS  integer   AS  $$      if  a  >  b:          return  a      return  b   $$  LANGUAGE  plpythonu;   SQL wrapper SQL wrapper Normal Python
  11. 11. @ronert_obst 11© Copyright 2014 Pivotal. All rights reserved. Returning Results Ÿ  Postgres primitive types (int, bigint, text, float8, double precision, date, NULL etc.) Ÿ  Composite types can be returned by creating a composite type in the database:     CREATE  TYPE  named_value  AS  (      name    text,      value    integer);   Ÿ  Then you can return a list, tuple or dict (not sets) which reference the same structure as the table:   CREATE  FUNCTION  make_pair  (name  text,  value  integer)      RETURNS  named_value   AS  $$      return  [  name,  value  ]      #  or  alternatively,  as  tuple:  return  (  name,  value  )      #  or  as  dict:  return  {  "name":  name,  "value":  value  }      #  or  as  an  object  with  attributes  .name  and  .value   $$  LANGUAGE  plpythonu;   Ÿ  For functions which return multiple rows, prefix “setof” before the return type
  12. 12. @ronert_obst 12© Copyright 2014 Pivotal. All rights reserved. Returning more results You can return multiple results by wrapping them in a sequence (tuple, list or set), an iterator or a generator: CREATE  FUNCTION  make_pair  (name  text)      RETURNS  SETOF  named_value   AS  $$      return  ([  name,  1  ],  [  name,  2  ],  [  name,  3])     $$  LANGUAGE  plpythonu;   Sequence Generator CREATE  FUNCTION  make_pair  (name  text)      RETURNS  SETOF  named_value    AS  $$      for  i  in  range(3):              yield  (name,  i)     $$  LANGUAGE  plpythonu;  
  13. 13. @ronert_obst 13© Copyright 2014 Pivotal. All rights reserved. Accessing Packages Ÿ  On Greenplum DB: To be available packages must be installed on the individual segment nodes. –  Can use “parallel ssh” tool gpssh to conda/pip install –  Currently Greenplum DB ships with Python 2.6 (!) Ÿ  Then just import as usual inside function:     CREATE  FUNCTION  make_pair  (name  text)      RETURNS  named_value   AS  $$      import  numpy  as  np      return  ((name,i)  for  i  in  np.arange(3))   $$  LANGUAGE  plpythonu;  
  14. 14. @ronert_obst 14© Copyright 2014 Pivotal. All rights reserved. MPP Architectural Overview Think of it as multiple PostgreSQL servers Workers Master
  15. 15. @ronert_obst 15© Copyright 2014 Pivotal. All rights reserved. Benefits of PL/Python Ÿ  Easy to bring your code to the data Ÿ  When SQL falls short, leverage your Python experience Ÿ  Apply Python across petabytes of data with minimal overhead or additional requirements Ÿ  Results are already in the database system, ready for further analysis or storage
  16. 16. @ronert_obst 16© Copyright 2014 Pivotal. All rights reserved. 16© Copyright 2013 Pivotal. All rights reserved. 16© Copyright 2013 Pivotal. All rights reserved. 16© Copyright 2013 Pivotal. All rights reserved. 16© Copyright 2014 Pivotal. All rights reserved. MADlib
  17. 17. @ronert_obst 17© Copyright 2014 Pivotal. All rights reserved. Going Beyond Data Parallelism Ÿ  PL/Python only allows us to run ‘n’ models in parallel Ÿ  No global model in PL/Python. Solution: •  Open Source! https://github.com/madlib/madlib •  Works on Greenplum DB, PostgreSQL and HAWQ •  Active development by Pivotal -  Latest Release: v1.6 (July 2014) •  Downloads and Docs: http://madlib.net/
  18. 18. @ronert_obst 18© Copyright 2014 Pivotal. All rights reserved. MADlib In-Database Functions Predictive Modeling Library Linear Systems •  Sparse and Dense Solvers Matrix Factorization •  Single Value Decomposition (SVD) •  Low-Rank Generalized Linear Models •  Linear Regression •  Logistic Regression •  Multinomial Logistic Regression •  Cox Proportional Hazards •  Regression •  Elastic Net Regularization •  Sandwich Estimators (Huber white, clustered, marginal effects) Machine Learning Algorithms •  Principal Component Analysis (PCA) •  Association Rules (Affinity Analysis, Market Basket) •  Topic Modeling (Parallel LDA) •  Decision Trees •  Ensemble Learners (Random Forests) •  Support Vector Machines •  Conditional Random Field (CRF) •  Clustering (K-means) •  Cross Validation Descriptive Statistics Sketch-based Estimators •  CountMin (Cormode- Muthukrishnan) •  FM (Flajolet-Martin) •  MFV (Most Frequent Values) Correlation Summary Support Modules Array Operations Sparse Vectors Random Sampling Probability Functions
  19. 19. @ronert_obst 19© Copyright 2014 Pivotal. All rights reserved. Architecture RDBMS Query Processing (Greenplum, PostgreSQL, HAWQ…) Low-level Abstraction Layer (matrix operations, C++) RDBMS Built-In Functions User Interface High-level Abstraction Layer (iteration controller, ...) Functions for Inner Loops (for streaming algorithms) “Driver” Functions (outer loops of iterative algorithms, optimizer invocations) Python Python with templated SQL SQL, generated from specification C++
  20. 20. @ronert_obst 20© Copyright 2014 Pivotal. All rights reserved. 20© Copyright 2013 Pivotal. All rights reserved. 20© Copyright 2013 Pivotal. All rights reserved. 20© Copyright 2013 Pivotal. All rights reserved. 20© Copyright 2014 Pivotal. All rights reserved. Examples
  21. 21. @ronert_obst 21© Copyright 2014 Pivotal. All rights reserved. In Database Image Processing Ÿ  Many use cases –  Defect detection in manufacturing –  Tumor detection in medical images Ÿ  Challenge: Size of main memory Beck et al. Sci Transl Med 2011. Name Row Col R G B img.jpg 331 188 250 249 255 img.jpg 332 188 248 250 255 img.jpg 331 189 249 249 255
  22. 22. @ronert_obst 22© Copyright 2014 Pivotal. All rights reserved. Image Smoothing create  or  replace  function  smooth(maxr  integer,  maxc  integer,  im_array  integer[])   returns  integer[]   as  $$      import  numpy  as  np      import  scipy  as  sc      import  scipy.ndimage  as  ndi      smooth  =  np.reshape(im_array,  (maxr+1,  maxc+1))      smooth  =  ndi.uniform_filter(smooth,  size=3)   return  np.reshape(smooth.astype(int),  (maxr+1)*(maxc+1))   $$  language  plpythonu;     select  im_id,  smooth(max(row),  max(col),      array_agg(blue_intensity  order  by  row,col))        from  (select  im_id,row,col,blue_intensity            from  images_table  where  im_id=1878            order  by  im_id,  row,  col)  t      group  by  im_id  distributed  by  (im_id);  
  23. 23. @ronert_obst 23© Copyright 2014 Pivotal. All rights reserved. Pivotal GNIP Decahose Pipeline Parallel Parsing of JSON (PL/Python) Topic Analysis through MADlib LDA Unsupervised Sentiment Analysis (PL/Python) D3.js Twitter Decahose (~55 million tweets/day) Source: http Sink: hdfs HDFS External Tables PXF Nightly Cron Jobs
  24. 24. @ronert_obst 24© Copyright 2014 Pivotal. All rights reserved. Sentiment Analysis – PL/Python Functions Sentiment Scored Tweets Use learned phrasal polarities to score sentiment of new tweets 1: Parts-of-speech Tagger : Gp-Ark-Tweet-NLP (http://vatsan.github.io/gp-ark-tweet-nlp/) Part-of-speech tagger1 Break-up Tweets into tokens and tag their parts-of-speech Phrase Extraction Semi-Supervised Sentiment Classification Phrasal Polarity Scoring PL/Python
  25. 25. @ronert_obst 25© Copyright 2014 Pivotal. All rights reserved. Transport for London Traffic Disruption feed Pivotal Greenplum Database Deduplication Feature Creation Modelling & Machine Learning d3.js  &  NVD3   Interactive SVG figures Transport Disruption Prediction Pipeline
  26. 26. @ronert_obst 26© Copyright 2014 Pivotal. All rights reserved. Get in touch Feel free to contact me about PL/Python, or more generally about Data Science. @ronert_obst robst@gopivotal.com www.ronert-obst.com
  27. 27. BUILT FOR THE SPEED OF BUSINESS
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×