SlideShare a Scribd company logo
1 of 35
Download to read offline
BUILT FOR THE SPEED OF BUSINESS
Massively Parallel Processing
with Procedural Python
How do we use the PyData stack in data science
engagements at Pivotal?
Ian Huston, @ianhuston
Data Scientist, Pivotal

@ianhuston
© Copyright 2014 Pivotal. All rights reserved.
2013

2
Some Links for this talk
Ÿ  Simple code examples:
https://github.com/ihuston/plpython_examples
Ÿ  IPython notebook rendered with nbviewer:
http://tinyurl.com/ih-plpython
Ÿ  More info (written for PL/R but applies to PL/Python):
http://gopivotal.github.io/gp-r/
Ÿ  Traffic Disruption demo (if we have time)
http://ds-demo-transport.cfapps.io
@ianhuston
© Copyright 2014 Pivotal. All rights reserved.

3
About Pivotal
Data-Driven
Application
Development

Pivotal Data
Science Labs

Cloud
Application
Platform
Data &
Analytics
Platform

Virtualization

Cloud Storage

@ianhuston
© Copyright 2014 Pivotal. All rights reserved.

4
What do our customers look like?
Ÿ  Large enterprises with lots of data collected
–  Work with 10s of TBs to PBs of data, structured & unstructured

Ÿ  Not able to get what they want out of their data
–  Old Legacy systems with high cost and no flexibility
–  Response times are too slow for interactive data analysis
–  Can only deal with small samples of data locally

Ÿ  They want to transform into data driven enterprises

@ianhuston
© Copyright 2014 Pivotal. All rights reserved.

5
Open Source is Pivotal

@ianhuston
© Copyright 2014 Pivotal. All rights reserved.

6
Pivotal’s Open Source Contributions
Lots more interesting small projects:

•  PyMADlib – Python Wrapper for MADlib
https://github.com/gopivotal/pymadlib

•  PivotalR – R wrapper for MADlib
http://github.com/madlib-internal/PivotalR

•  Part-of-speech tagger for Twitter via SQL
http://vatsan.github.io/gp-ark-tweet-nlp/

•  Pandas via psql
(interactive PostgreSQL terminal)
https://github.com/vatsan/pandas_via_psql

@ianhuston
© Copyright 2014 Pivotal. All rights reserved.

7
Typical Engagement Tech Setup
Ÿ  Platform:

–  Greenplum Analytics Database (GPDB)
–  Pivotal HD Hadoop Distribution + HAWQ (SQL DB on Hadoop)

Ÿ  Open Source Options (http://gopivotal.com):
–  Greenplum Community Edition
–  Pivotal HD Community Edition (HAWQ not included)
–  MADlib in-database machine learning library (http://madlib.net)

Ÿ  Where Python fits in:
–  PL/Python running in-database, with nltk, scikit-learn etc
–  IPython for exploratory analysis
–  Pandas, Matplotlib etc.
@ianhuston
© Copyright 2014 Pivotal. All rights reserved.

8
PIVOTAL DATA SCIENCE
TOOLKIT
1

Find Data

Platforms
•  Greenplum DB
•  Pivotal HD
•  Hadoop (other)
•  SAS HPA
•  AWS

2

3

Run Code

Interfaces
•  pgAdminIII
•  psql
•  psycopg2
•  Terminal
•  Cygwin
•  Putty
•  Winscp

Write Code

Editing Tools
•  Vi/Vim
•  Emacs
•  Smultron
•  TextWrangler
•  Eclipse
•  Notepad++
•  IPython
•  Sublime

Languages
•  SQL
•  Bash scripting
•  C
•  C++
•  C#
•  Java
•  Python
•  R

4

Write Code for Big Data

In-Database
•  SQL
•  PL/Python
•  PL/Java
•  PL/R
•  PL/pgSQL
5

Hadoop
•  HAWQ
•  Pig
•  Hive
•  Java

6

Visualization
•  python-matplotlib
•  python-networkx
•  D3.js
•  Tableau

Implement Algorithms

Libraries
•  MADlib
Java
•  Mahout
R
•  (Too many to list!)
Text
•  OpenNLP
•  NLTK
•  GPText
C++
•  opencv

Show Results

Python
•  NumPy
•  SciPy
•  scikit-learn
•  Pandas
Programs
•  Alpine Miner
•  Rstudio
•  MATLAB
•  SAS
•  Stata

•  GraphViz
•  Gephi
•  R (ggplot2, lattice,
shiny)
•  Excel
7

Collaborate

Sharing Tools
•  Chorus
•  Confluence
•  Socialcast
•  Github
•  Google Drive &
Hangouts

A large and
varied tool box!

@ianhuston
© Copyright 2014 Pivotal. All rights reserved.

9
PL/Python

@ianhuston
© Copyright 2014 Pivotal. All rights reserved.
2013

10
MPP Architectural Overview
Think of it as multiple
PostGreSQL servers
Master

Workers
@ianhuston
© Copyright 2014 Pivotal. All rights reserved.

11
Data Parallelism
Ÿ  Little or no effort is required to break up the problem into a
number of parallel tasks, and there exists no dependency (or
communication) between those parallel tasks.
Ÿ  Examples:
–  Measure the height of each student in a classroom (explicitly
parallelizable by student)
–  MapReduce
–  map() function in Python

@ianhuston
© Copyright 2014 Pivotal. All rights reserved.

12
User-Defined Functions (UDFs)
Ÿ  PostgreSQL/Greenplum provide lots of flexibility in defining your own functions.
Ÿ  Simple UDFs are SQL queries with calling arguments and return types.

Definition:

Execution:

CREATE	
  FUNCTION	
  times2(INT)	
  
RETURNS	
  INT	
  
AS	
  $$	
  
	
  	
  	
  	
  SELECT	
  2	
  *	
  $1	
  
$$	
  LANGUAGE	
  sql;	
  

SELECT	
  times2(1);	
  
	
  times2	
  	
  
-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐	
  
	
  	
  	
  	
  	
  	
  2	
  
(1	
  row)	
  

@ianhuston
© Copyright 2014 Pivotal. All rights reserved.

13
PL/X : X in {pgsql, R, Python, Java, Perl, C etc.}
•  Allows users to write Greenplum/
PostgreSQL functions in the R/Python/
Java, Perl, pgsql or C languages

SQL
Master
Host

Ÿ  The interpreter/VM of the language ‘X’ is
installed on each node of the Greenplum
Database Cluster
•  Data Parallelism:
-  PL/X piggybacks on
Greenplum’s MPP architecture

Standby
Master

Interconnect

Segment Host
Segment
Segment

Segment Host
Segment
Segment

Segment Host
Segment
Segment

Segment Host
Segment
Segment

…

@ianhuston
© Copyright 2014 Pivotal. All rights reserved.

14
Intro to PL/Python
Ÿ  Procedural languages need to be installed on each database used.
Ÿ  Name in SQL is plpythonu, ‘u’ means untrusted so need to be superuser to install.
Ÿ  Syntax is like normal Python function with function definition line replaced by SQL wrapper.
Alternatively like a SQL User Defined Function with Python inside.

SQL wrapper
Normal Python
SQL wrapper

CREATE	
  FUNCTION	
  pymax	
  (a	
  integer,	
  b	
  integer)	
  
	
  	
  RETURNS	
  integer	
  
AS	
  $$	
  
	
  	
  if	
  a	
  >	
  b:	
  
	
  	
  	
  	
  return	
  a	
  
	
  	
  return	
  b	
  
$$	
  LANGUAGE	
  plpythonu;	
  

@ianhuston
© Copyright 2014 Pivotal. All rights reserved.

15
Examples

@ianhuston
© Copyright 2014 Pivotal. All rights reserved.
2013

16
Returning Results
Ÿ  Postgres primitive types (int, bigint, text, float8, double precision, date, NULL etc.)
Ÿ  Composite types can be returned by creating a composite type in the database:	
  
CREATE	
  TYPE	
  named_value	
  AS	
  (	
  
	
  	
  name	
  	
  text,	
  
	
  	
  value	
  	
  integer	
  
);	
  

Ÿ  Then you can return a list, tuple or dict (not sets) which reference the same structure as the table:
CREATE	
  FUNCTION	
  make_pair	
  (name	
  text,	
  value	
  integer)	
  
	
  	
  RETURNS	
  named_value	
  
AS	
  $$	
  
	
  	
  return	
  [	
  name,	
  value	
  ]	
  
	
  	
  #	
  or	
  alternatively,	
  as	
  tuple:	
  return	
  (	
  name,	
  value	
  )	
  
	
  	
  #	
  or	
  as	
  dict:	
  return	
  {	
  "name":	
  name,	
  "value":	
  value	
  }	
  
	
  	
  #	
  or	
  as	
  an	
  object	
  with	
  attributes	
  .name	
  and	
  .value	
  
$$	
  LANGUAGE	
  plpythonu;	
  

Ÿ  For functions which return multiple rows, prefix “setof” before the return type

@ianhuston
© Copyright 2014 Pivotal. All rights reserved.

17
Returning more results
You can return multiple results by wrapping them in a sequence (tuple, list or set),
an iterator or a generator:

Sequence

Generator

CREATE	
  FUNCTION	
  make_pair	
  (name	
  text)	
  
	
  	
  RETURNS	
  SETOF	
  named_value	
  
AS	
  $$	
  
	
  	
  return	
  ([	
  name,	
  1	
  ],	
  [	
  name,	
  2	
  ],	
  [	
  name,	
  3])	
  	
  
$$	
  LANGUAGE	
  plpythonu;	
  
CREATE	
  FUNCTION	
  make_pair	
  (name	
  text)	
  
	
  	
  RETURNS	
  SETOF	
  named_value	
  	
  AS	
  $$	
  
	
  	
  for	
  i	
  in	
  range(3):	
  
	
  	
  	
  	
  	
  	
  yield	
  (name,	
  i)	
  	
  
$$	
  LANGUAGE	
  plpythonu;	
  

@ianhuston
© Copyright 2014 Pivotal. All rights reserved.

18
Accessing Packages
Ÿ  On Greenplum DB: To be available packages must be installed on the
individual segment nodes.
–  Can use “parallel ssh” tool gpssh to conda/pip install
–  Currently Greenplum DB ships with Python 2.6 (!)

Ÿ  Then just import as usual inside function:

	
  	
  

CREATE	
  FUNCTION	
  make_pair	
  (name	
  text)	
  
	
  	
  RETURNS	
  named_value	
  
AS	
  $$	
  
	
  	
  import	
  numpy	
  as	
  np	
  
	
  	
  return	
  ((name,i)	
  for	
  i	
  in	
  np.arange(3))	
  
$$	
  LANGUAGE	
  plpythonu;	
  

@ianhuston
© Copyright 2014 Pivotal. All rights reserved.

19
Benefits of PL/Python
Ÿ  Easy to bring your code to the data.
Ÿ  When SQL falls short leverage your Python (or R/Java/C)
experience quickly.
Ÿ  Apply Python across terabytes of data with minimal
overhead or additional requirements.
Ÿ  Results are already in the database system, ready for further
analysis or storage.
@ianhuston
© Copyright 2014 Pivotal. All rights reserved.

20
MADlib

@ianhuston
© Copyright 2014 Pivotal. All rights reserved.
2013

21
Going Beyond Data Parallelism
Ÿ  Data Parallel computation via PL/Python libraries only allow
us to run ‘n’ models in parallel.
Ÿ  This works great when we are building one model for each
value of the group by column, but we need parallelized
algorithms to be able to build a single model on all the
available data
Ÿ  For this, we use MADlib – an open source library of parallel
in-database machine learning algorithms.
@ianhuston
© Copyright 2014 Pivotal. All rights reserved.

22
MADlib: The Origin
UrbanDictionary
mad (adj.): an adjective used to enhance a noun.
1- dude, you got skills.
2- dude, you got mad skills

•  First mention of MAD analytics was at VLDB 2009
MAD Skills: New Analysis Practices for Big Data
J. Hellerstein, J. Cohen, B. Dolan, M. Dunlap, C. Welton
(with help from: Noelle Sio, David Hubbard, James Marca)
http://db.cs.berkeley.edu/papers/vldb09-madskills.pdf

• 

Open Source!
https://github.com/madlib/madlib

• 
• 

Works on Greenplum DB,
PostgreSQL and also HAWQ &
Impala
Active development by Pivotal
- 

• 

Latest Release: v1.4 (Nov 2013)

Downloads and Docs:
http://madlib.net/

•  MADlib project initiated in late 2010:
Greenplum Analytics team and Prof. Joe Hellerstein

@ianhuston
© Copyright 2014 Pivotal. All rights reserved.

23
MADlib Executes Algorithms In-Place
MADlib User

MADlib Advantages
Master
Processor

Ø 

SQL

SQL

SQL

M

M

M

No Data Movement

Ø 

M

Use MPP architecture’s
full compute power

Ø 

Use MPP architecture’s
entire memory to
process data sets

Segment
Processors

@ianhuston
© Copyright 2014 Pivotal. All rights reserved.

24
MADlib In-Database
Functions
Descriptive Statistics

Predictive Modeling Library
Generalized Linear Models
•  Linear Regression
•  Logistic Regression
•  Multinomial Logistic Regression
•  Cox Proportional Hazards
•  Regression
•  Elastic Net Regularization
•  Sandwich Estimators (Huber white,
clustered, marginal effects)

Matrix Factorization
•  Single Value Decomposition (SVD)
•  Low-Rank

Machine Learning Algorithms
•  Principal Component Analysis (PCA)
•  Association Rules (Affinity Analysis, Market
Basket)
•  Topic Modeling (Parallel LDA)
•  Decision Trees
•  Ensemble Learners (Random Forests)
•  Support Vector Machines
•  Conditional Random Field (CRF)
•  Clustering (K-means)
•  Cross Validation
Linear Systems
•  Sparse and Dense Solvers

Sketch-based Estimators
•  CountMin (CormodeMuthukrishnan)
•  FM (Flajolet-Martin)
•  MFV (Most Frequent
Values)
Correlation
Summary
Support Modules
Array Operations
Sparse Vectors
Random Sampling
Probability Functions

@ianhuston
© Copyright 2014 Pivotal. All rights reserved.

25
Architecture
User Interface
“Driver” Functions
(outer loops of iterative algorithms, optimizer invocations)
High-level Abstraction Layer
(iteration controller, ...)

RDBMS
Built-in
Functions

SQL, generated from
specification

Python with
templated SQL
Python

Functions for Inner Loops
(for streaming algorithms)
Low-level Abstraction Layer
(matrix operations, C++ to RDBMS
type bridge, …)

C++

RDBMS Query Processing
(Greenplum, PostgreSQL, …)

@ianhuston
© Copyright 2014 Pivotal. All rights reserved.

26
How does it work ? : A Linear Regression Example
Ÿ  Finding linear dependencies between variables
–  y ≈ c0 + c1 · x1 + c2 · x2 ?
# select y, x1, x2

Vector of dependent
variables y

y
| x1 | x2
-------+------+----10.14 |
0 | 0.3
11.93 | 0.69 | 0.6
13.57 | 1.1 | 0.9
14.17 | 1.39 | 1.2
15.25 | 1.61 | 1.5
16.15 | 1.79 | 1.8

from unm limit 6;

Design Matrix X

@ianhuston
© Copyright 2014 Pivotal. All rights reserved.

27
Reminder: Linear-Regression Model
• 
•  If residuals i.i.d. Gaussians with standard deviation σ:
–  max likelihood ⇔ min sum of squared residuals

f (y | x) ∝ exp

−

1
· (y − xT c)2
2σ 2

•  First-order conditions for the following quadratic objective (in c)
yield the minimizer

@ianhuston
© Copyright 2014 Pivotal. All rights reserved.

28
Linear Regression: Streaming Algorithm
•  How to compute with a single table scan?
-1

XT

XT

X

XTX

y

XTy

@ianhuston
© Copyright 2014 Pivotal. All rights reserved.

29
Linear Regression: Parallel Computation
XT
y

Segment 1

T
X1

y1

Segment 2

T
T
X T y = X 1 X2

Master

T
X2 y2

y1
y2

=

XTy

T
Xi y i

@ianhuston
© Copyright 2014 Pivotal. All rights reserved.

30
Demos
Ÿ  We built demos to showcase our technology pipeline, using
Python technology.
Ÿ  Two use cases:
–  Topic and Sentiment Analysis of Tweets
–  London Road Traffic Disruption prediction

@ianhuston
© Copyright 2014 Pivotal. All rights reserved.

31
Topic and Sentiment Analysis Pipeline

Tweet
Stream

D3.js
Stored on
HDFS
Topic Analysis through
MADlib pLDA
(gpfdist)
Loaded as
external tables
into GPDB

Parallel Parsing of
JSON and extraction
of fields using PL/
Python

Sentiment Analysis
through custom
PL/Python functions

@ianhuston
© Copyright 2014 Pivotal. All rights reserved.

32
Transport Disruption Prediction Pipeline

Transport for London
Traffic Disruption feed

Pivotal Greenplum
Database

d3.js	
  &	
  NVD3	
  
Interactive SVG figures

Deduplication

Feature Creation

Modelling & Machine Learning

@ianhuston
© Copyright 2014 Pivotal. All rights reserved.

33
Get in touch
Feel free to contact me about PL/Python, or more generally
about Data Science and opportunities available.

@ianhuston
ihuston @ gopivotal.com
http://www.ianhuston.net
@ianhuston
© Copyright 2014 Pivotal. All rights reserved.

34
BUILT FOR THE SPEED OF BUSINESS

More Related Content

What's hot

HopsML Meetup talk on Hopsworks + ROCm/AMD June 2019
HopsML Meetup talk on Hopsworks + ROCm/AMD June 2019HopsML Meetup talk on Hopsworks + ROCm/AMD June 2019
HopsML Meetup talk on Hopsworks + ROCm/AMD June 2019Jim Dowling
 
Big Data Hadoop Training
Big Data Hadoop TrainingBig Data Hadoop Training
Big Data Hadoop Trainingstratapps
 
Pig on Tez - Low Latency ETL with Big Data
Pig on Tez - Low Latency ETL with Big DataPig on Tez - Low Latency ETL with Big Data
Pig on Tez - Low Latency ETL with Big DataDataWorks Summit
 
Hadoop Design and k -Means Clustering
Hadoop Design and k -Means ClusteringHadoop Design and k -Means Clustering
Hadoop Design and k -Means ClusteringGeorge Ang
 
Scalable Hadoop with succinct Python: the best of both worlds
Scalable Hadoop with succinct Python: the best of both worldsScalable Hadoop with succinct Python: the best of both worlds
Scalable Hadoop with succinct Python: the best of both worldsDataWorks Summit
 
Scale up and Scale Out Anaconda and PyData
Scale up and Scale Out Anaconda and PyDataScale up and Scale Out Anaconda and PyData
Scale up and Scale Out Anaconda and PyDataTravis Oliphant
 
Nov HUG 2009: Hadoop Record Reader In Python
Nov HUG 2009: Hadoop Record Reader In PythonNov HUG 2009: Hadoop Record Reader In Python
Nov HUG 2009: Hadoop Record Reader In PythonYahoo Developer Network
 
Map reduce and Hadoop on windows
Map reduce and Hadoop on windowsMap reduce and Hadoop on windows
Map reduce and Hadoop on windowsMuhammad Shahid
 
Pig programming is more fun: New features in Pig
Pig programming is more fun: New features in PigPig programming is more fun: New features in Pig
Pig programming is more fun: New features in Pigdaijy
 
Introduction To Apache Pig at WHUG
Introduction To Apache Pig at WHUGIntroduction To Apache Pig at WHUG
Introduction To Apache Pig at WHUGAdam Kawa
 
MapReduce Design Patterns
MapReduce Design PatternsMapReduce Design Patterns
MapReduce Design PatternsDonald Miner
 
Parallel Linear Regression in Interative Reduce and YARN
Parallel Linear Regression in Interative Reduce and YARNParallel Linear Regression in Interative Reduce and YARN
Parallel Linear Regression in Interative Reduce and YARNDataWorks Summit
 
HCatalog: Table Management for Hadoop - CHUG - 20120917
HCatalog: Table Management for Hadoop - CHUG - 20120917HCatalog: Table Management for Hadoop - CHUG - 20120917
HCatalog: Table Management for Hadoop - CHUG - 20120917Chicago Hadoop Users Group
 
Hadoop, Pig, and Python (PyData NYC 2012)
Hadoop, Pig, and Python (PyData NYC 2012)Hadoop, Pig, and Python (PyData NYC 2012)
Hadoop, Pig, and Python (PyData NYC 2012)mortardata
 
Introduction To Elastic MapReduce at WHUG
Introduction To Elastic MapReduce at WHUGIntroduction To Elastic MapReduce at WHUG
Introduction To Elastic MapReduce at WHUGAdam Kawa
 

What's hot (20)

HopsML Meetup talk on Hopsworks + ROCm/AMD June 2019
HopsML Meetup talk on Hopsworks + ROCm/AMD June 2019HopsML Meetup talk on Hopsworks + ROCm/AMD June 2019
HopsML Meetup talk on Hopsworks + ROCm/AMD June 2019
 
Big Data Hadoop Training
Big Data Hadoop TrainingBig Data Hadoop Training
Big Data Hadoop Training
 
Apache pig
Apache pigApache pig
Apache pig
 
Running R on Hadoop - CHUG - 20120815
Running R on Hadoop - CHUG - 20120815Running R on Hadoop - CHUG - 20120815
Running R on Hadoop - CHUG - 20120815
 
Map Reduce introduction
Map Reduce introductionMap Reduce introduction
Map Reduce introduction
 
Pig on Tez - Low Latency ETL with Big Data
Pig on Tez - Low Latency ETL with Big DataPig on Tez - Low Latency ETL with Big Data
Pig on Tez - Low Latency ETL with Big Data
 
Hadoop Design and k -Means Clustering
Hadoop Design and k -Means ClusteringHadoop Design and k -Means Clustering
Hadoop Design and k -Means Clustering
 
Scalable Hadoop with succinct Python: the best of both worlds
Scalable Hadoop with succinct Python: the best of both worldsScalable Hadoop with succinct Python: the best of both worlds
Scalable Hadoop with succinct Python: the best of both worlds
 
Scale up and Scale Out Anaconda and PyData
Scale up and Scale Out Anaconda and PyDataScale up and Scale Out Anaconda and PyData
Scale up and Scale Out Anaconda and PyData
 
Nov HUG 2009: Hadoop Record Reader In Python
Nov HUG 2009: Hadoop Record Reader In PythonNov HUG 2009: Hadoop Record Reader In Python
Nov HUG 2009: Hadoop Record Reader In Python
 
Map reduce and Hadoop on windows
Map reduce and Hadoop on windowsMap reduce and Hadoop on windows
Map reduce and Hadoop on windows
 
myHadoop 0.30
myHadoop 0.30myHadoop 0.30
myHadoop 0.30
 
Apache Pig
Apache PigApache Pig
Apache Pig
 
Pig programming is more fun: New features in Pig
Pig programming is more fun: New features in PigPig programming is more fun: New features in Pig
Pig programming is more fun: New features in Pig
 
Introduction To Apache Pig at WHUG
Introduction To Apache Pig at WHUGIntroduction To Apache Pig at WHUG
Introduction To Apache Pig at WHUG
 
MapReduce Design Patterns
MapReduce Design PatternsMapReduce Design Patterns
MapReduce Design Patterns
 
Parallel Linear Regression in Interative Reduce and YARN
Parallel Linear Regression in Interative Reduce and YARNParallel Linear Regression in Interative Reduce and YARN
Parallel Linear Regression in Interative Reduce and YARN
 
HCatalog: Table Management for Hadoop - CHUG - 20120917
HCatalog: Table Management for Hadoop - CHUG - 20120917HCatalog: Table Management for Hadoop - CHUG - 20120917
HCatalog: Table Management for Hadoop - CHUG - 20120917
 
Hadoop, Pig, and Python (PyData NYC 2012)
Hadoop, Pig, and Python (PyData NYC 2012)Hadoop, Pig, and Python (PyData NYC 2012)
Hadoop, Pig, and Python (PyData NYC 2012)
 
Introduction To Elastic MapReduce at WHUG
Introduction To Elastic MapReduce at WHUGIntroduction To Elastic MapReduce at WHUG
Introduction To Elastic MapReduce at WHUG
 

Viewers also liked

The DSP/BIOS Bridge - OMAP3
The DSP/BIOS Bridge - OMAP3The DSP/BIOS Bridge - OMAP3
The DSP/BIOS Bridge - OMAP3vjaquez
 
Introduction to Parallel Processing Algorithms in Shared Nothing Databases
Introduction to Parallel Processing Algorithms in Shared Nothing DatabasesIntroduction to Parallel Processing Algorithms in Shared Nothing Databases
Introduction to Parallel Processing Algorithms in Shared Nothing DatabasesOfir Manor
 
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMText Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMlucenerevolution
 
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...Srivatsan Ramanujam
 
Cloud Foundry for Data Science
Cloud Foundry for Data ScienceCloud Foundry for Data Science
Cloud Foundry for Data ScienceIan Huston
 
Programming with Python and PostgreSQL
Programming with Python and PostgreSQLProgramming with Python and PostgreSQL
Programming with Python and PostgreSQLPeter Eisentraut
 
Big Data & Sentiment Analysis
Big Data & Sentiment AnalysisBig Data & Sentiment Analysis
Big Data & Sentiment AnalysisMichel Bruley
 

Viewers also liked (7)

The DSP/BIOS Bridge - OMAP3
The DSP/BIOS Bridge - OMAP3The DSP/BIOS Bridge - OMAP3
The DSP/BIOS Bridge - OMAP3
 
Introduction to Parallel Processing Algorithms in Shared Nothing Databases
Introduction to Parallel Processing Algorithms in Shared Nothing DatabasesIntroduction to Parallel Processing Algorithms in Shared Nothing Databases
Introduction to Parallel Processing Algorithms in Shared Nothing Databases
 
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMText Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
 
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...
 
Cloud Foundry for Data Science
Cloud Foundry for Data ScienceCloud Foundry for Data Science
Cloud Foundry for Data Science
 
Programming with Python and PostgreSQL
Programming with Python and PostgreSQLProgramming with Python and PostgreSQL
Programming with Python and PostgreSQL
 
Big Data & Sentiment Analysis
Big Data & Sentiment AnalysisBig Data & Sentiment Analysis
Big Data & Sentiment Analysis
 

Similar to Massively Parallel Processing with Procedural Python (PyData London 2014)

Massively Parallel Process with Prodedural Python by Ian Huston
Massively Parallel Process with Prodedural Python by Ian HustonMassively Parallel Process with Prodedural Python by Ian Huston
Massively Parallel Process with Prodedural Python by Ian HustonPyData
 
Python Powered Data Science at Pivotal (PyData 2013)
Python Powered Data Science at Pivotal (PyData 2013)Python Powered Data Science at Pivotal (PyData 2013)
Python Powered Data Science at Pivotal (PyData 2013)Srivatsan Ramanujam
 
Data Science Amsterdam - Massively Parallel Processing with Procedural Languages
Data Science Amsterdam - Massively Parallel Processing with Procedural LanguagesData Science Amsterdam - Massively Parallel Processing with Procedural Languages
Data Science Amsterdam - Massively Parallel Processing with Procedural LanguagesIan Huston
 
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal Srivatsan Ramanujam
 
Docopt, beautiful command-line options for R, user2014
Docopt, beautiful command-line options for R,  user2014Docopt, beautiful command-line options for R,  user2014
Docopt, beautiful command-line options for R, user2014Edwin de Jonge
 
Enabling Python to be a Better Big Data Citizen
Enabling Python to be a Better Big Data CitizenEnabling Python to be a Better Big Data Citizen
Enabling Python to be a Better Big Data CitizenWes McKinney
 
Sql saturday pig session (wes floyd) v2
Sql saturday   pig session (wes floyd) v2Sql saturday   pig session (wes floyd) v2
Sql saturday pig session (wes floyd) v2Wes Floyd
 
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labsApache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labsViswanath Gangavaram
 
Hopsworks at Google AI Huddle, Sunnyvale
Hopsworks at Google AI Huddle, SunnyvaleHopsworks at Google AI Huddle, Sunnyvale
Hopsworks at Google AI Huddle, SunnyvaleJim Dowling
 
High-Performance Python On Spark
High-Performance Python On SparkHigh-Performance Python On Spark
High-Performance Python On SparkJen Aman
 
High Performance Python on Apache Spark
High Performance Python on Apache SparkHigh Performance Python on Apache Spark
High Performance Python on Apache SparkWes McKinney
 
Business logic with PostgreSQL and Python
Business logic with PostgreSQL and PythonBusiness logic with PostgreSQL and Python
Business logic with PostgreSQL and PythonHubert Piotrowski
 
Programming Under Linux In Python
Programming Under Linux In PythonProgramming Under Linux In Python
Programming Under Linux In PythonMarwan Osman
 
Introduction To Python
Introduction To PythonIntroduction To Python
Introduction To PythonVanessa Rene
 
Government Polytechnic Arvi-1.pptx
Government Polytechnic Arvi-1.pptxGovernment Polytechnic Arvi-1.pptx
Government Polytechnic Arvi-1.pptxShivamDenge
 
What we can learn from Rebol?
What we can learn from Rebol?What we can learn from Rebol?
What we can learn from Rebol?lichtkind
 
Researh toolbox-data-analysis-with-python
Researh toolbox-data-analysis-with-pythonResearh toolbox-data-analysis-with-python
Researh toolbox-data-analysis-with-pythonWaternomics
 

Similar to Massively Parallel Processing with Procedural Python (PyData London 2014) (20)

Massively Parallel Process with Prodedural Python by Ian Huston
Massively Parallel Process with Prodedural Python by Ian HustonMassively Parallel Process with Prodedural Python by Ian Huston
Massively Parallel Process with Prodedural Python by Ian Huston
 
Python Powered Data Science at Pivotal (PyData 2013)
Python Powered Data Science at Pivotal (PyData 2013)Python Powered Data Science at Pivotal (PyData 2013)
Python Powered Data Science at Pivotal (PyData 2013)
 
Data Science Amsterdam - Massively Parallel Processing with Procedural Languages
Data Science Amsterdam - Massively Parallel Processing with Procedural LanguagesData Science Amsterdam - Massively Parallel Processing with Procedural Languages
Data Science Amsterdam - Massively Parallel Processing with Procedural Languages
 
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
 
Docopt, beautiful command-line options for R, user2014
Docopt, beautiful command-line options for R,  user2014Docopt, beautiful command-line options for R,  user2014
Docopt, beautiful command-line options for R, user2014
 
Enabling Python to be a Better Big Data Citizen
Enabling Python to be a Better Big Data CitizenEnabling Python to be a Better Big Data Citizen
Enabling Python to be a Better Big Data Citizen
 
Sql saturday pig session (wes floyd) v2
Sql saturday   pig session (wes floyd) v2Sql saturday   pig session (wes floyd) v2
Sql saturday pig session (wes floyd) v2
 
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labsApache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
 
Hopsworks at Google AI Huddle, Sunnyvale
Hopsworks at Google AI Huddle, SunnyvaleHopsworks at Google AI Huddle, Sunnyvale
Hopsworks at Google AI Huddle, Sunnyvale
 
High-Performance Python On Spark
High-Performance Python On SparkHigh-Performance Python On Spark
High-Performance Python On Spark
 
High Performance Python on Apache Spark
High Performance Python on Apache SparkHigh Performance Python on Apache Spark
High Performance Python on Apache Spark
 
Business logic with PostgreSQL and Python
Business logic with PostgreSQL and PythonBusiness logic with PostgreSQL and Python
Business logic with PostgreSQL and Python
 
Programming Under Linux In Python
Programming Under Linux In PythonProgramming Under Linux In Python
Programming Under Linux In Python
 
Introduction To Python
Introduction To PythonIntroduction To Python
Introduction To Python
 
Pyhton-1a-Basics.pdf
Pyhton-1a-Basics.pdfPyhton-1a-Basics.pdf
Pyhton-1a-Basics.pdf
 
Government Polytechnic Arvi-1.pptx
Government Polytechnic Arvi-1.pptxGovernment Polytechnic Arvi-1.pptx
Government Polytechnic Arvi-1.pptx
 
What we can learn from Rebol?
What we can learn from Rebol?What we can learn from Rebol?
What we can learn from Rebol?
 
Shivam PPT.pptx
Shivam PPT.pptxShivam PPT.pptx
Shivam PPT.pptx
 
biopython, doctest and makefiles
biopython, doctest and makefilesbiopython, doctest and makefiles
biopython, doctest and makefiles
 
Researh toolbox-data-analysis-with-python
Researh toolbox-data-analysis-with-pythonResearh toolbox-data-analysis-with-python
Researh toolbox-data-analysis-with-python
 

More from Ian Huston

CFSummit: Data Science on Cloud Foundry
CFSummit: Data Science on Cloud FoundryCFSummit: Data Science on Cloud Foundry
CFSummit: Data Science on Cloud FoundryIan Huston
 
Driving the Future of Smart Cities - How to Beat the Traffic (Pivotal talk at...
Driving the Future of Smart Cities - How to Beat the Traffic (Pivotal talk at...Driving the Future of Smart Cities - How to Beat the Traffic (Pivotal talk at...
Driving the Future of Smart Cities - How to Beat the Traffic (Pivotal talk at...Ian Huston
 
Calculating Non-adiabatic Pressure Perturbations during Multi-field Inflation
Calculating Non-adiabatic Pressure Perturbations during Multi-field InflationCalculating Non-adiabatic Pressure Perturbations during Multi-field Inflation
Calculating Non-adiabatic Pressure Perturbations during Multi-field InflationIan Huston
 
Second Order Perturbations - National Astronomy Meeting 2011
Second Order Perturbations - National Astronomy Meeting 2011Second Order Perturbations - National Astronomy Meeting 2011
Second Order Perturbations - National Astronomy Meeting 2011Ian Huston
 
Second Order Perturbations During Inflation Beyond Slow-roll
Second Order Perturbations During Inflation Beyond Slow-rollSecond Order Perturbations During Inflation Beyond Slow-roll
Second Order Perturbations During Inflation Beyond Slow-rollIan Huston
 
Inflation as a solution to the problems of the Big Bang
Inflation as a solution to the problems of the Big BangInflation as a solution to the problems of the Big Bang
Inflation as a solution to the problems of the Big BangIan Huston
 
Cosmological Perturbations and Numerical Simulations
Cosmological Perturbations and Numerical SimulationsCosmological Perturbations and Numerical Simulations
Cosmological Perturbations and Numerical SimulationsIan Huston
 
Cosmo09 presentation
Cosmo09 presentationCosmo09 presentation
Cosmo09 presentationIan Huston
 

More from Ian Huston (8)

CFSummit: Data Science on Cloud Foundry
CFSummit: Data Science on Cloud FoundryCFSummit: Data Science on Cloud Foundry
CFSummit: Data Science on Cloud Foundry
 
Driving the Future of Smart Cities - How to Beat the Traffic (Pivotal talk at...
Driving the Future of Smart Cities - How to Beat the Traffic (Pivotal talk at...Driving the Future of Smart Cities - How to Beat the Traffic (Pivotal talk at...
Driving the Future of Smart Cities - How to Beat the Traffic (Pivotal talk at...
 
Calculating Non-adiabatic Pressure Perturbations during Multi-field Inflation
Calculating Non-adiabatic Pressure Perturbations during Multi-field InflationCalculating Non-adiabatic Pressure Perturbations during Multi-field Inflation
Calculating Non-adiabatic Pressure Perturbations during Multi-field Inflation
 
Second Order Perturbations - National Astronomy Meeting 2011
Second Order Perturbations - National Astronomy Meeting 2011Second Order Perturbations - National Astronomy Meeting 2011
Second Order Perturbations - National Astronomy Meeting 2011
 
Second Order Perturbations During Inflation Beyond Slow-roll
Second Order Perturbations During Inflation Beyond Slow-rollSecond Order Perturbations During Inflation Beyond Slow-roll
Second Order Perturbations During Inflation Beyond Slow-roll
 
Inflation as a solution to the problems of the Big Bang
Inflation as a solution to the problems of the Big BangInflation as a solution to the problems of the Big Bang
Inflation as a solution to the problems of the Big Bang
 
Cosmological Perturbations and Numerical Simulations
Cosmological Perturbations and Numerical SimulationsCosmological Perturbations and Numerical Simulations
Cosmological Perturbations and Numerical Simulations
 
Cosmo09 presentation
Cosmo09 presentationCosmo09 presentation
Cosmo09 presentation
 

Recently uploaded

Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentationphoebematthew05
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfngoud9212
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfjimielynbastida
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 

Recently uploaded (20)

Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentation
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdf
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdf
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
 

Massively Parallel Processing with Procedural Python (PyData London 2014)

  • 1. BUILT FOR THE SPEED OF BUSINESS
  • 2. Massively Parallel Processing with Procedural Python How do we use the PyData stack in data science engagements at Pivotal? Ian Huston, @ianhuston Data Scientist, Pivotal @ianhuston © Copyright 2014 Pivotal. All rights reserved. 2013 2
  • 3. Some Links for this talk Ÿ  Simple code examples: https://github.com/ihuston/plpython_examples Ÿ  IPython notebook rendered with nbviewer: http://tinyurl.com/ih-plpython Ÿ  More info (written for PL/R but applies to PL/Python): http://gopivotal.github.io/gp-r/ Ÿ  Traffic Disruption demo (if we have time) http://ds-demo-transport.cfapps.io @ianhuston © Copyright 2014 Pivotal. All rights reserved. 3
  • 4. About Pivotal Data-Driven Application Development Pivotal Data Science Labs Cloud Application Platform Data & Analytics Platform Virtualization Cloud Storage @ianhuston © Copyright 2014 Pivotal. All rights reserved. 4
  • 5. What do our customers look like? Ÿ  Large enterprises with lots of data collected –  Work with 10s of TBs to PBs of data, structured & unstructured Ÿ  Not able to get what they want out of their data –  Old Legacy systems with high cost and no flexibility –  Response times are too slow for interactive data analysis –  Can only deal with small samples of data locally Ÿ  They want to transform into data driven enterprises @ianhuston © Copyright 2014 Pivotal. All rights reserved. 5
  • 6. Open Source is Pivotal @ianhuston © Copyright 2014 Pivotal. All rights reserved. 6
  • 7. Pivotal’s Open Source Contributions Lots more interesting small projects: •  PyMADlib – Python Wrapper for MADlib https://github.com/gopivotal/pymadlib •  PivotalR – R wrapper for MADlib http://github.com/madlib-internal/PivotalR •  Part-of-speech tagger for Twitter via SQL http://vatsan.github.io/gp-ark-tweet-nlp/ •  Pandas via psql (interactive PostgreSQL terminal) https://github.com/vatsan/pandas_via_psql @ianhuston © Copyright 2014 Pivotal. All rights reserved. 7
  • 8. Typical Engagement Tech Setup Ÿ  Platform: –  Greenplum Analytics Database (GPDB) –  Pivotal HD Hadoop Distribution + HAWQ (SQL DB on Hadoop) Ÿ  Open Source Options (http://gopivotal.com): –  Greenplum Community Edition –  Pivotal HD Community Edition (HAWQ not included) –  MADlib in-database machine learning library (http://madlib.net) Ÿ  Where Python fits in: –  PL/Python running in-database, with nltk, scikit-learn etc –  IPython for exploratory analysis –  Pandas, Matplotlib etc. @ianhuston © Copyright 2014 Pivotal. All rights reserved. 8
  • 9. PIVOTAL DATA SCIENCE TOOLKIT 1 Find Data Platforms •  Greenplum DB •  Pivotal HD •  Hadoop (other) •  SAS HPA •  AWS 2 3 Run Code Interfaces •  pgAdminIII •  psql •  psycopg2 •  Terminal •  Cygwin •  Putty •  Winscp Write Code Editing Tools •  Vi/Vim •  Emacs •  Smultron •  TextWrangler •  Eclipse •  Notepad++ •  IPython •  Sublime Languages •  SQL •  Bash scripting •  C •  C++ •  C# •  Java •  Python •  R 4 Write Code for Big Data In-Database •  SQL •  PL/Python •  PL/Java •  PL/R •  PL/pgSQL 5 Hadoop •  HAWQ •  Pig •  Hive •  Java 6 Visualization •  python-matplotlib •  python-networkx •  D3.js •  Tableau Implement Algorithms Libraries •  MADlib Java •  Mahout R •  (Too many to list!) Text •  OpenNLP •  NLTK •  GPText C++ •  opencv Show Results Python •  NumPy •  SciPy •  scikit-learn •  Pandas Programs •  Alpine Miner •  Rstudio •  MATLAB •  SAS •  Stata •  GraphViz •  Gephi •  R (ggplot2, lattice, shiny) •  Excel 7 Collaborate Sharing Tools •  Chorus •  Confluence •  Socialcast •  Github •  Google Drive & Hangouts A large and varied tool box! @ianhuston © Copyright 2014 Pivotal. All rights reserved. 9
  • 10. PL/Python @ianhuston © Copyright 2014 Pivotal. All rights reserved. 2013 10
  • 11. MPP Architectural Overview Think of it as multiple PostGreSQL servers Master Workers @ianhuston © Copyright 2014 Pivotal. All rights reserved. 11
  • 12. Data Parallelism Ÿ  Little or no effort is required to break up the problem into a number of parallel tasks, and there exists no dependency (or communication) between those parallel tasks. Ÿ  Examples: –  Measure the height of each student in a classroom (explicitly parallelizable by student) –  MapReduce –  map() function in Python @ianhuston © Copyright 2014 Pivotal. All rights reserved. 12
  • 13. User-Defined Functions (UDFs) Ÿ  PostgreSQL/Greenplum provide lots of flexibility in defining your own functions. Ÿ  Simple UDFs are SQL queries with calling arguments and return types. Definition: Execution: CREATE  FUNCTION  times2(INT)   RETURNS  INT   AS  $$          SELECT  2  *  $1   $$  LANGUAGE  sql;   SELECT  times2(1);    times2     -­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐              2   (1  row)   @ianhuston © Copyright 2014 Pivotal. All rights reserved. 13
  • 14. PL/X : X in {pgsql, R, Python, Java, Perl, C etc.} •  Allows users to write Greenplum/ PostgreSQL functions in the R/Python/ Java, Perl, pgsql or C languages SQL Master Host Ÿ  The interpreter/VM of the language ‘X’ is installed on each node of the Greenplum Database Cluster •  Data Parallelism: -  PL/X piggybacks on Greenplum’s MPP architecture Standby Master Interconnect Segment Host Segment Segment Segment Host Segment Segment Segment Host Segment Segment Segment Host Segment Segment … @ianhuston © Copyright 2014 Pivotal. All rights reserved. 14
  • 15. Intro to PL/Python Ÿ  Procedural languages need to be installed on each database used. Ÿ  Name in SQL is plpythonu, ‘u’ means untrusted so need to be superuser to install. Ÿ  Syntax is like normal Python function with function definition line replaced by SQL wrapper. Alternatively like a SQL User Defined Function with Python inside. SQL wrapper Normal Python SQL wrapper CREATE  FUNCTION  pymax  (a  integer,  b  integer)      RETURNS  integer   AS  $$      if  a  >  b:          return  a      return  b   $$  LANGUAGE  plpythonu;   @ianhuston © Copyright 2014 Pivotal. All rights reserved. 15
  • 16. Examples @ianhuston © Copyright 2014 Pivotal. All rights reserved. 2013 16
  • 17. Returning Results Ÿ  Postgres primitive types (int, bigint, text, float8, double precision, date, NULL etc.) Ÿ  Composite types can be returned by creating a composite type in the database:   CREATE  TYPE  named_value  AS  (      name    text,      value    integer   );   Ÿ  Then you can return a list, tuple or dict (not sets) which reference the same structure as the table: CREATE  FUNCTION  make_pair  (name  text,  value  integer)      RETURNS  named_value   AS  $$      return  [  name,  value  ]      #  or  alternatively,  as  tuple:  return  (  name,  value  )      #  or  as  dict:  return  {  "name":  name,  "value":  value  }      #  or  as  an  object  with  attributes  .name  and  .value   $$  LANGUAGE  plpythonu;   Ÿ  For functions which return multiple rows, prefix “setof” before the return type @ianhuston © Copyright 2014 Pivotal. All rights reserved. 17
  • 18. Returning more results You can return multiple results by wrapping them in a sequence (tuple, list or set), an iterator or a generator: Sequence Generator CREATE  FUNCTION  make_pair  (name  text)      RETURNS  SETOF  named_value   AS  $$      return  ([  name,  1  ],  [  name,  2  ],  [  name,  3])     $$  LANGUAGE  plpythonu;   CREATE  FUNCTION  make_pair  (name  text)      RETURNS  SETOF  named_value    AS  $$      for  i  in  range(3):              yield  (name,  i)     $$  LANGUAGE  plpythonu;   @ianhuston © Copyright 2014 Pivotal. All rights reserved. 18
  • 19. Accessing Packages Ÿ  On Greenplum DB: To be available packages must be installed on the individual segment nodes. –  Can use “parallel ssh” tool gpssh to conda/pip install –  Currently Greenplum DB ships with Python 2.6 (!) Ÿ  Then just import as usual inside function:     CREATE  FUNCTION  make_pair  (name  text)      RETURNS  named_value   AS  $$      import  numpy  as  np      return  ((name,i)  for  i  in  np.arange(3))   $$  LANGUAGE  plpythonu;   @ianhuston © Copyright 2014 Pivotal. All rights reserved. 19
  • 20. Benefits of PL/Python Ÿ  Easy to bring your code to the data. Ÿ  When SQL falls short leverage your Python (or R/Java/C) experience quickly. Ÿ  Apply Python across terabytes of data with minimal overhead or additional requirements. Ÿ  Results are already in the database system, ready for further analysis or storage. @ianhuston © Copyright 2014 Pivotal. All rights reserved. 20
  • 21. MADlib @ianhuston © Copyright 2014 Pivotal. All rights reserved. 2013 21
  • 22. Going Beyond Data Parallelism Ÿ  Data Parallel computation via PL/Python libraries only allow us to run ‘n’ models in parallel. Ÿ  This works great when we are building one model for each value of the group by column, but we need parallelized algorithms to be able to build a single model on all the available data Ÿ  For this, we use MADlib – an open source library of parallel in-database machine learning algorithms. @ianhuston © Copyright 2014 Pivotal. All rights reserved. 22
  • 23. MADlib: The Origin UrbanDictionary mad (adj.): an adjective used to enhance a noun. 1- dude, you got skills. 2- dude, you got mad skills •  First mention of MAD analytics was at VLDB 2009 MAD Skills: New Analysis Practices for Big Data J. Hellerstein, J. Cohen, B. Dolan, M. Dunlap, C. Welton (with help from: Noelle Sio, David Hubbard, James Marca) http://db.cs.berkeley.edu/papers/vldb09-madskills.pdf •  Open Source! https://github.com/madlib/madlib •  •  Works on Greenplum DB, PostgreSQL and also HAWQ & Impala Active development by Pivotal -  •  Latest Release: v1.4 (Nov 2013) Downloads and Docs: http://madlib.net/ •  MADlib project initiated in late 2010: Greenplum Analytics team and Prof. Joe Hellerstein @ianhuston © Copyright 2014 Pivotal. All rights reserved. 23
  • 24. MADlib Executes Algorithms In-Place MADlib User MADlib Advantages Master Processor Ø  SQL SQL SQL M M M No Data Movement Ø  M Use MPP architecture’s full compute power Ø  Use MPP architecture’s entire memory to process data sets Segment Processors @ianhuston © Copyright 2014 Pivotal. All rights reserved. 24
  • 25. MADlib In-Database Functions Descriptive Statistics Predictive Modeling Library Generalized Linear Models •  Linear Regression •  Logistic Regression •  Multinomial Logistic Regression •  Cox Proportional Hazards •  Regression •  Elastic Net Regularization •  Sandwich Estimators (Huber white, clustered, marginal effects) Matrix Factorization •  Single Value Decomposition (SVD) •  Low-Rank Machine Learning Algorithms •  Principal Component Analysis (PCA) •  Association Rules (Affinity Analysis, Market Basket) •  Topic Modeling (Parallel LDA) •  Decision Trees •  Ensemble Learners (Random Forests) •  Support Vector Machines •  Conditional Random Field (CRF) •  Clustering (K-means) •  Cross Validation Linear Systems •  Sparse and Dense Solvers Sketch-based Estimators •  CountMin (CormodeMuthukrishnan) •  FM (Flajolet-Martin) •  MFV (Most Frequent Values) Correlation Summary Support Modules Array Operations Sparse Vectors Random Sampling Probability Functions @ianhuston © Copyright 2014 Pivotal. All rights reserved. 25
  • 26. Architecture User Interface “Driver” Functions (outer loops of iterative algorithms, optimizer invocations) High-level Abstraction Layer (iteration controller, ...) RDBMS Built-in Functions SQL, generated from specification Python with templated SQL Python Functions for Inner Loops (for streaming algorithms) Low-level Abstraction Layer (matrix operations, C++ to RDBMS type bridge, …) C++ RDBMS Query Processing (Greenplum, PostgreSQL, …) @ianhuston © Copyright 2014 Pivotal. All rights reserved. 26
  • 27. How does it work ? : A Linear Regression Example Ÿ  Finding linear dependencies between variables –  y ≈ c0 + c1 · x1 + c2 · x2 ? # select y, x1, x2 Vector of dependent variables y y | x1 | x2 -------+------+----10.14 | 0 | 0.3 11.93 | 0.69 | 0.6 13.57 | 1.1 | 0.9 14.17 | 1.39 | 1.2 15.25 | 1.61 | 1.5 16.15 | 1.79 | 1.8 from unm limit 6; Design Matrix X @ianhuston © Copyright 2014 Pivotal. All rights reserved. 27
  • 28. Reminder: Linear-Regression Model •  •  If residuals i.i.d. Gaussians with standard deviation σ: –  max likelihood ⇔ min sum of squared residuals f (y | x) ∝ exp − 1 · (y − xT c)2 2σ 2 •  First-order conditions for the following quadratic objective (in c) yield the minimizer @ianhuston © Copyright 2014 Pivotal. All rights reserved. 28
  • 29. Linear Regression: Streaming Algorithm •  How to compute with a single table scan? -1 XT XT X XTX y XTy @ianhuston © Copyright 2014 Pivotal. All rights reserved. 29
  • 30. Linear Regression: Parallel Computation XT y Segment 1 T X1 y1 Segment 2 T T X T y = X 1 X2 Master T X2 y2 y1 y2 = XTy T Xi y i @ianhuston © Copyright 2014 Pivotal. All rights reserved. 30
  • 31. Demos Ÿ  We built demos to showcase our technology pipeline, using Python technology. Ÿ  Two use cases: –  Topic and Sentiment Analysis of Tweets –  London Road Traffic Disruption prediction @ianhuston © Copyright 2014 Pivotal. All rights reserved. 31
  • 32. Topic and Sentiment Analysis Pipeline Tweet Stream D3.js Stored on HDFS Topic Analysis through MADlib pLDA (gpfdist) Loaded as external tables into GPDB Parallel Parsing of JSON and extraction of fields using PL/ Python Sentiment Analysis through custom PL/Python functions @ianhuston © Copyright 2014 Pivotal. All rights reserved. 32
  • 33. Transport Disruption Prediction Pipeline Transport for London Traffic Disruption feed Pivotal Greenplum Database d3.js  &  NVD3   Interactive SVG figures Deduplication Feature Creation Modelling & Machine Learning @ianhuston © Copyright 2014 Pivotal. All rights reserved. 33
  • 34. Get in touch Feel free to contact me about PL/Python, or more generally about Data Science and opportunities available. @ianhuston ihuston @ gopivotal.com http://www.ianhuston.net @ianhuston © Copyright 2014 Pivotal. All rights reserved. 34
  • 35. BUILT FOR THE SPEED OF BUSINESS