SlideShare a Scribd company logo
BUILT FOR THE SPEED OF BUSINESS
@ianhuston
2© Copyright 2014 Pivotal. All rights reserved. 2© Copyright 2013 Pivotal. All rights reserved. 2© Copyright 2013 Pivotal. All rights reserved. 2© Copyright 2013 Pivotal. All rights reserved. 2© Copyright 2014 Pivotal. All rights reserved.
Massively Parallel Processing
with Procedural Languages
PL/R and PL/Python
Ian Huston, @ianhuston
Data Scientist, Pivotal
@ianhuston
3© Copyright 2014 Pivotal. All rights reserved.
Some Links for this talk
Ÿ  Simple code examples:
https://github.com/ihuston/plpython_examples
Ÿ  IPython notebook rendered with nbviewer:
http://tinyurl.com/ih-plpython
Ÿ  More info (especially for PL/R):
http://gopivotal.github.io/gp-r/
@ianhuston
4© Copyright 2014 Pivotal. All rights reserved.
The Pivotal Message
Agile: data driven apps and
rapid time to value
Data Lake: store everything,
analyze anything
Enterprise PaaS: revolutionary
software development and
speed; build the right thing
@ianhuston
5© Copyright 2014 Pivotal. All rights reserved.
Open Source is Pivotal
@ianhuston
6© Copyright 2014 Pivotal. All rights reserved.
Introducing Pivotal Data Labs
Our Charter:
Pivotal Data Labs is Pivotal’s differentiated and highly opinionated data-centric
service delivery organization.
Our Goals:
Expedite customer time-to-value and ROI, by driving business-aligned innovation
and solutions assurance within Pivotal’s Data Fabric technologies.
Drive customer adoption and autonomy across the full spectrum of Pivotal Data
technologies through best-in-class data science and data engineering services,
with a deep emphasis on knowledge transfer.
@ianhuston
7© Copyright 2014 Pivotal. All rights reserved. 7© Copyright 2013 Pivotal. All rights reserved. 7© Copyright 2013 Pivotal. All rights reserved. 7© Copyright 2013 Pivotal. All rights reserved. 7© Copyright 2014 Pivotal. All rights reserved.
Procedural Languages
@ianhuston
8© Copyright 2014 Pivotal. All rights reserved.
MPP Architectural Overview
Think of it as multiple
PostGreSQL servers
Workers
Master
@ianhuston
9© Copyright 2014 Pivotal. All rights reserved.
User-Defined Functions (UDFs)
Ÿ  PostgreSQL/Greenplum provide lots of flexibility in defining your own functions.
Ÿ  Simple UDFs are SQL queries with calling arguments and return types.
Definition:
CREATE	
  FUNCTION	
  times2(INT)	
  
RETURNS	
  INT	
  
AS	
  $$	
  
	
  	
  	
  	
  SELECT	
  2	
  *	
  $1	
  
$$	
  LANGUAGE	
  sql;	
  
Execution:
SELECT	
  times2(1);	
  
	
  times2	
  	
  
-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐	
  
	
  	
  	
  	
  	
  	
  2	
  
(1	
  row)	
  
@ianhuston
10© Copyright 2014 Pivotal. All rights reserved.
Ÿ  The interpreter/VM of the language ‘X’ is
installed on each node of the Greenplum
Database Cluster
•  Data Parallelism:
-  PL/X piggybacks on
Greenplum’s MPP architecture
•  Allows users to write Greenplum/
PostgreSQL functions in the R/Python/
Java, Perl, pgsql or C languages Standby
Master
…
Master
Host
SQL
Interconnect
Segment Host
Segment
Segment
Segment Host
Segment
Segment
Segment Host
Segment
Segment
Segment Host
Segment
Segment
PL/X : X in {pgsql, R, Python, Java, Perl, C etc.}
@ianhuston
11© Copyright 2014 Pivotal. All rights reserved.
Data Parallelism
Ÿ  Little or no effort is required to break up the problem into a
number of parallel tasks, and there exists no dependency (or
communication) between those parallel tasks.
Ÿ  Examples:
–  Measure the height of each student in a classroom (explicitly
parallelizable by student)
–  MapReduce
–  Individual machine learning models per customer/region/etc.
@ianhuston
12© Copyright 2014 Pivotal. All rights reserved.
Intro to PL/Python
Ÿ  Procedural languages need to be installed on each database used.
Ÿ  Syntax is like normal Python function inside a SQL wrapper
Ÿ  Need to specify return type
CREATE	
  FUNCTION	
  pymax	
  (a	
  integer,	
  b	
  integer)	
  
	
  	
  RETURNS	
  integer	
  
AS	
  $$	
  
	
  	
  if	
  a	
  >	
  b:	
  
	
  	
  	
  	
  return	
  a	
  
	
  	
  return	
  b	
  
$$	
  LANGUAGE	
  plpythonu;	
  
SQL wrapper
SQL wrapper
Normal Python
@ianhuston
13© Copyright 2014 Pivotal. All rights reserved.
Returning Results
Ÿ  Postgres primitive types (int, bigint, text, float8, double precision, date, NULL etc.)
Ÿ  Composite types can be returned by creating a composite type in the database:	
  
CREATE	
  TYPE	
  named_value	
  AS	
  (	
  
	
  	
  name	
  	
  text,	
  
	
  	
  value	
  	
  integer	
  
);	
  
Ÿ  Then you can return a list, tuple or dict (not sets) which reference the same structure as the table:
CREATE	
  FUNCTION	
  make_pair	
  (name	
  text,	
  value	
  integer)	
  
	
  	
  RETURNS	
  named_value	
  
AS	
  $$	
  
	
  	
  return	
  [	
  name,	
  value	
  ]	
  
	
  	
  #	
  or	
  alternatively,	
  as	
  tuple:	
  return	
  (	
  name,	
  value	
  )	
  
	
  	
  #	
  or	
  as	
  dict:	
  return	
  {	
  "name":	
  name,	
  "value":	
  value	
  }	
  
	
  	
  #	
  or	
  as	
  an	
  object	
  with	
  attributes	
  .name	
  and	
  .value	
  
$$	
  LANGUAGE	
  plpythonu;	
  
Ÿ  For functions which return multiple rows, prefix “setof” before the return type
@ianhuston
14© Copyright 2014 Pivotal. All rights reserved.
Parallelisation
Separate data into units using GROUP BY
Aggregate input for function using array_agg
SELECT
name
, plp.linregr(array_agg(x), array_agg(y))
FROM plp.test_data
GROUP BY name ORDER BY name;
@ianhuston
15© Copyright 2014 Pivotal. All rights reserved.
Benefits of Procedural Languages
Ÿ  Easy to bring your code to the data.
Ÿ  When SQL falls short leverage your Python/R/Java/C
experience quickly.
Ÿ  Apply Python/R packages across terabytes of data with
minimal overhead or additional requirements.
Ÿ  Results are already in the database system, ready for further
analysis or storage.
@ianhuston
16© Copyright 2014 Pivotal. All rights reserved.
Text Analytics for Churn Prediction
Customer
A major telecom company
Business Problem
Reducing churn through more accurate
models
Challenges
Ÿ  Existing models only used structured
features
Ÿ  Call center memos had poor structure and
had lots of typos
Solution
Ÿ  Built sentiment analysis models to predict
churn and topic models to understand topics
of conversation in call center memos
Ÿ  Achieved 16% improvement in ROC curve
for Churn Prediction
@ianhuston
17© Copyright 2014 Pivotal. All rights reserved.
Predicting Customs Clearance Intercepts
Customer
A major courier delivery services company
Business Problem
7% of shipments into the US are intercepted
by Customs based on package description.
Challenge
Ÿ  Data Volume: Previously not able to build
models with rich features for billions of
packages.
Ÿ  Data Variety: Customer’s prior technology
did not allow them to extract features from
unstructured shipment documents.
Solution
•  Processed shipment description documents
using PL/Python. Tokenized, stemmed, removed
stop words, etc.
•  Built features to summarize the shippers’ prior
shipment and interception history
•  Built a model which predicts the propensity of a
package being intercepted (AUC =0.86)
Intercepted Not Intercepted
@ianhuston
18© Copyright 2014 Pivotal. All rights reserved.
Network User Behavior Anomaly Detection
Customer
A major global financial service enterprise
Business Problem
Detect anomalous user behaviour in the
global enterprise computer network
Challenges
10 Billions events in 6 months; 15K+
network devices; No existing SIEM solutions
can model user behavioral resource access
baseline and enable anomaly detection in an
adaptive and scalable architecture.
Solution
An innovative Graph Mining based algorithmic
framework with advanced Machine Learning.
Network topology and temporal behaviors are both
modeled. (Patent pending). Implemented in MPP
and PL/R, enabling parallel model training and
behavior risk scoring. Successfully identified DLP
violating anomalous users.
@ianhuston
19© Copyright 2014 Pivotal. All rights reserved. 19© Copyright 2013 Pivotal. All rights reserved. 19© Copyright 2013 Pivotal. All rights reserved. 19© Copyright 2013 Pivotal. All rights reserved. 19© Copyright 2014 Pivotal. All rights reserved.
MADlib
@ianhuston
20© Copyright 2014 Pivotal. All rights reserved.
MADlib: The Origin
•  First mention of MAD analytics was at VLDB 2009
MAD Skills: New Analysis Practices for Big Data
J. Hellerstein, J. Cohen, B. Dolan, M. Dunlap, C. Welton
(with help from: Noelle Sio, David Hubbard, James Marca)
http://db.cs.berkeley.edu/papers/vldb09-madskills.pdf
•  MADlib project initiated in late 2010:
Greenplum Analytics team and Prof. Joe Hellerstein
UrbanDictionary
mad (adj.): an adjective used to enhance a noun.
1- dude, you got skills.
2- dude, you got mad skills
•  Open Source!
https://github.com/madlib/madlib
•  Works on Greenplum DB,
PostgreSQL and also HAWQ &
Impala
•  Active development by Pivotal
-  Latest Release: v1.5 (Apr 2014)
•  Downloads and Docs:
http://madlib.net/
@ianhuston
21© Copyright 2014 Pivotal. All rights reserved.
MADlib In-Database
Functions
Predictive Modeling Library
Linear Systems
•  Sparse and Dense Solvers
Matrix Factorization
•  Single Value Decomposition (SVD)
•  Low-Rank
Generalized Linear Models
•  Linear Regression
•  Logistic Regression
•  Multinomial Logistic Regression
•  Cox Proportional Hazards
•  Regression
•  Elastic Net Regularization
•  Sandwich Estimators (Huber white,
clustered, marginal effects)
Machine Learning Algorithms
•  Principal Component Analysis (PCA)
•  Association Rules (Affinity Analysis, Market
Basket)
•  Topic Modeling (Parallel LDA)
•  Decision Trees
•  Ensemble Learners (Random Forests)
•  Support Vector Machines
•  Conditional Random Field (CRF)
•  Clustering (K-means)
•  Cross Validation
Descriptive Statistics
Sketch-based Estimators
•  CountMin (Cormode-
Muthukrishnan)
•  FM (Flajolet-Martin)
•  MFV (Most Frequent
Values)
Correlation
Summary
Support Modules
Array Operations
Sparse Vectors
Random Sampling
Probability Functions
@ianhuston
22© Copyright 2014 Pivotal. All rights reserved.
Get in touch
Feel free to contact me about PL/Python & PL/R, or more
generally about Data Science and opportunities available.
@ianhuston
ihuston @ gopivotal.com
http://www.ianhuston.net
BUILT FOR THE SPEED OF BUSINESS

More Related Content

What's hot

Apache Pig for Data Scientists
Apache Pig for Data ScientistsApache Pig for Data Scientists
Apache Pig for Data Scientists
DataWorks Summit
 
R tutorial
R tutorialR tutorial
R tutorial
Richard Vidgen
 
MapReduce Design Patterns
MapReduce Design PatternsMapReduce Design Patterns
MapReduce Design Patterns
Donald Miner
 
Pig programming is more fun: New features in Pig
Pig programming is more fun: New features in PigPig programming is more fun: New features in Pig
Pig programming is more fun: New features in Pig
daijy
 
inteSearch: An Intelligent Linked Data Information Access Framework
inteSearch: An Intelligent Linked Data Information Access FrameworkinteSearch: An Intelligent Linked Data Information Access Framework
inteSearch: An Intelligent Linked Data Information Access Framework
National Inistitute of Informatics (NII), Tokyo, Japann
 
Democratizing Big Semantic Data management
Democratizing Big Semantic Data managementDemocratizing Big Semantic Data management
Democratizing Big Semantic Data management
WU (Vienna University of Economics and Business)
 
Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...
Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...
Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...
Simplilearn
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
IJERD Editor
 
A short tutorial on r
A short tutorial on rA short tutorial on r
A short tutorial on r
Ashraf Uddin
 
Python and HDF5: Overview
Python and HDF5: OverviewPython and HDF5: Overview
Python and HDF5: Overview
andrewcollette
 
ffbase, statistical functions for large datasets
ffbase, statistical functions for large datasetsffbase, statistical functions for large datasets
ffbase, statistical functions for large datasets
Edwin de Jonge
 
Python and R for quantitative finance
Python and R for quantitative financePython and R for quantitative finance
Python and R for quantitative finance
Luca Sbardella
 
R, Hadoop and Amazon Web Services
R, Hadoop and Amazon Web ServicesR, Hadoop and Amazon Web Services
R, Hadoop and Amazon Web Services
Portland R User Group
 
Twinkle: A SPARQL Query Tool
Twinkle: A SPARQL Query ToolTwinkle: A SPARQL Query Tool
Twinkle: A SPARQL Query Tool
Leigh Dodds
 
MapReduce and Its Discontents
MapReduce and Its DiscontentsMapReduce and Its Discontents
MapReduce and Its Discontents
Dean Wampler
 
SPARQL Cheat Sheet
SPARQL Cheat SheetSPARQL Cheat Sheet
SPARQL Cheat Sheet
LeeFeigenbaum
 
hadoop&zing
hadoop&zinghadoop&zing
hadoop&zing
zingopen
 
A Map of the PyData Stack
A Map of the PyData StackA Map of the PyData Stack
A Map of the PyData Stack
Peadar Coyle
 
Introduction to data analysis using R
Introduction to data analysis using RIntroduction to data analysis using R
Introduction to data analysis using R
Victoria López
 
Linked Data and Services
Linked Data and ServicesLinked Data and Services
Linked Data and Services
Barry Norton
 

What's hot (20)

Apache Pig for Data Scientists
Apache Pig for Data ScientistsApache Pig for Data Scientists
Apache Pig for Data Scientists
 
R tutorial
R tutorialR tutorial
R tutorial
 
MapReduce Design Patterns
MapReduce Design PatternsMapReduce Design Patterns
MapReduce Design Patterns
 
Pig programming is more fun: New features in Pig
Pig programming is more fun: New features in PigPig programming is more fun: New features in Pig
Pig programming is more fun: New features in Pig
 
inteSearch: An Intelligent Linked Data Information Access Framework
inteSearch: An Intelligent Linked Data Information Access FrameworkinteSearch: An Intelligent Linked Data Information Access Framework
inteSearch: An Intelligent Linked Data Information Access Framework
 
Democratizing Big Semantic Data management
Democratizing Big Semantic Data managementDemocratizing Big Semantic Data management
Democratizing Big Semantic Data management
 
Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...
Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...
Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
 
A short tutorial on r
A short tutorial on rA short tutorial on r
A short tutorial on r
 
Python and HDF5: Overview
Python and HDF5: OverviewPython and HDF5: Overview
Python and HDF5: Overview
 
ffbase, statistical functions for large datasets
ffbase, statistical functions for large datasetsffbase, statistical functions for large datasets
ffbase, statistical functions for large datasets
 
Python and R for quantitative finance
Python and R for quantitative financePython and R for quantitative finance
Python and R for quantitative finance
 
R, Hadoop and Amazon Web Services
R, Hadoop and Amazon Web ServicesR, Hadoop and Amazon Web Services
R, Hadoop and Amazon Web Services
 
Twinkle: A SPARQL Query Tool
Twinkle: A SPARQL Query ToolTwinkle: A SPARQL Query Tool
Twinkle: A SPARQL Query Tool
 
MapReduce and Its Discontents
MapReduce and Its DiscontentsMapReduce and Its Discontents
MapReduce and Its Discontents
 
SPARQL Cheat Sheet
SPARQL Cheat SheetSPARQL Cheat Sheet
SPARQL Cheat Sheet
 
hadoop&zing
hadoop&zinghadoop&zing
hadoop&zing
 
A Map of the PyData Stack
A Map of the PyData StackA Map of the PyData Stack
A Map of the PyData Stack
 
Introduction to data analysis using R
Introduction to data analysis using RIntroduction to data analysis using R
Introduction to data analysis using R
 
Linked Data and Services
Linked Data and ServicesLinked Data and Services
Linked Data and Services
 

Similar to Data Science Amsterdam - Massively Parallel Processing with Procedural Languages

Massively Parallel Process with Prodedural Python by Ian Huston
Massively Parallel Process with Prodedural Python by Ian HustonMassively Parallel Process with Prodedural Python by Ian Huston
Massively Parallel Process with Prodedural Python by Ian Huston
PyData
 
Massively Parallel Processing with Procedural Python (PyData London 2014)
Massively Parallel Processing with Procedural Python (PyData London 2014)Massively Parallel Processing with Procedural Python (PyData London 2014)
Massively Parallel Processing with Procedural Python (PyData London 2014)
Ian Huston
 
Massively Parallel Processing with Procedural Python by Ronert Obst PyData Be...
Massively Parallel Processing with Procedural Python by Ronert Obst PyData Be...Massively Parallel Processing with Procedural Python by Ronert Obst PyData Be...
Massively Parallel Processing with Procedural Python by Ronert Obst PyData Be...
PyData
 
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...
Srivatsan Ramanujam
 
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
Srivatsan Ramanujam
 
Docopt, beautiful command-line options for R, user2014
Docopt, beautiful command-line options for R,  user2014Docopt, beautiful command-line options for R,  user2014
Docopt, beautiful command-line options for R, user2014
Edwin de Jonge
 
Enabling Python to be a Better Big Data Citizen
Enabling Python to be a Better Big Data CitizenEnabling Python to be a Better Big Data Citizen
Enabling Python to be a Better Big Data Citizen
Wes McKinney
 
Data Science at Scale on MPP databases - Use Cases & Open Source Tools
Data Science at Scale on MPP databases - Use Cases & Open Source ToolsData Science at Scale on MPP databases - Use Cases & Open Source Tools
Data Science at Scale on MPP databases - Use Cases & Open Source Tools
Esther Vasiete
 
Hadoop 2.0 - Solving the Data Quality Challenge
Hadoop 2.0 - Solving the Data Quality ChallengeHadoop 2.0 - Solving the Data Quality Challenge
Hadoop 2.0 - Solving the Data Quality Challenge
Inside Analysis
 
Toolboxes for data scientists
Toolboxes for data scientistsToolboxes for data scientists
Toolboxes for data scientists
Sudipto Krishna Dutta
 
Big data Big Analytics
Big data Big AnalyticsBig data Big Analytics
Big data Big Analytics
Ajay Ohri
 
Db tech show - hivemall
Db tech show - hivemallDb tech show - hivemall
Db tech show - hivemall
Makoto Yui
 
biopython, doctest and makefiles
biopython, doctest and makefilesbiopython, doctest and makefiles
biopython, doctest and makefiles
Giovanni Marco Dall'Olio
 
Big Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-onBig Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-on
Dony Riyanto
 
R_L1-Aug-2022.pptx
R_L1-Aug-2022.pptxR_L1-Aug-2022.pptx
R_L1-Aug-2022.pptx
ShantilalBhayal1
 
BigData_Krishna Kumar Sharma
BigData_Krishna Kumar SharmaBigData_Krishna Kumar Sharma
BigData_Krishna Kumar Sharma
Krishna Kumar Sharma
 
Talk about Hivemall at Data Scientist Organization on 2015/09/17
Talk about Hivemall at Data Scientist Organization on 2015/09/17Talk about Hivemall at Data Scientist Organization on 2015/09/17
Talk about Hivemall at Data Scientist Organization on 2015/09/17
Makoto Yui
 
Running Cognos on Hadoop
Running Cognos on HadoopRunning Cognos on Hadoop
Running Cognos on Hadoop
Senturus
 
Sql saturday pig session (wes floyd) v2
Sql saturday   pig session (wes floyd) v2Sql saturday   pig session (wes floyd) v2
Sql saturday pig session (wes floyd) v2
Wes Floyd
 
Hadoop for Data Science: Moving from BI dashboards to R models, using Hive st...
Hadoop for Data Science: Moving from BI dashboards to R models, using Hive st...Hadoop for Data Science: Moving from BI dashboards to R models, using Hive st...
Hadoop for Data Science: Moving from BI dashboards to R models, using Hive st...
huguk
 

Similar to Data Science Amsterdam - Massively Parallel Processing with Procedural Languages (20)

Massively Parallel Process with Prodedural Python by Ian Huston
Massively Parallel Process with Prodedural Python by Ian HustonMassively Parallel Process with Prodedural Python by Ian Huston
Massively Parallel Process with Prodedural Python by Ian Huston
 
Massively Parallel Processing with Procedural Python (PyData London 2014)
Massively Parallel Processing with Procedural Python (PyData London 2014)Massively Parallel Processing with Procedural Python (PyData London 2014)
Massively Parallel Processing with Procedural Python (PyData London 2014)
 
Massively Parallel Processing with Procedural Python by Ronert Obst PyData Be...
Massively Parallel Processing with Procedural Python by Ronert Obst PyData Be...Massively Parallel Processing with Procedural Python by Ronert Obst PyData Be...
Massively Parallel Processing with Procedural Python by Ronert Obst PyData Be...
 
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...
 
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
 
Docopt, beautiful command-line options for R, user2014
Docopt, beautiful command-line options for R,  user2014Docopt, beautiful command-line options for R,  user2014
Docopt, beautiful command-line options for R, user2014
 
Enabling Python to be a Better Big Data Citizen
Enabling Python to be a Better Big Data CitizenEnabling Python to be a Better Big Data Citizen
Enabling Python to be a Better Big Data Citizen
 
Data Science at Scale on MPP databases - Use Cases & Open Source Tools
Data Science at Scale on MPP databases - Use Cases & Open Source ToolsData Science at Scale on MPP databases - Use Cases & Open Source Tools
Data Science at Scale on MPP databases - Use Cases & Open Source Tools
 
Hadoop 2.0 - Solving the Data Quality Challenge
Hadoop 2.0 - Solving the Data Quality ChallengeHadoop 2.0 - Solving the Data Quality Challenge
Hadoop 2.0 - Solving the Data Quality Challenge
 
Toolboxes for data scientists
Toolboxes for data scientistsToolboxes for data scientists
Toolboxes for data scientists
 
Big data Big Analytics
Big data Big AnalyticsBig data Big Analytics
Big data Big Analytics
 
Db tech show - hivemall
Db tech show - hivemallDb tech show - hivemall
Db tech show - hivemall
 
biopython, doctest and makefiles
biopython, doctest and makefilesbiopython, doctest and makefiles
biopython, doctest and makefiles
 
Big Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-onBig Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-on
 
R_L1-Aug-2022.pptx
R_L1-Aug-2022.pptxR_L1-Aug-2022.pptx
R_L1-Aug-2022.pptx
 
BigData_Krishna Kumar Sharma
BigData_Krishna Kumar SharmaBigData_Krishna Kumar Sharma
BigData_Krishna Kumar Sharma
 
Talk about Hivemall at Data Scientist Organization on 2015/09/17
Talk about Hivemall at Data Scientist Organization on 2015/09/17Talk about Hivemall at Data Scientist Organization on 2015/09/17
Talk about Hivemall at Data Scientist Organization on 2015/09/17
 
Running Cognos on Hadoop
Running Cognos on HadoopRunning Cognos on Hadoop
Running Cognos on Hadoop
 
Sql saturday pig session (wes floyd) v2
Sql saturday   pig session (wes floyd) v2Sql saturday   pig session (wes floyd) v2
Sql saturday pig session (wes floyd) v2
 
Hadoop for Data Science: Moving from BI dashboards to R models, using Hive st...
Hadoop for Data Science: Moving from BI dashboards to R models, using Hive st...Hadoop for Data Science: Moving from BI dashboards to R models, using Hive st...
Hadoop for Data Science: Moving from BI dashboards to R models, using Hive st...
 

More from Ian Huston

CFSummit: Data Science on Cloud Foundry
CFSummit: Data Science on Cloud FoundryCFSummit: Data Science on Cloud Foundry
CFSummit: Data Science on Cloud Foundry
Ian Huston
 
Cloud Foundry for Data Science
Cloud Foundry for Data ScienceCloud Foundry for Data Science
Cloud Foundry for Data Science
Ian Huston
 
Python on Cloud Foundry
Python on Cloud FoundryPython on Cloud Foundry
Python on Cloud Foundry
Ian Huston
 
Driving the Future of Smart Cities - How to Beat the Traffic (Pivotal talk at...
Driving the Future of Smart Cities - How to Beat the Traffic (Pivotal talk at...Driving the Future of Smart Cities - How to Beat the Traffic (Pivotal talk at...
Driving the Future of Smart Cities - How to Beat the Traffic (Pivotal talk at...
Ian Huston
 
Calculating Non-adiabatic Pressure Perturbations during Multi-field Inflation
Calculating Non-adiabatic Pressure Perturbations during Multi-field InflationCalculating Non-adiabatic Pressure Perturbations during Multi-field Inflation
Calculating Non-adiabatic Pressure Perturbations during Multi-field Inflation
Ian Huston
 
Second Order Perturbations - National Astronomy Meeting 2011
Second Order Perturbations - National Astronomy Meeting 2011Second Order Perturbations - National Astronomy Meeting 2011
Second Order Perturbations - National Astronomy Meeting 2011
Ian Huston
 
Second Order Perturbations During Inflation Beyond Slow-roll
Second Order Perturbations During Inflation Beyond Slow-rollSecond Order Perturbations During Inflation Beyond Slow-roll
Second Order Perturbations During Inflation Beyond Slow-roll
Ian Huston
 
Inflation as a solution to the problems of the Big Bang
Inflation as a solution to the problems of the Big BangInflation as a solution to the problems of the Big Bang
Inflation as a solution to the problems of the Big Bang
Ian Huston
 
Cosmological Perturbations and Numerical Simulations
Cosmological Perturbations and Numerical SimulationsCosmological Perturbations and Numerical Simulations
Cosmological Perturbations and Numerical Simulations
Ian Huston
 
Cosmo09 presentation
Cosmo09 presentationCosmo09 presentation
Cosmo09 presentation
Ian Huston
 

More from Ian Huston (10)

CFSummit: Data Science on Cloud Foundry
CFSummit: Data Science on Cloud FoundryCFSummit: Data Science on Cloud Foundry
CFSummit: Data Science on Cloud Foundry
 
Cloud Foundry for Data Science
Cloud Foundry for Data ScienceCloud Foundry for Data Science
Cloud Foundry for Data Science
 
Python on Cloud Foundry
Python on Cloud FoundryPython on Cloud Foundry
Python on Cloud Foundry
 
Driving the Future of Smart Cities - How to Beat the Traffic (Pivotal talk at...
Driving the Future of Smart Cities - How to Beat the Traffic (Pivotal talk at...Driving the Future of Smart Cities - How to Beat the Traffic (Pivotal talk at...
Driving the Future of Smart Cities - How to Beat the Traffic (Pivotal talk at...
 
Calculating Non-adiabatic Pressure Perturbations during Multi-field Inflation
Calculating Non-adiabatic Pressure Perturbations during Multi-field InflationCalculating Non-adiabatic Pressure Perturbations during Multi-field Inflation
Calculating Non-adiabatic Pressure Perturbations during Multi-field Inflation
 
Second Order Perturbations - National Astronomy Meeting 2011
Second Order Perturbations - National Astronomy Meeting 2011Second Order Perturbations - National Astronomy Meeting 2011
Second Order Perturbations - National Astronomy Meeting 2011
 
Second Order Perturbations During Inflation Beyond Slow-roll
Second Order Perturbations During Inflation Beyond Slow-rollSecond Order Perturbations During Inflation Beyond Slow-roll
Second Order Perturbations During Inflation Beyond Slow-roll
 
Inflation as a solution to the problems of the Big Bang
Inflation as a solution to the problems of the Big BangInflation as a solution to the problems of the Big Bang
Inflation as a solution to the problems of the Big Bang
 
Cosmological Perturbations and Numerical Simulations
Cosmological Perturbations and Numerical SimulationsCosmological Perturbations and Numerical Simulations
Cosmological Perturbations and Numerical Simulations
 
Cosmo09 presentation
Cosmo09 presentationCosmo09 presentation
Cosmo09 presentation
 

Recently uploaded

Feature sql server terbaru performance.pptx
Feature sql server terbaru performance.pptxFeature sql server terbaru performance.pptx
Feature sql server terbaru performance.pptx
ssuser1915fe1
 
kk vathada _digital transformation frameworks_2024.pdf
kk vathada _digital transformation frameworks_2024.pdfkk vathada _digital transformation frameworks_2024.pdf
kk vathada _digital transformation frameworks_2024.pdf
KIRAN KV
 
Generative AI Reasoning Tech Talk - July 2024
Generative AI Reasoning Tech Talk - July 2024Generative AI Reasoning Tech Talk - July 2024
Generative AI Reasoning Tech Talk - July 2024
siddu769252
 
Computer HARDWARE presenattion by CWD students class 10
Computer HARDWARE presenattion by CWD students class 10Computer HARDWARE presenattion by CWD students class 10
Computer HARDWARE presenattion by CWD students class 10
ankush9927
 
BLOCKCHAIN TECHNOLOGY - Advantages and Disadvantages
BLOCKCHAIN TECHNOLOGY - Advantages and DisadvantagesBLOCKCHAIN TECHNOLOGY - Advantages and Disadvantages
BLOCKCHAIN TECHNOLOGY - Advantages and Disadvantages
SAI KAILASH R
 
Types of Weaving loom machine & it's technology
Types of Weaving loom machine & it's technologyTypes of Weaving loom machine & it's technology
Types of Weaving loom machine & it's technology
ldtexsolbl
 
Integrating Kafka with MuleSoft 4 and usecase
Integrating Kafka with MuleSoft 4 and usecaseIntegrating Kafka with MuleSoft 4 and usecase
Integrating Kafka with MuleSoft 4 and usecase
shyamraj55
 
EuroPython 2024 - Streamlining Testing in a Large Python Codebase
EuroPython 2024 - Streamlining Testing in a Large Python CodebaseEuroPython 2024 - Streamlining Testing in a Large Python Codebase
EuroPython 2024 - Streamlining Testing in a Large Python Codebase
Jimmy Lai
 
Mule Experience Hub and Release Channel with Java 17
Mule Experience Hub and Release Channel with Java 17Mule Experience Hub and Release Channel with Java 17
Mule Experience Hub and Release Channel with Java 17
Bhajan Mehta
 
Redefining Cybersecurity with AI Capabilities
Redefining Cybersecurity with AI CapabilitiesRedefining Cybersecurity with AI Capabilities
Redefining Cybersecurity with AI Capabilities
Priyanka Aash
 
UX Webinar Series: Drive Revenue and Decrease Costs with Passkeys for Consume...
UX Webinar Series: Drive Revenue and Decrease Costs with Passkeys for Consume...UX Webinar Series: Drive Revenue and Decrease Costs with Passkeys for Consume...
UX Webinar Series: Drive Revenue and Decrease Costs with Passkeys for Consume...
FIDO Alliance
 
Improving Learning Content Efficiency with Reusable Learning Content
Improving Learning Content Efficiency with Reusable Learning ContentImproving Learning Content Efficiency with Reusable Learning Content
Improving Learning Content Efficiency with Reusable Learning Content
Enterprise Knowledge
 
Russian Girls Call Navi Mumbai 🎈🔥9920725232 🔥💋🎈 Provide Best And Top Girl Ser...
Russian Girls Call Navi Mumbai 🎈🔥9920725232 🔥💋🎈 Provide Best And Top Girl Ser...Russian Girls Call Navi Mumbai 🎈🔥9920725232 🔥💋🎈 Provide Best And Top Girl Ser...
Russian Girls Call Navi Mumbai 🎈🔥9920725232 🔥💋🎈 Provide Best And Top Girl Ser...
bellared2
 
Use Cases & Benefits of RPA in Manufacturing in 2024.pptx
Use Cases & Benefits of RPA in Manufacturing in 2024.pptxUse Cases & Benefits of RPA in Manufacturing in 2024.pptx
Use Cases & Benefits of RPA in Manufacturing in 2024.pptx
SynapseIndia
 
Communications Mining Series - Zero to Hero - Session 3
Communications Mining Series - Zero to Hero - Session 3Communications Mining Series - Zero to Hero - Session 3
Communications Mining Series - Zero to Hero - Session 3
DianaGray10
 
Girls Call Churchgate 9910780858 Provide Best And Top Girl Service And No1 in...
Girls Call Churchgate 9910780858 Provide Best And Top Girl Service And No1 in...Girls Call Churchgate 9910780858 Provide Best And Top Girl Service And No1 in...
Girls Call Churchgate 9910780858 Provide Best And Top Girl Service And No1 in...
maigasapphire
 
Acumatica vs. Sage Intacct _Construction_July (1).pptx
Acumatica vs. Sage Intacct _Construction_July (1).pptxAcumatica vs. Sage Intacct _Construction_July (1).pptx
Acumatica vs. Sage Intacct _Construction_July (1).pptx
BrainSell Technologies
 
LeadMagnet IQ Review: Unlock the Secret to Effortless Traffic and Leads.pdf
LeadMagnet IQ Review:  Unlock the Secret to Effortless Traffic and Leads.pdfLeadMagnet IQ Review:  Unlock the Secret to Effortless Traffic and Leads.pdf
LeadMagnet IQ Review: Unlock the Secret to Effortless Traffic and Leads.pdf
SelfMade bd
 
Girls call Kolkata 👀 XXXXXXXXXXX 👀 Rs.9.5 K Cash Payment With Room Delivery
Girls call Kolkata 👀 XXXXXXXXXXX 👀 Rs.9.5 K Cash Payment With Room Delivery Girls call Kolkata 👀 XXXXXXXXXXX 👀 Rs.9.5 K Cash Payment With Room Delivery
Girls call Kolkata 👀 XXXXXXXXXXX 👀 Rs.9.5 K Cash Payment With Room Delivery
sunilverma7884
 
Vertex AI Agent Builder - GDG Alicante - Julio 2024
Vertex AI Agent Builder - GDG Alicante - Julio 2024Vertex AI Agent Builder - GDG Alicante - Julio 2024
Vertex AI Agent Builder - GDG Alicante - Julio 2024
Nicolás Lopéz
 

Recently uploaded (20)

Feature sql server terbaru performance.pptx
Feature sql server terbaru performance.pptxFeature sql server terbaru performance.pptx
Feature sql server terbaru performance.pptx
 
kk vathada _digital transformation frameworks_2024.pdf
kk vathada _digital transformation frameworks_2024.pdfkk vathada _digital transformation frameworks_2024.pdf
kk vathada _digital transformation frameworks_2024.pdf
 
Generative AI Reasoning Tech Talk - July 2024
Generative AI Reasoning Tech Talk - July 2024Generative AI Reasoning Tech Talk - July 2024
Generative AI Reasoning Tech Talk - July 2024
 
Computer HARDWARE presenattion by CWD students class 10
Computer HARDWARE presenattion by CWD students class 10Computer HARDWARE presenattion by CWD students class 10
Computer HARDWARE presenattion by CWD students class 10
 
BLOCKCHAIN TECHNOLOGY - Advantages and Disadvantages
BLOCKCHAIN TECHNOLOGY - Advantages and DisadvantagesBLOCKCHAIN TECHNOLOGY - Advantages and Disadvantages
BLOCKCHAIN TECHNOLOGY - Advantages and Disadvantages
 
Types of Weaving loom machine & it's technology
Types of Weaving loom machine & it's technologyTypes of Weaving loom machine & it's technology
Types of Weaving loom machine & it's technology
 
Integrating Kafka with MuleSoft 4 and usecase
Integrating Kafka with MuleSoft 4 and usecaseIntegrating Kafka with MuleSoft 4 and usecase
Integrating Kafka with MuleSoft 4 and usecase
 
EuroPython 2024 - Streamlining Testing in a Large Python Codebase
EuroPython 2024 - Streamlining Testing in a Large Python CodebaseEuroPython 2024 - Streamlining Testing in a Large Python Codebase
EuroPython 2024 - Streamlining Testing in a Large Python Codebase
 
Mule Experience Hub and Release Channel with Java 17
Mule Experience Hub and Release Channel with Java 17Mule Experience Hub and Release Channel with Java 17
Mule Experience Hub and Release Channel with Java 17
 
Redefining Cybersecurity with AI Capabilities
Redefining Cybersecurity with AI CapabilitiesRedefining Cybersecurity with AI Capabilities
Redefining Cybersecurity with AI Capabilities
 
UX Webinar Series: Drive Revenue and Decrease Costs with Passkeys for Consume...
UX Webinar Series: Drive Revenue and Decrease Costs with Passkeys for Consume...UX Webinar Series: Drive Revenue and Decrease Costs with Passkeys for Consume...
UX Webinar Series: Drive Revenue and Decrease Costs with Passkeys for Consume...
 
Improving Learning Content Efficiency with Reusable Learning Content
Improving Learning Content Efficiency with Reusable Learning ContentImproving Learning Content Efficiency with Reusable Learning Content
Improving Learning Content Efficiency with Reusable Learning Content
 
Russian Girls Call Navi Mumbai 🎈🔥9920725232 🔥💋🎈 Provide Best And Top Girl Ser...
Russian Girls Call Navi Mumbai 🎈🔥9920725232 🔥💋🎈 Provide Best And Top Girl Ser...Russian Girls Call Navi Mumbai 🎈🔥9920725232 🔥💋🎈 Provide Best And Top Girl Ser...
Russian Girls Call Navi Mumbai 🎈🔥9920725232 🔥💋🎈 Provide Best And Top Girl Ser...
 
Use Cases & Benefits of RPA in Manufacturing in 2024.pptx
Use Cases & Benefits of RPA in Manufacturing in 2024.pptxUse Cases & Benefits of RPA in Manufacturing in 2024.pptx
Use Cases & Benefits of RPA in Manufacturing in 2024.pptx
 
Communications Mining Series - Zero to Hero - Session 3
Communications Mining Series - Zero to Hero - Session 3Communications Mining Series - Zero to Hero - Session 3
Communications Mining Series - Zero to Hero - Session 3
 
Girls Call Churchgate 9910780858 Provide Best And Top Girl Service And No1 in...
Girls Call Churchgate 9910780858 Provide Best And Top Girl Service And No1 in...Girls Call Churchgate 9910780858 Provide Best And Top Girl Service And No1 in...
Girls Call Churchgate 9910780858 Provide Best And Top Girl Service And No1 in...
 
Acumatica vs. Sage Intacct _Construction_July (1).pptx
Acumatica vs. Sage Intacct _Construction_July (1).pptxAcumatica vs. Sage Intacct _Construction_July (1).pptx
Acumatica vs. Sage Intacct _Construction_July (1).pptx
 
LeadMagnet IQ Review: Unlock the Secret to Effortless Traffic and Leads.pdf
LeadMagnet IQ Review:  Unlock the Secret to Effortless Traffic and Leads.pdfLeadMagnet IQ Review:  Unlock the Secret to Effortless Traffic and Leads.pdf
LeadMagnet IQ Review: Unlock the Secret to Effortless Traffic and Leads.pdf
 
Girls call Kolkata 👀 XXXXXXXXXXX 👀 Rs.9.5 K Cash Payment With Room Delivery
Girls call Kolkata 👀 XXXXXXXXXXX 👀 Rs.9.5 K Cash Payment With Room Delivery Girls call Kolkata 👀 XXXXXXXXXXX 👀 Rs.9.5 K Cash Payment With Room Delivery
Girls call Kolkata 👀 XXXXXXXXXXX 👀 Rs.9.5 K Cash Payment With Room Delivery
 
Vertex AI Agent Builder - GDG Alicante - Julio 2024
Vertex AI Agent Builder - GDG Alicante - Julio 2024Vertex AI Agent Builder - GDG Alicante - Julio 2024
Vertex AI Agent Builder - GDG Alicante - Julio 2024
 

Data Science Amsterdam - Massively Parallel Processing with Procedural Languages

  • 1. BUILT FOR THE SPEED OF BUSINESS
  • 2. @ianhuston 2© Copyright 2014 Pivotal. All rights reserved. 2© Copyright 2013 Pivotal. All rights reserved. 2© Copyright 2013 Pivotal. All rights reserved. 2© Copyright 2013 Pivotal. All rights reserved. 2© Copyright 2014 Pivotal. All rights reserved. Massively Parallel Processing with Procedural Languages PL/R and PL/Python Ian Huston, @ianhuston Data Scientist, Pivotal
  • 3. @ianhuston 3© Copyright 2014 Pivotal. All rights reserved. Some Links for this talk Ÿ  Simple code examples: https://github.com/ihuston/plpython_examples Ÿ  IPython notebook rendered with nbviewer: http://tinyurl.com/ih-plpython Ÿ  More info (especially for PL/R): http://gopivotal.github.io/gp-r/
  • 4. @ianhuston 4© Copyright 2014 Pivotal. All rights reserved. The Pivotal Message Agile: data driven apps and rapid time to value Data Lake: store everything, analyze anything Enterprise PaaS: revolutionary software development and speed; build the right thing
  • 5. @ianhuston 5© Copyright 2014 Pivotal. All rights reserved. Open Source is Pivotal
  • 6. @ianhuston 6© Copyright 2014 Pivotal. All rights reserved. Introducing Pivotal Data Labs Our Charter: Pivotal Data Labs is Pivotal’s differentiated and highly opinionated data-centric service delivery organization. Our Goals: Expedite customer time-to-value and ROI, by driving business-aligned innovation and solutions assurance within Pivotal’s Data Fabric technologies. Drive customer adoption and autonomy across the full spectrum of Pivotal Data technologies through best-in-class data science and data engineering services, with a deep emphasis on knowledge transfer.
  • 7. @ianhuston 7© Copyright 2014 Pivotal. All rights reserved. 7© Copyright 2013 Pivotal. All rights reserved. 7© Copyright 2013 Pivotal. All rights reserved. 7© Copyright 2013 Pivotal. All rights reserved. 7© Copyright 2014 Pivotal. All rights reserved. Procedural Languages
  • 8. @ianhuston 8© Copyright 2014 Pivotal. All rights reserved. MPP Architectural Overview Think of it as multiple PostGreSQL servers Workers Master
  • 9. @ianhuston 9© Copyright 2014 Pivotal. All rights reserved. User-Defined Functions (UDFs) Ÿ  PostgreSQL/Greenplum provide lots of flexibility in defining your own functions. Ÿ  Simple UDFs are SQL queries with calling arguments and return types. Definition: CREATE  FUNCTION  times2(INT)   RETURNS  INT   AS  $$          SELECT  2  *  $1   $$  LANGUAGE  sql;   Execution: SELECT  times2(1);    times2     -­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐              2   (1  row)  
  • 10. @ianhuston 10© Copyright 2014 Pivotal. All rights reserved. Ÿ  The interpreter/VM of the language ‘X’ is installed on each node of the Greenplum Database Cluster •  Data Parallelism: -  PL/X piggybacks on Greenplum’s MPP architecture •  Allows users to write Greenplum/ PostgreSQL functions in the R/Python/ Java, Perl, pgsql or C languages Standby Master … Master Host SQL Interconnect Segment Host Segment Segment Segment Host Segment Segment Segment Host Segment Segment Segment Host Segment Segment PL/X : X in {pgsql, R, Python, Java, Perl, C etc.}
  • 11. @ianhuston 11© Copyright 2014 Pivotal. All rights reserved. Data Parallelism Ÿ  Little or no effort is required to break up the problem into a number of parallel tasks, and there exists no dependency (or communication) between those parallel tasks. Ÿ  Examples: –  Measure the height of each student in a classroom (explicitly parallelizable by student) –  MapReduce –  Individual machine learning models per customer/region/etc.
  • 12. @ianhuston 12© Copyright 2014 Pivotal. All rights reserved. Intro to PL/Python Ÿ  Procedural languages need to be installed on each database used. Ÿ  Syntax is like normal Python function inside a SQL wrapper Ÿ  Need to specify return type CREATE  FUNCTION  pymax  (a  integer,  b  integer)      RETURNS  integer   AS  $$      if  a  >  b:          return  a      return  b   $$  LANGUAGE  plpythonu;   SQL wrapper SQL wrapper Normal Python
  • 13. @ianhuston 13© Copyright 2014 Pivotal. All rights reserved. Returning Results Ÿ  Postgres primitive types (int, bigint, text, float8, double precision, date, NULL etc.) Ÿ  Composite types can be returned by creating a composite type in the database:   CREATE  TYPE  named_value  AS  (      name    text,      value    integer   );   Ÿ  Then you can return a list, tuple or dict (not sets) which reference the same structure as the table: CREATE  FUNCTION  make_pair  (name  text,  value  integer)      RETURNS  named_value   AS  $$      return  [  name,  value  ]      #  or  alternatively,  as  tuple:  return  (  name,  value  )      #  or  as  dict:  return  {  "name":  name,  "value":  value  }      #  or  as  an  object  with  attributes  .name  and  .value   $$  LANGUAGE  plpythonu;   Ÿ  For functions which return multiple rows, prefix “setof” before the return type
  • 14. @ianhuston 14© Copyright 2014 Pivotal. All rights reserved. Parallelisation Separate data into units using GROUP BY Aggregate input for function using array_agg SELECT name , plp.linregr(array_agg(x), array_agg(y)) FROM plp.test_data GROUP BY name ORDER BY name;
  • 15. @ianhuston 15© Copyright 2014 Pivotal. All rights reserved. Benefits of Procedural Languages Ÿ  Easy to bring your code to the data. Ÿ  When SQL falls short leverage your Python/R/Java/C experience quickly. Ÿ  Apply Python/R packages across terabytes of data with minimal overhead or additional requirements. Ÿ  Results are already in the database system, ready for further analysis or storage.
  • 16. @ianhuston 16© Copyright 2014 Pivotal. All rights reserved. Text Analytics for Churn Prediction Customer A major telecom company Business Problem Reducing churn through more accurate models Challenges Ÿ  Existing models only used structured features Ÿ  Call center memos had poor structure and had lots of typos Solution Ÿ  Built sentiment analysis models to predict churn and topic models to understand topics of conversation in call center memos Ÿ  Achieved 16% improvement in ROC curve for Churn Prediction
  • 17. @ianhuston 17© Copyright 2014 Pivotal. All rights reserved. Predicting Customs Clearance Intercepts Customer A major courier delivery services company Business Problem 7% of shipments into the US are intercepted by Customs based on package description. Challenge Ÿ  Data Volume: Previously not able to build models with rich features for billions of packages. Ÿ  Data Variety: Customer’s prior technology did not allow them to extract features from unstructured shipment documents. Solution •  Processed shipment description documents using PL/Python. Tokenized, stemmed, removed stop words, etc. •  Built features to summarize the shippers’ prior shipment and interception history •  Built a model which predicts the propensity of a package being intercepted (AUC =0.86) Intercepted Not Intercepted
  • 18. @ianhuston 18© Copyright 2014 Pivotal. All rights reserved. Network User Behavior Anomaly Detection Customer A major global financial service enterprise Business Problem Detect anomalous user behaviour in the global enterprise computer network Challenges 10 Billions events in 6 months; 15K+ network devices; No existing SIEM solutions can model user behavioral resource access baseline and enable anomaly detection in an adaptive and scalable architecture. Solution An innovative Graph Mining based algorithmic framework with advanced Machine Learning. Network topology and temporal behaviors are both modeled. (Patent pending). Implemented in MPP and PL/R, enabling parallel model training and behavior risk scoring. Successfully identified DLP violating anomalous users.
  • 19. @ianhuston 19© Copyright 2014 Pivotal. All rights reserved. 19© Copyright 2013 Pivotal. All rights reserved. 19© Copyright 2013 Pivotal. All rights reserved. 19© Copyright 2013 Pivotal. All rights reserved. 19© Copyright 2014 Pivotal. All rights reserved. MADlib
  • 20. @ianhuston 20© Copyright 2014 Pivotal. All rights reserved. MADlib: The Origin •  First mention of MAD analytics was at VLDB 2009 MAD Skills: New Analysis Practices for Big Data J. Hellerstein, J. Cohen, B. Dolan, M. Dunlap, C. Welton (with help from: Noelle Sio, David Hubbard, James Marca) http://db.cs.berkeley.edu/papers/vldb09-madskills.pdf •  MADlib project initiated in late 2010: Greenplum Analytics team and Prof. Joe Hellerstein UrbanDictionary mad (adj.): an adjective used to enhance a noun. 1- dude, you got skills. 2- dude, you got mad skills •  Open Source! https://github.com/madlib/madlib •  Works on Greenplum DB, PostgreSQL and also HAWQ & Impala •  Active development by Pivotal -  Latest Release: v1.5 (Apr 2014) •  Downloads and Docs: http://madlib.net/
  • 21. @ianhuston 21© Copyright 2014 Pivotal. All rights reserved. MADlib In-Database Functions Predictive Modeling Library Linear Systems •  Sparse and Dense Solvers Matrix Factorization •  Single Value Decomposition (SVD) •  Low-Rank Generalized Linear Models •  Linear Regression •  Logistic Regression •  Multinomial Logistic Regression •  Cox Proportional Hazards •  Regression •  Elastic Net Regularization •  Sandwich Estimators (Huber white, clustered, marginal effects) Machine Learning Algorithms •  Principal Component Analysis (PCA) •  Association Rules (Affinity Analysis, Market Basket) •  Topic Modeling (Parallel LDA) •  Decision Trees •  Ensemble Learners (Random Forests) •  Support Vector Machines •  Conditional Random Field (CRF) •  Clustering (K-means) •  Cross Validation Descriptive Statistics Sketch-based Estimators •  CountMin (Cormode- Muthukrishnan) •  FM (Flajolet-Martin) •  MFV (Most Frequent Values) Correlation Summary Support Modules Array Operations Sparse Vectors Random Sampling Probability Functions
  • 22. @ianhuston 22© Copyright 2014 Pivotal. All rights reserved. Get in touch Feel free to contact me about PL/Python & PL/R, or more generally about Data Science and opportunities available. @ianhuston ihuston @ gopivotal.com http://www.ianhuston.net
  • 23. BUILT FOR THE SPEED OF BUSINESS