© Copyright 2019 Pivotal Software, Inc. All rights Reserved.
Scott Hajek
@jscotthajek
A Modern Interface for Data
Science on Postgres/Greenplum
Cover w/ Image
Executive Summary
■ Who am I?
■ Pure SQL: not the preferred interface
for sophisticated data scientists
■ Maturity and scale of SQL-based
systems are what enterprise DS
demands
■ Data Frames: A better abstraction that
can be layered on top of SQL systems
○ Ibis: Python implementation of
Data Frames for SQL and large
data platforms
Who is Scott Hajek?
● Data Science consulting for 5+ years
● Senior Data Scientist for Pivotal
● Specialty in Linguistics and Natural
Language Processing
● Many industries:
○ Banking, Telecom, Manufacturing
● Problems tackled
○ Entity resolution, info extraction,
optimization, anomaly detection,
e-comm surveillance
Personas
● DBA
● Application developer
● Business Analyst
● Data Scientist
○ Operates on large data sets interactively
○ Uses advanced statistics and machine
learning techniques
○ Sophisticated programmer
○ Appreciates good abstractions
Different kinds of users of a
database system
COBOL:
IDENTIFICATION DIVISION.
PROGRAM-ID. BINSRCH1.
The binary search
reads every input record
after looking up the employee’s month of hire on a
table,
by a sequential search, it writes it out to an
output file
ENVIRONMENT DIVISION.
INPUT-OUTPUT SECTION.
FILE-CONTROL.
...
Python vs R
Abstraction
● Today popular
languages have
abstractions like
DataFrames
● Not just about fewer
lines of code
● Easier to reason about
Rob Story, bit.ly/python-vs-R-vs-cobol
vs COBOL
which(sapply(dataframe,
function(x) any(month == "January")))
dataframe[dataframe.month == "January"]
Good Abstractions in Math
Matrix notation makes multiple linear regression digestible
● Simple linear regression
● Multiple regression
without matrix
notation
● Multiple regression
with matrix notation
Good Abstractions for Tabular Data
SQL
● Pros
○ Well-defined standards, very familiar in enterprise
○ Declarative language
○ Analytic operations available
○ Abstractions for tables, columns, windows
● Cons
○ Verbose
○ Difficult to compose complex queries and
transformations by hand
○ Difficult to represent subqueries and intermediate
result sets
PL /
Good Abstractions for Tabular Data
Data Frames (df)
● Tabular data structures with named columns
● Different types allowed in different columns
● Easy to select subset of columns
● Analytic operations available:
○ Arithmetic, joins, maps, filters, aggregate & window
functions
● Popular with Data Scientists:
○ Easily hooks into programming languages
○ Flexible for interactive data exploration
○ Represent sub-queries as variables → clear data
flow through pipelines
Good Abstractions in Data Science Packages
Model
Tra mo
from sklearn.ensemble import
RandomForestClassifier
clf = RandomForestClassifier()
clf.fit(X, y)
10
Abstractions for training
Good Abstractions in Data Science Packages
Pre t al
clf.predict(new_data)
Regression
11
Model
Abstractions for predicting
Challenges with Writing Complex SQL Directly
App developers use frameworks to avoid writing SQL directly
● Tedious, error prone
● Insecure (SQL injection)
● Significant rewrite needed if you switch the flavor of database, especially if you’re
utilizing advanced features of databases.
● Instead they use Object-Relational Mapping (ORMs) to generate SQL
○ Spring or ActiveRecord from Ruby on Rails
Reconsidering SQL-based Platforms
Relational DB Management Systems have a lot to offer
● Stability
● SQL is the most common and familiar language in the enterprise
● Analytical capabilities
● MPP variants offer massive scale in storage and processing
What is Ibis?
“A [Python] pandas-like deferred expression system, with
first-class SQL support”
● Pandas: Python package with DataFrame abstraction, staple for data scientists
● Same code can work on multiple data platforms
● Deferred expression → lazy evaluation
○ Define complex pipeline of transformations, represented as an object
○ Can inspect properties of the end result without evaluating
○ Allows type/error checking client side before sending job to server/cluster
○ Make bad code fail fast!
○ Gives query/execution optimizer the full picture → better plans
● Developed by Wes McKinney, Phillip Cloud, and community
Ibis in Action
Establish a connection object
Create an object that refers to a
table
Table object contains information
about the schema
Ibis in Action
Columns and aggregation
● Column selection looks like pandas
● Methods for defining aggregation and
computation (e.g. sum)
● Computation deferred until final step
when execution is called
Ibis in Action
Joins
● Define join in object-oriented fashion
● Potential columns and types are
known before actually evaluation
Ibis in Action
Making Ibis More Versatile
To cover the full range of DS tasks in Postgres/Greenplum, ibis needs
some further development
● Ability to create and use user-defined functions (UDFs)
● Ability to create a table and save the results to it
● Data science modeling abstractions that use ibis table objects as input
Let’s round out ibis and give
Postgres/Greenplum a modern
data science interface
References
● Ibis project: docs.ibis-project.org
● Greenplum.org
● Pandas.pydata.org
● Scikit-learn.org
● pivotal.io/pivotal-data-science

A Modern Interface for Data Science on Postgres/Greenplum - Greenplum Summit 2019

  • 2.
    © Copyright 2019Pivotal Software, Inc. All rights Reserved. Scott Hajek @jscotthajek A Modern Interface for Data Science on Postgres/Greenplum
  • 3.
    Cover w/ Image ExecutiveSummary ■ Who am I? ■ Pure SQL: not the preferred interface for sophisticated data scientists ■ Maturity and scale of SQL-based systems are what enterprise DS demands ■ Data Frames: A better abstraction that can be layered on top of SQL systems ○ Ibis: Python implementation of Data Frames for SQL and large data platforms
  • 4.
    Who is ScottHajek? ● Data Science consulting for 5+ years ● Senior Data Scientist for Pivotal ● Specialty in Linguistics and Natural Language Processing ● Many industries: ○ Banking, Telecom, Manufacturing ● Problems tackled ○ Entity resolution, info extraction, optimization, anomaly detection, e-comm surveillance
  • 5.
    Personas ● DBA ● Applicationdeveloper ● Business Analyst ● Data Scientist ○ Operates on large data sets interactively ○ Uses advanced statistics and machine learning techniques ○ Sophisticated programmer ○ Appreciates good abstractions Different kinds of users of a database system
  • 6.
    COBOL: IDENTIFICATION DIVISION. PROGRAM-ID. BINSRCH1. Thebinary search reads every input record after looking up the employee’s month of hire on a table, by a sequential search, it writes it out to an output file ENVIRONMENT DIVISION. INPUT-OUTPUT SECTION. FILE-CONTROL. ... Python vs R Abstraction ● Today popular languages have abstractions like DataFrames ● Not just about fewer lines of code ● Easier to reason about Rob Story, bit.ly/python-vs-R-vs-cobol vs COBOL which(sapply(dataframe, function(x) any(month == "January"))) dataframe[dataframe.month == "January"]
  • 7.
    Good Abstractions inMath Matrix notation makes multiple linear regression digestible ● Simple linear regression ● Multiple regression without matrix notation ● Multiple regression with matrix notation
  • 8.
    Good Abstractions forTabular Data SQL ● Pros ○ Well-defined standards, very familiar in enterprise ○ Declarative language ○ Analytic operations available ○ Abstractions for tables, columns, windows ● Cons ○ Verbose ○ Difficult to compose complex queries and transformations by hand ○ Difficult to represent subqueries and intermediate result sets PL /
  • 9.
    Good Abstractions forTabular Data Data Frames (df) ● Tabular data structures with named columns ● Different types allowed in different columns ● Easy to select subset of columns ● Analytic operations available: ○ Arithmetic, joins, maps, filters, aggregate & window functions ● Popular with Data Scientists: ○ Easily hooks into programming languages ○ Flexible for interactive data exploration ○ Represent sub-queries as variables → clear data flow through pipelines
  • 10.
    Good Abstractions inData Science Packages Model Tra mo from sklearn.ensemble import RandomForestClassifier clf = RandomForestClassifier() clf.fit(X, y) 10 Abstractions for training
  • 11.
    Good Abstractions inData Science Packages Pre t al clf.predict(new_data) Regression 11 Model Abstractions for predicting
  • 12.
    Challenges with WritingComplex SQL Directly App developers use frameworks to avoid writing SQL directly ● Tedious, error prone ● Insecure (SQL injection) ● Significant rewrite needed if you switch the flavor of database, especially if you’re utilizing advanced features of databases. ● Instead they use Object-Relational Mapping (ORMs) to generate SQL ○ Spring or ActiveRecord from Ruby on Rails
  • 13.
    Reconsidering SQL-based Platforms RelationalDB Management Systems have a lot to offer ● Stability ● SQL is the most common and familiar language in the enterprise ● Analytical capabilities ● MPP variants offer massive scale in storage and processing
  • 14.
    What is Ibis? “A[Python] pandas-like deferred expression system, with first-class SQL support” ● Pandas: Python package with DataFrame abstraction, staple for data scientists ● Same code can work on multiple data platforms ● Deferred expression → lazy evaluation ○ Define complex pipeline of transformations, represented as an object ○ Can inspect properties of the end result without evaluating ○ Allows type/error checking client side before sending job to server/cluster ○ Make bad code fail fast! ○ Gives query/execution optimizer the full picture → better plans ● Developed by Wes McKinney, Phillip Cloud, and community
  • 15.
    Ibis in Action Establisha connection object Create an object that refers to a table Table object contains information about the schema
  • 16.
    Ibis in Action Columnsand aggregation ● Column selection looks like pandas ● Methods for defining aggregation and computation (e.g. sum) ● Computation deferred until final step when execution is called
  • 17.
    Ibis in Action Joins ●Define join in object-oriented fashion ● Potential columns and types are known before actually evaluation
  • 18.
  • 19.
    Making Ibis MoreVersatile To cover the full range of DS tasks in Postgres/Greenplum, ibis needs some further development ● Ability to create and use user-defined functions (UDFs) ● Ability to create a table and save the results to it ● Data science modeling abstractions that use ibis table objects as input
  • 20.
    Let’s round outibis and give Postgres/Greenplum a modern data science interface
  • 21.
    References ● Ibis project:docs.ibis-project.org ● Greenplum.org ● Pandas.pydata.org ● Scikit-learn.org ● pivotal.io/pivotal-data-science