SlideShare a Scribd company logo
© Copyright 2019 Pivotal Software, Inc. All rights Reserved.
Scott Hajek
@jscotthajek
A Modern Interface for Data
Science on Postgres/Greenplum
Cover w/ Image
Executive Summary
■ Who am I?
■ Pure SQL: not the preferred interface
for sophisticated data scientists
■ Maturity and scale of SQL-based
systems are what enterprise DS
demands
■ Data Frames: A better abstraction that
can be layered on top of SQL systems
○ Ibis: Python implementation of
Data Frames for SQL and large
data platforms
Who is Scott Hajek?
● Data Science consulting for 5+ years
● Senior Data Scientist for Pivotal
● Specialty in Linguistics and Natural
Language Processing
● Many industries:
○ Banking, Telecom, Manufacturing
● Problems tackled
○ Entity resolution, info extraction,
optimization, anomaly detection,
e-comm surveillance
Personas
● DBA
● Application developer
● Business Analyst
● Data Scientist
○ Operates on large data sets interactively
○ Uses advanced statistics and machine
learning techniques
○ Sophisticated programmer
○ Appreciates good abstractions
Different kinds of users of a
database system
COBOL:
IDENTIFICATION DIVISION.
PROGRAM-ID. BINSRCH1.
The binary search
reads every input record
after looking up the employee’s month of hire on a
table,
by a sequential search, it writes it out to an
output file
ENVIRONMENT DIVISION.
INPUT-OUTPUT SECTION.
FILE-CONTROL.
...
Python vs R
Abstraction
● Today popular
languages have
abstractions like
DataFrames
● Not just about fewer
lines of code
● Easier to reason about
Rob Story, bit.ly/python-vs-R-vs-cobol
vs COBOL
which(sapply(dataframe,
function(x) any(month == "January")))
dataframe[dataframe.month == "January"]
Good Abstractions in Math
Matrix notation makes multiple linear regression digestible
● Simple linear regression
● Multiple regression
without matrix
notation
● Multiple regression
with matrix notation
Good Abstractions for Tabular Data
SQL
● Pros
○ Well-defined standards, very familiar in enterprise
○ Declarative language
○ Analytic operations available
○ Abstractions for tables, columns, windows
● Cons
○ Verbose
○ Difficult to compose complex queries and
transformations by hand
○ Difficult to represent subqueries and intermediate
result sets
PL /
Good Abstractions for Tabular Data
Data Frames (df)
● Tabular data structures with named columns
● Different types allowed in different columns
● Easy to select subset of columns
● Analytic operations available:
○ Arithmetic, joins, maps, filters, aggregate & window
functions
● Popular with Data Scientists:
○ Easily hooks into programming languages
○ Flexible for interactive data exploration
○ Represent sub-queries as variables → clear data
flow through pipelines
Good Abstractions in Data Science Packages
Model
Tra mo
from sklearn.ensemble import
RandomForestClassifier
clf = RandomForestClassifier()
clf.fit(X, y)
10
Abstractions for training
Good Abstractions in Data Science Packages
Pre t al
clf.predict(new_data)
Regression
11
Model
Abstractions for predicting
Challenges with Writing Complex SQL Directly
App developers use frameworks to avoid writing SQL directly
● Tedious, error prone
● Insecure (SQL injection)
● Significant rewrite needed if you switch the flavor of database, especially if you’re
utilizing advanced features of databases.
● Instead they use Object-Relational Mapping (ORMs) to generate SQL
○ Spring or ActiveRecord from Ruby on Rails
Reconsidering SQL-based Platforms
Relational DB Management Systems have a lot to offer
● Stability
● SQL is the most common and familiar language in the enterprise
● Analytical capabilities
● MPP variants offer massive scale in storage and processing
What is Ibis?
“A [Python] pandas-like deferred expression system, with
first-class SQL support”
● Pandas: Python package with DataFrame abstraction, staple for data scientists
● Same code can work on multiple data platforms
● Deferred expression → lazy evaluation
○ Define complex pipeline of transformations, represented as an object
○ Can inspect properties of the end result without evaluating
○ Allows type/error checking client side before sending job to server/cluster
○ Make bad code fail fast!
○ Gives query/execution optimizer the full picture → better plans
● Developed by Wes McKinney, Phillip Cloud, and community
Ibis in Action
Establish a connection object
Create an object that refers to a
table
Table object contains information
about the schema
Ibis in Action
Columns and aggregation
● Column selection looks like pandas
● Methods for defining aggregation and
computation (e.g. sum)
● Computation deferred until final step
when execution is called
Ibis in Action
Joins
● Define join in object-oriented fashion
● Potential columns and types are
known before actually evaluation
Ibis in Action
Making Ibis More Versatile
To cover the full range of DS tasks in Postgres/Greenplum, ibis needs
some further development
● Ability to create and use user-defined functions (UDFs)
● Ability to create a table and save the results to it
● Data science modeling abstractions that use ibis table objects as input
Let’s round out ibis and give
Postgres/Greenplum a modern
data science interface
References
● Ibis project: docs.ibis-project.org
● Greenplum.org
● Pandas.pydata.org
● Scikit-learn.org
● pivotal.io/pivotal-data-science

More Related Content

What's hot

Distributing big astronomical catalogues with Greenplum - Greenplum Summit 2019
Distributing big astronomical catalogues with Greenplum - Greenplum Summit 2019Distributing big astronomical catalogues with Greenplum - Greenplum Summit 2019
Distributing big astronomical catalogues with Greenplum - Greenplum Summit 2019
VMware Tanzu
 
Learn How Dell Improved Postgres/Greenplum Performance 20x with a Database Pr...
Learn How Dell Improved Postgres/Greenplum Performance 20x with a Database Pr...Learn How Dell Improved Postgres/Greenplum Performance 20x with a Database Pr...
Learn How Dell Improved Postgres/Greenplum Performance 20x with a Database Pr...
VMware Tanzu
 
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
VMware Tanzu
 
Machine Learning, Graph, Text and Geospatial on Postgres and Greenplum - Gree...
Machine Learning, Graph, Text and Geospatial on Postgres and Greenplum - Gree...Machine Learning, Graph, Text and Geospatial on Postgres and Greenplum - Gree...
Machine Learning, Graph, Text and Geospatial on Postgres and Greenplum - Gree...
VMware Tanzu
 
An End-to-End Spark-Based Machine Learning Stack in the Hybrid Cloud with Far...
An End-to-End Spark-Based Machine Learning Stack in the Hybrid Cloud with Far...An End-to-End Spark-Based Machine Learning Stack in the Hybrid Cloud with Far...
An End-to-End Spark-Based Machine Learning Stack in the Hybrid Cloud with Far...
Databricks
 
Pivotal Greenplum Cloud Marketplaces - Greenplum Summit 2019
Pivotal Greenplum Cloud Marketplaces - Greenplum Summit 2019Pivotal Greenplum Cloud Marketplaces - Greenplum Summit 2019
Pivotal Greenplum Cloud Marketplaces - Greenplum Summit 2019
VMware Tanzu
 
Democratizing Data
Democratizing DataDemocratizing Data
Democratizing Data
Databricks
 
Greenplum-PXF November 2018
Greenplum-PXF November 2018Greenplum-PXF November 2018
Greenplum-PXF November 2018
KongYew Chan, MBA
 
Hadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARN
Hadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARNHadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARN
Hadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARN
Josh Patterson
 
Accelerating Deep Learning Training with BigDL and Drizzle on Apache Spark wi...
Accelerating Deep Learning Training with BigDL and Drizzle on Apache Spark wi...Accelerating Deep Learning Training with BigDL and Drizzle on Apache Spark wi...
Accelerating Deep Learning Training with BigDL and Drizzle on Apache Spark wi...
Databricks
 
Distributed Heterogeneous Mixture Learning On Spark
Distributed Heterogeneous Mixture Learning On SparkDistributed Heterogeneous Mixture Learning On Spark
Distributed Heterogeneous Mixture Learning On Spark
Spark Summit
 
Ai platform at scale
Ai platform at scaleAi platform at scale
Ai platform at scale
Henry Saputra
 
Presto Summit 2018 - 03 - Starburst CBO
Presto Summit 2018  - 03 - Starburst CBOPresto Summit 2018  - 03 - Starburst CBO
Presto Summit 2018 - 03 - Starburst CBO
kbajda
 
Managing the Complete Machine Learning Lifecycle with MLflow
Managing the Complete Machine Learning Lifecycle with MLflowManaging the Complete Machine Learning Lifecycle with MLflow
Managing the Complete Machine Learning Lifecycle with MLflow
Databricks
 
Personalization Journey: From Single Node to Cloud Streaming
Personalization Journey: From Single Node to Cloud StreamingPersonalization Journey: From Single Node to Cloud Streaming
Personalization Journey: From Single Node to Cloud Streaming
Databricks
 
Optimize the Large Scale Graph Applications by using Apache Spark with 4-5x P...
Optimize the Large Scale Graph Applications by using Apache Spark with 4-5x P...Optimize the Large Scale Graph Applications by using Apache Spark with 4-5x P...
Optimize the Large Scale Graph Applications by using Apache Spark with 4-5x P...
Databricks
 
Bighead: Airbnb’s End-to-End Machine Learning Platform with Krishna Puttaswa...
 Bighead: Airbnb’s End-to-End Machine Learning Platform with Krishna Puttaswa... Bighead: Airbnb’s End-to-End Machine Learning Platform with Krishna Puttaswa...
Bighead: Airbnb’s End-to-End Machine Learning Platform with Krishna Puttaswa...
Databricks
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
Real-Time Forecasting at Scale using Delta Lake and Delta Caching
Real-Time Forecasting at Scale using Delta Lake and Delta CachingReal-Time Forecasting at Scale using Delta Lake and Delta Caching
Real-Time Forecasting at Scale using Delta Lake and Delta Caching
Databricks
 
Building Robust Production Data Pipelines with Databricks Delta
Building Robust Production Data Pipelines with Databricks DeltaBuilding Robust Production Data Pipelines with Databricks Delta
Building Robust Production Data Pipelines with Databricks Delta
Databricks
 

What's hot (20)

Distributing big astronomical catalogues with Greenplum - Greenplum Summit 2019
Distributing big astronomical catalogues with Greenplum - Greenplum Summit 2019Distributing big astronomical catalogues with Greenplum - Greenplum Summit 2019
Distributing big astronomical catalogues with Greenplum - Greenplum Summit 2019
 
Learn How Dell Improved Postgres/Greenplum Performance 20x with a Database Pr...
Learn How Dell Improved Postgres/Greenplum Performance 20x with a Database Pr...Learn How Dell Improved Postgres/Greenplum Performance 20x with a Database Pr...
Learn How Dell Improved Postgres/Greenplum Performance 20x with a Database Pr...
 
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
 
Machine Learning, Graph, Text and Geospatial on Postgres and Greenplum - Gree...
Machine Learning, Graph, Text and Geospatial on Postgres and Greenplum - Gree...Machine Learning, Graph, Text and Geospatial on Postgres and Greenplum - Gree...
Machine Learning, Graph, Text and Geospatial on Postgres and Greenplum - Gree...
 
An End-to-End Spark-Based Machine Learning Stack in the Hybrid Cloud with Far...
An End-to-End Spark-Based Machine Learning Stack in the Hybrid Cloud with Far...An End-to-End Spark-Based Machine Learning Stack in the Hybrid Cloud with Far...
An End-to-End Spark-Based Machine Learning Stack in the Hybrid Cloud with Far...
 
Pivotal Greenplum Cloud Marketplaces - Greenplum Summit 2019
Pivotal Greenplum Cloud Marketplaces - Greenplum Summit 2019Pivotal Greenplum Cloud Marketplaces - Greenplum Summit 2019
Pivotal Greenplum Cloud Marketplaces - Greenplum Summit 2019
 
Democratizing Data
Democratizing DataDemocratizing Data
Democratizing Data
 
Greenplum-PXF November 2018
Greenplum-PXF November 2018Greenplum-PXF November 2018
Greenplum-PXF November 2018
 
Hadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARN
Hadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARNHadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARN
Hadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARN
 
Accelerating Deep Learning Training with BigDL and Drizzle on Apache Spark wi...
Accelerating Deep Learning Training with BigDL and Drizzle on Apache Spark wi...Accelerating Deep Learning Training with BigDL and Drizzle on Apache Spark wi...
Accelerating Deep Learning Training with BigDL and Drizzle on Apache Spark wi...
 
Distributed Heterogeneous Mixture Learning On Spark
Distributed Heterogeneous Mixture Learning On SparkDistributed Heterogeneous Mixture Learning On Spark
Distributed Heterogeneous Mixture Learning On Spark
 
Ai platform at scale
Ai platform at scaleAi platform at scale
Ai platform at scale
 
Presto Summit 2018 - 03 - Starburst CBO
Presto Summit 2018  - 03 - Starburst CBOPresto Summit 2018  - 03 - Starburst CBO
Presto Summit 2018 - 03 - Starburst CBO
 
Managing the Complete Machine Learning Lifecycle with MLflow
Managing the Complete Machine Learning Lifecycle with MLflowManaging the Complete Machine Learning Lifecycle with MLflow
Managing the Complete Machine Learning Lifecycle with MLflow
 
Personalization Journey: From Single Node to Cloud Streaming
Personalization Journey: From Single Node to Cloud StreamingPersonalization Journey: From Single Node to Cloud Streaming
Personalization Journey: From Single Node to Cloud Streaming
 
Optimize the Large Scale Graph Applications by using Apache Spark with 4-5x P...
Optimize the Large Scale Graph Applications by using Apache Spark with 4-5x P...Optimize the Large Scale Graph Applications by using Apache Spark with 4-5x P...
Optimize the Large Scale Graph Applications by using Apache Spark with 4-5x P...
 
Bighead: Airbnb’s End-to-End Machine Learning Platform with Krishna Puttaswa...
 Bighead: Airbnb’s End-to-End Machine Learning Platform with Krishna Puttaswa... Bighead: Airbnb’s End-to-End Machine Learning Platform with Krishna Puttaswa...
Bighead: Airbnb’s End-to-End Machine Learning Platform with Krishna Puttaswa...
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Real-Time Forecasting at Scale using Delta Lake and Delta Caching
Real-Time Forecasting at Scale using Delta Lake and Delta CachingReal-Time Forecasting at Scale using Delta Lake and Delta Caching
Real-Time Forecasting at Scale using Delta Lake and Delta Caching
 
Building Robust Production Data Pipelines with Databricks Delta
Building Robust Production Data Pipelines with Databricks DeltaBuilding Robust Production Data Pipelines with Databricks Delta
Building Robust Production Data Pipelines with Databricks Delta
 

Similar to A Modern Interface for Data Science on Postgres/Greenplum - Greenplum Summit 2019

Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dan Lynn
 
Pullareddy_tavva_resume.doc
Pullareddy_tavva_resume.docPullareddy_tavva_resume.doc
Pullareddy_tavva_resume.doc
T Pulla Reddy
 
Resume 11 2015
Resume 11 2015Resume 11 2015
Resume 11 2015
Sukanta Saha
 
Dirty data? Clean it up! - Datapalooza Denver 2016
Dirty data? Clean it up! - Datapalooza Denver 2016Dirty data? Clean it up! - Datapalooza Denver 2016
Dirty data? Clean it up! - Datapalooza Denver 2016
Dan Lynn
 
Resume april updated
Resume april updatedResume april updated
Resume april updated
Sukanta Saha
 
Resume
ResumeResume
Resume
Sukanta Saha
 
Resume_sukanta_updated
Resume_sukanta_updatedResume_sukanta_updated
Resume_sukanta_updated
Sukanta Saha
 
Resume_APRIL_updated
Resume_APRIL_updatedResume_APRIL_updated
Resume_APRIL_updated
Sukanta Saha
 
Resume (1)
Resume (1)Resume (1)
Resume (1)
Rajat Goswami
 
SURENDRANATH GANDLA4
SURENDRANATH GANDLA4SURENDRANATH GANDLA4
SURENDRANATH GANDLA4
Surendranath Gandla
 
Uday Job Profile
Uday Job ProfileUday Job Profile
Uday Job Profile
udaypathipati
 
Sujit lead plsql
Sujit lead plsqlSujit lead plsql
Sujit lead plsql
Sujit Jha
 
Ravi_Shrivas_CV
Ravi_Shrivas_CVRavi_Shrivas_CV
Ravi_Shrivas_CV
Ravi Shrivas
 
Enabling Your Data Science Team with Modern Data Engineering
Enabling Your Data Science Team with Modern Data EngineeringEnabling Your Data Science Team with Modern Data Engineering
Enabling Your Data Science Team with Modern Data Engineering
James Densmore
 
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning InfrastructureML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure
Fei Chen
 
Using Databricks as an Analysis Platform
Using Databricks as an Analysis PlatformUsing Databricks as an Analysis Platform
Using Databricks as an Analysis Platform
Databricks
 
Parfenov Vladimir br
Parfenov Vladimir brParfenov Vladimir br
Parfenov Vladimir br
Vlad Parfenov
 
Shaik Niyas Ahamed M Resume
Shaik Niyas Ahamed M ResumeShaik Niyas Ahamed M Resume
Shaik Niyas Ahamed M Resume
Shaik Niyas Ahamed M
 
Oop lec 2(introduction to object oriented technology)
Oop lec 2(introduction to object oriented technology)Oop lec 2(introduction to object oriented technology)
Oop lec 2(introduction to object oriented technology)
Asfand Hassan
 
My C.V
My C.VMy C.V

Similar to A Modern Interface for Data Science on Postgres/Greenplum - Greenplum Summit 2019 (20)

Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
 
Pullareddy_tavva_resume.doc
Pullareddy_tavva_resume.docPullareddy_tavva_resume.doc
Pullareddy_tavva_resume.doc
 
Resume 11 2015
Resume 11 2015Resume 11 2015
Resume 11 2015
 
Dirty data? Clean it up! - Datapalooza Denver 2016
Dirty data? Clean it up! - Datapalooza Denver 2016Dirty data? Clean it up! - Datapalooza Denver 2016
Dirty data? Clean it up! - Datapalooza Denver 2016
 
Resume april updated
Resume april updatedResume april updated
Resume april updated
 
Resume
ResumeResume
Resume
 
Resume_sukanta_updated
Resume_sukanta_updatedResume_sukanta_updated
Resume_sukanta_updated
 
Resume_APRIL_updated
Resume_APRIL_updatedResume_APRIL_updated
Resume_APRIL_updated
 
Resume (1)
Resume (1)Resume (1)
Resume (1)
 
SURENDRANATH GANDLA4
SURENDRANATH GANDLA4SURENDRANATH GANDLA4
SURENDRANATH GANDLA4
 
Uday Job Profile
Uday Job ProfileUday Job Profile
Uday Job Profile
 
Sujit lead plsql
Sujit lead plsqlSujit lead plsql
Sujit lead plsql
 
Ravi_Shrivas_CV
Ravi_Shrivas_CVRavi_Shrivas_CV
Ravi_Shrivas_CV
 
Enabling Your Data Science Team with Modern Data Engineering
Enabling Your Data Science Team with Modern Data EngineeringEnabling Your Data Science Team with Modern Data Engineering
Enabling Your Data Science Team with Modern Data Engineering
 
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning InfrastructureML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure
 
Using Databricks as an Analysis Platform
Using Databricks as an Analysis PlatformUsing Databricks as an Analysis Platform
Using Databricks as an Analysis Platform
 
Parfenov Vladimir br
Parfenov Vladimir brParfenov Vladimir br
Parfenov Vladimir br
 
Shaik Niyas Ahamed M Resume
Shaik Niyas Ahamed M ResumeShaik Niyas Ahamed M Resume
Shaik Niyas Ahamed M Resume
 
Oop lec 2(introduction to object oriented technology)
Oop lec 2(introduction to object oriented technology)Oop lec 2(introduction to object oriented technology)
Oop lec 2(introduction to object oriented technology)
 
My C.V
My C.VMy C.V
My C.V
 

More from VMware Tanzu

Spring into AI presented by Dan Vega 5/14
Spring into AI presented by Dan Vega 5/14Spring into AI presented by Dan Vega 5/14
Spring into AI presented by Dan Vega 5/14
VMware Tanzu
 
What AI Means For Your Product Strategy And What To Do About It
What AI Means For Your Product Strategy And What To Do About ItWhat AI Means For Your Product Strategy And What To Do About It
What AI Means For Your Product Strategy And What To Do About It
VMware Tanzu
 
Make the Right Thing the Obvious Thing at Cardinal Health 2023
Make the Right Thing the Obvious Thing at Cardinal Health 2023Make the Right Thing the Obvious Thing at Cardinal Health 2023
Make the Right Thing the Obvious Thing at Cardinal Health 2023
VMware Tanzu
 
Enhancing DevEx and Simplifying Operations at Scale
Enhancing DevEx and Simplifying Operations at ScaleEnhancing DevEx and Simplifying Operations at Scale
Enhancing DevEx and Simplifying Operations at Scale
VMware Tanzu
 
Spring Update | July 2023
Spring Update | July 2023Spring Update | July 2023
Spring Update | July 2023
VMware Tanzu
 
Platforms, Platform Engineering, & Platform as a Product
Platforms, Platform Engineering, & Platform as a ProductPlatforms, Platform Engineering, & Platform as a Product
Platforms, Platform Engineering, & Platform as a Product
VMware Tanzu
 
Building Cloud Ready Apps
Building Cloud Ready AppsBuilding Cloud Ready Apps
Building Cloud Ready Apps
VMware Tanzu
 
Spring Boot 3 And Beyond
Spring Boot 3 And BeyondSpring Boot 3 And Beyond
Spring Boot 3 And Beyond
VMware Tanzu
 
Spring Cloud Gateway - SpringOne Tour 2023 Charles Schwab.pdf
Spring Cloud Gateway - SpringOne Tour 2023 Charles Schwab.pdfSpring Cloud Gateway - SpringOne Tour 2023 Charles Schwab.pdf
Spring Cloud Gateway - SpringOne Tour 2023 Charles Schwab.pdf
VMware Tanzu
 
Simplify and Scale Enterprise Apps in the Cloud | Boston 2023
Simplify and Scale Enterprise Apps in the Cloud | Boston 2023Simplify and Scale Enterprise Apps in the Cloud | Boston 2023
Simplify and Scale Enterprise Apps in the Cloud | Boston 2023
VMware Tanzu
 
Simplify and Scale Enterprise Apps in the Cloud | Seattle 2023
Simplify and Scale Enterprise Apps in the Cloud | Seattle 2023Simplify and Scale Enterprise Apps in the Cloud | Seattle 2023
Simplify and Scale Enterprise Apps in the Cloud | Seattle 2023
VMware Tanzu
 
tanzu_developer_connect.pptx
tanzu_developer_connect.pptxtanzu_developer_connect.pptx
tanzu_developer_connect.pptx
VMware Tanzu
 
Tanzu Virtual Developer Connect Workshop - French
Tanzu Virtual Developer Connect Workshop - FrenchTanzu Virtual Developer Connect Workshop - French
Tanzu Virtual Developer Connect Workshop - French
VMware Tanzu
 
Tanzu Developer Connect Workshop - English
Tanzu Developer Connect Workshop - EnglishTanzu Developer Connect Workshop - English
Tanzu Developer Connect Workshop - English
VMware Tanzu
 
Virtual Developer Connect Workshop - English
Virtual Developer Connect Workshop - EnglishVirtual Developer Connect Workshop - English
Virtual Developer Connect Workshop - English
VMware Tanzu
 
Tanzu Developer Connect - French
Tanzu Developer Connect - FrenchTanzu Developer Connect - French
Tanzu Developer Connect - French
VMware Tanzu
 
Simplify and Scale Enterprise Apps in the Cloud | Dallas 2023
Simplify and Scale Enterprise Apps in the Cloud | Dallas 2023Simplify and Scale Enterprise Apps in the Cloud | Dallas 2023
Simplify and Scale Enterprise Apps in the Cloud | Dallas 2023
VMware Tanzu
 
SpringOne Tour: Deliver 15-Factor Applications on Kubernetes with Spring Boot
SpringOne Tour: Deliver 15-Factor Applications on Kubernetes with Spring BootSpringOne Tour: Deliver 15-Factor Applications on Kubernetes with Spring Boot
SpringOne Tour: Deliver 15-Factor Applications on Kubernetes with Spring Boot
VMware Tanzu
 
SpringOne Tour: The Influential Software Engineer
SpringOne Tour: The Influential Software EngineerSpringOne Tour: The Influential Software Engineer
SpringOne Tour: The Influential Software Engineer
VMware Tanzu
 
SpringOne Tour: Domain-Driven Design: Theory vs Practice
SpringOne Tour: Domain-Driven Design: Theory vs PracticeSpringOne Tour: Domain-Driven Design: Theory vs Practice
SpringOne Tour: Domain-Driven Design: Theory vs Practice
VMware Tanzu
 

More from VMware Tanzu (20)

Spring into AI presented by Dan Vega 5/14
Spring into AI presented by Dan Vega 5/14Spring into AI presented by Dan Vega 5/14
Spring into AI presented by Dan Vega 5/14
 
What AI Means For Your Product Strategy And What To Do About It
What AI Means For Your Product Strategy And What To Do About ItWhat AI Means For Your Product Strategy And What To Do About It
What AI Means For Your Product Strategy And What To Do About It
 
Make the Right Thing the Obvious Thing at Cardinal Health 2023
Make the Right Thing the Obvious Thing at Cardinal Health 2023Make the Right Thing the Obvious Thing at Cardinal Health 2023
Make the Right Thing the Obvious Thing at Cardinal Health 2023
 
Enhancing DevEx and Simplifying Operations at Scale
Enhancing DevEx and Simplifying Operations at ScaleEnhancing DevEx and Simplifying Operations at Scale
Enhancing DevEx and Simplifying Operations at Scale
 
Spring Update | July 2023
Spring Update | July 2023Spring Update | July 2023
Spring Update | July 2023
 
Platforms, Platform Engineering, & Platform as a Product
Platforms, Platform Engineering, & Platform as a ProductPlatforms, Platform Engineering, & Platform as a Product
Platforms, Platform Engineering, & Platform as a Product
 
Building Cloud Ready Apps
Building Cloud Ready AppsBuilding Cloud Ready Apps
Building Cloud Ready Apps
 
Spring Boot 3 And Beyond
Spring Boot 3 And BeyondSpring Boot 3 And Beyond
Spring Boot 3 And Beyond
 
Spring Cloud Gateway - SpringOne Tour 2023 Charles Schwab.pdf
Spring Cloud Gateway - SpringOne Tour 2023 Charles Schwab.pdfSpring Cloud Gateway - SpringOne Tour 2023 Charles Schwab.pdf
Spring Cloud Gateway - SpringOne Tour 2023 Charles Schwab.pdf
 
Simplify and Scale Enterprise Apps in the Cloud | Boston 2023
Simplify and Scale Enterprise Apps in the Cloud | Boston 2023Simplify and Scale Enterprise Apps in the Cloud | Boston 2023
Simplify and Scale Enterprise Apps in the Cloud | Boston 2023
 
Simplify and Scale Enterprise Apps in the Cloud | Seattle 2023
Simplify and Scale Enterprise Apps in the Cloud | Seattle 2023Simplify and Scale Enterprise Apps in the Cloud | Seattle 2023
Simplify and Scale Enterprise Apps in the Cloud | Seattle 2023
 
tanzu_developer_connect.pptx
tanzu_developer_connect.pptxtanzu_developer_connect.pptx
tanzu_developer_connect.pptx
 
Tanzu Virtual Developer Connect Workshop - French
Tanzu Virtual Developer Connect Workshop - FrenchTanzu Virtual Developer Connect Workshop - French
Tanzu Virtual Developer Connect Workshop - French
 
Tanzu Developer Connect Workshop - English
Tanzu Developer Connect Workshop - EnglishTanzu Developer Connect Workshop - English
Tanzu Developer Connect Workshop - English
 
Virtual Developer Connect Workshop - English
Virtual Developer Connect Workshop - EnglishVirtual Developer Connect Workshop - English
Virtual Developer Connect Workshop - English
 
Tanzu Developer Connect - French
Tanzu Developer Connect - FrenchTanzu Developer Connect - French
Tanzu Developer Connect - French
 
Simplify and Scale Enterprise Apps in the Cloud | Dallas 2023
Simplify and Scale Enterprise Apps in the Cloud | Dallas 2023Simplify and Scale Enterprise Apps in the Cloud | Dallas 2023
Simplify and Scale Enterprise Apps in the Cloud | Dallas 2023
 
SpringOne Tour: Deliver 15-Factor Applications on Kubernetes with Spring Boot
SpringOne Tour: Deliver 15-Factor Applications on Kubernetes with Spring BootSpringOne Tour: Deliver 15-Factor Applications on Kubernetes with Spring Boot
SpringOne Tour: Deliver 15-Factor Applications on Kubernetes with Spring Boot
 
SpringOne Tour: The Influential Software Engineer
SpringOne Tour: The Influential Software EngineerSpringOne Tour: The Influential Software Engineer
SpringOne Tour: The Influential Software Engineer
 
SpringOne Tour: Domain-Driven Design: Theory vs Practice
SpringOne Tour: Domain-Driven Design: Theory vs PracticeSpringOne Tour: Domain-Driven Design: Theory vs Practice
SpringOne Tour: Domain-Driven Design: Theory vs Practice
 

Recently uploaded

Odoo ERP Vs. Traditional ERP Systems – A Comparative Analysis
Odoo ERP Vs. Traditional ERP Systems – A Comparative AnalysisOdoo ERP Vs. Traditional ERP Systems – A Comparative Analysis
Odoo ERP Vs. Traditional ERP Systems – A Comparative Analysis
Envertis Software Solutions
 
Energy consumption of Database Management - Florina Jonuzi
Energy consumption of Database Management - Florina JonuziEnergy consumption of Database Management - Florina Jonuzi
Energy consumption of Database Management - Florina Jonuzi
Green Software Development
 
Transform Your Communication with Cloud-Based IVR Solutions
Transform Your Communication with Cloud-Based IVR SolutionsTransform Your Communication with Cloud-Based IVR Solutions
Transform Your Communication with Cloud-Based IVR Solutions
TheSMSPoint
 
Top 9 Trends in Cybersecurity for 2024.pptx
Top 9 Trends in Cybersecurity for 2024.pptxTop 9 Trends in Cybersecurity for 2024.pptx
Top 9 Trends in Cybersecurity for 2024.pptx
devvsandy
 
Using Xen Hypervisor for Functional Safety
Using Xen Hypervisor for Functional SafetyUsing Xen Hypervisor for Functional Safety
Using Xen Hypervisor for Functional Safety
Ayan Halder
 
UI5con 2024 - Bring Your Own Design System
UI5con 2024 - Bring Your Own Design SystemUI5con 2024 - Bring Your Own Design System
UI5con 2024 - Bring Your Own Design System
Peter Muessig
 
All you need to know about Spring Boot and GraalVM
All you need to know about Spring Boot and GraalVMAll you need to know about Spring Boot and GraalVM
All you need to know about Spring Boot and GraalVM
Alina Yurenko
 
在线购买加拿大英属哥伦比亚大学毕业证本科学位证书原版一模一样
在线购买加拿大英属哥伦比亚大学毕业证本科学位证书原版一模一样在线购买加拿大英属哥伦比亚大学毕业证本科学位证书原版一模一样
在线购买加拿大英属哥伦比亚大学毕业证本科学位证书原版一模一样
mz5nrf0n
 
SMS API Integration in Saudi Arabia| Best SMS API Service
SMS API Integration in Saudi Arabia| Best SMS API ServiceSMS API Integration in Saudi Arabia| Best SMS API Service
SMS API Integration in Saudi Arabia| Best SMS API Service
Yara Milbes
 
Oracle 23c New Features For DBAs and Developers.pptx
Oracle 23c New Features For DBAs and Developers.pptxOracle 23c New Features For DBAs and Developers.pptx
Oracle 23c New Features For DBAs and Developers.pptx
Remote DBA Services
 
WWDC 2024 Keynote Review: For CocoaCoders Austin
WWDC 2024 Keynote Review: For CocoaCoders AustinWWDC 2024 Keynote Review: For CocoaCoders Austin
WWDC 2024 Keynote Review: For CocoaCoders Austin
Patrick Weigel
 
Modelling Up - DDDEurope 2024 - Amsterdam
Modelling Up - DDDEurope 2024 - AmsterdamModelling Up - DDDEurope 2024 - Amsterdam
Modelling Up - DDDEurope 2024 - Amsterdam
Alberto Brandolini
 
Enums On Steroids - let's look at sealed classes !
Enums On Steroids - let's look at sealed classes !Enums On Steroids - let's look at sealed classes !
Enums On Steroids - let's look at sealed classes !
Marcin Chrost
 
UI5con 2024 - Boost Your Development Experience with UI5 Tooling Extensions
UI5con 2024 - Boost Your Development Experience with UI5 Tooling ExtensionsUI5con 2024 - Boost Your Development Experience with UI5 Tooling Extensions
UI5con 2024 - Boost Your Development Experience with UI5 Tooling Extensions
Peter Muessig
 
一比一原版(USF毕业证)旧金山大学毕业证如何办理
一比一原版(USF毕业证)旧金山大学毕业证如何办理一比一原版(USF毕业证)旧金山大学毕业证如何办理
一比一原版(USF毕业证)旧金山大学毕业证如何办理
dakas1
 
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
mz5nrf0n
 
Lecture 2 - software testing SE 412.pptx
Lecture 2 - software testing SE 412.pptxLecture 2 - software testing SE 412.pptx
Lecture 2 - software testing SE 412.pptx
TaghreedAltamimi
 
Oracle Database 19c New Features for DBAs and Developers.pptx
Oracle Database 19c New Features for DBAs and Developers.pptxOracle Database 19c New Features for DBAs and Developers.pptx
Oracle Database 19c New Features for DBAs and Developers.pptx
Remote DBA Services
 
What next after learning python programming basics
What next after learning python programming basicsWhat next after learning python programming basics
What next after learning python programming basics
Rakesh Kumar R
 
SQL Accounting Software Brochure Malaysia
SQL Accounting Software Brochure MalaysiaSQL Accounting Software Brochure Malaysia
SQL Accounting Software Brochure Malaysia
GohKiangHock
 

Recently uploaded (20)

Odoo ERP Vs. Traditional ERP Systems – A Comparative Analysis
Odoo ERP Vs. Traditional ERP Systems – A Comparative AnalysisOdoo ERP Vs. Traditional ERP Systems – A Comparative Analysis
Odoo ERP Vs. Traditional ERP Systems – A Comparative Analysis
 
Energy consumption of Database Management - Florina Jonuzi
Energy consumption of Database Management - Florina JonuziEnergy consumption of Database Management - Florina Jonuzi
Energy consumption of Database Management - Florina Jonuzi
 
Transform Your Communication with Cloud-Based IVR Solutions
Transform Your Communication with Cloud-Based IVR SolutionsTransform Your Communication with Cloud-Based IVR Solutions
Transform Your Communication with Cloud-Based IVR Solutions
 
Top 9 Trends in Cybersecurity for 2024.pptx
Top 9 Trends in Cybersecurity for 2024.pptxTop 9 Trends in Cybersecurity for 2024.pptx
Top 9 Trends in Cybersecurity for 2024.pptx
 
Using Xen Hypervisor for Functional Safety
Using Xen Hypervisor for Functional SafetyUsing Xen Hypervisor for Functional Safety
Using Xen Hypervisor for Functional Safety
 
UI5con 2024 - Bring Your Own Design System
UI5con 2024 - Bring Your Own Design SystemUI5con 2024 - Bring Your Own Design System
UI5con 2024 - Bring Your Own Design System
 
All you need to know about Spring Boot and GraalVM
All you need to know about Spring Boot and GraalVMAll you need to know about Spring Boot and GraalVM
All you need to know about Spring Boot and GraalVM
 
在线购买加拿大英属哥伦比亚大学毕业证本科学位证书原版一模一样
在线购买加拿大英属哥伦比亚大学毕业证本科学位证书原版一模一样在线购买加拿大英属哥伦比亚大学毕业证本科学位证书原版一模一样
在线购买加拿大英属哥伦比亚大学毕业证本科学位证书原版一模一样
 
SMS API Integration in Saudi Arabia| Best SMS API Service
SMS API Integration in Saudi Arabia| Best SMS API ServiceSMS API Integration in Saudi Arabia| Best SMS API Service
SMS API Integration in Saudi Arabia| Best SMS API Service
 
Oracle 23c New Features For DBAs and Developers.pptx
Oracle 23c New Features For DBAs and Developers.pptxOracle 23c New Features For DBAs and Developers.pptx
Oracle 23c New Features For DBAs and Developers.pptx
 
WWDC 2024 Keynote Review: For CocoaCoders Austin
WWDC 2024 Keynote Review: For CocoaCoders AustinWWDC 2024 Keynote Review: For CocoaCoders Austin
WWDC 2024 Keynote Review: For CocoaCoders Austin
 
Modelling Up - DDDEurope 2024 - Amsterdam
Modelling Up - DDDEurope 2024 - AmsterdamModelling Up - DDDEurope 2024 - Amsterdam
Modelling Up - DDDEurope 2024 - Amsterdam
 
Enums On Steroids - let's look at sealed classes !
Enums On Steroids - let's look at sealed classes !Enums On Steroids - let's look at sealed classes !
Enums On Steroids - let's look at sealed classes !
 
UI5con 2024 - Boost Your Development Experience with UI5 Tooling Extensions
UI5con 2024 - Boost Your Development Experience with UI5 Tooling ExtensionsUI5con 2024 - Boost Your Development Experience with UI5 Tooling Extensions
UI5con 2024 - Boost Your Development Experience with UI5 Tooling Extensions
 
一比一原版(USF毕业证)旧金山大学毕业证如何办理
一比一原版(USF毕业证)旧金山大学毕业证如何办理一比一原版(USF毕业证)旧金山大学毕业证如何办理
一比一原版(USF毕业证)旧金山大学毕业证如何办理
 
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
 
Lecture 2 - software testing SE 412.pptx
Lecture 2 - software testing SE 412.pptxLecture 2 - software testing SE 412.pptx
Lecture 2 - software testing SE 412.pptx
 
Oracle Database 19c New Features for DBAs and Developers.pptx
Oracle Database 19c New Features for DBAs and Developers.pptxOracle Database 19c New Features for DBAs and Developers.pptx
Oracle Database 19c New Features for DBAs and Developers.pptx
 
What next after learning python programming basics
What next after learning python programming basicsWhat next after learning python programming basics
What next after learning python programming basics
 
SQL Accounting Software Brochure Malaysia
SQL Accounting Software Brochure MalaysiaSQL Accounting Software Brochure Malaysia
SQL Accounting Software Brochure Malaysia
 

A Modern Interface for Data Science on Postgres/Greenplum - Greenplum Summit 2019

  • 1.
  • 2. © Copyright 2019 Pivotal Software, Inc. All rights Reserved. Scott Hajek @jscotthajek A Modern Interface for Data Science on Postgres/Greenplum
  • 3. Cover w/ Image Executive Summary ■ Who am I? ■ Pure SQL: not the preferred interface for sophisticated data scientists ■ Maturity and scale of SQL-based systems are what enterprise DS demands ■ Data Frames: A better abstraction that can be layered on top of SQL systems ○ Ibis: Python implementation of Data Frames for SQL and large data platforms
  • 4. Who is Scott Hajek? ● Data Science consulting for 5+ years ● Senior Data Scientist for Pivotal ● Specialty in Linguistics and Natural Language Processing ● Many industries: ○ Banking, Telecom, Manufacturing ● Problems tackled ○ Entity resolution, info extraction, optimization, anomaly detection, e-comm surveillance
  • 5. Personas ● DBA ● Application developer ● Business Analyst ● Data Scientist ○ Operates on large data sets interactively ○ Uses advanced statistics and machine learning techniques ○ Sophisticated programmer ○ Appreciates good abstractions Different kinds of users of a database system
  • 6. COBOL: IDENTIFICATION DIVISION. PROGRAM-ID. BINSRCH1. The binary search reads every input record after looking up the employee’s month of hire on a table, by a sequential search, it writes it out to an output file ENVIRONMENT DIVISION. INPUT-OUTPUT SECTION. FILE-CONTROL. ... Python vs R Abstraction ● Today popular languages have abstractions like DataFrames ● Not just about fewer lines of code ● Easier to reason about Rob Story, bit.ly/python-vs-R-vs-cobol vs COBOL which(sapply(dataframe, function(x) any(month == "January"))) dataframe[dataframe.month == "January"]
  • 7. Good Abstractions in Math Matrix notation makes multiple linear regression digestible ● Simple linear regression ● Multiple regression without matrix notation ● Multiple regression with matrix notation
  • 8. Good Abstractions for Tabular Data SQL ● Pros ○ Well-defined standards, very familiar in enterprise ○ Declarative language ○ Analytic operations available ○ Abstractions for tables, columns, windows ● Cons ○ Verbose ○ Difficult to compose complex queries and transformations by hand ○ Difficult to represent subqueries and intermediate result sets PL /
  • 9. Good Abstractions for Tabular Data Data Frames (df) ● Tabular data structures with named columns ● Different types allowed in different columns ● Easy to select subset of columns ● Analytic operations available: ○ Arithmetic, joins, maps, filters, aggregate & window functions ● Popular with Data Scientists: ○ Easily hooks into programming languages ○ Flexible for interactive data exploration ○ Represent sub-queries as variables → clear data flow through pipelines
  • 10. Good Abstractions in Data Science Packages Model Tra mo from sklearn.ensemble import RandomForestClassifier clf = RandomForestClassifier() clf.fit(X, y) 10 Abstractions for training
  • 11. Good Abstractions in Data Science Packages Pre t al clf.predict(new_data) Regression 11 Model Abstractions for predicting
  • 12. Challenges with Writing Complex SQL Directly App developers use frameworks to avoid writing SQL directly ● Tedious, error prone ● Insecure (SQL injection) ● Significant rewrite needed if you switch the flavor of database, especially if you’re utilizing advanced features of databases. ● Instead they use Object-Relational Mapping (ORMs) to generate SQL ○ Spring or ActiveRecord from Ruby on Rails
  • 13. Reconsidering SQL-based Platforms Relational DB Management Systems have a lot to offer ● Stability ● SQL is the most common and familiar language in the enterprise ● Analytical capabilities ● MPP variants offer massive scale in storage and processing
  • 14. What is Ibis? “A [Python] pandas-like deferred expression system, with first-class SQL support” ● Pandas: Python package with DataFrame abstraction, staple for data scientists ● Same code can work on multiple data platforms ● Deferred expression → lazy evaluation ○ Define complex pipeline of transformations, represented as an object ○ Can inspect properties of the end result without evaluating ○ Allows type/error checking client side before sending job to server/cluster ○ Make bad code fail fast! ○ Gives query/execution optimizer the full picture → better plans ● Developed by Wes McKinney, Phillip Cloud, and community
  • 15. Ibis in Action Establish a connection object Create an object that refers to a table Table object contains information about the schema
  • 16. Ibis in Action Columns and aggregation ● Column selection looks like pandas ● Methods for defining aggregation and computation (e.g. sum) ● Computation deferred until final step when execution is called
  • 17. Ibis in Action Joins ● Define join in object-oriented fashion ● Potential columns and types are known before actually evaluation
  • 19. Making Ibis More Versatile To cover the full range of DS tasks in Postgres/Greenplum, ibis needs some further development ● Ability to create and use user-defined functions (UDFs) ● Ability to create a table and save the results to it ● Data science modeling abstractions that use ibis table objects as input
  • 20. Let’s round out ibis and give Postgres/Greenplum a modern data science interface
  • 21. References ● Ibis project: docs.ibis-project.org ● Greenplum.org ● Pandas.pydata.org ● Scikit-learn.org ● pivotal.io/pivotal-data-science