Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

A Modern Interface for Data Science on Postgres/Greenplum - Greenplum Summit 2019


Published on

Greenplum Summit 2019
Scott Hajek

Published in: Software
  • Be the first to comment

  • Be the first to like this

A Modern Interface for Data Science on Postgres/Greenplum - Greenplum Summit 2019

  1. 1. © Copyright 2019 Pivotal Software, Inc. All rights Reserved. Scott Hajek @jscotthajek A Modern Interface for Data Science on Postgres/Greenplum
  2. 2. Cover w/ Image Executive Summary ■ Who am I? ■ Pure SQL: not the preferred interface for sophisticated data scientists ■ Maturity and scale of SQL-based systems are what enterprise DS demands ■ Data Frames: A better abstraction that can be layered on top of SQL systems ○ Ibis: Python implementation of Data Frames for SQL and large data platforms
  3. 3. Who is Scott Hajek? ● Data Science consulting for 5+ years ● Senior Data Scientist for Pivotal ● Specialty in Linguistics and Natural Language Processing ● Many industries: ○ Banking, Telecom, Manufacturing ● Problems tackled ○ Entity resolution, info extraction, optimization, anomaly detection, e-comm surveillance
  4. 4. Personas ● DBA ● Application developer ● Business Analyst ● Data Scientist ○ Operates on large data sets interactively ○ Uses advanced statistics and machine learning techniques ○ Sophisticated programmer ○ Appreciates good abstractions Different kinds of users of a database system
  5. 5. COBOL: IDENTIFICATION DIVISION. PROGRAM-ID. BINSRCH1. The binary search reads every input record after looking up the employee’s month of hire on a table, by a sequential search, it writes it out to an output file ENVIRONMENT DIVISION. INPUT-OUTPUT SECTION. FILE-CONTROL. ... Python vs R Abstraction ● Today popular languages have abstractions like DataFrames ● Not just about fewer lines of code ● Easier to reason about Rob Story, vs COBOL which(sapply(dataframe, function(x) any(month == "January"))) dataframe[dataframe.month == "January"]
  6. 6. Good Abstractions in Math Matrix notation makes multiple linear regression digestible ● Simple linear regression ● Multiple regression without matrix notation ● Multiple regression with matrix notation
  7. 7. Good Abstractions for Tabular Data SQL ● Pros ○ Well-defined standards, very familiar in enterprise ○ Declarative language ○ Analytic operations available ○ Abstractions for tables, columns, windows ● Cons ○ Verbose ○ Difficult to compose complex queries and transformations by hand ○ Difficult to represent subqueries and intermediate result sets PL /
  8. 8. Good Abstractions for Tabular Data Data Frames (df) ● Tabular data structures with named columns ● Different types allowed in different columns ● Easy to select subset of columns ● Analytic operations available: ○ Arithmetic, joins, maps, filters, aggregate & window functions ● Popular with Data Scientists: ○ Easily hooks into programming languages ○ Flexible for interactive data exploration ○ Represent sub-queries as variables → clear data flow through pipelines
  9. 9. Good Abstractions in Data Science Packages Model Tra mo from sklearn.ensemble import RandomForestClassifier clf = RandomForestClassifier(), y) 10 Abstractions for training
  10. 10. Good Abstractions in Data Science Packages Pre t al clf.predict(new_data) Regression 11 Model Abstractions for predicting
  11. 11. Challenges with Writing Complex SQL Directly App developers use frameworks to avoid writing SQL directly ● Tedious, error prone ● Insecure (SQL injection) ● Significant rewrite needed if you switch the flavor of database, especially if you’re utilizing advanced features of databases. ● Instead they use Object-Relational Mapping (ORMs) to generate SQL ○ Spring or ActiveRecord from Ruby on Rails
  12. 12. Reconsidering SQL-based Platforms Relational DB Management Systems have a lot to offer ● Stability ● SQL is the most common and familiar language in the enterprise ● Analytical capabilities ● MPP variants offer massive scale in storage and processing
  13. 13. What is Ibis? “A [Python] pandas-like deferred expression system, with first-class SQL support” ● Pandas: Python package with DataFrame abstraction, staple for data scientists ● Same code can work on multiple data platforms ● Deferred expression → lazy evaluation ○ Define complex pipeline of transformations, represented as an object ○ Can inspect properties of the end result without evaluating ○ Allows type/error checking client side before sending job to server/cluster ○ Make bad code fail fast! ○ Gives query/execution optimizer the full picture → better plans ● Developed by Wes McKinney, Phillip Cloud, and community
  14. 14. Ibis in Action Establish a connection object Create an object that refers to a table Table object contains information about the schema
  15. 15. Ibis in Action Columns and aggregation ● Column selection looks like pandas ● Methods for defining aggregation and computation (e.g. sum) ● Computation deferred until final step when execution is called
  16. 16. Ibis in Action Joins ● Define join in object-oriented fashion ● Potential columns and types are known before actually evaluation
  17. 17. Ibis in Action
  18. 18. Making Ibis More Versatile To cover the full range of DS tasks in Postgres/Greenplum, ibis needs some further development ● Ability to create and use user-defined functions (UDFs) ● Ability to create a table and save the results to it ● Data science modeling abstractions that use ibis table objects as input
  19. 19. Let’s round out ibis and give Postgres/Greenplum a modern data science interface
  20. 20. References ● Ibis project: ● ● ● ●