Splitgraph
"Docker for Data"
Artjoms Iškovs, Miles Richardson
"B.D." Building Packages Before Docker
The Dark Ages
• Sourcing packages
• Rebuilding,
reconfiguring,
rebuilding...
• Googling, rage
inducing
Data preparation
accounts for about
80% of the work
of data scientists.
Why so hard to build and maintain data sets?
• Sourcing data is not composable
• Why can’t I query multiple data sets at once?
• Wrangling and cleaning data is not maintainable
• Why can’t I keep my data sets up to date?
• Running ad-hoc queries is not reproducible
• Why can’t I share my data sets?
What do we mean by data?
Sources
• Open Data
• Internal Data
• Licensed Data
Types
• SQL Databases
• NoSQL Databases
• CSV Files...
The journey of a dataset: Scenario
• Two publishers:
• NOAA publishes climate data
• USDA publishes corn yields
• Consumer wants to merge both data sets
• Let’s follow the climate data...
The journey of a dataset: Introduction
The journey of a dataset: Introduction
The journey of a dataset 1: Creation
Ingesting data from another DB via CLI
$ sgr mount -t mongo_fdw me:pwd@my_db:27017 ’
{ "rainfall": {
"db": "observations",
"coll": "rainfall",
"schema": {
"timestamp": "timestamp",
"state": "varchar",
"rainfall": "numeric
} } }’ staging
$ sgr import staging 
’SELECT timestamp, state, rainfall FROM rainfall’
noaa/climate rainfall
The journey of a dataset 2: Publication
Committing and Publishing Data via CLI
$ sgr publish noaa/climate data.splitgraph.com
The journey of a dataset 3: Usage
SGFiles: Dockerfiles for data
• Like Dockerfiles.
• Image: state of a database schema
• Layers w/ deterministic hashes and cache invalidation if:
• Previous layer changes
• Command changes
• Commands:
• FROM – base the image on something else
• IMPORT – import tables from another image
• SQL – run SQL against the image
Consumption: Demo
FROM usda/yields IMPORT crop_yields
FROM noaa/climate:latest IMPORT rainfall
SQL CREATE TABLE rainfall_yields AS
SELECT * FROM rainfall JOIN crop_yields ...
The journey of a dataset 4: Updating
The journey of a dataset 4: Updating
• Puerto Rico is now a US state
• NOAA wants to revise its climate data
• Can the consumer get just the changes?
Delta compression
• Only care about changes
• Need to efficiently:
• Create diffs (→ commit, push)
• Apply diffs (→ checkout, pull)
Delta compression
Docker
• Files
• Custom FS
Git
• Lines
• diff
Splitgraph
• Rows
• Audit triggers
Updating: Demo
The journey of a dataset 5: Maintenance
The journey of a dataset 5: Maintenance
• Can we update it?
• Where did this dataset come from?
• Build context fully encapsulated within the metadata
Provenance and rebasing demo
Q&A
twitter.com/splitgraph · splitgraph.com

Splitgraph: Docker for Data

  • 1.
    Splitgraph "Docker for Data" ArtjomsIškovs, Miles Richardson
  • 2.
    "B.D." Building PackagesBefore Docker The Dark Ages • Sourcing packages • Rebuilding, reconfiguring, rebuilding... • Googling, rage inducing
  • 3.
    Data preparation accounts forabout 80% of the work of data scientists.
  • 4.
    Why so hardto build and maintain data sets? • Sourcing data is not composable • Why can’t I query multiple data sets at once? • Wrangling and cleaning data is not maintainable • Why can’t I keep my data sets up to date? • Running ad-hoc queries is not reproducible • Why can’t I share my data sets?
  • 5.
    What do wemean by data? Sources • Open Data • Internal Data • Licensed Data Types • SQL Databases • NoSQL Databases • CSV Files...
  • 6.
    The journey ofa dataset: Scenario • Two publishers: • NOAA publishes climate data • USDA publishes corn yields • Consumer wants to merge both data sets • Let’s follow the climate data...
  • 7.
    The journey ofa dataset: Introduction
  • 8.
    The journey ofa dataset: Introduction
  • 9.
    The journey ofa dataset 1: Creation
  • 10.
    Ingesting data fromanother DB via CLI $ sgr mount -t mongo_fdw me:pwd@my_db:27017 ’ { "rainfall": { "db": "observations", "coll": "rainfall", "schema": { "timestamp": "timestamp", "state": "varchar", "rainfall": "numeric } } }’ staging $ sgr import staging ’SELECT timestamp, state, rainfall FROM rainfall’ noaa/climate rainfall
  • 11.
    The journey ofa dataset 2: Publication
  • 12.
    Committing and PublishingData via CLI $ sgr publish noaa/climate data.splitgraph.com
  • 13.
    The journey ofa dataset 3: Usage
  • 14.
    SGFiles: Dockerfiles fordata • Like Dockerfiles. • Image: state of a database schema • Layers w/ deterministic hashes and cache invalidation if: • Previous layer changes • Command changes • Commands: • FROM – base the image on something else • IMPORT – import tables from another image • SQL – run SQL against the image
  • 15.
    Consumption: Demo FROM usda/yieldsIMPORT crop_yields FROM noaa/climate:latest IMPORT rainfall SQL CREATE TABLE rainfall_yields AS SELECT * FROM rainfall JOIN crop_yields ...
  • 18.
    The journey ofa dataset 4: Updating
  • 19.
    The journey ofa dataset 4: Updating • Puerto Rico is now a US state • NOAA wants to revise its climate data • Can the consumer get just the changes?
  • 20.
    Delta compression • Onlycare about changes • Need to efficiently: • Create diffs (→ commit, push) • Apply diffs (→ checkout, pull)
  • 21.
    Delta compression Docker • Files •Custom FS Git • Lines • diff Splitgraph • Rows • Audit triggers
  • 22.
  • 29.
    The journey ofa dataset 5: Maintenance
  • 30.
    The journey ofa dataset 5: Maintenance • Can we update it? • Where did this dataset come from? • Build context fully encapsulated within the metadata
  • 31.
  • 39.