Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Splitgraph: Docker for Data

295 views

Published on

Slides from our talk at the Oct 10, 2018 Docker Cambridge meetup discussing our product, Splitgraph, demoing an early prototype and talking about the challenges of applying the Docker model for data.

https://splitgraph.com, twitter.com/@splitgraph

Artjoms Iskovs: twitter.com/@mildbyte, mildbyte.xyz
Miles Richardson: milesrichardson.com

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Splitgraph: Docker for Data

  1. 1. Splitgraph "Docker for Data" Artjoms Iškovs, Miles Richardson
  2. 2. "B.D." Building Packages Before Docker The Dark Ages • Sourcing packages • Rebuilding, reconfiguring, rebuilding... • Googling, rage inducing
  3. 3. Data preparation accounts for about 80% of the work of data scientists.
  4. 4. Why so hard to build and maintain data sets? • Sourcing data is not composable • Why can’t I query multiple data sets at once? • Wrangling and cleaning data is not maintainable • Why can’t I keep my data sets up to date? • Running ad-hoc queries is not reproducible • Why can’t I share my data sets?
  5. 5. What do we mean by data? Sources • Open Data • Internal Data • Licensed Data Types • SQL Databases • NoSQL Databases • CSV Files...
  6. 6. The journey of a dataset: Scenario • Two publishers: • NOAA publishes climate data • USDA publishes corn yields • Consumer wants to merge both data sets • Let’s follow the climate data...
  7. 7. The journey of a dataset: Introduction
  8. 8. The journey of a dataset: Introduction
  9. 9. The journey of a dataset 1: Creation
  10. 10. Ingesting data from another DB via CLI $ sgr mount -t mongo_fdw me:pwd@my_db:27017 ’ { "rainfall": { "db": "observations", "coll": "rainfall", "schema": { "timestamp": "timestamp", "state": "varchar", "rainfall": "numeric } } }’ staging $ sgr import staging ’SELECT timestamp, state, rainfall FROM rainfall’ noaa/climate rainfall
  11. 11. The journey of a dataset 2: Publication
  12. 12. Committing and Publishing Data via CLI $ sgr publish noaa/climate data.splitgraph.com
  13. 13. The journey of a dataset 3: Usage
  14. 14. SGFiles: Dockerfiles for data • Like Dockerfiles. • Image: state of a database schema • Layers w/ deterministic hashes and cache invalidation if: • Previous layer changes • Command changes • Commands: • FROM – base the image on something else • IMPORT – import tables from another image • SQL – run SQL against the image
  15. 15. Consumption: Demo FROM usda/yields IMPORT crop_yields FROM noaa/climate:latest IMPORT rainfall SQL CREATE TABLE rainfall_yields AS SELECT * FROM rainfall JOIN crop_yields ...
  16. 16. The journey of a dataset 4: Updating
  17. 17. The journey of a dataset 4: Updating • Puerto Rico is now a US state • NOAA wants to revise its climate data • Can the consumer get just the changes?
  18. 18. Delta compression • Only care about changes • Need to efficiently: • Create diffs (→ commit, push) • Apply diffs (→ checkout, pull)
  19. 19. Delta compression Docker • Files • Custom FS Git • Lines • diff Splitgraph • Rows • Audit triggers
  20. 20. Updating: Demo
  21. 21. The journey of a dataset 5: Maintenance
  22. 22. The journey of a dataset 5: Maintenance • Can we update it? • Where did this dataset come from? • Build context fully encapsulated within the metadata
  23. 23. Provenance and rebasing demo
  24. 24. Q&A twitter.com/splitgraph · splitgraph.com

×