Successfully reported this slideshow.
Your SlideShare is downloading. ×

Splitgraph: Docker for Data

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Loading in …3
×

Check these out next

1 of 39 Ad

Splitgraph: Docker for Data

Download to read offline

Slides from our talk at the Oct 10, 2018 Docker Cambridge meetup discussing our product, Splitgraph, demoing an early prototype and talking about the challenges of applying the Docker model for data.

https://splitgraph.com, twitter.com/@splitgraph

Artjoms Iskovs: twitter.com/@mildbyte, mildbyte.xyz
Miles Richardson: milesrichardson.com

Slides from our talk at the Oct 10, 2018 Docker Cambridge meetup discussing our product, Splitgraph, demoing an early prototype and talking about the challenges of applying the Docker model for data.

https://splitgraph.com, twitter.com/@splitgraph

Artjoms Iskovs: twitter.com/@mildbyte, mildbyte.xyz
Miles Richardson: milesrichardson.com

Advertisement
Advertisement

More Related Content

Slideshows for you (20)

Similar to Splitgraph: Docker for Data (20)

Advertisement

Recently uploaded (20)

Splitgraph: Docker for Data

  1. 1. Splitgraph "Docker for Data" Artjoms Iškovs, Miles Richardson
  2. 2. "B.D." Building Packages Before Docker The Dark Ages • Sourcing packages • Rebuilding, reconfiguring, rebuilding... • Googling, rage inducing
  3. 3. Data preparation accounts for about 80% of the work of data scientists.
  4. 4. Why so hard to build and maintain data sets? • Sourcing data is not composable • Why can’t I query multiple data sets at once? • Wrangling and cleaning data is not maintainable • Why can’t I keep my data sets up to date? • Running ad-hoc queries is not reproducible • Why can’t I share my data sets?
  5. 5. What do we mean by data? Sources • Open Data • Internal Data • Licensed Data Types • SQL Databases • NoSQL Databases • CSV Files...
  6. 6. The journey of a dataset: Scenario • Two publishers: • NOAA publishes climate data • USDA publishes corn yields • Consumer wants to merge both data sets • Let’s follow the climate data...
  7. 7. The journey of a dataset: Introduction
  8. 8. The journey of a dataset: Introduction
  9. 9. The journey of a dataset 1: Creation
  10. 10. Ingesting data from another DB via CLI $ sgr mount -t mongo_fdw me:pwd@my_db:27017 ’ { "rainfall": { "db": "observations", "coll": "rainfall", "schema": { "timestamp": "timestamp", "state": "varchar", "rainfall": "numeric } } }’ staging $ sgr import staging ’SELECT timestamp, state, rainfall FROM rainfall’ noaa/climate rainfall
  11. 11. The journey of a dataset 2: Publication
  12. 12. Committing and Publishing Data via CLI $ sgr publish noaa/climate data.splitgraph.com
  13. 13. The journey of a dataset 3: Usage
  14. 14. SGFiles: Dockerfiles for data • Like Dockerfiles. • Image: state of a database schema • Layers w/ deterministic hashes and cache invalidation if: • Previous layer changes • Command changes • Commands: • FROM – base the image on something else • IMPORT – import tables from another image • SQL – run SQL against the image
  15. 15. Consumption: Demo FROM usda/yields IMPORT crop_yields FROM noaa/climate:latest IMPORT rainfall SQL CREATE TABLE rainfall_yields AS SELECT * FROM rainfall JOIN crop_yields ...
  16. 16. The journey of a dataset 4: Updating
  17. 17. The journey of a dataset 4: Updating • Puerto Rico is now a US state • NOAA wants to revise its climate data • Can the consumer get just the changes?
  18. 18. Delta compression • Only care about changes • Need to efficiently: • Create diffs (→ commit, push) • Apply diffs (→ checkout, pull)
  19. 19. Delta compression Docker • Files • Custom FS Git • Lines • diff Splitgraph • Rows • Audit triggers
  20. 20. Updating: Demo
  21. 21. The journey of a dataset 5: Maintenance
  22. 22. The journey of a dataset 5: Maintenance • Can we update it? • Where did this dataset come from? • Build context fully encapsulated within the metadata
  23. 23. Provenance and rebasing demo
  24. 24. Q&A twitter.com/splitgraph · splitgraph.com

×