Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Splitgraph: AHL talk

263 views

Published on

Slides from an internal tech talk we gave at Man AHL (twitter.com/manahltech) discussing Splitgraph and how it can be used to create and share datasets in a composable, maintainable and reproducible way with demos of basic functionality (version control, mounting and a Docker-like language for defining datasets) included.

Splitgraph: www.splitgraph.com
Github: github.com/splitgraph
Twitter: twitter.com/splitgraph

Published in: Technology
  • Be the first to comment

Splitgraph: AHL talk

  1. 1. Splitgraph Artjoms Iškovs, Miles Richardson
  2. 2. Data preparation accounts for about 80% of the work of data scientists. 1 / 6 Problem Statement
  3. 3. Why so hard to build and maintain data sets? • Sourcing data is not composable • Why can’t I query multiple data sets at once? • Wrangling and cleaning data is not maintainable • Why can’t I keep my data sets up to date? • Running ad-hoc queries is not reproducible • Why can’t I share my data sets? 1 / 6 Problem Statement
  4. 4. • Problem Statement • Design Goals • Versioning and Sharing Data • Mounting External Data as SQL • Layered Querying • Efficient, Reproducible Data Set Builds 2 / 6 Design Goals
  5. 5. Good tools enhance existing abstractions without breaking them. 2 / 6 Design Goals
  6. 6. The filesystem as a core abstraction • Universal, shared abstraction with a standard API • Tools that use it all benefit from each other • e.g. git enhances the filesystem for other tools: • IDE, Editor, Compiler, etc... 2 / 6 Design Goals
  7. 7. Good tools that leverage the filesystem Tool git FUSE Docker Benefit Version source code Mount anything as a filesystem Efficient, reproducible builds 2 / 6 Design Goals
  8. 8. Can we find a common abstraction for databases? 2 / 6 Design Goals
  9. 9. SQL as a common abstraction Just like the filesystem... • Universal, shared abstraction with a standard API • Tools that use it all benefit from each other • splitgraph enhances the database for other tools: • Data Analysis tools, ETL tools, anything that speaks SQL... 2 / 6 Design Goals
  10. 10. Benefits of leveraging SQL Benefits analogous to git, FUSE, Docker... splitgraph Benefit Version and share data Mount external data as SQL Efficient, reproducible data set builds 2 / 6 Design Goals
  11. 11. • Problem Statement • Design Goals • Versioning and Sharing Data • Mounting External Data as SQL • Layered Querying • Efficient, Reproducible Data Set Builds 3 / 6 Versioning and Sharing Data
  12. 12. Basic Architecture: Client and Driver 3 / 6 Versioning and Sharing Data
  13. 13. Versioning: How does it work? • Git: .git • Stores every revision of every file • Splitgraph: splitgraph_meta • Stores every revision of every table • Space? 3 / 6 Versioning and Sharing Data
  14. 14. Versioning: Delta compression • Only care about changes • Need to efficiently: • Create diffs (→ commit, push) • Apply diffs (→ checkout, pull) 3 / 6 Versioning and Sharing Data
  15. 15. Versioning: Delta compression Git • Lines • diff Docker • Files • Custom FS Splitgraph • Rows • Audit triggers 3 / 6 Versioning and Sharing Data
  16. 16. Versioning → Sharing • Just like git • Immutable versions (commits) easily shared • git remotes: • Local FS .git → SSH → Remote FS .git • splitgraph remotes: • Local sg_meta → SQL → Remote sg_meta 3 / 6 Versioning and Sharing Data
  17. 17. Sharing Architecture: Splitgraph Remotes 3 / 6 Versioning and Sharing Data
  18. 18. Sharing Architecture: Splitgraph Remotes 3 / 6 Versioning and Sharing Data
  19. 19. Demo: NOAA Changes data 3 / 6 Versioning and Sharing Data
  20. 20. • Problem Statement • Design Goals • Versioning and Sharing Data • Mounting External Data as SQL • Layered Querying • Efficient, Reproducible Data Set Builds 4 / 6 Mounting External Data as SQL
  21. 21. Meanwhile at the USDA • USDA stores their crop yields data in Mongo • Want to add Puerto Rico and create a new repository 4 / 6 Mounting External Data as SQL
  22. 22. Mounting: Architecture 4 / 6 Mounting External Data as SQL
  23. 23. Mounting • Many filesystems, many relative advantages • How to copy some files between two filesystems? • mount /dev/sda1 /mnt/fs1 • mount /dev/sdb1 /mnt/fs2 • cp -r /mnt/fs1/dir1 /mnt/fs2 4 / 6 Mounting External Data as SQL
  24. 24. Mounting • Many filesystems, many relative advantages • How to copy some files between two filesystems? • mount /dev/sda1 /mnt/fs1 • mount /dev/sdb1 /mnt/fs2 • cd /mnt/fs1/dir1 && gcc 4 / 6 Mounting External Data as SQL
  25. 25. FS mounting: FUSE • Implement filesystem primitives • fuse_operations.read() → HTTP GET • Can mount anything as a filesystem 4 / 6 Mounting External Data as SQL
  26. 26. Database mounting: Postgres FDW • Implement SQL primitives • ForeignScan → coll.find() • Can mount anything as an SQL table 4 / 6 Mounting External Data as SQL
  27. 27. Demo 4 / 6 Mounting External Data as SQL
  28. 28. • Problem Statement • Design Goals • Versioning and Sharing Data • Mounting External Data as SQL • Layered Querying • Efficient, Reproducible Data Set Builds 5 / 6 Layered Querying
  29. 29. Docker: union mount • Every layer is a set of files • Use OverlayFS (AUFS) to mount as a filesystem • How to read a file? • Check top layer • Check next layer... 5 / 6 Layered Querying
  30. 30. Splitgraph: layered querying • Every layer is a set of added/deleted/updated rows • Use Postgres FDW to mount as a set of tables • How to query a table? • Apply qualifiers to the snapshot • Apply first diff, apply qualifiers again... • Tradeoff: slower than querying a materialized table. 5 / 6 Layered Querying
  31. 31. • Problem Statement • Design Goals • Versioning and Sharing Data • Mounting External Data as SQL • Layered Querying • Efficient, Reproducible Data Set Builds 6 / 6 Efficient, Reproducible Data Set Builds
  32. 32. Dockerfiles • Every command defines a layer • Every layer has a deterministic hash • Cache invalidation if: • Previous layer changes • Command changes 6 / 6 Efficient, Reproducible Data Set Builds
  33. 33. SGFiles: Dockerfiles for data • Same idea. • Commands: • FROM – base the image on something else • IMPORT – import tables from another image • SQL – run SQL against the image 6 / 6 Efficient, Reproducible Data Set Builds
  34. 34. Consumption: Architecture 6 / 6 Efficient, Reproducible Data Set Builds
  35. 35. SGFile Demo FROM usda/yields IMPORT crop_yields FROM noaa/climate:original_estimate IMPORT rainfall SQL CREATE TABLE rainfall_yields AS SELECT * FROM rainfall JOIN crop_yields ... 6 / 6 Efficient, Reproducible Data Set Builds
  36. 36. Q&A splitgraph.com twitter.com/splitgraph github.com/splitgraph
  37. 37. Registry Architecture / Storage Optimization 7 / 6 Appendix
  38. 38. Future Work: Optimizing Schema Changes Splitgraph can handle schema changes, but... • Could be more space efficient • Stores schema change as new table (snapshot) Future optimizations... • Detect column changes via schema introspection • Store schema change like data (as diff) 7 / 6 Appendix
  39. 39. Future Work: More FDW, better mounting UX • Ship more FDW extensions with driver • Common data types: CSV, MySQL, HDFS, etc. • SaaS sources: Mixpanel, segment, salesforce, etc. • • Improve UX layer over Postgres FDW • Reduce cognitive overhead of mounting 7 / 6 Appendix
  40. 40. Future Work: Reverse FDW • Now, sync is pull-based • → Requires manual updates or “snapper” • Future: “reverse FDW” for each data source? • Push changes from mount to driver • e.g. for mongo, push oplog into postgres 7 / 6 Appendix

×