Advertisement
Advertisement

More Related Content

Advertisement

Splitgraph: AHL talk

  1. Splitgraph Artjoms Iškovs, Miles Richardson
  2. Data preparation accounts for about 80% of the work of data scientists. 1 / 6 Problem Statement
  3. Why so hard to build and maintain data sets? • Sourcing data is not composable • Why can’t I query multiple data sets at once? • Wrangling and cleaning data is not maintainable • Why can’t I keep my data sets up to date? • Running ad-hoc queries is not reproducible • Why can’t I share my data sets? 1 / 6 Problem Statement
  4. • Problem Statement • Design Goals • Versioning and Sharing Data • Mounting External Data as SQL • Layered Querying • Efficient, Reproducible Data Set Builds 2 / 6 Design Goals
  5. Good tools enhance existing abstractions without breaking them. 2 / 6 Design Goals
  6. The filesystem as a core abstraction • Universal, shared abstraction with a standard API • Tools that use it all benefit from each other • e.g. git enhances the filesystem for other tools: • IDE, Editor, Compiler, etc... 2 / 6 Design Goals
  7. Good tools that leverage the filesystem Tool git FUSE Docker Benefit Version source code Mount anything as a filesystem Efficient, reproducible builds 2 / 6 Design Goals
  8. Can we find a common abstraction for databases? 2 / 6 Design Goals
  9. SQL as a common abstraction Just like the filesystem... • Universal, shared abstraction with a standard API • Tools that use it all benefit from each other • splitgraph enhances the database for other tools: • Data Analysis tools, ETL tools, anything that speaks SQL... 2 / 6 Design Goals
  10. Benefits of leveraging SQL Benefits analogous to git, FUSE, Docker... splitgraph Benefit Version and share data Mount external data as SQL Efficient, reproducible data set builds 2 / 6 Design Goals
  11. • Problem Statement • Design Goals • Versioning and Sharing Data • Mounting External Data as SQL • Layered Querying • Efficient, Reproducible Data Set Builds 3 / 6 Versioning and Sharing Data
  12. Basic Architecture: Client and Driver 3 / 6 Versioning and Sharing Data
  13. Versioning: How does it work? • Git: .git • Stores every revision of every file • Splitgraph: splitgraph_meta • Stores every revision of every table • Space? 3 / 6 Versioning and Sharing Data
  14. Versioning: Delta compression • Only care about changes • Need to efficiently: • Create diffs (→ commit, push) • Apply diffs (→ checkout, pull) 3 / 6 Versioning and Sharing Data
  15. Versioning: Delta compression Git • Lines • diff Docker • Files • Custom FS Splitgraph • Rows • Audit triggers 3 / 6 Versioning and Sharing Data
  16. Versioning → Sharing • Just like git • Immutable versions (commits) easily shared • git remotes: • Local FS .git → SSH → Remote FS .git • splitgraph remotes: • Local sg_meta → SQL → Remote sg_meta 3 / 6 Versioning and Sharing Data
  17. Sharing Architecture: Splitgraph Remotes 3 / 6 Versioning and Sharing Data
  18. Sharing Architecture: Splitgraph Remotes 3 / 6 Versioning and Sharing Data
  19. Demo: NOAA Changes data 3 / 6 Versioning and Sharing Data
  20. • Problem Statement • Design Goals • Versioning and Sharing Data • Mounting External Data as SQL • Layered Querying • Efficient, Reproducible Data Set Builds 4 / 6 Mounting External Data as SQL
  21. Meanwhile at the USDA • USDA stores their crop yields data in Mongo • Want to add Puerto Rico and create a new repository 4 / 6 Mounting External Data as SQL
  22. Mounting: Architecture 4 / 6 Mounting External Data as SQL
  23. Mounting • Many filesystems, many relative advantages • How to copy some files between two filesystems? • mount /dev/sda1 /mnt/fs1 • mount /dev/sdb1 /mnt/fs2 • cp -r /mnt/fs1/dir1 /mnt/fs2 4 / 6 Mounting External Data as SQL
  24. Mounting • Many filesystems, many relative advantages • How to copy some files between two filesystems? • mount /dev/sda1 /mnt/fs1 • mount /dev/sdb1 /mnt/fs2 • cd /mnt/fs1/dir1 && gcc 4 / 6 Mounting External Data as SQL
  25. FS mounting: FUSE • Implement filesystem primitives • fuse_operations.read() → HTTP GET • Can mount anything as a filesystem 4 / 6 Mounting External Data as SQL
  26. Database mounting: Postgres FDW • Implement SQL primitives • ForeignScan → coll.find() • Can mount anything as an SQL table 4 / 6 Mounting External Data as SQL
  27. Demo 4 / 6 Mounting External Data as SQL
  28. • Problem Statement • Design Goals • Versioning and Sharing Data • Mounting External Data as SQL • Layered Querying • Efficient, Reproducible Data Set Builds 5 / 6 Layered Querying
  29. Docker: union mount • Every layer is a set of files • Use OverlayFS (AUFS) to mount as a filesystem • How to read a file? • Check top layer • Check next layer... 5 / 6 Layered Querying
  30. Splitgraph: layered querying • Every layer is a set of added/deleted/updated rows • Use Postgres FDW to mount as a set of tables • How to query a table? • Apply qualifiers to the snapshot • Apply first diff, apply qualifiers again... • Tradeoff: slower than querying a materialized table. 5 / 6 Layered Querying
  31. • Problem Statement • Design Goals • Versioning and Sharing Data • Mounting External Data as SQL • Layered Querying • Efficient, Reproducible Data Set Builds 6 / 6 Efficient, Reproducible Data Set Builds
  32. Dockerfiles • Every command defines a layer • Every layer has a deterministic hash • Cache invalidation if: • Previous layer changes • Command changes 6 / 6 Efficient, Reproducible Data Set Builds
  33. SGFiles: Dockerfiles for data • Same idea. • Commands: • FROM – base the image on something else • IMPORT – import tables from another image • SQL – run SQL against the image 6 / 6 Efficient, Reproducible Data Set Builds
  34. Consumption: Architecture 6 / 6 Efficient, Reproducible Data Set Builds
  35. SGFile Demo FROM usda/yields IMPORT crop_yields FROM noaa/climate:original_estimate IMPORT rainfall SQL CREATE TABLE rainfall_yields AS SELECT * FROM rainfall JOIN crop_yields ... 6 / 6 Efficient, Reproducible Data Set Builds
  36. Q&A splitgraph.com twitter.com/splitgraph github.com/splitgraph
  37. Registry Architecture / Storage Optimization 7 / 6 Appendix
  38. Future Work: Optimizing Schema Changes Splitgraph can handle schema changes, but... • Could be more space efficient • Stores schema change as new table (snapshot) Future optimizations... • Detect column changes via schema introspection • Store schema change like data (as diff) 7 / 6 Appendix
  39. Future Work: More FDW, better mounting UX • Ship more FDW extensions with driver • Common data types: CSV, MySQL, HDFS, etc. • SaaS sources: Mixpanel, segment, salesforce, etc. • • Improve UX layer over Postgres FDW • Reduce cognitive overhead of mounting 7 / 6 Appendix
  40. Future Work: Reverse FDW • Now, sync is pull-based • → Requires manual updates or “snapper” • Future: “reverse FDW” for each data source? • Push changes from mount to driver • e.g. for mongo, push oplog into postgres 7 / 6 Appendix
Advertisement