Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

of

Splitgraph: AHL talk Slide 1 Splitgraph: AHL talk Slide 2 Splitgraph: AHL talk Slide 3 Splitgraph: AHL talk Slide 4 Splitgraph: AHL talk Slide 5 Splitgraph: AHL talk Slide 6 Splitgraph: AHL talk Slide 7 Splitgraph: AHL talk Slide 8 Splitgraph: AHL talk Slide 9 Splitgraph: AHL talk Slide 10 Splitgraph: AHL talk Slide 11 Splitgraph: AHL talk Slide 12 Splitgraph: AHL talk Slide 13 Splitgraph: AHL talk Slide 14 Splitgraph: AHL talk Slide 15 Splitgraph: AHL talk Slide 16 Splitgraph: AHL talk Slide 17 Splitgraph: AHL talk Slide 18 Splitgraph: AHL talk Slide 19 Splitgraph: AHL talk Slide 20 Splitgraph: AHL talk Slide 21 Splitgraph: AHL talk Slide 22 Splitgraph: AHL talk Slide 23 Splitgraph: AHL talk Slide 24 Splitgraph: AHL talk Slide 25 Splitgraph: AHL talk Slide 26 Splitgraph: AHL talk Slide 27 Splitgraph: AHL talk Slide 28 Splitgraph: AHL talk Slide 29 Splitgraph: AHL talk Slide 30 Splitgraph: AHL talk Slide 31 Splitgraph: AHL talk Slide 32 Splitgraph: AHL talk Slide 33 Splitgraph: AHL talk Slide 34 Splitgraph: AHL talk Slide 35 Splitgraph: AHL talk Slide 36 Splitgraph: AHL talk Slide 37 Splitgraph: AHL talk Slide 38 Splitgraph: AHL talk Slide 39 Splitgraph: AHL talk Slide 40 Splitgraph: AHL talk Slide 41 Splitgraph: AHL talk Slide 42 Splitgraph: AHL talk Slide 43 Splitgraph: AHL talk Slide 44 Splitgraph: AHL talk Slide 45 Splitgraph: AHL talk Slide 46 Splitgraph: AHL talk Slide 47 Splitgraph: AHL talk Slide 48 Splitgraph: AHL talk Slide 49 Splitgraph: AHL talk Slide 50 Splitgraph: AHL talk Slide 51 Splitgraph: AHL talk Slide 52 Splitgraph: AHL talk Slide 53 Splitgraph: AHL talk Slide 54 Splitgraph: AHL talk Slide 55 Splitgraph: AHL talk Slide 56 Splitgraph: AHL talk Slide 57 Splitgraph: AHL talk Slide 58 Splitgraph: AHL talk Slide 59 Splitgraph: AHL talk Slide 60 Splitgraph: AHL talk Slide 61 Splitgraph: AHL talk Slide 62 Splitgraph: AHL talk Slide 63 Splitgraph: AHL talk Slide 64 Splitgraph: AHL talk Slide 65 Splitgraph: AHL talk Slide 66 Splitgraph: AHL talk Slide 67 Splitgraph: AHL talk Slide 68 Splitgraph: AHL talk Slide 69 Splitgraph: AHL talk Slide 70 Splitgraph: AHL talk Slide 71 Splitgraph: AHL talk Slide 72 Splitgraph: AHL talk Slide 73 Splitgraph: AHL talk Slide 74
Upcoming SlideShare
What to Upload to SlideShare
Next
Download to read offline and view in fullscreen.

1 Like

Share

Download to read offline

Splitgraph: AHL talk

Download to read offline

Slides from an internal tech talk we gave at Man AHL (twitter.com/manahltech) discussing Splitgraph and how it can be used to create and share datasets in a composable, maintainable and reproducible way with demos of basic functionality (version control, mounting and a Docker-like language for defining datasets) included.

Splitgraph: www.splitgraph.com
Github: github.com/splitgraph
Twitter: twitter.com/splitgraph

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all

Splitgraph: AHL talk

  1. 1. Splitgraph Artjoms Iškovs, Miles Richardson
  2. 2. Data preparation accounts for about 80% of the work of data scientists. 1 / 6 Problem Statement
  3. 3. Why so hard to build and maintain data sets? • Sourcing data is not composable • Why can’t I query multiple data sets at once? • Wrangling and cleaning data is not maintainable • Why can’t I keep my data sets up to date? • Running ad-hoc queries is not reproducible • Why can’t I share my data sets? 1 / 6 Problem Statement
  4. 4. • Problem Statement • Design Goals • Versioning and Sharing Data • Mounting External Data as SQL • Layered Querying • Efficient, Reproducible Data Set Builds 2 / 6 Design Goals
  5. 5. Good tools enhance existing abstractions without breaking them. 2 / 6 Design Goals
  6. 6. The filesystem as a core abstraction • Universal, shared abstraction with a standard API • Tools that use it all benefit from each other • e.g. git enhances the filesystem for other tools: • IDE, Editor, Compiler, etc... 2 / 6 Design Goals
  7. 7. Good tools that leverage the filesystem Tool git FUSE Docker Benefit Version source code Mount anything as a filesystem Efficient, reproducible builds 2 / 6 Design Goals
  8. 8. Can we find a common abstraction for databases? 2 / 6 Design Goals
  9. 9. SQL as a common abstraction Just like the filesystem... • Universal, shared abstraction with a standard API • Tools that use it all benefit from each other • splitgraph enhances the database for other tools: • Data Analysis tools, ETL tools, anything that speaks SQL... 2 / 6 Design Goals
  10. 10. Benefits of leveraging SQL Benefits analogous to git, FUSE, Docker... splitgraph Benefit Version and share data Mount external data as SQL Efficient, reproducible data set builds 2 / 6 Design Goals
  11. 11. • Problem Statement • Design Goals • Versioning and Sharing Data • Mounting External Data as SQL • Layered Querying • Efficient, Reproducible Data Set Builds 3 / 6 Versioning and Sharing Data
  12. 12. Basic Architecture: Client and Driver 3 / 6 Versioning and Sharing Data
  13. 13. Versioning: How does it work? • Git: .git • Stores every revision of every file • Splitgraph: splitgraph_meta • Stores every revision of every table • Space? 3 / 6 Versioning and Sharing Data
  14. 14. Versioning: Delta compression • Only care about changes • Need to efficiently: • Create diffs (→ commit, push) • Apply diffs (→ checkout, pull) 3 / 6 Versioning and Sharing Data
  15. 15. Versioning: Delta compression Git • Lines • diff Docker • Files • Custom FS Splitgraph • Rows • Audit triggers 3 / 6 Versioning and Sharing Data
  16. 16. Versioning → Sharing • Just like git • Immutable versions (commits) easily shared • git remotes: • Local FS .git → SSH → Remote FS .git • splitgraph remotes: • Local sg_meta → SQL → Remote sg_meta 3 / 6 Versioning and Sharing Data
  17. 17. Sharing Architecture: Splitgraph Remotes 3 / 6 Versioning and Sharing Data
  18. 18. Sharing Architecture: Splitgraph Remotes 3 / 6 Versioning and Sharing Data
  19. 19. Demo: NOAA Changes data 3 / 6 Versioning and Sharing Data
  20. 20. • Problem Statement • Design Goals • Versioning and Sharing Data • Mounting External Data as SQL • Layered Querying • Efficient, Reproducible Data Set Builds 4 / 6 Mounting External Data as SQL
  21. 21. Meanwhile at the USDA • USDA stores their crop yields data in Mongo • Want to add Puerto Rico and create a new repository 4 / 6 Mounting External Data as SQL
  22. 22. Mounting: Architecture 4 / 6 Mounting External Data as SQL
  23. 23. Mounting • Many filesystems, many relative advantages • How to copy some files between two filesystems? • mount /dev/sda1 /mnt/fs1 • mount /dev/sdb1 /mnt/fs2 • cp -r /mnt/fs1/dir1 /mnt/fs2 4 / 6 Mounting External Data as SQL
  24. 24. Mounting • Many filesystems, many relative advantages • How to copy some files between two filesystems? • mount /dev/sda1 /mnt/fs1 • mount /dev/sdb1 /mnt/fs2 • cd /mnt/fs1/dir1 && gcc 4 / 6 Mounting External Data as SQL
  25. 25. FS mounting: FUSE • Implement filesystem primitives • fuse_operations.read() → HTTP GET • Can mount anything as a filesystem 4 / 6 Mounting External Data as SQL
  26. 26. Database mounting: Postgres FDW • Implement SQL primitives • ForeignScan → coll.find() • Can mount anything as an SQL table 4 / 6 Mounting External Data as SQL
  27. 27. Demo 4 / 6 Mounting External Data as SQL
  28. 28. • Problem Statement • Design Goals • Versioning and Sharing Data • Mounting External Data as SQL • Layered Querying • Efficient, Reproducible Data Set Builds 5 / 6 Layered Querying
  29. 29. Docker: union mount • Every layer is a set of files • Use OverlayFS (AUFS) to mount as a filesystem • How to read a file? • Check top layer • Check next layer... 5 / 6 Layered Querying
  30. 30. Splitgraph: layered querying • Every layer is a set of added/deleted/updated rows • Use Postgres FDW to mount as a set of tables • How to query a table? • Apply qualifiers to the snapshot • Apply first diff, apply qualifiers again... • Tradeoff: slower than querying a materialized table. 5 / 6 Layered Querying
  31. 31. • Problem Statement • Design Goals • Versioning and Sharing Data • Mounting External Data as SQL • Layered Querying • Efficient, Reproducible Data Set Builds 6 / 6 Efficient, Reproducible Data Set Builds
  32. 32. Dockerfiles • Every command defines a layer • Every layer has a deterministic hash • Cache invalidation if: • Previous layer changes • Command changes 6 / 6 Efficient, Reproducible Data Set Builds
  33. 33. SGFiles: Dockerfiles for data • Same idea. • Commands: • FROM – base the image on something else • IMPORT – import tables from another image • SQL – run SQL against the image 6 / 6 Efficient, Reproducible Data Set Builds
  34. 34. Consumption: Architecture 6 / 6 Efficient, Reproducible Data Set Builds
  35. 35. SGFile Demo FROM usda/yields IMPORT crop_yields FROM noaa/climate:original_estimate IMPORT rainfall SQL CREATE TABLE rainfall_yields AS SELECT * FROM rainfall JOIN crop_yields ... 6 / 6 Efficient, Reproducible Data Set Builds
  36. 36. Q&A splitgraph.com twitter.com/splitgraph github.com/splitgraph
  37. 37. Registry Architecture / Storage Optimization 7 / 6 Appendix
  38. 38. Future Work: Optimizing Schema Changes Splitgraph can handle schema changes, but... • Could be more space efficient • Stores schema change as new table (snapshot) Future optimizations... • Detect column changes via schema introspection • Store schema change like data (as diff) 7 / 6 Appendix
  39. 39. Future Work: More FDW, better mounting UX • Ship more FDW extensions with driver • Common data types: CSV, MySQL, HDFS, etc. • SaaS sources: Mixpanel, segment, salesforce, etc. • • Improve UX layer over Postgres FDW • Reduce cognitive overhead of mounting 7 / 6 Appendix
  40. 40. Future Work: Reverse FDW • Now, sync is pull-based • → Requires manual updates or “snapper” • Future: “reverse FDW” for each data source? • Push changes from mount to driver • e.g. for mongo, push oplog into postgres 7 / 6 Appendix
  • MilesRichardson

    Nov. 7, 2018

Slides from an internal tech talk we gave at Man AHL (twitter.com/manahltech) discussing Splitgraph and how it can be used to create and share datasets in a composable, maintainable and reproducible way with demos of basic functionality (version control, mounting and a Docker-like language for defining datasets) included. Splitgraph: www.splitgraph.com Github: github.com/splitgraph Twitter: twitter.com/splitgraph

Views

Total views

530

On Slideshare

0

From embeds

0

Number of embeds

49

Actions

Downloads

0

Shares

0

Comments

0

Likes

1

×