3. My Background
Worked on a variety of projects in backend and distributed systems at:
Passionate about large scale data problems:
1. ETLs for data warehouses
2. Pipelines for machine learning applications
3. Distributed Simulation Environments
Quilt Data May 2019
4. Problems Everyone Encounters
● Missing or incorrect data
● Type issues and Schema mismatches
● Failed pipelines, buggy pipelines, delayed pipelines
● Differing Format for Dates and Timezones, units of measurement
● Ensemble models trained on inconsistent data sets
● Training-serving skew
● Data quality regressions over time
Quilt Data May 2019
5. Growing Understanding of Cost of Problems
Quilt Data May 2019
Hidden Technical Debt in Machine Learning Systems
NIPS 2015
6. Need for Developing Process Around Data
Managing machine learning and data intensive systems require adopting
good practices and processes to address the complexities of behaviours
that data changes can cause.
“If data replaces code in ML systems, and code should be tested, then it seems clear
that some amount of testing of input data is critical to a well-functioning system. Basic
sanity checks are useful, as more sophisticated tests that monitor changes in input
distributions.” - Sculley, et al.
Quilt Data May 2019
7. Where Does Quilt Fit In?
I found that everywhere I went, people were building very similar internal tooling to
manage these issues.
Opportunity to bring techniques and tools into the broader community and
Open-source
Quilt Data May 2019
8. Software Engineering Principles
Versioned Code
Clean and explicit interfaces between modules
and services
Unit and integration tests
Observability (Metrics, logging, tracing)
Data Engineering equivalents
Versioned Data
Clean and explicit interfaces between code and data
Data quality and schema checks
Dataset regressions (changes in range, std, etc)
What if data were managed like code?
Quilt Data May 2019
9. Example Pipeline
To motivate some of these ideas, let’s look at an example pipelines built for a civic
tech hackathon with Aleksey Bilogur
Great example of pipeline jungles and the types of problems that we encounters
when code and data interact.
Quilt Data May 2019
13. Data Problems Encountered
● Incorrect Schema:
○ Some malformed GTFS Files (missing stops.txt)
● Misaligned/Bad timezones
○ Some local timestamps, some UTC. Required careful handling of datetime.datetime(tzinfo)
● Timestamp mismatch
○ Schedule for a different date range than realtime feed
● Missing Data
○ Endpoint scrapers experiences occasional failures or backoffs resulting in missing data
● Underlying data from different pipelines may be out of sync
Quilt Data May 2019
15. Data Consistency Problems
DRY For Data Dependencies
1. GTFS schedules used in two different parts of the pipeline
2. Could be different dates/ranges, need to ensure consistency
Training-Serving Skew:
1. The features need to be consistent between training and serving
2. Code consistency for ETL transform on GTFS-RT feed
3. Hyperparameter consistency
Quilt Data May 2019
18. Package Definition
Need to enforce consistent underlying data across microservices and models
● Global Singleton for Data
● Immutable, Reproducible
● Versioned
● Language agnostic
Quilt Data May 2019
21. Package Lifecycle
1. Build your package
2. Publish your package
3. Run your pipeline
Well suited for batch-style jobs coming out of a
data lake.
Build, release, run
Quilt Data May 2019
pkg = quilt3.Package().set(...)
pkg.push(usr/pkg)
quilt3.Package.install(usr/pkg)
from quilt3.data.usr import pkg
23. Datasets Don’t Play Well with Source Control
First impulse: Throw it into GitHub!
Doesn’t handle datasets well:
1. Poor support for Large files (>100MB)
2. Repo size blows up
3. Diff algorithms based on new lines
Going to need some different tooling, but want to capture the same ideas and
principles
Quilt Data May 2019
24. S3 Blob Storage
Properties
1. High availability
2. High throughput
3. Object level versioned
4. Not designed to be a user friendly web app
Quilt3 is a Data Collaboration layer built on top of S3, to be more user friendly and
interactive
Quilt Data May 2019
25. Versioned Packages Help Reproducibility
Using the package versions for the input to a pipeline allows
1. Tracking data lineage
2. Cross run comparisons
3. Skip already computed phases
Version the trained model, by tagging metadata with
1. Data Package
2. Trained Model Artifact + Config (i.e. hyperparameter info in metadata)
3. Git hashes of the code used in pipelines
code + data + model
Quilt Data May 2019
37. Adding a schema to package manifest
{"version": "v0", "message": "Package of NYC MTA Buses Delays", "implements":
["s3://quilt-transit/gtfs_city_schema.jsonl?versionId=o35eue5yTFhZzqyiZfNle2E9oCJL.I.O"]}
{"logical_key": "quilt_summarize.json", "physical_keys":
["s3://quilt-transit/gtfs/hackathon/quilt_summarize.json?versionId=o35eue5yTFhZzqyiZfNle2E9oCJL.I.O
"], "size": 20953, "hash": {"type": "SHA256", "value":
"80f7dec8c14945f3565dbf528d13b6cae3422e7bc65dc7b569213c519e9146dd"}, "meta": {}}
{"logical_key": "MTA_BUS_COMP/gtfs.zip", "physical_keys":
["s3://quilt-transit/gtfs/US/NY/NEW_YORK/MTA/MTA_BUS_COMP/gtfs.zip?version=vxKXRmxT138ihJRStIy4_9cs
1E0ukcUq"], "size": 697643, "hash": {"type": "SHA256", "value":
"2a46b0288a22471d8737f38abc4ff36866e6522edc8391bf633bace5e2952a8e"}, "meta": {}}
Quilt Data May 2019
In Development Feature ( PR’s Welcome :) )
38. Inheritance and Data Schema Integrity
A Package specification can optionally specify a schema that it implements.
1. Any field defined in the specification inherited becomes available if a default is
defined
2. Any field without a default value must be present as a logical key in the
package manifest
Ensures that any Package implementing a schema will guarantee to have the
necessary data elements for a pipeline to programatically access
Quilt Data May 2019
40. Fail Fast
Catch schema and data quality issues before running an expensive pipeline.
Decrease time spent debugging data quality issues.
Decrease time getting fixes from data providers/contractors. (i.e. broken
file/schemas, wrong time zones, wrong date ranges, units issues, etc)
Quilt Data May 2019
41. Testing Data like Code
Unit tests - Schema, types, ranges
Integration tests - consistency across objects,
consistency across
Wraps pytest
Staging with .1% run for instance
Package data that has only passed tests
Publish a new version of the Package with a git
hash of the tests’ code
from quilt3.data import validation
class TestGTFSFile(validation.TestCase):
@validation.TestCase.test_elements('gtfs/US/NY/NEW_YORK/*/*/*')
def validateFileStructure(self, gtfs_key):
self.assertTrue(gtfs_key.endswith('.zip'),
'GTFS should be a .zip files')
with zipfile.ZipFile(gtfs_key) as z:
filenames = zipfile.ZipFile(
fileobj=archive).getnames()
self.assert(set(filenames).intersection(
['stops.txt']), 'GTFS tar missing stops.txt')
Quilt Data May 2019
42. Validation is slow
To large to run on CircleCI, too slow to run as part of any CI without adequate
parallelization
Hard to balance the frequency of running continuous integration jobs and
resource utilization
Submit to a validation pipeline on when publishing package.
Validation pipeline runs tests stored in Github repo. On success, attaches a PASS
with git hash to the package metadata.
Quilt Data May 2019
43. Best Practices for
Dataset Management
Quilt Data May 2019
VERSIONED
IMMUTABLE
REPRODUCIBLE
DOCUMENTED
INTERCHANGEABLE
EXPLICIT DATA DEPENDENCIES
PORTABLE
EXTENSIBLE
BUILD, RELEASE, RUN