Manage Data Like Code (sf analytics meetup) (1)

Making Big Data Easy
Data Discovery Lineage Data Like Code

Manage Data Like Code
Quilt Data May 2019

My Background
Worked on a variety of projects in backend and distributed systems at:
Passionate about large scale data problems:
1. ETLs for data warehouses
2. Pipelines for machine learning applications
3. Distributed Simulation Environments
Quilt Data May 2019

Problems Everyone Encounters
● Missing or incorrect data
● Type issues and Schema mismatches
● Failed pipelines, buggy pipelines, delayed pipelines
● Differing Format for Dates and Timezones, units of measurement
● Ensemble models trained on inconsistent data sets
● Training-serving skew
● Data quality regressions over time
Quilt Data May 2019

Growing Understanding of Cost of Problems
Quilt Data May 2019
Hidden Technical Debt in Machine Learning Systems
NIPS 2015

Need for Developing Process Around Data
Managing machine learning and data intensive systems require adopting
good practices and processes to address the complexities of behaviours
that data changes can cause.
“If data replaces code in ML systems, and code should be tested, then it seems clear
that some amount of testing of input data is critical to a well-functioning system. Basic
sanity checks are useful, as more sophisticated tests that monitor changes in input
distributions.” - Sculley, et al.
Quilt Data May 2019

Where Does Quilt Fit In?
I found that everywhere I went, people were building very similar internal tooling to
manage these issues.
Opportunity to bring techniques and tools into the broader community and
Open-source
Quilt Data May 2019

Software Engineering Principles
Versioned Code
Clean and explicit interfaces between modules
and services
Unit and integration tests
Observability (Metrics, logging, tracing)
Data Engineering equivalents
Versioned Data
Clean and explicit interfaces between code and data
Data quality and schema checks
Dataset regressions (changes in range, std, etc)
What if data were managed like code?
Quilt Data May 2019

Example Pipeline
To motivate some of these ideas, let’s look at an example pipelines built for a civic
tech hackathon with Aleksey Bilogur
Great example of pipeline jungles and the types of problems that we encounters
when code and data interact.
Quilt Data May 2019

Building a Pipeline to Predict Bus Delays
Quilt Data May 2019

Adding Features to the Dataset
Quilt Data May 2019

Adding Data Sources
Quilt Data May 2019

Data Problems Encountered
● Incorrect Schema:
○ Some malformed GTFS Files (missing stops.txt)
● Misaligned/Bad timezones
○ Some local timestamps, some UTC. Required careful handling of datetime.datetime(tzinfo)
● Timestamp mismatch
○ Schedule for a different date range than realtime feed
● Missing Data
○ Endpoint scrapers experiences occasional failures or backoffs resulting in missing data
● Underlying data from different pipelines may be out of sync
Quilt Data May 2019

Quilt Data May 2019
Enforcing Data Consistency

Data Consistency Problems
DRY For Data Dependencies
1. GTFS schedules used in two different parts of the pipeline
2. Could be different dates/ranges, need to ensure consistency
Training-Serving Skew:
1. The features need to be consistent between training and serving
2. Code consistency for ETL transform on GTFS-RT feed
3. Hyperparameter consistency
Quilt Data May 2019

Ensure consistency across inputs
Quilt Data May 2019

Capture Inputs as Packages
Quilt Data May 2019

Package Deﬁnition
Need to enforce consistent underlying data across microservices and models
● Global Singleton for Data
● Immutable, Reproducible
● Versioned
● Language agnostic
Quilt Data May 2019

Package manifest
{"version": "v0", "message": "Package of NYC MTA Buses Delays"}
{"logical_key": "quilt_summarize.json", "physical_keys":
["s3://quilt-transit/gtfs/hackathon/quilt_summarize.json?versionId=o35eue5yTFhZzqyiZfNle2E9oCJL.I.O
"], "size": 20953, "hash": {"type": "SHA256", "value":
"80f7dec8c14945f3565dbf528d13b6cae3422e7bc65dc7b569213c519e9146dd"}, "meta": {}}
{"logical_key": "MTA_BUS_COMP/gtfs.zip", "physical_keys":
["s3://quilt-transit/gtfs/US/NY/NEW_YORK/MTA/MTA_BUS_COMP/gtfs.zip?version=vxKXRmxT138ihJRStIy4_9cs
1E0ukcUq"], "size": 697643, "hash": {"type": "SHA256", "value":
"2a46b0288a22471d8737f38abc4ff36866e6522edc8391bf633bace5e2952a8e"}, "meta": {}}
Quilt Data May 2019

A More Consistent Pipeline
Quilt Data May 2019

Package Lifecycle
1. Build your package
2. Publish your package
3. Run your pipeline
Well suited for batch-style jobs coming out of a
data lake.
Build, release, run
Quilt Data May 2019
pkg = quilt3.Package().set(...)
pkg.push(usr/pkg)
quilt3.Package.install(usr/pkg)
from quilt3.data.usr import pkg

Versioning Data
Quilt Data May 2019

Datasets Don’t Play Well with Source Control
First impulse: Throw it into GitHub!
Doesn’t handle datasets well:
1. Poor support for Large files (>100MB)
2. Repo size blows up
3. Diff algorithms based on new lines
Going to need some different tooling, but want to capture the same ideas and
principles
Quilt Data May 2019

S3 Blob Storage
Properties
1. High availability
2. High throughput
3. Object level versioned
4. Not designed to be a user friendly web app
Quilt3 is a Data Collaboration layer built on top of S3, to be more user friendly and
interactive
Quilt Data May 2019

Versioned Packages Help Reproducibility
Using the package versions for the input to a pipeline allows
1. Tracking data lineage
2. Cross run comparisons
3. Skip already computed phases
Version the trained model, by tagging metadata with
1. Data Package
2. Trained Model Artifact + Config (i.e. hyperparameter info in metadata)
3. Git hashes of the code used in pipelines
code + data + model
Quilt Data May 2019

Quilt Data May 2019
Documented Datasets

Quick Example
https://allencell.quiltdata.com/b/quilt-example/tree/robnewman/us_county_smoking_vs_poverty/
Quilt Data May 2019

Searchable
https://allencell.quiltdata.com/b/quilt-aics/tree/aics/pipeline_integrated_cell/fov/f_00103cfcd0a7452c9ec7cb32398ce463.ome.tiff
https://allencell.quiltdata.com/b/quilt-aics/search?q=user_meta.microscopy_info.colony_position%3A%20%22center%22
Quilt Data May 2019

Searchable Data Provenance
Quilt Data May 2019

Discoverability
Quilt Data May 2019

Discoverable Schema
Quilt Data May 2019
https://allencell.quiltdata.com/b/quilt-example/tree/akarve/previews/storms.parquet

Quilt Data May 2019
Schema Mismatches

Moving Cities
Quilt Data May 2019

Missing Data Provider
Quilt Data May 2019
X

Crashes Subsequent Phases
Quilt Data May 2019
X
X
X
X X

Errors Propagate Through Pipeline
Quilt Data May 2019
X

Adding a schema to package manifest
{"version": "v0", "message": "Package of NYC MTA Buses Delays", "implements":
["s3://quilt-transit/gtfs_city_schema.jsonl?versionId=o35eue5yTFhZzqyiZfNle2E9oCJL.I.O"]}
{"logical_key": "quilt_summarize.json", "physical_keys":
["s3://quilt-transit/gtfs/hackathon/quilt_summarize.json?versionId=o35eue5yTFhZzqyiZfNle2E9oCJL.I.O
"], "size": 20953, "hash": {"type": "SHA256", "value":
"80f7dec8c14945f3565dbf528d13b6cae3422e7bc65dc7b569213c519e9146dd"}, "meta": {}}
{"logical_key": "MTA_BUS_COMP/gtfs.zip", "physical_keys":
["s3://quilt-transit/gtfs/US/NY/NEW_YORK/MTA/MTA_BUS_COMP/gtfs.zip?version=vxKXRmxT138ihJRStIy4_9cs
1E0ukcUq"], "size": 697643, "hash": {"type": "SHA256", "value":
"2a46b0288a22471d8737f38abc4ff36866e6522edc8391bf633bace5e2952a8e"}, "meta": {}}
Quilt Data May 2019
In Development Feature ( PR’s Welcome :) )

Inheritance and Data Schema Integrity
A Package specification can optionally specify a schema that it implements.
1. Any field defined in the specification inherited becomes available if a default is
defined
2. Any field without a default value must be present as a logical key in the
package manifest
Ensures that any Package implementing a schema will guarantee to have the
necessary data elements for a pipeline to programatically access
Quilt Data May 2019

Quilt Data May 2019
Catch Bad Data Early

Fail Fast
Catch schema and data quality issues before running an expensive pipeline.
Decrease time spent debugging data quality issues.
Decrease time getting ﬁxes from data providers/contractors. (i.e. broken
ﬁle/schemas, wrong time zones, wrong date ranges, units issues, etc)
Quilt Data May 2019

Testing Data like Code
Unit tests - Schema, types, ranges
Integration tests - consistency across objects,
consistency across
Wraps pytest
Staging with .1% run for instance
Package data that has only passed tests
Publish a new version of the Package with a git
hash of the tests’ code
from quilt3.data import validation
class TestGTFSFile(validation.TestCase):
@validation.TestCase.test_elements('gtfs/US/NY/NEW_YORK/*/*/*')
def validateFileStructure(self, gtfs_key):
self.assertTrue(gtfs_key.endswith('.zip'),
'GTFS should be a .zip files')
with zipfile.ZipFile(gtfs_key) as z:
filenames = zipfile.ZipFile(
fileobj=archive).getnames()
self.assert(set(filenames).intersection(
['stops.txt']), 'GTFS tar missing stops.txt')
Quilt Data May 2019

Validation is slow
To large to run on CircleCI, too slow to run as part of any CI without adequate
parallelization
Hard to balance the frequency of running continuous integration jobs and
resource utilization
Submit to a validation pipeline on when publishing package.
Validation pipeline runs tests stored in Github repo. On success, attaches a PASS
with git hash to the package metadata.
Quilt Data May 2019

Best Practices for
Dataset Management
Quilt Data May 2019
VERSIONED
IMMUTABLE
REPRODUCIBLE
DOCUMENTED
INTERCHANGEABLE
EXPLICIT DATA DEPENDENCIES
PORTABLE
EXTENSIBLE
BUILD, RELEASE, RUN

Questions
@quiltdata on twitter
sindelar@quiltdata.io
quiltdata/quilt on GitHub
We're hiring a Cloud Engineer
in Downtown SF
Quilt Data May 2019

Manage Data Like Code (sf analytics meetup) (1)

Recommended

Recommended

More Related Content

Similar to Manage Data Like Code (sf analytics meetup) (1)

Similar to Manage Data Like Code (sf analytics meetup) (1) (20)

Recently uploaded

Recently uploaded (20)

Manage Data Like Code (sf analytics meetup) (1)