SlideShare a Scribd company logo
1 of 44
Download to read offline
Making Big Data Easy
Data Discovery Lineage Data Like Code
Manage Data Like Code
Quilt Data May 2019
My Background
Worked on a variety of projects in backend and distributed systems at:
Passionate about large scale data problems:
1. ETLs for data warehouses
2. Pipelines for machine learning applications
3. Distributed Simulation Environments
Quilt Data May 2019
Problems Everyone Encounters
● Missing or incorrect data
● Type issues and Schema mismatches
● Failed pipelines, buggy pipelines, delayed pipelines
● Differing Format for Dates and Timezones, units of measurement
● Ensemble models trained on inconsistent data sets
● Training-serving skew
● Data quality regressions over time
Quilt Data May 2019
Growing Understanding of Cost of Problems
Quilt Data May 2019
Hidden Technical Debt in Machine Learning Systems
NIPS 2015
Need for Developing Process Around Data
Managing machine learning and data intensive systems require adopting
good practices and processes to address the complexities of behaviours
that data changes can cause.
“If data replaces code in ML systems, and code should be tested, then it seems clear
that some amount of testing of input data is critical to a well-functioning system. Basic
sanity checks are useful, as more sophisticated tests that monitor changes in input
distributions.” - Sculley, et al.
Quilt Data May 2019
Where Does Quilt Fit In?
I found that everywhere I went, people were building very similar internal tooling to
manage these issues.
Opportunity to bring techniques and tools into the broader community and
Open-source
Quilt Data May 2019
Software Engineering Principles
Versioned Code
Clean and explicit interfaces between modules
and services
Unit and integration tests
Observability (Metrics, logging, tracing)
Data Engineering equivalents
Versioned Data
Clean and explicit interfaces between code and data
Data quality and schema checks
Dataset regressions (changes in range, std, etc)
What if data were managed like code?
Quilt Data May 2019
Example Pipeline
To motivate some of these ideas, let’s look at an example pipelines built for a civic
tech hackathon with Aleksey Bilogur
Great example of pipeline jungles and the types of problems that we encounters
when code and data interact.
Quilt Data May 2019
Building a Pipeline to Predict Bus Delays
Quilt Data May 2019
Adding Features to the Dataset
Quilt Data May 2019
Adding Data Sources
Quilt Data May 2019
Data Problems Encountered
● Incorrect Schema:
○ Some malformed GTFS Files (missing stops.txt)
● Misaligned/Bad timezones
○ Some local timestamps, some UTC. Required careful handling of datetime.datetime(tzinfo)
● Timestamp mismatch
○ Schedule for a different date range than realtime feed
● Missing Data
○ Endpoint scrapers experiences occasional failures or backoffs resulting in missing data
● Underlying data from different pipelines may be out of sync
Quilt Data May 2019
Quilt Data May 2019
Enforcing Data Consistency
Data Consistency Problems
DRY For Data Dependencies
1. GTFS schedules used in two different parts of the pipeline
2. Could be different dates/ranges, need to ensure consistency
Training-Serving Skew:
1. The features need to be consistent between training and serving
2. Code consistency for ETL transform on GTFS-RT feed
3. Hyperparameter consistency
Quilt Data May 2019
Ensure consistency across inputs
Quilt Data May 2019
Capture Inputs as Packages
Quilt Data May 2019
Package Definition
Need to enforce consistent underlying data across microservices and models
● Global Singleton for Data
● Immutable, Reproducible
● Versioned
● Language agnostic
Quilt Data May 2019
Package manifest
{"version": "v0", "message": "Package of NYC MTA Buses Delays"}
{"logical_key": "quilt_summarize.json", "physical_keys":
["s3://quilt-transit/gtfs/hackathon/quilt_summarize.json?versionId=o35eue5yTFhZzqyiZfNle2E9oCJL.I.O
"], "size": 20953, "hash": {"type": "SHA256", "value":
"80f7dec8c14945f3565dbf528d13b6cae3422e7bc65dc7b569213c519e9146dd"}, "meta": {}}
{"logical_key": "MTA_BUS_COMP/gtfs.zip", "physical_keys":
["s3://quilt-transit/gtfs/US/NY/NEW_YORK/MTA/MTA_BUS_COMP/gtfs.zip?version=vxKXRmxT138ihJRStIy4_9cs
1E0ukcUq"], "size": 697643, "hash": {"type": "SHA256", "value":
"2a46b0288a22471d8737f38abc4ff36866e6522edc8391bf633bace5e2952a8e"}, "meta": {}}
Quilt Data May 2019
A More Consistent Pipeline
Quilt Data May 2019
Package Lifecycle
1. Build your package
2. Publish your package
3. Run your pipeline
Well suited for batch-style jobs coming out of a
data lake.
Build, release, run
Quilt Data May 2019
pkg = quilt3.Package().set(...)
pkg.push(usr/pkg)
quilt3.Package.install(usr/pkg)
from quilt3.data.usr import pkg
Versioning Data
Quilt Data May 2019
Datasets Don’t Play Well with Source Control
First impulse: Throw it into GitHub!
Doesn’t handle datasets well:
1. Poor support for Large files (>100MB)
2. Repo size blows up
3. Diff algorithms based on new lines
Going to need some different tooling, but want to capture the same ideas and
principles
Quilt Data May 2019
S3 Blob Storage
Properties
1. High availability
2. High throughput
3. Object level versioned
4. Not designed to be a user friendly web app
Quilt3 is a Data Collaboration layer built on top of S3, to be more user friendly and
interactive
Quilt Data May 2019
Versioned Packages Help Reproducibility
Using the package versions for the input to a pipeline allows
1. Tracking data lineage
2. Cross run comparisons
3. Skip already computed phases
Version the trained model, by tagging metadata with
1. Data Package
2. Trained Model Artifact + Config (i.e. hyperparameter info in metadata)
3. Git hashes of the code used in pipelines
code + data + model
Quilt Data May 2019
Quilt Data May 2019
Documented Datasets
Quick Example
https://allencell.quiltdata.com/b/quilt-example/tree/robnewman/us_county_smoking_vs_poverty/
Quilt Data May 2019
Searchable
https://allencell.quiltdata.com/b/quilt-aics/tree/aics/pipeline_integrated_cell/fov/f_00103cfcd0a7452c9ec7cb32398ce463.ome.tiff
https://allencell.quiltdata.com/b/quilt-aics/search?q=user_meta.microscopy_info.colony_position%3A%20%22center%22
Quilt Data May 2019
Searchable Data Provenance
Quilt Data May 2019
Discoverability
Quilt Data May 2019
Discoverable Schema
Quilt Data May 2019
https://allencell.quiltdata.com/b/quilt-example/tree/akarve/previews/storms.parquet
Quilt Data May 2019
Schema Mismatches
Moving Cities
Quilt Data May 2019
Missing Data Provider
Quilt Data May 2019
X
Crashes Subsequent Phases
Quilt Data May 2019
X
X
X
X X
Errors Propagate Through Pipeline
Quilt Data May 2019
X
Adding a schema to package manifest
{"version": "v0", "message": "Package of NYC MTA Buses Delays", "implements":
["s3://quilt-transit/gtfs_city_schema.jsonl?versionId=o35eue5yTFhZzqyiZfNle2E9oCJL.I.O"]}
{"logical_key": "quilt_summarize.json", "physical_keys":
["s3://quilt-transit/gtfs/hackathon/quilt_summarize.json?versionId=o35eue5yTFhZzqyiZfNle2E9oCJL.I.O
"], "size": 20953, "hash": {"type": "SHA256", "value":
"80f7dec8c14945f3565dbf528d13b6cae3422e7bc65dc7b569213c519e9146dd"}, "meta": {}}
{"logical_key": "MTA_BUS_COMP/gtfs.zip", "physical_keys":
["s3://quilt-transit/gtfs/US/NY/NEW_YORK/MTA/MTA_BUS_COMP/gtfs.zip?version=vxKXRmxT138ihJRStIy4_9cs
1E0ukcUq"], "size": 697643, "hash": {"type": "SHA256", "value":
"2a46b0288a22471d8737f38abc4ff36866e6522edc8391bf633bace5e2952a8e"}, "meta": {}}
Quilt Data May 2019
In Development Feature ( PR’s Welcome :) )
Inheritance and Data Schema Integrity
A Package specification can optionally specify a schema that it implements.
1. Any field defined in the specification inherited becomes available if a default is
defined
2. Any field without a default value must be present as a logical key in the
package manifest
Ensures that any Package implementing a schema will guarantee to have the
necessary data elements for a pipeline to programatically access
Quilt Data May 2019
Quilt Data May 2019
Catch Bad Data Early
Fail Fast
Catch schema and data quality issues before running an expensive pipeline.
Decrease time spent debugging data quality issues.
Decrease time getting fixes from data providers/contractors. (i.e. broken
file/schemas, wrong time zones, wrong date ranges, units issues, etc)
Quilt Data May 2019
Testing Data like Code
Unit tests - Schema, types, ranges
Integration tests - consistency across objects,
consistency across
Wraps pytest
Staging with .1% run for instance
Package data that has only passed tests
Publish a new version of the Package with a git
hash of the tests’ code
from quilt3.data import validation
class TestGTFSFile(validation.TestCase):
@validation.TestCase.test_elements('gtfs/US/NY/NEW_YORK/*/*/*')
def validateFileStructure(self, gtfs_key):
self.assertTrue(gtfs_key.endswith('.zip'),
'GTFS should be a .zip files')
with zipfile.ZipFile(gtfs_key) as z:
filenames = zipfile.ZipFile(
fileobj=archive).getnames()
self.assert(set(filenames).intersection(
['stops.txt']), 'GTFS tar missing stops.txt')
Quilt Data May 2019
Validation is slow
To large to run on CircleCI, too slow to run as part of any CI without adequate
parallelization
Hard to balance the frequency of running continuous integration jobs and
resource utilization
Submit to a validation pipeline on when publishing package.
Validation pipeline runs tests stored in Github repo. On success, attaches a PASS
with git hash to the package metadata.
Quilt Data May 2019
Best Practices for
Dataset Management
Quilt Data May 2019
VERSIONED
IMMUTABLE
REPRODUCIBLE
DOCUMENTED
INTERCHANGEABLE
EXPLICIT DATA DEPENDENCIES
PORTABLE
EXTENSIBLE
BUILD, RELEASE, RUN
Questions
@quiltdata on twitter
sindelar@quiltdata.io
quiltdata/quilt on GitHub
We're hiring a Cloud Engineer
in Downtown SF
Quilt Data May 2019

More Related Content

Similar to Manage Data Like Code (sf analytics meetup) (1)

Spam detection using machine learning based binary classifier_043660
Spam detection using machine learning based binary classifier_043660Spam detection using machine learning based binary classifier_043660
Spam detection using machine learning based binary classifier_043660syaidatulamirah
 
IRJET- An Integrity Auditing &Data Dedupe withEffective Bandwidth in Cloud St...
IRJET- An Integrity Auditing &Data Dedupe withEffective Bandwidth in Cloud St...IRJET- An Integrity Auditing &Data Dedupe withEffective Bandwidth in Cloud St...
IRJET- An Integrity Auditing &Data Dedupe withEffective Bandwidth in Cloud St...IRJET Journal
 
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...Alluxio, Inc.
 
Sql portfolio admin_practicals
Sql portfolio admin_practicalsSql portfolio admin_practicals
Sql portfolio admin_practicalsShelli Ciaschini
 
Ist Daten-Liberalismus der richtige Weg?
Ist Daten-Liberalismus der richtige Weg?Ist Daten-Liberalismus der richtige Weg?
Ist Daten-Liberalismus der richtige Weg?confluent
 
IRJET- Top-K Query Processing using Top Order Preserving Encryption (TOPE)
IRJET- Top-K Query Processing using Top Order Preserving Encryption (TOPE)IRJET- Top-K Query Processing using Top Order Preserving Encryption (TOPE)
IRJET- Top-K Query Processing using Top Order Preserving Encryption (TOPE)IRJET Journal
 
ODSC West 2022 – Kitbashing in ML
ODSC West 2022 – Kitbashing in MLODSC West 2022 – Kitbashing in ML
ODSC West 2022 – Kitbashing in MLBryan Bischof
 
Recurrent Neural Networks for Recommendations and Personalization with Nick P...
Recurrent Neural Networks for Recommendations and Personalization with Nick P...Recurrent Neural Networks for Recommendations and Personalization with Nick P...
Recurrent Neural Networks for Recommendations and Personalization with Nick P...Databricks
 
Unit-III(Design).pptx
Unit-III(Design).pptxUnit-III(Design).pptx
Unit-III(Design).pptxFajar Baskoro
 
Internet data mining 2006
Internet data mining   2006Internet data mining   2006
Internet data mining 2006raj_vij
 
Measuring, Quantifying, & Predicting the Cost-Accuracy Tradeoff
Measuring, Quantifying, & Predicting the Cost-Accuracy TradeoffMeasuring, Quantifying, & Predicting the Cost-Accuracy Tradeoff
Measuring, Quantifying, & Predicting the Cost-Accuracy TradeoffHong-Linh Truong
 
Comparing the performance of a business process: using Excel & Python
Comparing the performance of a business process: using Excel & PythonComparing the performance of a business process: using Excel & Python
Comparing the performance of a business process: using Excel & PythonIRJET Journal
 
Crossing the Analytics Chasm and Getting the Models You Developed Deployed
Crossing the Analytics Chasm and Getting the Models You Developed DeployedCrossing the Analytics Chasm and Getting the Models You Developed Deployed
Crossing the Analytics Chasm and Getting the Models You Developed DeployedRobert Grossman
 
RNNs for Recommendations and Personalization
RNNs for Recommendations and PersonalizationRNNs for Recommendations and Personalization
RNNs for Recommendations and PersonalizationNick Pentreath
 
8 Guiding Principles to Kickstart Your Healthcare Big Data Project
8 Guiding Principles to Kickstart Your Healthcare Big Data Project8 Guiding Principles to Kickstart Your Healthcare Big Data Project
8 Guiding Principles to Kickstart Your Healthcare Big Data ProjectCitiusTech
 
Implementation and Review Paper of Secure and Dynamic Multi Keyword Search in...
Implementation and Review Paper of Secure and Dynamic Multi Keyword Search in...Implementation and Review Paper of Secure and Dynamic Multi Keyword Search in...
Implementation and Review Paper of Secure and Dynamic Multi Keyword Search in...IRJET Journal
 

Similar to Manage Data Like Code (sf analytics meetup) (1) (20)

Data mining
Data miningData mining
Data mining
 
Spam detection using machine learning based binary classifier_043660
Spam detection using machine learning based binary classifier_043660Spam detection using machine learning based binary classifier_043660
Spam detection using machine learning based binary classifier_043660
 
IRJET- An Integrity Auditing &Data Dedupe withEffective Bandwidth in Cloud St...
IRJET- An Integrity Auditing &Data Dedupe withEffective Bandwidth in Cloud St...IRJET- An Integrity Auditing &Data Dedupe withEffective Bandwidth in Cloud St...
IRJET- An Integrity Auditing &Data Dedupe withEffective Bandwidth in Cloud St...
 
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...
 
Sql portfolio admin_practicals
Sql portfolio admin_practicalsSql portfolio admin_practicals
Sql portfolio admin_practicals
 
Ist Daten-Liberalismus der richtige Weg?
Ist Daten-Liberalismus der richtige Weg?Ist Daten-Liberalismus der richtige Weg?
Ist Daten-Liberalismus der richtige Weg?
 
IRJET- Top-K Query Processing using Top Order Preserving Encryption (TOPE)
IRJET- Top-K Query Processing using Top Order Preserving Encryption (TOPE)IRJET- Top-K Query Processing using Top Order Preserving Encryption (TOPE)
IRJET- Top-K Query Processing using Top Order Preserving Encryption (TOPE)
 
ODSC West 2022 – Kitbashing in ML
ODSC West 2022 – Kitbashing in MLODSC West 2022 – Kitbashing in ML
ODSC West 2022 – Kitbashing in ML
 
Recurrent Neural Networks for Recommendations and Personalization with Nick P...
Recurrent Neural Networks for Recommendations and Personalization with Nick P...Recurrent Neural Networks for Recommendations and Personalization with Nick P...
Recurrent Neural Networks for Recommendations and Personalization with Nick P...
 
Unit-III(Design).pptx
Unit-III(Design).pptxUnit-III(Design).pptx
Unit-III(Design).pptx
 
Internet data mining 2006
Internet data mining   2006Internet data mining   2006
Internet data mining 2006
 
Measuring, Quantifying, & Predicting the Cost-Accuracy Tradeoff
Measuring, Quantifying, & Predicting the Cost-Accuracy TradeoffMeasuring, Quantifying, & Predicting the Cost-Accuracy Tradeoff
Measuring, Quantifying, & Predicting the Cost-Accuracy Tradeoff
 
Comparing the performance of a business process: using Excel & Python
Comparing the performance of a business process: using Excel & PythonComparing the performance of a business process: using Excel & Python
Comparing the performance of a business process: using Excel & Python
 
Shikha fdp 62_14july2017
Shikha fdp 62_14july2017Shikha fdp 62_14july2017
Shikha fdp 62_14july2017
 
Crossing the Analytics Chasm and Getting the Models You Developed Deployed
Crossing the Analytics Chasm and Getting the Models You Developed DeployedCrossing the Analytics Chasm and Getting the Models You Developed Deployed
Crossing the Analytics Chasm and Getting the Models You Developed Deployed
 
RNNs for Recommendations and Personalization
RNNs for Recommendations and PersonalizationRNNs for Recommendations and Personalization
RNNs for Recommendations and Personalization
 
G-Link_Probablistic Record Linkage System_PVER Conf_May2011
G-Link_Probablistic Record Linkage System_PVER Conf_May2011G-Link_Probablistic Record Linkage System_PVER Conf_May2011
G-Link_Probablistic Record Linkage System_PVER Conf_May2011
 
8 Guiding Principles to Kickstart Your Healthcare Big Data Project
8 Guiding Principles to Kickstart Your Healthcare Big Data Project8 Guiding Principles to Kickstart Your Healthcare Big Data Project
8 Guiding Principles to Kickstart Your Healthcare Big Data Project
 
Implementation and Review Paper of Secure and Dynamic Multi Keyword Search in...
Implementation and Review Paper of Secure and Dynamic Multi Keyword Search in...Implementation and Review Paper of Secure and Dynamic Multi Keyword Search in...
Implementation and Review Paper of Secure and Dynamic Multi Keyword Search in...
 
Google Bigtable
Google BigtableGoogle Bigtable
Google Bigtable
 

Recently uploaded

Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityWSO2
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Bhuvaneswari Subramani
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 

Recently uploaded (20)

Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 

Manage Data Like Code (sf analytics meetup) (1)

  • 1. Making Big Data Easy Data Discovery Lineage Data Like Code
  • 2. Manage Data Like Code Quilt Data May 2019
  • 3. My Background Worked on a variety of projects in backend and distributed systems at: Passionate about large scale data problems: 1. ETLs for data warehouses 2. Pipelines for machine learning applications 3. Distributed Simulation Environments Quilt Data May 2019
  • 4. Problems Everyone Encounters ● Missing or incorrect data ● Type issues and Schema mismatches ● Failed pipelines, buggy pipelines, delayed pipelines ● Differing Format for Dates and Timezones, units of measurement ● Ensemble models trained on inconsistent data sets ● Training-serving skew ● Data quality regressions over time Quilt Data May 2019
  • 5. Growing Understanding of Cost of Problems Quilt Data May 2019 Hidden Technical Debt in Machine Learning Systems NIPS 2015
  • 6. Need for Developing Process Around Data Managing machine learning and data intensive systems require adopting good practices and processes to address the complexities of behaviours that data changes can cause. “If data replaces code in ML systems, and code should be tested, then it seems clear that some amount of testing of input data is critical to a well-functioning system. Basic sanity checks are useful, as more sophisticated tests that monitor changes in input distributions.” - Sculley, et al. Quilt Data May 2019
  • 7. Where Does Quilt Fit In? I found that everywhere I went, people were building very similar internal tooling to manage these issues. Opportunity to bring techniques and tools into the broader community and Open-source Quilt Data May 2019
  • 8. Software Engineering Principles Versioned Code Clean and explicit interfaces between modules and services Unit and integration tests Observability (Metrics, logging, tracing) Data Engineering equivalents Versioned Data Clean and explicit interfaces between code and data Data quality and schema checks Dataset regressions (changes in range, std, etc) What if data were managed like code? Quilt Data May 2019
  • 9. Example Pipeline To motivate some of these ideas, let’s look at an example pipelines built for a civic tech hackathon with Aleksey Bilogur Great example of pipeline jungles and the types of problems that we encounters when code and data interact. Quilt Data May 2019
  • 10. Building a Pipeline to Predict Bus Delays Quilt Data May 2019
  • 11. Adding Features to the Dataset Quilt Data May 2019
  • 12. Adding Data Sources Quilt Data May 2019
  • 13. Data Problems Encountered ● Incorrect Schema: ○ Some malformed GTFS Files (missing stops.txt) ● Misaligned/Bad timezones ○ Some local timestamps, some UTC. Required careful handling of datetime.datetime(tzinfo) ● Timestamp mismatch ○ Schedule for a different date range than realtime feed ● Missing Data ○ Endpoint scrapers experiences occasional failures or backoffs resulting in missing data ● Underlying data from different pipelines may be out of sync Quilt Data May 2019
  • 14. Quilt Data May 2019 Enforcing Data Consistency
  • 15. Data Consistency Problems DRY For Data Dependencies 1. GTFS schedules used in two different parts of the pipeline 2. Could be different dates/ranges, need to ensure consistency Training-Serving Skew: 1. The features need to be consistent between training and serving 2. Code consistency for ETL transform on GTFS-RT feed 3. Hyperparameter consistency Quilt Data May 2019
  • 16. Ensure consistency across inputs Quilt Data May 2019
  • 17. Capture Inputs as Packages Quilt Data May 2019
  • 18. Package Definition Need to enforce consistent underlying data across microservices and models ● Global Singleton for Data ● Immutable, Reproducible ● Versioned ● Language agnostic Quilt Data May 2019
  • 19. Package manifest {"version": "v0", "message": "Package of NYC MTA Buses Delays"} {"logical_key": "quilt_summarize.json", "physical_keys": ["s3://quilt-transit/gtfs/hackathon/quilt_summarize.json?versionId=o35eue5yTFhZzqyiZfNle2E9oCJL.I.O "], "size": 20953, "hash": {"type": "SHA256", "value": "80f7dec8c14945f3565dbf528d13b6cae3422e7bc65dc7b569213c519e9146dd"}, "meta": {}} {"logical_key": "MTA_BUS_COMP/gtfs.zip", "physical_keys": ["s3://quilt-transit/gtfs/US/NY/NEW_YORK/MTA/MTA_BUS_COMP/gtfs.zip?version=vxKXRmxT138ihJRStIy4_9cs 1E0ukcUq"], "size": 697643, "hash": {"type": "SHA256", "value": "2a46b0288a22471d8737f38abc4ff36866e6522edc8391bf633bace5e2952a8e"}, "meta": {}} Quilt Data May 2019
  • 20. A More Consistent Pipeline Quilt Data May 2019
  • 21. Package Lifecycle 1. Build your package 2. Publish your package 3. Run your pipeline Well suited for batch-style jobs coming out of a data lake. Build, release, run Quilt Data May 2019 pkg = quilt3.Package().set(...) pkg.push(usr/pkg) quilt3.Package.install(usr/pkg) from quilt3.data.usr import pkg
  • 23. Datasets Don’t Play Well with Source Control First impulse: Throw it into GitHub! Doesn’t handle datasets well: 1. Poor support for Large files (>100MB) 2. Repo size blows up 3. Diff algorithms based on new lines Going to need some different tooling, but want to capture the same ideas and principles Quilt Data May 2019
  • 24. S3 Blob Storage Properties 1. High availability 2. High throughput 3. Object level versioned 4. Not designed to be a user friendly web app Quilt3 is a Data Collaboration layer built on top of S3, to be more user friendly and interactive Quilt Data May 2019
  • 25. Versioned Packages Help Reproducibility Using the package versions for the input to a pipeline allows 1. Tracking data lineage 2. Cross run comparisons 3. Skip already computed phases Version the trained model, by tagging metadata with 1. Data Package 2. Trained Model Artifact + Config (i.e. hyperparameter info in metadata) 3. Git hashes of the code used in pipelines code + data + model Quilt Data May 2019
  • 26. Quilt Data May 2019 Documented Datasets
  • 31. Discoverable Schema Quilt Data May 2019 https://allencell.quiltdata.com/b/quilt-example/tree/akarve/previews/storms.parquet
  • 32. Quilt Data May 2019 Schema Mismatches
  • 34. Missing Data Provider Quilt Data May 2019 X
  • 35. Crashes Subsequent Phases Quilt Data May 2019 X X X X X
  • 36. Errors Propagate Through Pipeline Quilt Data May 2019 X
  • 37. Adding a schema to package manifest {"version": "v0", "message": "Package of NYC MTA Buses Delays", "implements": ["s3://quilt-transit/gtfs_city_schema.jsonl?versionId=o35eue5yTFhZzqyiZfNle2E9oCJL.I.O"]} {"logical_key": "quilt_summarize.json", "physical_keys": ["s3://quilt-transit/gtfs/hackathon/quilt_summarize.json?versionId=o35eue5yTFhZzqyiZfNle2E9oCJL.I.O "], "size": 20953, "hash": {"type": "SHA256", "value": "80f7dec8c14945f3565dbf528d13b6cae3422e7bc65dc7b569213c519e9146dd"}, "meta": {}} {"logical_key": "MTA_BUS_COMP/gtfs.zip", "physical_keys": ["s3://quilt-transit/gtfs/US/NY/NEW_YORK/MTA/MTA_BUS_COMP/gtfs.zip?version=vxKXRmxT138ihJRStIy4_9cs 1E0ukcUq"], "size": 697643, "hash": {"type": "SHA256", "value": "2a46b0288a22471d8737f38abc4ff36866e6522edc8391bf633bace5e2952a8e"}, "meta": {}} Quilt Data May 2019 In Development Feature ( PR’s Welcome :) )
  • 38. Inheritance and Data Schema Integrity A Package specification can optionally specify a schema that it implements. 1. Any field defined in the specification inherited becomes available if a default is defined 2. Any field without a default value must be present as a logical key in the package manifest Ensures that any Package implementing a schema will guarantee to have the necessary data elements for a pipeline to programatically access Quilt Data May 2019
  • 39. Quilt Data May 2019 Catch Bad Data Early
  • 40. Fail Fast Catch schema and data quality issues before running an expensive pipeline. Decrease time spent debugging data quality issues. Decrease time getting fixes from data providers/contractors. (i.e. broken file/schemas, wrong time zones, wrong date ranges, units issues, etc) Quilt Data May 2019
  • 41. Testing Data like Code Unit tests - Schema, types, ranges Integration tests - consistency across objects, consistency across Wraps pytest Staging with .1% run for instance Package data that has only passed tests Publish a new version of the Package with a git hash of the tests’ code from quilt3.data import validation class TestGTFSFile(validation.TestCase): @validation.TestCase.test_elements('gtfs/US/NY/NEW_YORK/*/*/*') def validateFileStructure(self, gtfs_key): self.assertTrue(gtfs_key.endswith('.zip'), 'GTFS should be a .zip files') with zipfile.ZipFile(gtfs_key) as z: filenames = zipfile.ZipFile( fileobj=archive).getnames() self.assert(set(filenames).intersection( ['stops.txt']), 'GTFS tar missing stops.txt') Quilt Data May 2019
  • 42. Validation is slow To large to run on CircleCI, too slow to run as part of any CI without adequate parallelization Hard to balance the frequency of running continuous integration jobs and resource utilization Submit to a validation pipeline on when publishing package. Validation pipeline runs tests stored in Github repo. On success, attaches a PASS with git hash to the package metadata. Quilt Data May 2019
  • 43. Best Practices for Dataset Management Quilt Data May 2019 VERSIONED IMMUTABLE REPRODUCIBLE DOCUMENTED INTERCHANGEABLE EXPLICIT DATA DEPENDENCIES PORTABLE EXTENSIBLE BUILD, RELEASE, RUN
  • 44. Questions @quiltdata on twitter sindelar@quiltdata.io quiltdata/quilt on GitHub We're hiring a Cloud Engineer in Downtown SF Quilt Data May 2019