DataOps: The Key to
Exponential Data Impact
Luis Vaquero
87%
https://venturebeat.com/2019/07/19/why-do-87-of-data-science-projects-never-make-it-into-production/
Luis Vaquero
Director of Data at Goco
Reader in Computer Science at Univ of Bristol
Dyson, HPE, HP Labs, Telefonica/O2
“20% of my time building my
model,
and 80% of my time
cleaning data”
Every Data Scientist in 2020
“20% of my time building my
app,
and 80% of my time
fighting the browser/the
infra/Spring config...”
Every Web developer in 2000
https://towardsdatascience.com/the-rise-of-dataops-from-the-ashes-of-data-governance-da3e0c3ac2c4
Pipeline Compiler
“Data pipelines are simply
compilers that use data as the
source code.”
Ryan Gross
Each Data Team Write their Own “Compiler”
https://towardsdatascience.com/the-rise-of-dataops-from-the-ashes-of-data-governance-da3e0c3ac2c4
data quality, metadata, raw or slightly
modelled data
Lexical and Syntactic
Analysis
Each Data Team Write their Own “Compiler”
https://towardsdatascience.com/the-rise-of-dataops-from-the-ashes-of-data-governance-da3e0c3ac2c4
Lexical and Syntactic
Analysis
Semantic Analysis
business understanding, model creation and
experiments
data quality, metadata, raw or slightly
modelled data
Each Data Team Write their Own “Compiler”
https://towardsdatascience.com/the-rise-of-dataops-from-the-ashes-of-data-governance-da3e0c3ac2c4
Lexical and Syntactic
Analysis
Semantic Analysis
Code Generation
business understanding, model creation and
experiments
data quality, metadata, raw or slightly
modelled data
Each Data Team Write their Own “Compiler”
https://towardsdatascience.com/the-rise-of-dataops-from-the-ashes-of-data-governance-da3e0c3ac2c4
further tests, robustness
Lexical and Syntactic
Analysis
Semantic Analysis
Code Generation
Optimisation
business understanding, model creation and
experiments
data quality, metadata, raw or slightly
modelled data
The Data Pipeline “Compiler”
ML
Feature
Extraction
Analysis Tools
http://cidrdb.org/cidr2020/papers/p8-agrawal-cidr20.pdf
http://cidrdb.org/cidr2020/papers/p8-agrawal-cidr20.pdf
Configuration
Resource
Mgmn &
Monitoring
Model
Monitoring
Serving
Infrastructure
http://cidrdb.org/cidr2020/papers/p8-agrawal-cidr20.pdf
Data
Verification
Process Management
Tools
Model
Monitoring
http://cidrdb.org/cidr2020/papers/p8-agrawal-cidr20.pdf
?
http://cidrdb.org/cidr2020/papers/p8-agrawal-cidr20.pdf
Providing complete and usable third-party solutions is
non-trivial
https://www.eecs.tufts.edu/~dsculley/papers/ml_test_score.pdf
Where do we test here?
Pipeline
ML
Model
Model
Output
Raw
Data
Training/
serving
Data
DataOps
Tests
Raw
Data
Pipeline
ML
Model
Model
Output
Training/
serving
Data
DataOps
Tests
Raw
Data
Pipeline
ML
Model
Model
Output
Training/
serving
Data
Data
Validator
What Kind of Tests?
1. data validator: schema conformance and evolution
a. also a way to document new features used in the pipeline
b. Also about trends and anomalies
https://blog.acolyer.org/2019/06/05/data-validation-for-machine-learning/
DataOps
Tests
Raw
Data
Pipeline
ML
Model
Model
Output
Training/
serving
Data
Data
Validator
Data
Analyser
What Kind of Tests?
1. data validator: schema conformance and evolution
a. also a way to document new features used in the pipeline
2. data analyser: basic statistics
a. Bias
b. Feature/distribution skew
c. ...
https://blog.acolyer.org/2019/06/05/data-validation-for-machine-learning/
New Feature Engineering:
“Instead of deriving the math
before feeding the model, we
ensure our features comply
with certain properties so that
the NN can do the math
effectively by itself” -- Airbnb
https://arxiv.org/pdf/1810.09591.pdf
DataOps
Tests
Raw
Data
Pipeline
ML
Model
Model
Output
Model
Unit
Tester
Training/
serving
Data
Data
Validator
Data
Analyser
What Kind of Tests?
1. data validator: schema conformance and evolution
a. also a way to document new features used in the pipeline
2. data analyser: basic statistics
a. Bias
b. Feature/distribution skew
c. ...
3. model unit tester looks for errors in the training code using
synthetic data (schema-led fuzzing)
https://blog.acolyer.org/2019/06/05/data-validation-for-machine-learning/
DataOps
Tests
Raw
Data
Pipeline
ML
Model
Model
Output
Data
Validator
Data
Analyser
Model
Unit
Tester
Training/
serving
Data
Model
Analyser
Model Analysis
https://arxiv.org/pdf/1810.09591.pdf
“Smooth distributions
coming from the lower
layers ensure that the upper
layers can correctly
‘interpolate’ the behavior for
unseen values…. ensure the
input features have a smooth
distribution” -- Airbnb
DataOps
Tests Monitoring
Raw
Data
Pipeline
ML
Model
Model
Output
Data
Validator
Data
Analyser
Model
Unit
Tester
Training/
serving
Data
What Kind of Tests?
1. data validator: schema conformance and evolution
a. also a way to document new features used in the pipeline
2. data analyser: basic statistics
a. Bias
b. Feature/distribution skew
c. ...
3. model unit tester looks for errors in the training code using
synthetic data (schema-led fuzzing)
4. monitoring tests check the output of the model to trigger alerts
https://blog.acolyer.org/2019/06/05/data-validation-for-machine-learning/
Monitoring
Model performance
dashboard
1. model output metrics
through training, validation,
testing, and deployment.
2. data input metrics
3. operational telemetry
Image from: https://www.parallelm.com/
https://eng.uber.com/backtesting-at-scale/
Pipeline Compiler
Data Version
Control
What is Special about ML?
1. New Artefacts to Manage
a. Data
b. Metadata: Hyperparameters
c. Code: architecture
d. Model: executable software “built from the data”
e. Experiment: Data + metadata + Hyperparams + Code -> Model
2. Different Process
a. Trial and error: Scientific method
b. Reproducibility - traceability
c. Explainability
https://medium.com/thelaunchpad/retracing-your-steps-in-machine-learning-ml-versioning-74d19a66bd08
Using Excel to Track your versions of data, metadata,
hyperparameters, and architecture?
The Data Version Control Tipping Point
● Datasets can be versioned, branched, acted upon by versioned
code to create new data sets
● Test and fill bugs against data
● Enable quality control for compiler steps
● Automated lineage and schema change deceting
● Make guarantees about system components
https://towardsdatascience.com/the-rise-of-dataops-from-the-ashes-of-data-governance-da3e0c3ac2c4
Pipeline
Compiler
Data Version
Control
https://towardsdatascience.com/the-rise-of-dataops-from-the-ashes-of-data-governance-da3e0c3ac2c4
Pipeline
Compiler
Data Version
Control
Process & Org
Changes
Tools
Relationships
Processes
https://blog.dominodatalab.com/introducing-the-data-science-maturity-model/
Ad Hoc Exploration1
Tools RelationshipsProcesses
- Isolated efforts
- No repeatability
- Siloed data
- Local dev
- No business buy-in - transactional
- Ivory tower
Tools
Relationships
Processes
Reproducible, but limited2
Tools RelationshipsProcesses
- Repeatability is patchy
- Poor governance
- Shy centralisation
- Static reports
- Heavy transactional - rapport
- Team management support only
Tools
Relationships
Processes
Defined, Controlled3
Tools RelationshipsProcesses
- Formal but manually enforced- Good centralization
(metadata, access)
- Live retrospective
reports
- Incipient experimentation
- Empathy
Tools
Relationships
Processes
Automated4
Tools RelationshipsProcesses
- Automated and searchable
- Broad accessibility
- Best practice
- Event-driven
- Wide experimentation
- Analytics is the business
Tools
Relationships
Processes
https://dl.acm.org/citation.cfm
?id=3098021
https://engineering.fb.com/core-
data/introducing-fblearner-flow-
facebook-s-ai-backbone/

DataOps - Production ML