DataOps - Production ML

DataOps: The Key to
Exponential Data Impact
Luis Vaquero

87%
https://venturebeat.com/2019/07/19/why-do-87-of-data-science-projects-never-make-it-into-production/

Luis Vaquero
Director of Data at Goco
Reader in Computer Science at Univ of Bristol
Dyson, HPE, HP Labs, Telefonica/O2

“20% of my time building my
model,
and 80% of my time
cleaning data”
Every Data Scientist in 2020

“20% of my time building my
app,
and 80% of my time
fighting the browser/the
infra/Spring config...”
Every Web developer in 2000

https://towardsdatascience.com/the-rise-of-dataops-from-the-ashes-of-data-governance-da3e0c3ac2c4

Pipeline Compiler
“Data pipelines are simply
compilers that use data as the
source code.”
Ryan Gross

Each Data Team Write their Own “Compiler”
data quality, metadata, raw or slightly
modelled data
Lexical and Syntactic
Analysis

Analysis
Semantic Analysis
business understanding, model creation and
experiments
modelled data

Analysis
Semantic Analysis
Code Generation
experiments
modelled data

further tests, robustness
Analysis
Semantic Analysis
Code Generation
Optimisation
experiments
modelled data

The Data Pipeline “Compiler”

ML
Feature
Extraction
Analysis Tools
http://cidrdb.org/cidr2020/papers/p8-agrawal-cidr20.pdf

Configuration
Resource
Mgmn &
Monitoring
Model
Monitoring
Serving
Infrastructure

Data
Verification
Process Management
Tools
Model
Monitoring

?

Providing complete and usable third-party solutions is
non-trivial

https://www.eecs.tufts.edu/~dsculley/papers/ml_test_score.pdf

Where do we test here?
Pipeline
ML
Model
Model
Output
Raw
Data
Training/
serving
Data

DataOps
Tests
Raw
Data
Pipeline
ML
Model
Model
Output
Training/
serving
Data

DataOps
Tests
Raw
Data
Pipeline
ML
Model
Model
Output
Training/
serving
Data
Data
Validator

What Kind of Tests?
1. data validator: schema conformance and evolution
a. also a way to document new features used in the pipeline
b. Also about trends and anomalies
https://blog.acolyer.org/2019/06/05/data-validation-for-machine-learning/

DataOps
Tests
Raw
Data
Pipeline
ML
Model
Model
Output
Training/
serving
Data
Data
Validator
Data
Analyser

What Kind of Tests?
2. data analyser: basic statistics
a. Bias
b. Feature/distribution skew
c. ...

New Feature Engineering:
“Instead of deriving the math
before feeding the model, we
ensure our features comply
with certain properties so that
the NN can do the math
effectively by itself” -- Airbnb
https://arxiv.org/pdf/1810.09591.pdf

DataOps
Tests
Raw
Data
Pipeline
ML
Model
Model
Output
Model
Unit
Tester
Training/
serving
Data
Data
Validator
Data
Analyser

What Kind of Tests?
a. Bias
c. ...
3. model unit tester looks for errors in the training code using
synthetic data (schema-led fuzzing)

DataOps
Tests
Raw
Data
Pipeline
ML
Model
Model
Output
Data
Validator
Data
Analyser
Model
Unit
Tester
Training/
serving
Data
Model
Analyser

Model Analysis
https://arxiv.org/pdf/1810.09591.pdf
“Smooth distributions
coming from the lower
layers ensure that the upper
layers can correctly
‘interpolate’ the behavior for
unseen values…. ensure the
input features have a smooth
distribution” -- Airbnb

DataOps
Tests Monitoring
Raw
Data
Pipeline
ML
Model
Model
Output
Data
Validator
Data
Analyser
Model
Unit
Tester
Training/
serving
Data

What Kind of Tests?
a. Bias
c. ...
3. model unit tester looks for errors in the training code using
synthetic data (schema-led fuzzing)
4. monitoring tests check the output of the model to trigger alerts

Monitoring
Model performance
dashboard
1. model output metrics
through training, validation,
testing, and deployment.
2. data input metrics
3. operational telemetry
Image from: https://www.parallelm.com/

https://eng.uber.com/backtesting-at-scale/

Pipeline Compiler
Data Version
Control

What is Special about ML?
1. New Artefacts to Manage
a. Data
b. Metadata: Hyperparameters
c. Code: architecture
d. Model: executable software “built from the data”
e. Experiment: Data + metadata + Hyperparams + Code -> Model
2. Different Process
a. Trial and error: Scientific method
b. Reproducibility - traceability
c. Explainability

https://medium.com/thelaunchpad/retracing-your-steps-in-machine-learning-ml-versioning-74d19a66bd08
Using Excel to Track your versions of data, metadata,
hyperparameters, and architecture?

The Data Version Control Tipping Point
● Datasets can be versioned, branched, acted upon by versioned
code to create new data sets
● Test and fill bugs against data
● Enable quality control for compiler steps
● Automated lineage and schema change deceting
● Make guarantees about system components

Pipeline
Compiler
Data Version
Control

Pipeline
Compiler
Data Version
Control
Process & Org
Changes

Tools
Relationships
Processes
https://blog.dominodatalab.com/introducing-the-data-science-maturity-model/

Ad Hoc Exploration1
Tools RelationshipsProcesses
- Isolated efforts
- No repeatability
- Siloed data
- Local dev
- No business buy-in - transactional
- Ivory tower
Tools
Relationships
Processes

Reproducible, but limited2
- Repeatability is patchy
- Poor governance
- Shy centralisation
- Static reports
- Heavy transactional - rapport
- Team management support only
Tools
Relationships
Processes

Defined, Controlled3
- Formal but manually enforced- Good centralization
(metadata, access)
- Live retrospective
reports
- Incipient experimentation
- Empathy
Tools
Relationships
Processes

Automated4
- Automated and searchable
- Broad accessibility
- Best practice
- Event-driven
- Wide experimentation
- Analytics is the business
Tools
Relationships
Processes

https://dl.acm.org/citation.cfm
?id=3098021
https://engineering.fb.com/core-
data/introducing-fblearner-flow-
facebook-s-ai-backbone/

DataOps - Production ML

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to DataOps - Production ML

Similar to DataOps - Production ML (20)

Recently uploaded

Recently uploaded (20)

DataOps - Production ML