The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix

The Function, the Context, and the Data
Building an Abstraction for Simpler ML Ops at Stitch Fix
Elijah ben Izzy
Data Platform Engineer - Model Lifecycle
@elijahbenizzy
linkedin.com/in/elijahbenizzy
Try out Stitch Fix → goo.gl/Q3tCQ3

2
- Stitch Fix/Data Science (DS) @ Stitch Fix
- Common Workﬂows/Motivation
- Representing a Model
- Unlocked Capabilities
- Future Musings
Agenda

3
The right abstraction enables separation of concerns between DS and Platforms
Take Home

DAIS 2021 6
Stitch Fix is a Personal Styling Service
Shop at your personal curated store. Check out what you like.

DAIS 2021 7
Data Science is Behind Everything We Do
algorithms-tour.stitchﬁx.com
Algorithms Org.
- 145+ Data Scientists and Platform Engineers
- 3 main verticals + platform
Data Platform

DAIS 2021 9
Common Approaches to Data Science
Typical organization:
● Horizontal teams
● Hand off between fns
● Coordination required
DATA SCIENCE /
RESEARCH TEAMS
ETL TEAMS
ENGINEERING TEAMS

DAIS 2021 10
At Stitch Fix:
● Single organization
● No handoffs
● End to end ownership
● Lots of DS!
● Built on top of data
platform tools &
abstractions
Data Scientists (DS) are Full Stack
See https://cultivating-algos.stitchﬁx.com/
DATA SCIENCE
ETL
ENGINEERING

DAIS 2021 12
The Problem with Verticals
“DS are full stack” != “DS builds stack from the ground up”
Goal: scale without
-> more complex infrastructure
-> more cognitive burden on DS
DS should always be full stack... ...but can we shorten the stack?
ML platform

DAIS 2021 13
Examining Workﬂows
etl.py save on s3
copy to
production
Training (run at a regular cadence)
Inference
model
microservice
predictions in
batch
streaming
predictions
track metrics
share with other
teams
Analysis

DAIS 2021 14
Optimizing the Workﬂow
Goal: Build abstraction to give DS all these capabilities for free
Caveat: Largely uniform workﬂows with independent technologies
???????
model
microservice
predictions in
batch
streaming
predictions
track metrics
share with other
teams
...

DAIS 2021 16
Build or Buy?
We built our own
- Seamless integration with current infrastructure -> leverage
- Model tracking/management data model was not standard
- We have lots of segments/varying ways to slice and dice our models
- Custom build allows for pivoting as needed
- Invest in interface design to allow for plug/play with open-source options
Called it the Model Envelope
Hats off to MLFlow, TFX, modelDB!

DAIS 2021 17
What we Built
DS only writes training script -- the rest is conﬁguration-driven
import model_envelope as me
from sklearn import linear_model, metrics
df_X, df_y = load_data_somehow()
model = linear_model.LogisticRegression(multi_class='auto')
model.fit(df_X, df_y)
my_envelope = me.save_model(instance_name='my_model_instance_name',
instance_description='my_model_instance_description',
model=model,
query_function='predict',
api_input=df_X, api_output=df_y,
tags={'canonical_name':'foo-bar'})
my_envelope.log_metrics(validation_loss=metrics.log_loss(df_X, df_y))

DAIS 2021 18
Model Envelope (ctd.)
model
microservice
predictions in
batch
streaming
predictions
track metrics
...
share with other
teams
model
envelope
registry

DAIS 2021 20
Writing a Recipe
The instructions
The cookware
The ingredients

DAIS 2021 21
Representing a Model
The function: what the model does
The context: where/how to run the model
The data: data the model needs to run

DAIS 2021 22
The Function
Artifact + Shape

DAIS 2021 23
The Function
Artifact + Shape
- Serialized model (bytes) including state
- Serialization metadata
model=model,
api_input=df_X,
api_output=df_y,
DS passes object, platform serializes
Platform derives metadata

DAIS 2021 24
The Function
Artifact + Shape
- Function inputs
- Function outputs
model=model,
api_input=df_X,
api_output=df_y,
DS passes sample dataframe or speciﬁes type-annotations
Platform serializes, represents in custom format

DAIS 2021 25
The Context
Environment + Index

DAIS 2021 26
The Context
Environment + Index
- Installed packages
- Custom code
- Language + version
import my_custom_fancy_ml_module
model=model,
api_input=df_X,
api_output=df_y,
tags={'canonical_name':'foo-bar'},
# pip_env=['scikit-learn', pandas'], edge case if needed
custom_modules=[my_custom_fancy_ml_module])
Platform automagically derived, or DS passes pointers
DS passes in as needed
Platform automagically derived

DAIS 2021 27
The Context
Environment + Index
- Key-value tags
- Spine/index of envelope registry
import my_custom_fancy_ml_module
model=model,
api_input=df_X,
api_output=df_y,
tags={'canonical_name':'foo-bar'},
custom_modules=[my_custom_module])
Platform derives base tags
DS passes custom tags as desired
`

DAIS 2021 28
The Data
Training Data + Metrics

DAIS 2021 29
The Data
- Features
- Summary statistics
model=model,
api_input=df_X,
api_output=df_y,
feature_store_pointers=...)
DS (optionally) passes spec for features
Platform derives summary stats from passed data

DAIS 2021 30
The Data
- Scalars
- Fancy metrics
model=model,
api_input=df_X,
api_output=df_y,
feature_store_pointers=...)
evaluations = model(df_X)
my_envelope.log_metrics(
validation_loss=metrics.log_loss(evaluations, df_y)
roc_curve=metrics.roc_curve(evaluations, df_y))
)
DS logs metrics
using Platform metric-schema library

DAIS 2021 32
Online Inference
Approach Generate, automatically deploy microservice for model predictions
1. Runs cron job to determine models for deployment
2. Generates code to run model microservice
3. Deploys models with conﬁg to AWS
4. Monitors/manages model infrastructure
1. Generates, tests out service locally
2. Sets up automatic deployment “rule”
3. Publishes model, waits
DS Platform

DAIS 2021 33
Online Inference
The Function
- Serialized artifacted loaded on service instantiation, called during endpoints
- Function shape used to create OpenAPI spec/validate inputs

DAIS 2021 34
Online Inference
The Context
- Tag spec used to automatically deploy whenever new model is published
- Note: user never has to call deploy()! Done through system-managed CD.
- Stored package versions used to build docker images
- Custom code made accessible to model for deserialization, execution
Docker Image
installed python
packages
custom code
CD

DAIS 2021 35
Online Inference
The Data
- Summary stats used to validate/monitor input (data drift)
- Feature pointer used to load feature data
Feature Store

DAIS 2021 36
Batch Inference
Approach Generate batch job in Stitch Fix workflow system (on top of airflow/flotilla)
1. Spins up spark cluster (if specified)
2. Loads input data, optionally joins with features
3. Execute model’s predict function over input
4. Saves to output table
1. Creates config for batch job (local/spark)
a. tag query to choose model
b. input/output tables
2. Executes as part of ETL
DS Platform

DAIS 2021 37
Batch Inference
The function
- Serialized artifacted loaded on batch job start
- Function shape used to validate against inputs and outputs
- MapPartitions + Pyarrow used to run models that take in DFs eﬃciently on spark -- abstracted away from user

DAIS 2021 38
Batch Inference
The context
- Frozen package, language versions used in installing dependencies
- Custom code made accessible to model for deserialization, execution
- Tags used to determine which model to run
Docker Image
installed python
packages
custom code

DAIS 2021 39
Batch Inference
The data
- Feature pointer used to load feature data if IDs speciﬁed
- Evaluation table pointers stored in the registry
Feature Store

DAIS 2021 40
Metrics Tracking
Approach Allow for metrics tracking with tag-based querying
1. Builds/manages dashboard
2. Adds fancy new metric types!
1. Logs metrics using python client
2. Explores in the Model Operations Dashboard
3. Saves URL for favorite viz
DS Platform

DAIS 2021 45
Value Added by Separating Concerns
Making deployment easy
Ensuring environment in prod == environment in training
Providing easy metrics analysis
Wrapping up complex systems
Behind-the-scenes best practices
Creating the best model
Choosing the best libraries
Determining the right metrics to log
DS concerned with... Platform concerned with...
DS focuses on creating the best model [writing the recipe]
Platform focuses on optimal infrastructure [cooking it]

DAIS 2021 47
Some Ideas...
More advanced use of the data
- production monitoring: utilize training data/stats to have visibility into prod/training drift
More deployment contexts
- Predictions on streaming/kafka topics
More sophisticated feature tracking/integration
- Feature stores are all the rage…
Lambda-like architecture
- Rather than requiring a deploy, can we query system for a model’s predictions?
- Requires more uniﬁed environments…
Attach external capabilities to replace home-built components of our own system...

Questions?
Find me at:
@elijahbenizzy
linkedin.com/in/elijahbenizzy
elijah.benizzy@stitchﬁx.com
Try out Stitch Fix → goo.gl/Q3tCQ3

The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix

More Related Content

What's hot

Similar to The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix

More from Databricks

Recently uploaded

The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix