Autonomy and ownership are core to working at Stitch Fix, particularly on the Algorithms team. We enable data scientists to deploy and operate their models independently, with minimal need for handoffs or gatekeeping. By writing a simple function and calling out to an intuitive API, data scientists can harness a suite of platform-provided tooling meant to make ML operations easy. In this talk, we will dive into the abstractions the Data Platform team has built to enable this. We will go over the interface data scientists use to specify a model and what that hooks into, including online deployment, batch execution on Spark, and metrics tracking and visualization.
Customer Service Analytics - Make Sense of All Your Data.pptx
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
1. The Function, the Context, and the Data
Building an Abstraction for Simpler ML Ops at Stitch Fix
Elijah ben Izzy
Data Platform Engineer - Model Lifecycle
@elijahbenizzy
linkedin.com/in/elijahbenizzy
Try out Stitch Fix → goo.gl/Q3tCQ3
2. 2
- Stitch Fix/Data Science (DS) @ Stitch Fix
- Common Workflows/Motivation
- Representing a Model
- Unlocked Capabilities
- Future Musings
Agenda
6. DAIS 2021 6
Stitch Fix is a Personal Styling Service
Shop at your personal curated store. Check out what you like.
7. DAIS 2021 7
Data Science is Behind Everything We Do
algorithms-tour.stitchfix.com
Algorithms Org.
- 145+ Data Scientists and Platform Engineers
- 3 main verticals + platform
Data Platform
9. DAIS 2021 9
Common Approaches to Data Science
Typical organization:
● Horizontal teams
● Hand off between fns
● Coordination required
DATA SCIENCE /
RESEARCH TEAMS
ETL TEAMS
ENGINEERING TEAMS
10. DAIS 2021 10
At Stitch Fix:
● Single organization
● No handoffs
● End to end ownership
● Lots of DS!
● Built on top of data
platform tools &
abstractions
Data Scientists (DS) are Full Stack
See https://cultivating-algos.stitchfix.com/
DATA SCIENCE
ETL
ENGINEERING
12. DAIS 2021 12
The Problem with Verticals
“DS are full stack” != “DS builds stack from the ground up”
Goal: scale without
-> more complex infrastructure
-> more cognitive burden on DS
DS should always be full stack... ...but can we shorten the stack?
ML platform
13. DAIS 2021 13
Examining Workflows
etl.py save on s3
copy to
production
Training (run at a regular cadence)
Inference
model
microservice
predictions in
batch
streaming
predictions
track metrics
share with other
teams
Analysis
14. DAIS 2021 14
Optimizing the Workflow
Goal: Build abstraction to give DS all these capabilities for free
Caveat: Largely uniform workflows with independent technologies
???????
model
microservice
predictions in
batch
streaming
predictions
track metrics
share with other
teams
...
16. DAIS 2021 16
Build or Buy?
We built our own
- Seamless integration with current infrastructure -> leverage
- Model tracking/management data model was not standard
- We have lots of segments/varying ways to slice and dice our models
- Custom build allows for pivoting as needed
- Invest in interface design to allow for plug/play with open-source options
Called it the Model Envelope
Hats off to MLFlow, TFX, modelDB!
17. DAIS 2021 17
What we Built
DS only writes training script -- the rest is configuration-driven
import model_envelope as me
from sklearn import linear_model, metrics
df_X, df_y = load_data_somehow()
model = linear_model.LogisticRegression(multi_class='auto')
model.fit(df_X, df_y)
my_envelope = me.save_model(instance_name='my_model_instance_name',
instance_description='my_model_instance_description',
model=model,
query_function='predict',
api_input=df_X, api_output=df_y,
tags={'canonical_name':'foo-bar'})
my_envelope.log_metrics(validation_loss=metrics.log_loss(df_X, df_y))
18. DAIS 2021 18
Model Envelope (ctd.)
model
microservice
predictions in
batch
streaming
predictions
track metrics
...
share with other
teams
model
envelope
registry
29. DAIS 2021 29
The Data
Training Data + Metrics
- Features
- Summary statistics
my_envelope = me.save_model(instance_name='my_model_instance_name',
instance_description='my_model_instance_description',
model=model,
query_function='predict',
api_input=df_X,
api_output=df_y,
feature_store_pointers=...)
DS (optionally) passes spec for features
Platform derives summary stats from passed data
30. DAIS 2021 30
The Data
Training Data + Metrics
- Scalars
- Fancy metrics
my_envelope = me.save_model(instance_name='my_model_instance_name',
instance_description='my_model_instance_description',
model=model,
query_function='predict',
api_input=df_X,
api_output=df_y,
feature_store_pointers=...)
evaluations = model(df_X)
my_envelope.log_metrics(
validation_loss=metrics.log_loss(evaluations, df_y)
roc_curve=metrics.roc_curve(evaluations, df_y))
)
DS logs metrics
using Platform metric-schema library
32. DAIS 2021 32
Online Inference
Approach Generate, automatically deploy microservice for model predictions
1. Runs cron job to determine models for deployment
2. Generates code to run model microservice
3. Deploys models with config to AWS
4. Monitors/manages model infrastructure
1. Generates, tests out service locally
2. Sets up automatic deployment “rule”
3. Publishes model, waits
DS Platform
33. DAIS 2021 33
Online Inference
The Function
- Serialized artifacted loaded on service instantiation, called during endpoints
- Function shape used to create OpenAPI spec/validate inputs
34. DAIS 2021 34
Online Inference
The Context
- Tag spec used to automatically deploy whenever new model is published
- Note: user never has to call deploy()! Done through system-managed CD.
- Stored package versions used to build docker images
- Custom code made accessible to model for deserialization, execution
Docker Image
installed python
packages
custom code
CD
35. DAIS 2021 35
Online Inference
The Data
- Summary stats used to validate/monitor input (data drift)
- Feature pointer used to load feature data
Feature Store
36. DAIS 2021 36
Batch Inference
Approach Generate batch job in Stitch Fix workflow system (on top of airflow/flotilla)
1. Spins up spark cluster (if specified)
2. Loads input data, optionally joins with features
3. Execute model’s predict function over input
4. Saves to output table
1. Creates config for batch job (local/spark)
a. tag query to choose model
b. input/output tables
2. Executes as part of ETL
DS Platform
37. DAIS 2021 37
Batch Inference
The function
- Serialized artifacted loaded on batch job start
- Function shape used to validate against inputs and outputs
- MapPartitions + Pyarrow used to run models that take in DFs efficiently on spark -- abstracted away from user
38. DAIS 2021 38
Batch Inference
The context
- Frozen package, language versions used in installing dependencies
- Custom code made accessible to model for deserialization, execution
- Tags used to determine which model to run
Docker Image
installed python
packages
custom code
39. DAIS 2021 39
Batch Inference
The data
- Feature pointer used to load feature data if IDs specified
- Evaluation table pointers stored in the registry
Feature Store
40. DAIS 2021 40
Metrics Tracking
Approach Allow for metrics tracking with tag-based querying
1. Builds/manages dashboard
2. Adds fancy new metric types!
1. Logs metrics using python client
2. Explores in the Model Operations Dashboard
3. Saves URL for favorite viz
DS Platform
45. DAIS 2021 45
Value Added by Separating Concerns
Making deployment easy
Ensuring environment in prod == environment in training
Providing easy metrics analysis
Wrapping up complex systems
Behind-the-scenes best practices
Creating the best model
Choosing the best libraries
Determining the right metrics to log
DS concerned with... Platform concerned with...
DS focuses on creating the best model [writing the recipe]
Platform focuses on optimal infrastructure [cooking it]
47. DAIS 2021 47
Some Ideas...
More advanced use of the data
- production monitoring: utilize training data/stats to have visibility into prod/training drift
More deployment contexts
- Predictions on streaming/kafka topics
More sophisticated feature tracking/integration
- Feature stores are all the rage…
Lambda-like architecture
- Rather than requiring a deploy, can we query system for a model’s predictions?
- Requires more unified environments…
Attach external capabilities to replace home-built components of our own system...