The Function, the Context, and the Data
Building an Abstraction for Simpler ML Ops at Stitch Fix
Elijah ben Izzy
Data Platform Engineer - Model Lifecycle
@elijahbenizzy
linkedin.com/in/elijahbenizzy
Try out Stitch Fix → goo.gl/Q3tCQ3
2
- Stitch Fix/Data Science (DS) @ Stitch Fix
- Common Workflows/Motivation
- Representing a Model
- Unlocked Capabilities
- Future Musings
Agenda
3
The right abstraction enables separation of concerns between DS and Platforms
Take Home
DAIS 2021 4
whoami
Stitch Fix?
DAIS 2021 6
Stitch Fix is a Personal Styling Service
Shop at your personal curated store. Check out what you like.
DAIS 2021 7
Data Science is Behind Everything We Do
algorithms-tour.stitchfix.com
Algorithms Org.
- 145+ Data Scientists and Platform Engineers
- 3 main verticals + platform
Data Platform
Data Science
@ Stitch Fix
DAIS 2021 9
Common Approaches to Data Science
Typical organization:
● Horizontal teams
● Hand off between fns
● Coordination required
DATA SCIENCE /
RESEARCH TEAMS
ETL TEAMS
ENGINEERING TEAMS
DAIS 2021 10
At Stitch Fix:
● Single organization
● No handoffs
● End to end ownership
● Lots of DS!
● Built on top of data
platform tools &
abstractions
Data Scientists (DS) are Full Stack
See https://cultivating-algos.stitchfix.com/
DATA SCIENCE
ETL
ENGINEERING
The Problem
DAIS 2021 12
The Problem with Verticals
“DS are full stack” != “DS builds stack from the ground up”
Goal: scale without
-> more complex infrastructure
-> more cognitive burden on DS
DS should always be full stack... ...but can we shorten the stack?
ML platform
DAIS 2021 13
Examining Workflows
etl.py save on s3
copy to
production
Training (run at a regular cadence)
Inference
model
microservice
predictions in
batch
streaming
predictions
track metrics
share with other
teams
Analysis
DAIS 2021 14
Optimizing the Workflow
Goal: Build abstraction to give DS all these capabilities for free
Caveat: Largely uniform workflows with independent technologies
???????
model
microservice
predictions in
batch
streaming
predictions
track metrics
share with other
teams
...
The Lede
DAIS 2021 16
Build or Buy?
We built our own
- Seamless integration with current infrastructure -> leverage
- Model tracking/management data model was not standard
- We have lots of segments/varying ways to slice and dice our models
- Custom build allows for pivoting as needed
- Invest in interface design to allow for plug/play with open-source options
Called it the Model Envelope
Hats off to MLFlow, TFX, modelDB!
DAIS 2021 17
What we Built
DS only writes training script -- the rest is configuration-driven
import model_envelope as me
from sklearn import linear_model, metrics
df_X, df_y = load_data_somehow()
model = linear_model.LogisticRegression(multi_class='auto')
model.fit(df_X, df_y)
my_envelope = me.save_model(instance_name='my_model_instance_name',
instance_description='my_model_instance_description',
model=model,
query_function='predict',
api_input=df_X, api_output=df_y,
tags={'canonical_name':'foo-bar'})
my_envelope.log_metrics(validation_loss=metrics.log_loss(df_X, df_y))
DAIS 2021 18
Model Envelope (ctd.)
model
microservice
predictions in
batch
streaming
predictions
track metrics
...
share with other
teams
model
envelope
registry
Representing a Model
DAIS 2021 20
Writing a Recipe
The instructions
The cookware
The ingredients
DAIS 2021 21
Representing a Model
The function: what the model does
The context: where/how to run the model
The data: data the model needs to run
DAIS 2021 22
The Function
Artifact + Shape
DAIS 2021 23
The Function
Artifact + Shape
- Serialized model (bytes) including state
- Serialization metadata
my_envelope = me.save_model(instance_name='my_model_instance_name',
instance_description='my_model_instance_description',
model=model,
query_function='predict',
api_input=df_X,
api_output=df_y,
tags={'canonical_name':'foo-bar'})
DS passes object, platform serializes
Platform derives metadata
DAIS 2021 24
The Function
Artifact + Shape
- Function inputs
- Function outputs
my_envelope = me.save_model(instance_name='my_model_instance_name',
instance_description='my_model_instance_description',
model=model,
query_function='predict',
api_input=df_X,
api_output=df_y,
tags={'canonical_name':'foo-bar'})
DS passes sample dataframe or specifies type-annotations
Platform serializes, represents in custom format
DAIS 2021 25
The Context
Environment + Index
DAIS 2021 26
The Context
Environment + Index
- Installed packages
- Custom code
- Language + version
import my_custom_fancy_ml_module
my_envelope = me.save_model(instance_name='my_model_instance_name',
instance_description='my_model_instance_description',
model=model,
query_function='predict',
api_input=df_X,
api_output=df_y,
tags={'canonical_name':'foo-bar'},
# pip_env=['scikit-learn', pandas'], edge case if needed
custom_modules=[my_custom_fancy_ml_module])
Platform automagically derived, or DS passes pointers
DS passes in as needed
Platform automagically derived
DAIS 2021 27
The Context
Environment + Index
- Key-value tags
- Spine/index of envelope registry
import my_custom_fancy_ml_module
my_envelope = me.save_model(instance_name='my_model_instance_name',
instance_description='my_model_instance_description',
model=model,
query_function='predict',
api_input=df_X,
api_output=df_y,
tags={'canonical_name':'foo-bar'},
custom_modules=[my_custom_module])
Platform derives base tags
DS passes custom tags as desired
`
DAIS 2021 28
The Data
Training Data + Metrics
DAIS 2021 29
The Data
Training Data + Metrics
- Features
- Summary statistics
my_envelope = me.save_model(instance_name='my_model_instance_name',
instance_description='my_model_instance_description',
model=model,
query_function='predict',
api_input=df_X,
api_output=df_y,
feature_store_pointers=...)
DS (optionally) passes spec for features
Platform derives summary stats from passed data
DAIS 2021 30
The Data
Training Data + Metrics
- Scalars
- Fancy metrics
my_envelope = me.save_model(instance_name='my_model_instance_name',
instance_description='my_model_instance_description',
model=model,
query_function='predict',
api_input=df_X,
api_output=df_y,
feature_store_pointers=...)
evaluations = model(df_X)
my_envelope.log_metrics(
validation_loss=metrics.log_loss(evaluations, df_y)
roc_curve=metrics.roc_curve(evaluations, df_y))
)
DS logs metrics
using Platform metric-schema library
Unlocked Capabilities
DAIS 2021 32
Online Inference
Approach Generate, automatically deploy microservice for model predictions
1. Runs cron job to determine models for deployment
2. Generates code to run model microservice
3. Deploys models with config to AWS
4. Monitors/manages model infrastructure
1. Generates, tests out service locally
2. Sets up automatic deployment “rule”
3. Publishes model, waits
DS Platform
DAIS 2021 33
Online Inference
The Function
- Serialized artifacted loaded on service instantiation, called during endpoints
- Function shape used to create OpenAPI spec/validate inputs
DAIS 2021 34
Online Inference
The Context
- Tag spec used to automatically deploy whenever new model is published
- Note: user never has to call deploy()! Done through system-managed CD.
- Stored package versions used to build docker images
- Custom code made accessible to model for deserialization, execution
Docker Image
installed python
packages
custom code
CD
DAIS 2021 35
Online Inference
The Data
- Summary stats used to validate/monitor input (data drift)
- Feature pointer used to load feature data
Feature Store
DAIS 2021 36
Batch Inference
Approach Generate batch job in Stitch Fix workflow system (on top of airflow/flotilla)
1. Spins up spark cluster (if specified)
2. Loads input data, optionally joins with features
3. Execute model’s predict function over input
4. Saves to output table
1. Creates config for batch job (local/spark)
a. tag query to choose model
b. input/output tables
2. Executes as part of ETL
DS Platform
DAIS 2021 37
Batch Inference
The function
- Serialized artifacted loaded on batch job start
- Function shape used to validate against inputs and outputs
- MapPartitions + Pyarrow used to run models that take in DFs efficiently on spark -- abstracted away from user
DAIS 2021 38
Batch Inference
The context
- Frozen package, language versions used in installing dependencies
- Custom code made accessible to model for deserialization, execution
- Tags used to determine which model to run
Docker Image
installed python
packages
custom code
DAIS 2021 39
Batch Inference
The data
- Feature pointer used to load feature data if IDs specified
- Evaluation table pointers stored in the registry
Feature Store
DAIS 2021 40
Metrics Tracking
Approach Allow for metrics tracking with tag-based querying
1. Builds/manages dashboard
2. Adds fancy new metric types!
1. Logs metrics using python client
2. Explores in the Model Operations Dashboard
3. Saves URL for favorite viz
DS Platform
DAIS 2021 41
Metrics Tracking
DAIS 2021 42
Metrics Tracking
DAIS 2021 43
Metrics Tracking
In Summation
DAIS 2021 45
Value Added by Separating Concerns
Making deployment easy
Ensuring environment in prod == environment in training
Providing easy metrics analysis
Wrapping up complex systems
Behind-the-scenes best practices
Creating the best model
Choosing the best libraries
Determining the right metrics to log
DS concerned with... Platform concerned with...
DS focuses on creating the best model [writing the recipe]
Platform focuses on optimal infrastructure [cooking it]
Future Musings
DAIS 2021 47
Some Ideas...
More advanced use of the data
- production monitoring: utilize training data/stats to have visibility into prod/training drift
More deployment contexts
- Predictions on streaming/kafka topics
More sophisticated feature tracking/integration
- Feature stores are all the rage…
Lambda-like architecture
- Rather than requiring a deploy, can we query system for a model’s predictions?
- Requires more unified environments…
Attach external capabilities to replace home-built components of our own system...
Questions?
Find me at:
@elijahbenizzy
linkedin.com/in/elijahbenizzy
elijah.benizzy@stitchfix.com
Try out Stitch Fix → goo.gl/Q3tCQ3

The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix

  • 1.
    The Function, theContext, and the Data Building an Abstraction for Simpler ML Ops at Stitch Fix Elijah ben Izzy Data Platform Engineer - Model Lifecycle @elijahbenizzy linkedin.com/in/elijahbenizzy Try out Stitch Fix → goo.gl/Q3tCQ3
  • 2.
    2 - Stitch Fix/DataScience (DS) @ Stitch Fix - Common Workflows/Motivation - Representing a Model - Unlocked Capabilities - Future Musings Agenda
  • 3.
    3 The right abstractionenables separation of concerns between DS and Platforms Take Home
  • 4.
  • 5.
  • 6.
    DAIS 2021 6 StitchFix is a Personal Styling Service Shop at your personal curated store. Check out what you like.
  • 7.
    DAIS 2021 7 DataScience is Behind Everything We Do algorithms-tour.stitchfix.com Algorithms Org. - 145+ Data Scientists and Platform Engineers - 3 main verticals + platform Data Platform
  • 8.
  • 9.
    DAIS 2021 9 CommonApproaches to Data Science Typical organization: ● Horizontal teams ● Hand off between fns ● Coordination required DATA SCIENCE / RESEARCH TEAMS ETL TEAMS ENGINEERING TEAMS
  • 10.
    DAIS 2021 10 AtStitch Fix: ● Single organization ● No handoffs ● End to end ownership ● Lots of DS! ● Built on top of data platform tools & abstractions Data Scientists (DS) are Full Stack See https://cultivating-algos.stitchfix.com/ DATA SCIENCE ETL ENGINEERING
  • 11.
  • 12.
    DAIS 2021 12 TheProblem with Verticals “DS are full stack” != “DS builds stack from the ground up” Goal: scale without -> more complex infrastructure -> more cognitive burden on DS DS should always be full stack... ...but can we shorten the stack? ML platform
  • 13.
    DAIS 2021 13 ExaminingWorkflows etl.py save on s3 copy to production Training (run at a regular cadence) Inference model microservice predictions in batch streaming predictions track metrics share with other teams Analysis
  • 14.
    DAIS 2021 14 Optimizingthe Workflow Goal: Build abstraction to give DS all these capabilities for free Caveat: Largely uniform workflows with independent technologies ??????? model microservice predictions in batch streaming predictions track metrics share with other teams ...
  • 15.
  • 16.
    DAIS 2021 16 Buildor Buy? We built our own - Seamless integration with current infrastructure -> leverage - Model tracking/management data model was not standard - We have lots of segments/varying ways to slice and dice our models - Custom build allows for pivoting as needed - Invest in interface design to allow for plug/play with open-source options Called it the Model Envelope Hats off to MLFlow, TFX, modelDB!
  • 17.
    DAIS 2021 17 Whatwe Built DS only writes training script -- the rest is configuration-driven import model_envelope as me from sklearn import linear_model, metrics df_X, df_y = load_data_somehow() model = linear_model.LogisticRegression(multi_class='auto') model.fit(df_X, df_y) my_envelope = me.save_model(instance_name='my_model_instance_name', instance_description='my_model_instance_description', model=model, query_function='predict', api_input=df_X, api_output=df_y, tags={'canonical_name':'foo-bar'}) my_envelope.log_metrics(validation_loss=metrics.log_loss(df_X, df_y))
  • 18.
    DAIS 2021 18 ModelEnvelope (ctd.) model microservice predictions in batch streaming predictions track metrics ... share with other teams model envelope registry
  • 19.
  • 20.
    DAIS 2021 20 Writinga Recipe The instructions The cookware The ingredients
  • 21.
    DAIS 2021 21 Representinga Model The function: what the model does The context: where/how to run the model The data: data the model needs to run
  • 22.
    DAIS 2021 22 TheFunction Artifact + Shape
  • 23.
    DAIS 2021 23 TheFunction Artifact + Shape - Serialized model (bytes) including state - Serialization metadata my_envelope = me.save_model(instance_name='my_model_instance_name', instance_description='my_model_instance_description', model=model, query_function='predict', api_input=df_X, api_output=df_y, tags={'canonical_name':'foo-bar'}) DS passes object, platform serializes Platform derives metadata
  • 24.
    DAIS 2021 24 TheFunction Artifact + Shape - Function inputs - Function outputs my_envelope = me.save_model(instance_name='my_model_instance_name', instance_description='my_model_instance_description', model=model, query_function='predict', api_input=df_X, api_output=df_y, tags={'canonical_name':'foo-bar'}) DS passes sample dataframe or specifies type-annotations Platform serializes, represents in custom format
  • 25.
    DAIS 2021 25 TheContext Environment + Index
  • 26.
    DAIS 2021 26 TheContext Environment + Index - Installed packages - Custom code - Language + version import my_custom_fancy_ml_module my_envelope = me.save_model(instance_name='my_model_instance_name', instance_description='my_model_instance_description', model=model, query_function='predict', api_input=df_X, api_output=df_y, tags={'canonical_name':'foo-bar'}, # pip_env=['scikit-learn', pandas'], edge case if needed custom_modules=[my_custom_fancy_ml_module]) Platform automagically derived, or DS passes pointers DS passes in as needed Platform automagically derived
  • 27.
    DAIS 2021 27 TheContext Environment + Index - Key-value tags - Spine/index of envelope registry import my_custom_fancy_ml_module my_envelope = me.save_model(instance_name='my_model_instance_name', instance_description='my_model_instance_description', model=model, query_function='predict', api_input=df_X, api_output=df_y, tags={'canonical_name':'foo-bar'}, custom_modules=[my_custom_module]) Platform derives base tags DS passes custom tags as desired `
  • 28.
    DAIS 2021 28 TheData Training Data + Metrics
  • 29.
    DAIS 2021 29 TheData Training Data + Metrics - Features - Summary statistics my_envelope = me.save_model(instance_name='my_model_instance_name', instance_description='my_model_instance_description', model=model, query_function='predict', api_input=df_X, api_output=df_y, feature_store_pointers=...) DS (optionally) passes spec for features Platform derives summary stats from passed data
  • 30.
    DAIS 2021 30 TheData Training Data + Metrics - Scalars - Fancy metrics my_envelope = me.save_model(instance_name='my_model_instance_name', instance_description='my_model_instance_description', model=model, query_function='predict', api_input=df_X, api_output=df_y, feature_store_pointers=...) evaluations = model(df_X) my_envelope.log_metrics( validation_loss=metrics.log_loss(evaluations, df_y) roc_curve=metrics.roc_curve(evaluations, df_y)) ) DS logs metrics using Platform metric-schema library
  • 31.
  • 32.
    DAIS 2021 32 OnlineInference Approach Generate, automatically deploy microservice for model predictions 1. Runs cron job to determine models for deployment 2. Generates code to run model microservice 3. Deploys models with config to AWS 4. Monitors/manages model infrastructure 1. Generates, tests out service locally 2. Sets up automatic deployment “rule” 3. Publishes model, waits DS Platform
  • 33.
    DAIS 2021 33 OnlineInference The Function - Serialized artifacted loaded on service instantiation, called during endpoints - Function shape used to create OpenAPI spec/validate inputs
  • 34.
    DAIS 2021 34 OnlineInference The Context - Tag spec used to automatically deploy whenever new model is published - Note: user never has to call deploy()! Done through system-managed CD. - Stored package versions used to build docker images - Custom code made accessible to model for deserialization, execution Docker Image installed python packages custom code CD
  • 35.
    DAIS 2021 35 OnlineInference The Data - Summary stats used to validate/monitor input (data drift) - Feature pointer used to load feature data Feature Store
  • 36.
    DAIS 2021 36 BatchInference Approach Generate batch job in Stitch Fix workflow system (on top of airflow/flotilla) 1. Spins up spark cluster (if specified) 2. Loads input data, optionally joins with features 3. Execute model’s predict function over input 4. Saves to output table 1. Creates config for batch job (local/spark) a. tag query to choose model b. input/output tables 2. Executes as part of ETL DS Platform
  • 37.
    DAIS 2021 37 BatchInference The function - Serialized artifacted loaded on batch job start - Function shape used to validate against inputs and outputs - MapPartitions + Pyarrow used to run models that take in DFs efficiently on spark -- abstracted away from user
  • 38.
    DAIS 2021 38 BatchInference The context - Frozen package, language versions used in installing dependencies - Custom code made accessible to model for deserialization, execution - Tags used to determine which model to run Docker Image installed python packages custom code
  • 39.
    DAIS 2021 39 BatchInference The data - Feature pointer used to load feature data if IDs specified - Evaluation table pointers stored in the registry Feature Store
  • 40.
    DAIS 2021 40 MetricsTracking Approach Allow for metrics tracking with tag-based querying 1. Builds/manages dashboard 2. Adds fancy new metric types! 1. Logs metrics using python client 2. Explores in the Model Operations Dashboard 3. Saves URL for favorite viz DS Platform
  • 41.
  • 42.
  • 43.
  • 44.
  • 45.
    DAIS 2021 45 ValueAdded by Separating Concerns Making deployment easy Ensuring environment in prod == environment in training Providing easy metrics analysis Wrapping up complex systems Behind-the-scenes best practices Creating the best model Choosing the best libraries Determining the right metrics to log DS concerned with... Platform concerned with... DS focuses on creating the best model [writing the recipe] Platform focuses on optimal infrastructure [cooking it]
  • 46.
  • 47.
    DAIS 2021 47 SomeIdeas... More advanced use of the data - production monitoring: utilize training data/stats to have visibility into prod/training drift More deployment contexts - Predictions on streaming/kafka topics More sophisticated feature tracking/integration - Feature stores are all the rage… Lambda-like architecture - Rather than requiring a deploy, can we query system for a model’s predictions? - Requires more unified environments… Attach external capabilities to replace home-built components of our own system...
  • 48.