The document discusses Stitch Fix's efforts to transform visibility into recommendations customers will love through machine learning. It summarizes the development of their Design the Line architecture, including model training, featurization, prediction, and deployment processes. It also discusses learnings around ways of working like steel thread development, code standards, and prioritizing people. The goal is to scale recommendations by leveraging internal ML products and integrating ML into operations for more efficient buying decisions.
8. 2020 Hired the first person into the role of ML
Integration. This role has been a foundational unlock in
designing ML systems.
About the Role
This role is responsible for unlocking business
opportunities for Stitch Fix to more efficiently grow
merchandise by leveraging in house ML products. On a
daily basis, this may involve researching how merchandise
is purchased for Stitch Fix or coding customizations to our
existing ML products to enable new use cases. This role
will involve both a solid understanding of machine learning
products from features to evaluation, and the creativity to
see how ML can be integrated into Stitch Fix for better
buying decisions and more efficient operations.
ML Integration
9. Set a Standard of Development
The standard doesn’t have to be the highest bar, but uniformity is a good baseline
Code Standards:
○ PEP 8/Black/Lint/etc
○ Google Python Style Guide
○ Documentation/Sphinx
Testing:
○ Unit/Integration/%
○ Deployment processes
Code Reviews:
○ Primary/Secondary Reviewers
○ Size of a Code Review
Blocker Resolution & Feedback Processes
10. Steel Thread v Modular Development
Modular Development
○ Create an overall architecture map
○ Mock out endpoints
○ Build deep within each module
○ Connect modules all together at the end
○ Release with a fully fledged product
Steel Thread
○ Create an overall architecture map
○ Mock out endpoints
○ Build bare minimum for each module
○ Connect modules as quickly as possible
○ Release with a ‘make it barely work’ product
○ Rapid tuning of bottlenecks for a ‘it works’
product
○ Long term investment in upgrading modules
Boehm's Spiral???
11. Modular Development
○ Great for known complexity
○ Good ROI of development
○ Increasingly Available
Steel Thread
○ Quicker release of major milestones
○ Laser focus
○ Requires
Steel Thread v Modular Development
12. Enable Focus:
○ Daily Stand Ups
○ Complete List of Everything that Needed Built
○ Steel Thread naturally lends itself to bite sized
tasks
○ Use low uncertainty solutions
○ Increased Pair Coding over Code Reviews
○ Clear Cross Functional Buy In
○ Mocked Out Endpoints Between Teams
Take Care of People:
○ Rotating ‘adjustment’ PTO
○ Mental Health Days
○ Pre-emptive No Meeting Days
○ Customize to what people need
○ “It’s okay to be happy at work. It’s okay to enjoy
being good at what you do.”
○ Increased online social interactions, lunches, etc
2020 Steel Thread Support
14. Stages of Development:
- Count level metrics
- Ratio metrics
- Domain Specific Business Value metrics
- Historically corrected labels
- [Wishlist] Distribution of a metric labels
Gotchas: Expect rapid schema changes as client
metadata, business context, and metrics evolve
with the business.
Labeler
Labeler
Client Sales
Client
Metadata
15. Metric Stability is a function of different levels of certainty. Fashion (and the stock market) have high levels of chaotic
influence, much higher than many areas of tech.
Manage with adding 2nd and 3rd moment metrics for gauging stability of predictions in production. Ie, not just absolute
loss, but also standard deviation of error and higher moments.
Labeler
Deterministic
Influence
Probabilistic
Influence
Chaotic Influence
Known Victory Lap Continuous
Development
Use in Confidence
Bounds
Unknown -
measurable
Roadmap Roadmap
Unknown -
unmeasurable
16. In Steel Thread development, pick a feature family covering each of the main
types of data {categorical, numerical, image, text} to put strong connectors
in place between each of the components. If the connectors are strong, then
additional feature families can be added at a later date without breaking
downstream data type assumptions.
Gotchas: Client Input features are calculated on a different timeline than
ML computed features. Handle by allowing null features to be returned and
taken into account at the model routing stage.
Featurization
Image
Service
Featurization
Client Input
Priors
17. Why do embeddings work?
○ There’s a lot of space in high dimensions. The
probability adding a set of vectors together lands
near a point is extraordinarily low.
What is a meaningful level of near in a high dimensional
space?
○ Use the variation of known similar vectors to
create a localized meaningful distance threshold.
[Tunkelang]
Featurization - Embeddings
18. To prevent time travel, have to create a “memoryless circuit” at training time
where only as much information that would be known at inference time is
known about the training data.
Common Forms of Time Travel:
- Randomly assigned test & train data sets
- Duplicate records of varying degrees
- Features calculated off of current tables v historical snapshots
Data Set Creation
19. This is the fun stuff. Try to build with an interchangeable parts
mindset to enable rapid iteration.
Gotchas
- Choosing a common set of tooling and approaches will
enable more dynamic resourcing for sprints
- Using default parameters that slow down the pipeline
for no improvement in accuracy
Model Training
Model Training
Dimensionality
Reduction
Calibration
{GBDT,
Regression, etc}
HyperParamet
erOptimization
20. UMAP
○ Faster than T-SNE
○ Biased towards preserving short distances at the
expense of ignoring large distances
T-SNE
○ Groundbreaking way to visualize high dimensional data
over large datasets
○ Preserves large distances at the expense of local
distances
PCA
○ Doesn’t do well with the cloudiness of large, high
dimensional data sets. If the dimensionality is large
enough nearly all points are equidistant.
LDA (may not be applicable)
Dimensionality Reduction
21. In practice, the dimensionality reduction step is a hybrid
approach with features being grouped for different levels of
compression.
Ie, price features should not be compressed, but embedding
features should be.
Dimensionality Reduction (in practice)
22. Grid Search
○ Higher dimensional spaces lead to spending most
of the time searching the boundary of the
parameter space
Random Search
○ Better distribution of evenly searching the space
Bayesian Optimization
○ At least as good as random … but so much quicker
It’s Free
○ Stop spending so many resources re-coding a free
solution, you won’t be able to beat
○ …...SigOpt
Hyper Parameter Optimization (HPO)
23. Many model types are primarily good at sorting
datasets, but struggle with biases that can cause
absolutely accuracy to suffer.
Calibration corrects for known biases to improve
absolute accuracy.
Calibration
37. Hard to get multiple points of measurement to know
when the chasm is crossed.
Measure 2 points:
○ Current system performance (Human
Accuracy)
○ Perfect performance
Rule of Thumb:
○ Release once ML accuracy is greater than the
split
Split the Difference
40. Common System Parameters
○ Client Segment
○ Business Context
○ Target Metric
○ Time Scale
Software Engineering Best Practices:
○ Every Degree of Freedom in a system has a
cost for maintenance, design complexity.
○ Adding Degrees of Freedom often requires a
refactor
Set Flexible & Narrow System Parameters
41. With enabling predictions to be used in multiple
contexts, providing predictions in context is
important for enabling strong decision making.
Examples of setting context
○ Provide a summary recommendation of buy,
unknown, or don’t buy
○ Provide a historical baseline of performance
with those predictions
○ Provide an example of the next best or most
similar item already in the system
Context Context Context
42. Elevate Program was designed to give a leg up to
emerging BIPOC designers at a time when it was
needed.
Access to data insights, predictions, and early
product market fit indicators for scaling help plan
supply chains, highlight growth areas, and help
emerging brands optimize their digital presence.
Building recommender systems is expensive, reusing
them is cheap. I encourage folks to think about how
their work can be reused by building up compassion
for what will help others.
Elevate Program