1) The document discusses several recommendation problems at Stitch Fix including match score, fix generation, style prediction, inventory health, and search. It outlines concerns with each problem including different loss functions, organizational barriers, and lack of joint training or validation.
2) It describes mistakes made such as type 1 errors from peeking, the "balkanization" of teams working independently, and humans being left out of the model evaluation process. Weak composition of models without joint training was also a challenge.
3) The document advocates for practices like global holdouts, published validation, random re-testing, and strengthening weak composition between models. It suggests institutionalizing internal task leaderboards and validation to improve experimental rigor.
1. An embarrassment of riches
Harmonizing several recommendation problems
NVIDIA RecSys Summit - July 28, 2022
Dr. Bryan Bischof
Head of Data Science
@ Weights and Biases
1
Email: bryan.bischof@gmail.com
2. Overview
2
1. Framing some problems [Business, RecSys]
2. Some causes for concern [Case studies for 3 models]
3. Mistakes and challenges [Type 1, Org, HITL, Weak Composition]
4. Very good things [Great place to do great work]
5. Good Big Model? [One with everything]
6. Learnings [Some things to strive for]
4. 1. Stitch Fix Business problem
- 5 items in a box
- shipments ~1/mo
- feedback from customer on each item
- 100Kʼs of users, 10Kʼs of items, 10Kʼs of shipments
- History of items in shipments, and outcomes
- Attributes of items, onboarding quiz features for customers
4
c.f. Recommendations as Unique as You, 2014
5. - Match Score: Which items a customer will purchase (MS)
- Fix Generation: Set of 5 items for the customer based on a request note (FG)
- Style Prediction: Which items a customer will like as part of a game where
they rate (binary) photos of clothing (SS)
- E-commerce: Best items to show a customer on their homepage (PS)
- Similar Styles: Items similar to uploaded image a customer will purchase (SB)
- Inventory Health: Inventory health at a given time for a given client (IH)
- Search: Best items to show a customer based on a text search (IS)
1. Stitch Fix RecSys problems
5
c.f. Colson 2015, Klingenberg 2015, Boyle 2018, Zielnicki 2019, Bischof-Horn 2021
7. 1a. The Core recommendation model
7
Klingenberg 2015
- Match Score: Which items a customer will purchase (MS)
- A stylist was shown available inventory for a client (user) and the output of the
recommendation model for the pair; this was called the match score.
- A stylist was asked to use information about the client, information of previous
shipments (fixes), information from a request note, and the match score, to
assemble a 5 item box of clothes.
- A stylist was judged on a high average match score of items included, time to
style, and outcomes from fixes.
8. 2a. Concerns with match score
8
- Match Score: Which items a customer will purchase (MS)
- The match score was uncalibrated, and loss was global Success Rate of the items
- Was used as a lever to influence stylist behavior; upscale/downscale MS to
nudge stylists relative to inventory
- Stylists were a mediator between ranking and outcomes, so poor model
performance was very hard to detect live
- Lacked access to many item features for a while
10. 1b. The combinatorial model
10
Zielnicki 2019,
- Fix Generation: Set of 5 items for the customer based on a request note (FG)
- Needed to include text features from request notes
- Should attempt to find good mutual items for a fix instead of 5 good items;
natively considering interaction
- Should fit into a similar UI and HITL experience for stylists
11. 2b. Concerns with Fix Generation
11
- Fix Generation: Set of 5 items for the customer based on a request note (FG)
- Ultimately did not replace MS, and instead was deployed and run in parallel.
Two major reasons why:
- Org structure (leadership champions on both sides)
- Disjoint loss functions (different methods for evaluation made it easy to
avoid choice)
- Evaluation/testing remained independent without direct model comparison;
no mutual validation
13. 1c. A style game
13
Boyle 2018,
- Style Prediction: Which items a customer will like as part of a game where
they rate (binary) photos of clothing (SS)
- Loss function was completely focused on item rating, not purchases
- Assumed all ratings were independent
- Much larger scales, could play the game anytime for a long time
- Ignored availability
- Purely photo based
14. 2c. Concerns with Style Shuffle
14
- Style Prediction: Which items a customer will like as part of a game where
they rate (binary) photos of clothing (SS)
- Different loss function, so not compared to other models directly
- Used as feature-engineering for downstream models (as a latent style space);
- Early on, the features seemed to improve other models when added in, but as MS got more
sophisticated and the architecture grew to use similar technologies to the SS model, the
features were never retested accurately
- Feature importance was calculated as part of the FG pipeline, but only via inclusion/exclusion
15. 2c. More Concerns with Style Shuffle
15
- Data leakage was not accounted for in prequential training
- Relied instead on A/B tests, but the tests werenʼt actually tied to business outcomes
- Massive organizational pressure to use the output of this model
- Latent spaces from other models did not have API availability for features, or
feature stores to use them
- Muddled story about including item attributes in the feature space
17. 3a. Type 1 error
- A widely held belief internally was that we had rampant type 1
error, i.e. we thought more experiments were successful than were
- Why? Peeking and early stopping, variance in ExpDesign, not controlling for
covariates, treatment assignment inconsistencies
- Matching + Randomization is hard, especially with HITL
- Virtual Warehouses were a hard problem but we tried
- Ultimately, client experience was the priority, sometimes we
sacrificed rigor for this
- Over time, sequential testing bias added up, and we didnʼt have
global holdouts or re-testing
17
18. 3b. Balkanization
- There was no team ultimately responsible for evaluation and
validation of models across the org; every team expected to
develop quickly and independently
- We did not have continuous testing/continuous learning, new
models were tested via singular effect estimates
- Teams were incentivized to bring something “new” to the table to
show impact separately, no emphasis on improving existing
18
Colson 2019, Algorithms Tour 2019
19. - HITL processes were used for the product, but humans were not
part of model evaluation
- The metrics on humans encouraged faith in the algorithm at the
cost of the human influence
- The algorithms created more and more simultaneous objectives
for stylists to consider
- E.g. request notes, sizing, weather in clientʼs region, recent
sends, global trends, low inventory
3c. Humans out of the loop
19
20. - Many of the model dependencies created a weak-composition
structure:
- two (or more) models are composed
- the models are not trained jointly
- second model uses a byproduct from the first(think some embedding space)
- there's no type/schema requirements between the two
- These situations lead to unmeasured interdependency
3d. Weak composition
20
22. - Access: Data Scientists had access and visibility into a huge variety of data and
work across the company
- Experimentation Framework: extremely powerful and developer friendly
- Massive applicability: ML was applied to almost all aspects of the business,
and intellectual freedom to try different things/problems was incredible
- Technical Competency: Very strong knowledge and skill broadly across ICs
- Impact driven culture: A great culture focused on having real impact
- Developer experience: DevExp for DS was very good, strong data platform
4. Upsides
22
Bradley 2019
24. - What about a Good Big Model? Why canʼt you simply make everything a large
multi-objective problem and throw more and more features into one master
model?
- Bias-Variance tradeoff, this will require more and more overparameterization
- Relative importance of components in your loss is hard to establish
- if you attempt to do this through human intuition, youʼre not guaranteed that this will be well
suited for learning; if you attempt to do this automatically, youʼre likely to get away from expert
knowledge
- Larger models more detached from the data generating process, and harder to
explain, but themselves can serve as interesting and useful data generation
5a. Is the solution a massively multi-objective NN?
24
25. Karpathyʼs team on self-driving was
distributed over many components of a
massively-multitask problem. In addition
to adversarial collaboration, he generally
found difficulty in optimizing how to
compose their efforts.
Maybe he should try back-propagation to
learn a better weighting.
25
c.f. Karpathy, ICML 2019
5b. Obligatory Karpathy Reference
27. 1. Global holdouts, published validation artifacts, random re-testing, no peeking
can decrease the risk of bridges to nowhere (long term type 1 error)
2. There should be a team who ultimately signs off on all experimental design.
3. Weak-composition should be avoided or strengthened to strong-composition
4. Better experiment documentation for ML experiments and development
5. Internal task leaderboards should be institutionalized and integrated with
model registry/automatic validation pipelines
6. Partnership should be upheld as a core value, and preferred to 🍠YAM (yet
another model) developed independently
6a. Some recommendations (get it?)
27
28. To help build compositional pipelines/orgs, our platform is built of components:
28
c.f. Weights and Biases
and our platform handles the coherence.
6b. Building a platform to facilitate
29. Thanks!
Check out W&B’s composable tools at:
Wandb.ai
Totally free for individuals & academics.
Get in touch: contact@wandb.ai.
29
30. Coming early 2023!
Find us on Twitter:
@bebischof, @eigenhector
Or LinkedIn:
Bryan, Hector
30