NVIDIA RecSys Summit 2022 - EoR

An embarrassment of riches
Harmonizing several recommendation problems
NVIDIA RecSys Summit - July 28, 2022
Dr. Bryan Bischof
Head of Data Science
@ Weights and Biases
1
Email: bryan.bischof@gmail.com

Overview
2
1. Framing some problems [Business, RecSys]
2. Some causes for concern [Case studies for 3 models]
3. Mistakes and challenges [Type 1, Org, HITL, Weak Composition]
4. Very good things [Great place to do great work]
5. Good Big Model? [One with everything]
6. Learnings [Some things to strive for]

1. Stitch Fix Business problem
- 5 items in a box
- shipments ~1/mo
- feedback from customer on each item
- 100Kʼs of users, 10Kʼs of items, 10Kʼs of shipments
- History of items in shipments, and outcomes
- Attributes of items, onboarding quiz features for customers
4
c.f. Recommendations as Unique as You, 2014

- Match Score: Which items a customer will purchase (MS)
- Fix Generation: Set of 5 items for the customer based on a request note (FG)
- Style Prediction: Which items a customer will like as part of a game where
they rate (binary) photos of clothing (SS)
- E-commerce: Best items to show a customer on their homepage (PS)
- Similar Styles: Items similar to uploaded image a customer will purchase (SB)
- Inventory Health: Inventory health at a given time for a given client (IH)
- Search: Best items to show a customer based on a text search (IS)
1. Stitch Fix RecSys problems
5
c.f. Colson 2015, Klingenberg 2015, Boyle 2018, Zielnicki 2019, Bischof-Horn 2021

1a. The Core recommendation model
7
Klingenberg 2015
- A stylist was shown available inventory for a client (user) and the output of the
recommendation model for the pair; this was called the match score.
- A stylist was asked to use information about the client, information of previous
shipments (ﬁxes), information from a request note, and the match score, to
assemble a 5 item box of clothes.
- A stylist was judged on a high average match score of items included, time to
style, and outcomes from ﬁxes.

2a. Concerns with match score
8
- The match score was uncalibrated, and loss was global Success Rate of the items
- Was used as a lever to inﬂuence stylist behavior; upscale/downscale MS to
nudge stylists relative to inventory
- Stylists were a mediator between ranking and outcomes, so poor model
performance was very hard to detect live
- Lacked access to many item features for a while

1b. The combinatorial model
10
Zielnicki 2019,
- Needed to include text features from request notes
- Should attempt to find good mutual items for a fix instead of 5 good items;
natively considering interaction
- Should fit into a similar UI and HITL experience for stylists

2b. Concerns with Fix Generation
11
- Ultimately did not replace MS, and instead was deployed and run in parallel.
Two major reasons why:
- Org structure (leadership champions on both sides)
- Disjoint loss functions (diﬀerent methods for evaluation made it easy to
avoid choice)
- Evaluation/testing remained independent without direct model comparison;
no mutual validation

1c. A style game
13
Boyle 2018,
- Loss function was completely focused on item rating, not purchases
- Assumed all ratings were independent
- Much larger scales, could play the game anytime for a long time
- Ignored availability
- Purely photo based

2c. Concerns with Style Shuffle
14
- Diﬀerent loss function, so not compared to other models directly
- Used as feature-engineering for downstream models (as a latent style space);
- Early on, the features seemed to improve other models when added in, but as MS got more
sophisticated and the architecture grew to use similar technologies to the SS model, the
features were never retested accurately
- Feature importance was calculated as part of the FG pipeline, but only via inclusion/exclusion

2c. More Concerns with Style Shuffle
15
- Data leakage was not accounted for in prequential training
- Relied instead on A/B tests, but the tests werenʼt actually tied to business outcomes
- Massive organizational pressure to use the output of this model
- Latent spaces from other models did not have API availability for features, or
feature stores to use them
- Muddled story about including item attributes in the feature space

3a. Type 1 error
- A widely held belief internally was that we had rampant type 1
error, i.e. we thought more experiments were successful than were
- Why? Peeking and early stopping, variance in ExpDesign, not controlling for
covariates, treatment assignment inconsistencies
- Matching + Randomization is hard, especially with HITL
- Virtual Warehouses were a hard problem but we tried
- Ultimately, client experience was the priority, sometimes we
sacriﬁced rigor for this
- Over time, sequential testing bias added up, and we didnʼt have
global holdouts or re-testing
17

3b. Balkanization
- There was no team ultimately responsible for evaluation and
validation of models across the org; every team expected to
develop quickly and independently
- We did not have continuous testing/continuous learning, new
models were tested via singular eﬀect estimates
- Teams were incentivized to bring something “new” to the table to
show impact separately, no emphasis on improving existing
18
Colson 2019, Algorithms Tour 2019

- HITL processes were used for the product, but humans were not
part of model evaluation
- The metrics on humans encouraged faith in the algorithm at the
cost of the human inﬂuence
- The algorithms created more and more simultaneous objectives
for stylists to consider
- E.g. request notes, sizing, weather in clientʼs region, recent
sends, global trends, low inventory
3c. Humans out of the loop
19

- Many of the model dependencies created a weak-composition
structure:
- two (or more) models are composed
- the models are not trained jointly
- second model uses a byproduct from the ﬁrst(think some embedding space)
- there's no type/schema requirements between the two
- These situations lead to unmeasured interdependency
3d. Weak composition
20

- Access: Data Scientists had access and visibility into a huge variety of data and
work across the company
- Experimentation Framework: extremely powerful and developer friendly
- Massive applicability: ML was applied to almost all aspects of the business,
and intellectual freedom to try diﬀerent things/problems was incredible
- Technical Competency: Very strong knowledge and skill broadly across ICs
- Impact driven culture: A great culture focused on having real impact
- Developer experience: DevExp for DS was very good, strong data platform
4. Upsides
22
Bradley 2019

- What about a Good Big Model? Why canʼt you simply make everything a large
multi-objective problem and throw more and more features into one master
model?
- Bias-Variance tradeoﬀ, this will require more and more overparameterization
- Relative importance of components in your loss is hard to establish
- if you attempt to do this through human intuition, youʼre not guaranteed that this will be well
suited for learning; if you attempt to do this automatically, youʼre likely to get away from expert
knowledge
- Larger models more detached from the data generating process, and harder to
explain, but themselves can serve as interesting and useful data generation
5a. Is the solution a massively multi-objective NN?
24

Karpathyʼs team on self-driving was
distributed over many components of a
massively-multitask problem. In addition
to adversarial collaboration, he generally
found diﬃculty in optimizing how to
compose their eﬀorts.
Maybe he should try back-propagation to
learn a better weighting. 󰤇
25
c.f. Karpathy, ICML 2019
5b. Obligatory Karpathy Reference

1. Global holdouts, published validation artifacts, random re-testing, no peeking
can decrease the risk of bridges to nowhere (long term type 1 error)
2. There should be a team who ultimately signs oﬀ on all experimental design.
3. Weak-composition should be avoided or strengthened to strong-composition
4. Better experiment documentation for ML experiments and development
5. Internal task leaderboards should be institutionalized and integrated with
model registry/automatic validation pipelines
6. Partnership should be upheld as a core value, and preferred to 🍠YAM (yet
another model) developed independently
6a. Some recommendations (get it?)
27

To help build compositional pipelines/orgs, our platform is built of components:
28
c.f. Weights and Biases
and our platform handles the coherence.
6b. Building a platform to facilitate

Thanks!
Check out W&B’s composable tools at:
Wandb.ai
Totally free for individuals & academics.
Get in touch: contact@wandb.ai.
29

Coming early 2023!
Find us on Twitter:
@bebischof, @eigenhector
Or LinkedIn:
Bryan, Hector
30

NVIDIA RecSys Summit 2022 - EoR

More Related Content

Similar to NVIDIA RecSys Summit 2022 - EoR

Recently uploaded

NVIDIA RecSys Summit 2022 - EoR