SlideShare a Scribd company logo
1 of 30
Download to read offline
An embarrassment of riches
Harmonizing several recommendation problems
NVIDIA RecSys Summit - July 28, 2022
Dr. Bryan Bischof
Head of Data Science
@ Weights and Biases
1
Email: bryan.bischof@gmail.com
Overview
2
1. Framing some problems [Business, RecSys]
2. Some causes for concern [Case studies for 3 models]
3. Mistakes and challenges [Type 1, Org, HITL, Weak Composition]
4. Very good things [Great place to do great work]
5. Good Big Model? [One with everything]
6. Learnings [Some things to strive for]
3
Problem
Framing
1. Stitch Fix Business problem
- 5 items in a box
- shipments ~1/mo
- feedback from customer on each item
- 100Kʼs of users, 10Kʼs of items, 10Kʼs of shipments
- History of items in shipments, and outcomes
- Attributes of items, onboarding quiz features for customers
4
c.f. Recommendations as Unique as You, 2014
- Match Score: Which items a customer will purchase (MS)
- Fix Generation: Set of 5 items for the customer based on a request note (FG)
- Style Prediction: Which items a customer will like as part of a game where
they rate (binary) photos of clothing (SS)
- E-commerce: Best items to show a customer on their homepage (PS)
- Similar Styles: Items similar to uploaded image a customer will purchase (SB)
- Inventory Health: Inventory health at a given time for a given client (IH)
- Search: Best items to show a customer based on a text search (IS)
1. Stitch Fix RecSys problems
5
c.f. Colson 2015, Klingenberg 2015, Boyle 2018, Zielnicki 2019, Bischof-Horn 2021
Match Score
6
1a. The Core recommendation model
7
Klingenberg 2015
- Match Score: Which items a customer will purchase (MS)
- A stylist was shown available inventory for a client (user) and the output of the
recommendation model for the pair; this was called the match score.
- A stylist was asked to use information about the client, information of previous
shipments (fixes), information from a request note, and the match score, to
assemble a 5 item box of clothes.
- A stylist was judged on a high average match score of items included, time to
style, and outcomes from fixes.
2a. Concerns with match score
8
- Match Score: Which items a customer will purchase (MS)
- The match score was uncalibrated, and loss was global Success Rate of the items
- Was used as a lever to influence stylist behavior; upscale/downscale MS to
nudge stylists relative to inventory
- Stylists were a mediator between ranking and outcomes, so poor model
performance was very hard to detect live
- Lacked access to many item features for a while
Fix Generation
9
1b. The combinatorial model
10
Zielnicki 2019,
- Fix Generation: Set of 5 items for the customer based on a request note (FG)
- Needed to include text features from request notes
- Should attempt to find good mutual items for a fix instead of 5 good items;
natively considering interaction
- Should fit into a similar UI and HITL experience for stylists
2b. Concerns with Fix Generation
11
- Fix Generation: Set of 5 items for the customer based on a request note (FG)
- Ultimately did not replace MS, and instead was deployed and run in parallel.
Two major reasons why:
- Org structure (leadership champions on both sides)
- Disjoint loss functions (different methods for evaluation made it easy to
avoid choice)
- Evaluation/testing remained independent without direct model comparison;
no mutual validation
Style Shuffle
12
1c. A style game
13
Boyle 2018,
- Style Prediction: Which items a customer will like as part of a game where
they rate (binary) photos of clothing (SS)
- Loss function was completely focused on item rating, not purchases
- Assumed all ratings were independent
- Much larger scales, could play the game anytime for a long time
- Ignored availability
- Purely photo based
2c. Concerns with Style Shuffle
14
- Style Prediction: Which items a customer will like as part of a game where
they rate (binary) photos of clothing (SS)
- Different loss function, so not compared to other models directly
- Used as feature-engineering for downstream models (as a latent style space);
- Early on, the features seemed to improve other models when added in, but as MS got more
sophisticated and the architecture grew to use similar technologies to the SS model, the
features were never retested accurately
- Feature importance was calculated as part of the FG pipeline, but only via inclusion/exclusion
2c. More Concerns with Style Shuffle
15
- Data leakage was not accounted for in prequential training
- Relied instead on A/B tests, but the tests werenʼt actually tied to business outcomes
- Massive organizational pressure to use the output of this model
- Latent spaces from other models did not have API availability for features, or
feature stores to use them
- Muddled story about including item attributes in the feature space
Mistakes
and
Challenges
16
3a. Type 1 error
- A widely held belief internally was that we had rampant type 1
error, i.e. we thought more experiments were successful than were
- Why? Peeking and early stopping, variance in ExpDesign, not controlling for
covariates, treatment assignment inconsistencies
- Matching + Randomization is hard, especially with HITL
- Virtual Warehouses were a hard problem but we tried
- Ultimately, client experience was the priority, sometimes we
sacrificed rigor for this
- Over time, sequential testing bias added up, and we didnʼt have
global holdouts or re-testing
17
3b. Balkanization
- There was no team ultimately responsible for evaluation and
validation of models across the org; every team expected to
develop quickly and independently
- We did not have continuous testing/continuous learning, new
models were tested via singular effect estimates
- Teams were incentivized to bring something “new” to the table to
show impact separately, no emphasis on improving existing
18
Colson 2019, Algorithms Tour 2019
- HITL processes were used for the product, but humans were not
part of model evaluation
- The metrics on humans encouraged faith in the algorithm at the
cost of the human influence
- The algorithms created more and more simultaneous objectives
for stylists to consider
- E.g. request notes, sizing, weather in clientʼs region, recent
sends, global trends, low inventory
3c. Humans out of the loop
19
- Many of the model dependencies created a weak-composition
structure:
- two (or more) models are composed
- the models are not trained jointly
- second model uses a byproduct from the first(think some embedding space)
- there's no type/schema requirements between the two
- These situations lead to unmeasured interdependency
3d. Weak composition
20
Very Good
Things
21
- Access: Data Scientists had access and visibility into a huge variety of data and
work across the company
- Experimentation Framework: extremely powerful and developer friendly
- Massive applicability: ML was applied to almost all aspects of the business,
and intellectual freedom to try different things/problems was incredible
- Technical Competency: Very strong knowledge and skill broadly across ICs
- Impact driven culture: A great culture focused on having real impact
- Developer experience: DevExp for DS was very good, strong data platform
4. Upsides
22
Bradley 2019
Good
big
model?
23
- What about a Good Big Model? Why canʼt you simply make everything a large
multi-objective problem and throw more and more features into one master
model?
- Bias-Variance tradeoff, this will require more and more overparameterization
- Relative importance of components in your loss is hard to establish
- if you attempt to do this through human intuition, youʼre not guaranteed that this will be well
suited for learning; if you attempt to do this automatically, youʼre likely to get away from expert
knowledge
- Larger models more detached from the data generating process, and harder to
explain, but themselves can serve as interesting and useful data generation
5a. Is the solution a massively multi-objective NN?
24
Karpathyʼs team on self-driving was
distributed over many components of a
massively-multitask problem. In addition
to adversarial collaboration, he generally
found difficulty in optimizing how to
compose their efforts.
Maybe he should try back-propagation to
learn a better weighting. 󰤇
25
c.f. Karpathy, ICML 2019
5b. Obligatory Karpathy Reference
Learnings
26
1. Global holdouts, published validation artifacts, random re-testing, no peeking
can decrease the risk of bridges to nowhere (long term type 1 error)
2. There should be a team who ultimately signs off on all experimental design.
3. Weak-composition should be avoided or strengthened to strong-composition
4. Better experiment documentation for ML experiments and development
5. Internal task leaderboards should be institutionalized and integrated with
model registry/automatic validation pipelines
6. Partnership should be upheld as a core value, and preferred to 🍠YAM (yet
another model) developed independently
6a. Some recommendations (get it?)
27
To help build compositional pipelines/orgs, our platform is built of components:
28
c.f. Weights and Biases
and our platform handles the coherence.
6b. Building a platform to facilitate
Thanks!
Check out W&B’s composable tools at:
Wandb.ai
Totally free for individuals & academics.
Get in touch: contact@wandb.ai.
29
Coming early 2023!
Find us on Twitter:
@bebischof, @eigenhector
Or LinkedIn:
Bryan, Hector
30

More Related Content

What's hot

Artwork Personalization at Netflix
Artwork Personalization at NetflixArtwork Personalization at Netflix
Artwork Personalization at Netflix
Justin Basilico
 

What's hot (20)

boosting algorithm
boosting algorithmboosting algorithm
boosting algorithm
 
Intelligent agents
Intelligent agentsIntelligent agents
Intelligent agents
 
Primitive Recursive Functions
Primitive Recursive FunctionsPrimitive Recursive Functions
Primitive Recursive Functions
 
Multi-armed bandit by Joni Turunen
Multi-armed bandit by Joni TurunenMulti-armed bandit by Joni Turunen
Multi-armed bandit by Joni Turunen
 
Learning to Rank for Recommender Systems - ACM RecSys 2013 tutorial
Learning to Rank for Recommender Systems -  ACM RecSys 2013 tutorialLearning to Rank for Recommender Systems -  ACM RecSys 2013 tutorial
Learning to Rank for Recommender Systems - ACM RecSys 2013 tutorial
 
Knowledge based agent
Knowledge based agentKnowledge based agent
Knowledge based agent
 
Minimax
MinimaxMinimax
Minimax
 
Apriori Algorithm
Apriori AlgorithmApriori Algorithm
Apriori Algorithm
 
Kaggle presentation
Kaggle presentationKaggle presentation
Kaggle presentation
 
Tips for data science competitions
Tips for data science competitionsTips for data science competitions
Tips for data science competitions
 
Recommending and Searching (Research @ Spotify)
Recommending and Searching (Research @ Spotify)Recommending and Searching (Research @ Spotify)
Recommending and Searching (Research @ Spotify)
 
Boston ML - Architecting Recommender Systems
Boston ML - Architecting Recommender SystemsBoston ML - Architecting Recommender Systems
Boston ML - Architecting Recommender Systems
 
Lecture 14 Heuristic Search-A star algorithm
Lecture 14 Heuristic Search-A star algorithmLecture 14 Heuristic Search-A star algorithm
Lecture 14 Heuristic Search-A star algorithm
 
Artificial Neural Networks Lect7: Neural networks based on competition
Artificial Neural Networks Lect7: Neural networks based on competitionArtificial Neural Networks Lect7: Neural networks based on competition
Artificial Neural Networks Lect7: Neural networks based on competition
 
Heuristic Search Techniques {Artificial Intelligence}
Heuristic Search Techniques {Artificial Intelligence}Heuristic Search Techniques {Artificial Intelligence}
Heuristic Search Techniques {Artificial Intelligence}
 
Gradient descent method
Gradient descent methodGradient descent method
Gradient descent method
 
Artwork Personalization at Netflix
Artwork Personalization at NetflixArtwork Personalization at Netflix
Artwork Personalization at Netflix
 
Prolog 01
Prolog 01Prolog 01
Prolog 01
 
Clique problem step_by_step
Clique problem step_by_stepClique problem step_by_step
Clique problem step_by_step
 
Apriori algorithm
Apriori algorithmApriori algorithm
Apriori algorithm
 

Similar to NVIDIA RecSys Summit 2022 - EoR

LPP application and problem formulation
LPP application and problem formulationLPP application and problem formulation
LPP application and problem formulation
Karishma Chaudhary
 
Summary Questions – Sartre, Leopold Put things in your own.docx
Summary Questions – Sartre, Leopold Put things in your own.docxSummary Questions – Sartre, Leopold Put things in your own.docx
Summary Questions – Sartre, Leopold Put things in your own.docx
fredr6
 
©️ SAP ERP Strategy - One of my PowerPoint Presentations - as of February 2012
©️ SAP ERP Strategy - One of my PowerPoint Presentations - as of February 2012©️ SAP ERP Strategy - One of my PowerPoint Presentations - as of February 2012
©️ SAP ERP Strategy - One of my PowerPoint Presentations - as of February 2012
none
 

Similar to NVIDIA RecSys Summit 2022 - EoR (20)

Elena Grewal, Data Science Manager, Airbnb at MLconf SF 2016
Elena Grewal, Data Science Manager, Airbnb at MLconf SF 2016Elena Grewal, Data Science Manager, Airbnb at MLconf SF 2016
Elena Grewal, Data Science Manager, Airbnb at MLconf SF 2016
 
ODSC West 2022 – Kitbashing in ML
ODSC West 2022 – Kitbashing in MLODSC West 2022 – Kitbashing in ML
ODSC West 2022 – Kitbashing in ML
 
Product strategies and the management of the development process, prototyping...
Product strategies and the management of the development process, prototyping...Product strategies and the management of the development process, prototyping...
Product strategies and the management of the development process, prototyping...
 
Business Analysis Study Notes
Business Analysis Study NotesBusiness Analysis Study Notes
Business Analysis Study Notes
 
Prototyping and MVPs for startups
Prototyping and MVPs for startupsPrototyping and MVPs for startups
Prototyping and MVPs for startups
 
2.1 Product_Specifications.ppt
2.1 Product_Specifications.ppt2.1 Product_Specifications.ppt
2.1 Product_Specifications.ppt
 
Npd
NpdNpd
Npd
 
20121121101127simulation azmi
20121121101127simulation azmi20121121101127simulation azmi
20121121101127simulation azmi
 
Simulation Powerpoint- Lecture Notes
Simulation Powerpoint- Lecture NotesSimulation Powerpoint- Lecture Notes
Simulation Powerpoint- Lecture Notes
 
Barga Data Science lecture 10
Barga Data Science lecture 10Barga Data Science lecture 10
Barga Data Science lecture 10
 
Key Success Factors in New Product Efforts
Key Success Factors in New Product EffortsKey Success Factors in New Product Efforts
Key Success Factors in New Product Efforts
 
Lessons learned from Large Scale Real World Recommender Systems
Lessons learned from Large Scale Real World Recommender SystemsLessons learned from Large Scale Real World Recommender Systems
Lessons learned from Large Scale Real World Recommender Systems
 
chap019.ppt
chap019.pptchap019.ppt
chap019.ppt
 
Software Life Cylce Model
Software Life Cylce ModelSoftware Life Cylce Model
Software Life Cylce Model
 
LPP application and problem formulation
LPP application and problem formulationLPP application and problem formulation
LPP application and problem formulation
 
Managing machine learning
Managing machine learningManaging machine learning
Managing machine learning
 
How to Build an AI/ML Product and Sell it by SalesChoice CPO
How to Build an AI/ML Product and Sell it by SalesChoice CPOHow to Build an AI/ML Product and Sell it by SalesChoice CPO
How to Build an AI/ML Product and Sell it by SalesChoice CPO
 
Summary Questions – Sartre, Leopold Put things in your own.docx
Summary Questions – Sartre, Leopold Put things in your own.docxSummary Questions – Sartre, Leopold Put things in your own.docx
Summary Questions – Sartre, Leopold Put things in your own.docx
 
©️ SAP ERP Strategy - One of my PowerPoint Presentations - as of February 2012
©️ SAP ERP Strategy - One of my PowerPoint Presentations - as of February 2012©️ SAP ERP Strategy - One of my PowerPoint Presentations - as of February 2012
©️ SAP ERP Strategy - One of my PowerPoint Presentations - as of February 2012
 
Maximising likelihood of success: Applying Product Management to AI/ML/DS pr...
Maximising likelihood of success:  Applying Product Management to AI/ML/DS pr...Maximising likelihood of success:  Applying Product Management to AI/ML/DS pr...
Maximising likelihood of success: Applying Product Management to AI/ML/DS pr...
 

Recently uploaded

Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
panagenda
 
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptxHarnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
FIDO Alliance
 
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf
Muhammad Subhan
 

Recently uploaded (20)

Overview of Hyperledger Foundation
Overview of Hyperledger FoundationOverview of Hyperledger Foundation
Overview of Hyperledger Foundation
 
Oauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoftOauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoft
 
UiPath manufacturing technology benefits and AI overview
UiPath manufacturing technology benefits and AI overviewUiPath manufacturing technology benefits and AI overview
UiPath manufacturing technology benefits and AI overview
 
ADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptxADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptx
 
JavaScript Usage Statistics 2024 - The Ultimate Guide
JavaScript Usage Statistics 2024 - The Ultimate GuideJavaScript Usage Statistics 2024 - The Ultimate Guide
JavaScript Usage Statistics 2024 - The Ultimate Guide
 
WebRTC and SIP not just audio and video @ OpenSIPS 2024
WebRTC and SIP not just audio and video @ OpenSIPS 2024WebRTC and SIP not just audio and video @ OpenSIPS 2024
WebRTC and SIP not just audio and video @ OpenSIPS 2024
 
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
 
Intro to Passkeys and the State of Passwordless.pptx
Intro to Passkeys and the State of Passwordless.pptxIntro to Passkeys and the State of Passwordless.pptx
Intro to Passkeys and the State of Passwordless.pptx
 
Introduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMIntroduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDM
 
Design and Development of a Provenance Capture Platform for Data Science
Design and Development of a Provenance Capture Platform for Data ScienceDesign and Development of a Provenance Capture Platform for Data Science
Design and Development of a Provenance Capture Platform for Data Science
 
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptxHarnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
 
AI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by AnitarajAI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by Anitaraj
 
Continuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
Continuing Bonds Through AI: A Hermeneutic Reflection on ThanabotsContinuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
Continuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
 
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf
 
الأمن السيبراني - ما لا يسع للمستخدم جهله
الأمن السيبراني - ما لا يسع للمستخدم جهلهالأمن السيبراني - ما لا يسع للمستخدم جهله
الأمن السيبراني - ما لا يسع للمستخدم جهله
 
Easier, Faster, and More Powerful – Notes Document Properties Reimagined
Easier, Faster, and More Powerful – Notes Document Properties ReimaginedEasier, Faster, and More Powerful – Notes Document Properties Reimagined
Easier, Faster, and More Powerful – Notes Document Properties Reimagined
 
ChatGPT and Beyond - Elevating DevOps Productivity
ChatGPT and Beyond - Elevating DevOps ProductivityChatGPT and Beyond - Elevating DevOps Productivity
ChatGPT and Beyond - Elevating DevOps Productivity
 
Working together SRE & Platform Engineering
Working together SRE & Platform EngineeringWorking together SRE & Platform Engineering
Working together SRE & Platform Engineering
 
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
 
Top 10 CodeIgniter Development Companies
Top 10 CodeIgniter Development CompaniesTop 10 CodeIgniter Development Companies
Top 10 CodeIgniter Development Companies
 

NVIDIA RecSys Summit 2022 - EoR

  • 1. An embarrassment of riches Harmonizing several recommendation problems NVIDIA RecSys Summit - July 28, 2022 Dr. Bryan Bischof Head of Data Science @ Weights and Biases 1 Email: bryan.bischof@gmail.com
  • 2. Overview 2 1. Framing some problems [Business, RecSys] 2. Some causes for concern [Case studies for 3 models] 3. Mistakes and challenges [Type 1, Org, HITL, Weak Composition] 4. Very good things [Great place to do great work] 5. Good Big Model? [One with everything] 6. Learnings [Some things to strive for]
  • 4. 1. Stitch Fix Business problem - 5 items in a box - shipments ~1/mo - feedback from customer on each item - 100Kʼs of users, 10Kʼs of items, 10Kʼs of shipments - History of items in shipments, and outcomes - Attributes of items, onboarding quiz features for customers 4 c.f. Recommendations as Unique as You, 2014
  • 5. - Match Score: Which items a customer will purchase (MS) - Fix Generation: Set of 5 items for the customer based on a request note (FG) - Style Prediction: Which items a customer will like as part of a game where they rate (binary) photos of clothing (SS) - E-commerce: Best items to show a customer on their homepage (PS) - Similar Styles: Items similar to uploaded image a customer will purchase (SB) - Inventory Health: Inventory health at a given time for a given client (IH) - Search: Best items to show a customer based on a text search (IS) 1. Stitch Fix RecSys problems 5 c.f. Colson 2015, Klingenberg 2015, Boyle 2018, Zielnicki 2019, Bischof-Horn 2021
  • 7. 1a. The Core recommendation model 7 Klingenberg 2015 - Match Score: Which items a customer will purchase (MS) - A stylist was shown available inventory for a client (user) and the output of the recommendation model for the pair; this was called the match score. - A stylist was asked to use information about the client, information of previous shipments (fixes), information from a request note, and the match score, to assemble a 5 item box of clothes. - A stylist was judged on a high average match score of items included, time to style, and outcomes from fixes.
  • 8. 2a. Concerns with match score 8 - Match Score: Which items a customer will purchase (MS) - The match score was uncalibrated, and loss was global Success Rate of the items - Was used as a lever to influence stylist behavior; upscale/downscale MS to nudge stylists relative to inventory - Stylists were a mediator between ranking and outcomes, so poor model performance was very hard to detect live - Lacked access to many item features for a while
  • 10. 1b. The combinatorial model 10 Zielnicki 2019, - Fix Generation: Set of 5 items for the customer based on a request note (FG) - Needed to include text features from request notes - Should attempt to find good mutual items for a fix instead of 5 good items; natively considering interaction - Should fit into a similar UI and HITL experience for stylists
  • 11. 2b. Concerns with Fix Generation 11 - Fix Generation: Set of 5 items for the customer based on a request note (FG) - Ultimately did not replace MS, and instead was deployed and run in parallel. Two major reasons why: - Org structure (leadership champions on both sides) - Disjoint loss functions (different methods for evaluation made it easy to avoid choice) - Evaluation/testing remained independent without direct model comparison; no mutual validation
  • 13. 1c. A style game 13 Boyle 2018, - Style Prediction: Which items a customer will like as part of a game where they rate (binary) photos of clothing (SS) - Loss function was completely focused on item rating, not purchases - Assumed all ratings were independent - Much larger scales, could play the game anytime for a long time - Ignored availability - Purely photo based
  • 14. 2c. Concerns with Style Shuffle 14 - Style Prediction: Which items a customer will like as part of a game where they rate (binary) photos of clothing (SS) - Different loss function, so not compared to other models directly - Used as feature-engineering for downstream models (as a latent style space); - Early on, the features seemed to improve other models when added in, but as MS got more sophisticated and the architecture grew to use similar technologies to the SS model, the features were never retested accurately - Feature importance was calculated as part of the FG pipeline, but only via inclusion/exclusion
  • 15. 2c. More Concerns with Style Shuffle 15 - Data leakage was not accounted for in prequential training - Relied instead on A/B tests, but the tests werenʼt actually tied to business outcomes - Massive organizational pressure to use the output of this model - Latent spaces from other models did not have API availability for features, or feature stores to use them - Muddled story about including item attributes in the feature space
  • 17. 3a. Type 1 error - A widely held belief internally was that we had rampant type 1 error, i.e. we thought more experiments were successful than were - Why? Peeking and early stopping, variance in ExpDesign, not controlling for covariates, treatment assignment inconsistencies - Matching + Randomization is hard, especially with HITL - Virtual Warehouses were a hard problem but we tried - Ultimately, client experience was the priority, sometimes we sacrificed rigor for this - Over time, sequential testing bias added up, and we didnʼt have global holdouts or re-testing 17
  • 18. 3b. Balkanization - There was no team ultimately responsible for evaluation and validation of models across the org; every team expected to develop quickly and independently - We did not have continuous testing/continuous learning, new models were tested via singular effect estimates - Teams were incentivized to bring something “new” to the table to show impact separately, no emphasis on improving existing 18 Colson 2019, Algorithms Tour 2019
  • 19. - HITL processes were used for the product, but humans were not part of model evaluation - The metrics on humans encouraged faith in the algorithm at the cost of the human influence - The algorithms created more and more simultaneous objectives for stylists to consider - E.g. request notes, sizing, weather in clientʼs region, recent sends, global trends, low inventory 3c. Humans out of the loop 19
  • 20. - Many of the model dependencies created a weak-composition structure: - two (or more) models are composed - the models are not trained jointly - second model uses a byproduct from the first(think some embedding space) - there's no type/schema requirements between the two - These situations lead to unmeasured interdependency 3d. Weak composition 20
  • 22. - Access: Data Scientists had access and visibility into a huge variety of data and work across the company - Experimentation Framework: extremely powerful and developer friendly - Massive applicability: ML was applied to almost all aspects of the business, and intellectual freedom to try different things/problems was incredible - Technical Competency: Very strong knowledge and skill broadly across ICs - Impact driven culture: A great culture focused on having real impact - Developer experience: DevExp for DS was very good, strong data platform 4. Upsides 22 Bradley 2019
  • 24. - What about a Good Big Model? Why canʼt you simply make everything a large multi-objective problem and throw more and more features into one master model? - Bias-Variance tradeoff, this will require more and more overparameterization - Relative importance of components in your loss is hard to establish - if you attempt to do this through human intuition, youʼre not guaranteed that this will be well suited for learning; if you attempt to do this automatically, youʼre likely to get away from expert knowledge - Larger models more detached from the data generating process, and harder to explain, but themselves can serve as interesting and useful data generation 5a. Is the solution a massively multi-objective NN? 24
  • 25. Karpathyʼs team on self-driving was distributed over many components of a massively-multitask problem. In addition to adversarial collaboration, he generally found difficulty in optimizing how to compose their efforts. Maybe he should try back-propagation to learn a better weighting. 󰤇 25 c.f. Karpathy, ICML 2019 5b. Obligatory Karpathy Reference
  • 27. 1. Global holdouts, published validation artifacts, random re-testing, no peeking can decrease the risk of bridges to nowhere (long term type 1 error) 2. There should be a team who ultimately signs off on all experimental design. 3. Weak-composition should be avoided or strengthened to strong-composition 4. Better experiment documentation for ML experiments and development 5. Internal task leaderboards should be institutionalized and integrated with model registry/automatic validation pipelines 6. Partnership should be upheld as a core value, and preferred to 🍠YAM (yet another model) developed independently 6a. Some recommendations (get it?) 27
  • 28. To help build compositional pipelines/orgs, our platform is built of components: 28 c.f. Weights and Biases and our platform handles the coherence. 6b. Building a platform to facilitate
  • 29. Thanks! Check out W&B’s composable tools at: Wandb.ai Totally free for individuals & academics. Get in touch: contact@wandb.ai. 29
  • 30. Coming early 2023! Find us on Twitter: @bebischof, @eigenhector Or LinkedIn: Bryan, Hector 30