Why Didn’t My Feature Improve              the Metric?                                  Ya Xu           Based on two paper...
What Metric?• Overall Evaluation Criterion (OEC): metric(s) used  to decide whether A or B is better.• Long term goal for ...
REASON #1The feature just wasn’t as good as you thought…We are poor at assessing the value of ideas.Jim Manzi: “Google ran...
REASON #2: CARRYOVER EFFECT
Background• Puzzling outcome:  – Several experiments showed surprising results  – Reran and effects disappeared  – Why?• B...
Carryover Effect• Explanation:  – bucket system recycles users; prior experiment    had carryover effects  – Effects last ...
REASON #3: STATISTICALSENSITIVITY
Background• Performance matters  – Bing: +100msec = -0.6% revenue  – Amazon: +100msec = -1% revenue  – Google: +100msec = ...
How to Achieve Better Sensitivity?1. Get more users2. Run longer experiments:  – We recruit users continuously.  – Longer ...
CUPED• Currently live in     ’s experiment system• Allows for running experiments with   – Half the users, or   – Half the...
• One top reason not discussed:  Instrumentation bugs• For more insights, check out our papers  (KDD’2012 and WSDM’2013) o...
Upcoming SlideShare
Loading in …5
×

Bing_Controlled Experimentation_Panel_The Hive

870 views

Published on

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
870
On SlideShare
0
From Embeds
0
Number of Embeds
401
Actions
Shares
0
Downloads
24
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Bing_Controlled Experimentation_Panel_The Hive

  1. 1. Why Didn’t My Feature Improve the Metric? Ya Xu Based on two papers (KDD’2012 and WSDM’2013) with Ronny Kohavi, Alex Deng, Toby Walker, Brian Frasca and Roger LongbothamExperimentation Panel 3/20/2013
  2. 2. What Metric?• Overall Evaluation Criterion (OEC): metric(s) used to decide whether A or B is better.• Long term goal for : query share & revenue• Puzzling outcome: – Ranking bug in an experiment resulted in very poor search results – Query up +10% and revenue up +30% – What should a search engine use as OEC?• We use Sessions-Per-User.
  3. 3. REASON #1The feature just wasn’t as good as you thought…We are poor at assessing the value of ideas.Jim Manzi: “Google ran approximately 12,000 randomizedexperiment in 2009, with [only] about 10% of theseleading to business changes.”
  4. 4. REASON #2: CARRYOVER EFFECT
  5. 5. Background• Puzzling outcome: – Several experiments showed surprising results – Reran and effects disappeared – Why?• Bucket system (Bing/Google/Yahoo) – Assign users into buckets, then assign buckets to experiments. – Buckets are reused from one experiment to next.
  6. 6. Carryover Effect• Explanation: – bucket system recycles users; prior experiment had carryover effects – Effects last for months• Solution: – Run A/A test start end – Local Re-randomization
  7. 7. REASON #3: STATISTICALSENSITIVITY
  8. 8. Background• Performance matters – Bing: +100msec = -0.6% revenue – Amazon: +100msec = -1% revenue – Google: +100msec = -0.2% query• But not for Etsy.com? “faster results better? Meh”Insensitive experimentation can lead to wrongconclusion that a feature has no impact.
  9. 9. How to Achieve Better Sensitivity?1. Get more users2. Run longer experiments: – We recruit users continuously. – Longer experiment = more users = more power? – Wrong! This doesn’t always get us more power3. CUPED Controlled Experiments Using Pre-Experiment Data Confidence interval for Sessions- per-User doesn’t shrink over a month!
  10. 10. CUPED• Currently live in ’s experiment system• Allows for running experiments with – Half the users, or – Half the duration• Leveraging pre-exp data to improve sensitivity• Intuition: mixture model total variance = between-group variance + within-group variance
  11. 11. • One top reason not discussed: Instrumentation bugs• For more insights, check out our papers (KDD’2012 and WSDM’2013) or find me at the networking session

×