Why Didn’t My Feature Improve the Metric? Ya Xu Based on two papers (KDD’2012 and WSDM’2013) with Ronny Kohavi, Alex Deng, Toby Walker, Brian Frasca and Roger LongbothamExperimentation Panel 3/20/2013
What Metric?• Overall Evaluation Criterion (OEC): metric(s) used to decide whether A or B is better.• Long term goal for : query share & revenue• Puzzling outcome: – Ranking bug in an experiment resulted in very poor search results – Query up +10% and revenue up +30% – What should a search engine use as OEC?• We use Sessions-Per-User.
REASON #1The feature just wasn’t as good as you thought…We are poor at assessing the value of ideas.Jim Manzi: “Google ran approximately 12,000 randomizedexperiment in 2009, with [only] about 10% of theseleading to business changes.”
Background• Puzzling outcome: – Several experiments showed surprising results – Reran and effects disappeared – Why?• Bucket system (Bing/Google/Yahoo) – Assign users into buckets, then assign buckets to experiments. – Buckets are reused from one experiment to next.
Carryover Effect• Explanation: – bucket system recycles users; prior experiment had carryover effects – Effects last for months• Solution: – Run A/A test start end – Local Re-randomization
Background• Performance matters – Bing: +100msec = -0.6% revenue – Amazon: +100msec = -1% revenue – Google: +100msec = -0.2% query• But not for Etsy.com? “faster results better? Meh”Insensitive experimentation can lead to wrongconclusion that a feature has no impact.
How to Achieve Better Sensitivity?1. Get more users2. Run longer experiments: – We recruit users continuously. – Longer experiment = more users = more power? – Wrong! This doesn’t always get us more power3. CUPED Controlled Experiments Using Pre-Experiment Data Confidence interval for Sessions- per-User doesn’t shrink over a month!
CUPED• Currently live in ’s experiment system• Allows for running experiments with – Half the users, or – Half the duration• Leveraging pre-exp data to improve sensitivity• Intuition: mixture model total variance = between-group variance + within-group variance
• One top reason not discussed: Instrumentation bugs• For more insights, check out our papers (KDD’2012 and WSDM’2013) or find me at the networking session