SlideShare a Scribd company logo
A/B Testing @ Internet Scale
Ya Xu
8/12/2014 @ Coursera
A/B Testing in One Slide
20%80%
Collect results to determine which one is better
Join now
Control Treatment
Outline
§ Culture Challenge
–  Why A/B testing
–  What to A/B test
§ Building a scalable experimentation system
§ Best practices
3
Why A/B Testing
Amazon Shopping Cart Recommendation
5
•  At Amazon, Greg Linden had this idea of showing
recommendations based on cart items
•  Trade-offs
•  Pro: cross-sell more items (increase average basket size)
•  Con: distract people from checking out (reduce conversion)
•  HiPPO (Highest Paid Person’s Opinion) : stop the project
From Greg Linden’s Blog: http://glinden.blogspot.com/2006/04/early-amazon-shopping-cart.html
http://www.exp-platform.com/Documents/2012-08%20Puzzling%20Outcomes%20KDD.pptx
MSN Real Estate
§ “Find a house” widget variations
§ Revenue to MSN generated every time a user
clicks search/find button
6
A B
http://www.exp-platform.com/Documents/2012-08%20Puzzling%20Outcomes%20KDD.pptx
Take-away
Experiments
are the only way to prove causality.
7
Use A/B testing to:
§ Guide product development
§ Measure impact (assess ROI)
§ Gain “real” customer feedback
What to A/B Test
8
Ads CTR Drop
9
Sudden drop
on 11/11/2013
Profile top ads
Root-Cause
10
5 Pixels!!
Navigation bar
Profile top ads
What to A/B Test
§ Evaluating new ideas:
–  Visual changes
–  Complete redesign of web page
–  Relevance algorithms
–  …
§ Platform changes
§ Code refactoring
§ Bug fixes
11
Test Everything!
Startups vs. Big Websites
§ Do startups have enough users to A/B test?
–  Startups typically look for larger effects
–  5% vs. 0.5% difference è 100 times more users!
§ Startups should establish A/B testing culture
early
12
A Scalable Experimentation
System
13
A/B Testing 3 Steps
14
Design
•  What/Whom to experiment on
Deploy
•  Code deployment
Analyze
•  Impact on metrics
A/B Testing Platform Architecture
1.  Experiment Management
2.  Online Infrastructure
3.  Offline Analysis
15
Example: Bing A/B
1. Experiment Management
§ Define experiments
–  Whom to target?
–  How to split traffic?
§ Start/stop an experiment
§ Important addition:
–  Define success criteria
–  Power analysis
16
2. Online Infrastructure
1)  Hash & partition: random & consistent
2)  Deploy: server-side, as a change to
–  The default configuration (Bing)
–  The default code path (LinkedIn)
3)  Data logging
17
0% 100%
Treatment1
D20% D20%
Hash (ID)
Treatment2 Control
Hash & Partition @ Scale (I)
§ Pure bucket system (Google/Bing before 200X)
18
0% 100%
Exp. 1
D20% D20%
Exp. 2 Exp. 3
60%
red green yellow
15% 15%30%
•  Does not scale
•  Traffic management
Hash & Partition @ Scale (II)
§ Fully overlapping system
0% 100%
D
Exp. 2
A2 B2 control
Exp.1
controlA1
D
B1
D
•  Each experiment gets 100% traffic
•  A user is in “all” experiments simultaneously
•  Randomization btw experiments are independent
(unique hashID)
•  Cannot avoid interaction
Hash & Partition @ Scale (III)
§ Hybrid: Layer + Domain
20
•  Centralized management (Bing)
•  Central exp. team creates/manages layers/domains
•  De-centralized management (LinkedIn)
•  Each experiment is one “layer” by default
•  Experimenter controls hashID to create a “domain”
Data Logging
§  Trigger
§  Trigger-based logging
–  Log whether a request is actually affected by the
experiment
–  Log for both factual & counter-factual
21
All LinkedIn members
300MM +
Triggered:
Members visiting
contacts page
3. Automated Offline Analysis
§  Large-scale data processing, e.g. daily @LinkedIn
–  200+ experiments
–  700+ metrics
–  Billions of experiment trigger events
§  Statistical analysis
–  Metrics design
–  Statistical significance test (p-value, confidence interval)
–  Deep-dive: slicing & dicing capability
§  Monitoring & alerting
–  Data quality
–  Early termination
22
Best Practices
23
Example: Unified Search
What to Experiment?
Measure one change at a time.
Unified Search Experiments 1+2+…N50%
En-US
Pre-unified search
50%
En-US
What to Measure?
§ Success metrics: summarize whether
treatment is better
§ Puzzling example:
–  Key metrics for Bing: number of searches &
revenue
–  Ranking bug in experiment resulted in poor search
results
–  Number of searches up +10% and revenue up
+30%
Success metrics should reflect long
term impact
Scientific Experiment Design
§ How long to run the experiment?
§ How much traffic to allocate to treatment?
Story:
§  Site speed matters
–  Bing: +100msec = -0.6% revenue
–  Amazon: +100msec = -1.0% revenue
–  Google: +100msec = -0.2% queries
§  But not for Etsy.com?
“Faster results better? … meh”
27
Power
§ Power: the chance of detecting a
difference when there really is one.
§ Two reasons your feature doesn’t move
metrics
1.  No “real” impact
2.  Not enough power
28
Properly power up your experiment!
Statistical Significance
§ Which experiment has a bigger impact?
29
Experiment 1 Experiment 2
Pageviews 1.5% 12.9%
Revenue 0.8% 2.4%
Statistical Significance
§ Which experiment has a bigger impact?
30
Experiment 1 Experiment 2
Pageviews 1.5% 12.9%
Revenue 0.8% Stat. significant 2.4%
Statistical Significance
31
§ Must consider statistical significance
–  A 12.9% delta can still be noise!
–  Identify signal from noise; focus on the “real” movers
–  Ensure results are reproducible
Experiment 1 Experiment 2
Pageviews 1.5% 12.9%
Revenue 0.8% Stat. significant 2.4%
Multiple Testing
§ Famous xkcd comic on Jelly Beans
32
Multiple Testing Concerns
§ Multiple ramps
–  Pre-decide a ramp to base decision on (e.g. 50/50)
§ Multiple “peeks”
–  Rely on “full”-week results
§ Multiple variants
–  Choose the best, then rerun to see if replicate
§ Multiple metrics
An irrelevant metric is statistically
significant. What to do?
§  Which metric?
§  How “significant”? (p-value)
34
34
All
metrics
2nd order
metrics
1st order
metrics
p-value < 0.05
p-value < 0.01
p-value < 0.001
Directly impacted by exp.
Maybe impacted by exp.
Watch out for multiple testing
With 100 metrics, how many would you see stat. significant
even if your experiment does NOTHING? 5
References
§  Tang, Diane, et al. Overlapping Experiment Infrastructure: More, Better,
Faster Experimentation. Proceedings 16th Conference on Knowledge
Discovery and Data Mining. 2010.
§  Kohavi, Ron, et al. Online Controlled Experiments at Large Scale. KDD
2013: Proceedings of the 19th ACM SIGKDD international conference on
Knowledge discovery and data mining. 2013.
§  LinkedIn blog post:
http://engineering.linkedin.com/ab-testing/xlnt-platform-driving-ab-testing-linkedin
Additional Resources: RecSys’14 A/B testing workshop
35

More Related Content

What's hot

Practical Introduction to A/B Testing
Practical Introduction to A/B TestingPractical Introduction to A/B Testing
Practical Introduction to A/B Testing
Alex Alwan
 
SXSW 2016 - Everything you think about A/B testing is wrong
SXSW 2016 - Everything you think about A/B testing is wrongSXSW 2016 - Everything you think about A/B testing is wrong
SXSW 2016 - Everything you think about A/B testing is wrong
Dan Chuparkoff
 
A/B Testing Framework Design
A/B Testing Framework DesignA/B Testing Framework Design
A/B Testing Framework Design
Patrick McKenzie
 
A/B Testing with Yammer's Product Manager
A/B Testing with Yammer's Product ManagerA/B Testing with Yammer's Product Manager
A/B Testing with Yammer's Product Manager
Product School
 
The Power of A/B Testing
The Power of A/B TestingThe Power of A/B Testing
The Power of A/B Testing
Alexandre Pallota
 
Controlled Experimentation aka A/B Testing for PMs by Tinder Sr PM
Controlled Experimentation aka A/B Testing for PMs by Tinder Sr PMControlled Experimentation aka A/B Testing for PMs by Tinder Sr PM
Controlled Experimentation aka A/B Testing for PMs by Tinder Sr PM
Product School
 
A/B testing
A/B testingA/B testing
신뢰할 수 있는 A/B 테스트를 위해 알아야 할 것들
신뢰할 수 있는 A/B 테스트를 위해 알아야 할 것들신뢰할 수 있는 A/B 테스트를 위해 알아야 할 것들
신뢰할 수 있는 A/B 테스트를 위해 알아야 할 것들
Minho Lee
 
Netflix JavaScript Talks - Scaling A/B Testing on Netflix.com with Node.js
Netflix JavaScript Talks - Scaling A/B Testing on Netflix.com with Node.jsNetflix JavaScript Talks - Scaling A/B Testing on Netflix.com with Node.js
Netflix JavaScript Talks - Scaling A/B Testing on Netflix.com with Node.js
Chris Saint-Amant
 
10 Guidelines for A/B Testing
10 Guidelines for A/B Testing10 Guidelines for A/B Testing
10 Guidelines for A/B Testing
Emily Robinson
 
AB Test Platform - 우종호
AB Test Platform - 우종호AB Test Platform - 우종호
AB Test Platform - 우종호
Jongho Woo
 
21 Actionable Growth Hacking Tactics
21 Actionable Growth Hacking Tactics21 Actionable Growth Hacking Tactics
21 Actionable Growth Hacking Tactics
Jon Yongfook
 
Growth Hacking / Marketing 101: It's about process
Growth Hacking / Marketing 101: It's about processGrowth Hacking / Marketing 101: It's about process
Growth Hacking / Marketing 101: It's about process
Ruben Hamilius
 
A/B Testing for New Product Launches by Booking.com Sr PM
A/B Testing for New Product Launches by Booking.com Sr PMA/B Testing for New Product Launches by Booking.com Sr PM
A/B Testing for New Product Launches by Booking.com Sr PM
Product School
 
Growth Hacking
Growth HackingGrowth Hacking
Growth Hacking
Mehdi Mehni
 
[팝콘 시즌1] 최보경 : 실무자를 위한 인과추론 활용 - Best Practices
[팝콘 시즌1] 최보경 : 실무자를 위한 인과추론 활용 - Best Practices[팝콘 시즌1] 최보경 : 실무자를 위한 인과추론 활용 - Best Practices
[팝콘 시즌1] 최보경 : 실무자를 위한 인과추론 활용 - Best Practices
PAP (Product Analytics Playground)
 
[500DISTRO] The Scientific Method: How to Design & Track Viral Growth Experim...
[500DISTRO] The Scientific Method: How to Design & Track Viral Growth Experim...[500DISTRO] The Scientific Method: How to Design & Track Viral Growth Experim...
[500DISTRO] The Scientific Method: How to Design & Track Viral Growth Experim...
500 Startups
 
Personalized Playlists at Spotify
Personalized Playlists at SpotifyPersonalized Playlists at Spotify
Personalized Playlists at Spotify
Rohan Agrawal
 
A/B Testing at Pinterest: Building a Culture of Experimentation
A/B Testing at Pinterest: Building a Culture of Experimentation A/B Testing at Pinterest: Building a Culture of Experimentation
A/B Testing at Pinterest: Building a Culture of Experimentation
WrangleConf
 

What's hot (20)

Practical Introduction to A/B Testing
Practical Introduction to A/B TestingPractical Introduction to A/B Testing
Practical Introduction to A/B Testing
 
SXSW 2016 - Everything you think about A/B testing is wrong
SXSW 2016 - Everything you think about A/B testing is wrongSXSW 2016 - Everything you think about A/B testing is wrong
SXSW 2016 - Everything you think about A/B testing is wrong
 
A/B Testing Framework Design
A/B Testing Framework DesignA/B Testing Framework Design
A/B Testing Framework Design
 
A/B Testing with Yammer's Product Manager
A/B Testing with Yammer's Product ManagerA/B Testing with Yammer's Product Manager
A/B Testing with Yammer's Product Manager
 
The Power of A/B Testing
The Power of A/B TestingThe Power of A/B Testing
The Power of A/B Testing
 
Controlled Experimentation aka A/B Testing for PMs by Tinder Sr PM
Controlled Experimentation aka A/B Testing for PMs by Tinder Sr PMControlled Experimentation aka A/B Testing for PMs by Tinder Sr PM
Controlled Experimentation aka A/B Testing for PMs by Tinder Sr PM
 
A/B testing
A/B testingA/B testing
A/B testing
 
신뢰할 수 있는 A/B 테스트를 위해 알아야 할 것들
신뢰할 수 있는 A/B 테스트를 위해 알아야 할 것들신뢰할 수 있는 A/B 테스트를 위해 알아야 할 것들
신뢰할 수 있는 A/B 테스트를 위해 알아야 할 것들
 
Netflix JavaScript Talks - Scaling A/B Testing on Netflix.com with Node.js
Netflix JavaScript Talks - Scaling A/B Testing on Netflix.com with Node.jsNetflix JavaScript Talks - Scaling A/B Testing on Netflix.com with Node.js
Netflix JavaScript Talks - Scaling A/B Testing on Netflix.com with Node.js
 
10 Guidelines for A/B Testing
10 Guidelines for A/B Testing10 Guidelines for A/B Testing
10 Guidelines for A/B Testing
 
AB Test Platform - 우종호
AB Test Platform - 우종호AB Test Platform - 우종호
AB Test Platform - 우종호
 
21 Actionable Growth Hacking Tactics
21 Actionable Growth Hacking Tactics21 Actionable Growth Hacking Tactics
21 Actionable Growth Hacking Tactics
 
Ab testing
Ab testingAb testing
Ab testing
 
Growth Hacking / Marketing 101: It's about process
Growth Hacking / Marketing 101: It's about processGrowth Hacking / Marketing 101: It's about process
Growth Hacking / Marketing 101: It's about process
 
A/B Testing for New Product Launches by Booking.com Sr PM
A/B Testing for New Product Launches by Booking.com Sr PMA/B Testing for New Product Launches by Booking.com Sr PM
A/B Testing for New Product Launches by Booking.com Sr PM
 
Growth Hacking
Growth HackingGrowth Hacking
Growth Hacking
 
[팝콘 시즌1] 최보경 : 실무자를 위한 인과추론 활용 - Best Practices
[팝콘 시즌1] 최보경 : 실무자를 위한 인과추론 활용 - Best Practices[팝콘 시즌1] 최보경 : 실무자를 위한 인과추론 활용 - Best Practices
[팝콘 시즌1] 최보경 : 실무자를 위한 인과추론 활용 - Best Practices
 
[500DISTRO] The Scientific Method: How to Design & Track Viral Growth Experim...
[500DISTRO] The Scientific Method: How to Design & Track Viral Growth Experim...[500DISTRO] The Scientific Method: How to Design & Track Viral Growth Experim...
[500DISTRO] The Scientific Method: How to Design & Track Viral Growth Experim...
 
Personalized Playlists at Spotify
Personalized Playlists at SpotifyPersonalized Playlists at Spotify
Personalized Playlists at Spotify
 
A/B Testing at Pinterest: Building a Culture of Experimentation
A/B Testing at Pinterest: Building a Culture of Experimentation A/B Testing at Pinterest: Building a Culture of Experimentation
A/B Testing at Pinterest: Building a Culture of Experimentation
 

Similar to Talks@Coursera - A/B Testing @ Internet Scale

DataEngConf SF16 - Three lessons learned from building a production machine l...
DataEngConf SF16 - Three lessons learned from building a production machine l...DataEngConf SF16 - Three lessons learned from building a production machine l...
DataEngConf SF16 - Three lessons learned from building a production machine l...
Hakka Labs
 
From Labelling Open data images to building a private recommender system
From Labelling Open data images to building a private recommender systemFrom Labelling Open data images to building a private recommender system
From Labelling Open data images to building a private recommender system
Pierre Gutierrez
 
Surviving the AB Testing Hype Cycle - Reaktor Breakpoint 2015
Surviving the AB Testing Hype Cycle - Reaktor Breakpoint 2015Surviving the AB Testing Hype Cycle - Reaktor Breakpoint 2015
Surviving the AB Testing Hype Cycle - Reaktor Breakpoint 2015
Craig Sullivan
 
Agile 2014 Software Moneyball (Troy Magennis)
Agile 2014   Software Moneyball (Troy Magennis)Agile 2014   Software Moneyball (Troy Magennis)
Agile 2014 Software Moneyball (Troy Magennis)
Troy Magennis
 
Making Strategic Decisions by fmr Capital One Dir. Digital PM
Making Strategic Decisions by fmr Capital One Dir. Digital PMMaking Strategic Decisions by fmr Capital One Dir. Digital PM
Making Strategic Decisions by fmr Capital One Dir. Digital PM
Product School
 
It Worked for Ustream
It Worked for UstreamIt Worked for Ustream
It Worked for Ustream
Balázs Kereskényi
 
Optimizely Partner Ecosystem
Optimizely Partner EcosystemOptimizely Partner Ecosystem
Optimizely Partner Ecosystem
Optimizely
 
Drippler's A/B test library
Drippler's A/B test libraryDrippler's A/B test library
Drippler's A/B test library
Nir Hartmann
 
Digital analytics: Optimization (Lecture 10)
Digital analytics: Optimization (Lecture 10)Digital analytics: Optimization (Lecture 10)
Digital analytics: Optimization (Lecture 10)
Joni Salminen
 
Ria Sankar on Building AI Products
Ria Sankar on Building AI ProductsRia Sankar on Building AI Products
Ria Sankar on Building AI Products
Ria Sankar
 
Test Case Design
Test Case DesignTest Case Design
Test Case Designacatalin
 
7 Step Data Cleanse: Salesforce Hygiene
7 Step Data Cleanse: Salesforce Hygiene7 Step Data Cleanse: Salesforce Hygiene
7 Step Data Cleanse: Salesforce Hygiene
CloudFixer
 
Data-Driven Marketing
Data-Driven MarketingData-Driven Marketing
Data-Driven Marketing
Performable
 
Petri for kyiv.pptx
Petri for kyiv.pptxPetri for kyiv.pptx
Petri for kyiv.pptx
Talya Gendler
 
Surviving the hype cycle Shortcuts to split testing success
Surviving the hype cycle   Shortcuts to split testing successSurviving the hype cycle   Shortcuts to split testing success
Surviving the hype cycle Shortcuts to split testing success
Craig Sullivan
 
Advanced Google Analytics #SearchFest
Advanced Google Analytics #SearchFestAdvanced Google Analytics #SearchFest
Advanced Google Analytics #SearchFest
Mike P.
 
Tips & Tricks for Getting Things Done Using Analytics Data
Tips & Tricks for Getting Things Done Using Analytics DataTips & Tricks for Getting Things Done Using Analytics Data
Tips & Tricks for Getting Things Done Using Analytics Data
Charles Meaden
 
Designing speed with progressive enhancement
Designing speed with progressive enhancementDesigning speed with progressive enhancement
Designing speed with progressive enhancement
SergeyChernyshev
 
CRO analytics - How to Continually Optimise
CRO analytics - How to Continually OptimiseCRO analytics - How to Continually Optimise
CRO analytics - How to Continually Optimise
Phil Pearce
 
Google Analytics Powerups and Smartcuts
Google Analytics Powerups and Smartcuts Google Analytics Powerups and Smartcuts
Google Analytics Powerups and Smartcuts
Charles Meaden
 

Similar to Talks@Coursera - A/B Testing @ Internet Scale (20)

DataEngConf SF16 - Three lessons learned from building a production machine l...
DataEngConf SF16 - Three lessons learned from building a production machine l...DataEngConf SF16 - Three lessons learned from building a production machine l...
DataEngConf SF16 - Three lessons learned from building a production machine l...
 
From Labelling Open data images to building a private recommender system
From Labelling Open data images to building a private recommender systemFrom Labelling Open data images to building a private recommender system
From Labelling Open data images to building a private recommender system
 
Surviving the AB Testing Hype Cycle - Reaktor Breakpoint 2015
Surviving the AB Testing Hype Cycle - Reaktor Breakpoint 2015Surviving the AB Testing Hype Cycle - Reaktor Breakpoint 2015
Surviving the AB Testing Hype Cycle - Reaktor Breakpoint 2015
 
Agile 2014 Software Moneyball (Troy Magennis)
Agile 2014   Software Moneyball (Troy Magennis)Agile 2014   Software Moneyball (Troy Magennis)
Agile 2014 Software Moneyball (Troy Magennis)
 
Making Strategic Decisions by fmr Capital One Dir. Digital PM
Making Strategic Decisions by fmr Capital One Dir. Digital PMMaking Strategic Decisions by fmr Capital One Dir. Digital PM
Making Strategic Decisions by fmr Capital One Dir. Digital PM
 
It Worked for Ustream
It Worked for UstreamIt Worked for Ustream
It Worked for Ustream
 
Optimizely Partner Ecosystem
Optimizely Partner EcosystemOptimizely Partner Ecosystem
Optimizely Partner Ecosystem
 
Drippler's A/B test library
Drippler's A/B test libraryDrippler's A/B test library
Drippler's A/B test library
 
Digital analytics: Optimization (Lecture 10)
Digital analytics: Optimization (Lecture 10)Digital analytics: Optimization (Lecture 10)
Digital analytics: Optimization (Lecture 10)
 
Ria Sankar on Building AI Products
Ria Sankar on Building AI ProductsRia Sankar on Building AI Products
Ria Sankar on Building AI Products
 
Test Case Design
Test Case DesignTest Case Design
Test Case Design
 
7 Step Data Cleanse: Salesforce Hygiene
7 Step Data Cleanse: Salesforce Hygiene7 Step Data Cleanse: Salesforce Hygiene
7 Step Data Cleanse: Salesforce Hygiene
 
Data-Driven Marketing
Data-Driven MarketingData-Driven Marketing
Data-Driven Marketing
 
Petri for kyiv.pptx
Petri for kyiv.pptxPetri for kyiv.pptx
Petri for kyiv.pptx
 
Surviving the hype cycle Shortcuts to split testing success
Surviving the hype cycle   Shortcuts to split testing successSurviving the hype cycle   Shortcuts to split testing success
Surviving the hype cycle Shortcuts to split testing success
 
Advanced Google Analytics #SearchFest
Advanced Google Analytics #SearchFestAdvanced Google Analytics #SearchFest
Advanced Google Analytics #SearchFest
 
Tips & Tricks for Getting Things Done Using Analytics Data
Tips & Tricks for Getting Things Done Using Analytics DataTips & Tricks for Getting Things Done Using Analytics Data
Tips & Tricks for Getting Things Done Using Analytics Data
 
Designing speed with progressive enhancement
Designing speed with progressive enhancementDesigning speed with progressive enhancement
Designing speed with progressive enhancement
 
CRO analytics - How to Continually Optimise
CRO analytics - How to Continually OptimiseCRO analytics - How to Continually Optimise
CRO analytics - How to Continually Optimise
 
Google Analytics Powerups and Smartcuts
Google Analytics Powerups and Smartcuts Google Analytics Powerups and Smartcuts
Google Analytics Powerups and Smartcuts
 

Recently uploaded

block diagram and signal flow graph representation
block diagram and signal flow graph representationblock diagram and signal flow graph representation
block diagram and signal flow graph representation
Divya Somashekar
 
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
AJAYKUMARPUND1
 
Immunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary AttacksImmunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary Attacks
gerogepatton
 
Gen AI Study Jams _ For the GDSC Leads in India.pdf
Gen AI Study Jams _ For the GDSC Leads in India.pdfGen AI Study Jams _ For the GDSC Leads in India.pdf
Gen AI Study Jams _ For the GDSC Leads in India.pdf
gdsczhcet
 
LIGA(E)11111111111111111111111111111111111111111.ppt
LIGA(E)11111111111111111111111111111111111111111.pptLIGA(E)11111111111111111111111111111111111111111.ppt
LIGA(E)11111111111111111111111111111111111111111.ppt
ssuser9bd3ba
 
road safety engineering r s e unit 3.pdf
road safety engineering  r s e unit 3.pdfroad safety engineering  r s e unit 3.pdf
road safety engineering r s e unit 3.pdf
VENKATESHvenky89705
 
Planning Of Procurement o different goods and services
Planning Of Procurement o different goods and servicesPlanning Of Procurement o different goods and services
Planning Of Procurement o different goods and services
JoytuBarua2
 
WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234
AafreenAbuthahir2
 
The role of big data in decision making.
The role of big data in decision making.The role of big data in decision making.
The role of big data in decision making.
ankuprajapati0525
 
J.Yang, ICLR 2024, MLILAB, KAIST AI.pdf
J.Yang,  ICLR 2024, MLILAB, KAIST AI.pdfJ.Yang,  ICLR 2024, MLILAB, KAIST AI.pdf
J.Yang, ICLR 2024, MLILAB, KAIST AI.pdf
MLILAB
 
COLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdf
COLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdfCOLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdf
COLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdf
Kamal Acharya
 
Architectural Portfolio Sean Lockwood
Architectural Portfolio Sean LockwoodArchitectural Portfolio Sean Lockwood
Architectural Portfolio Sean Lockwood
seandesed
 
Standard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - NeometrixStandard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - Neometrix
Neometrix_Engineering_Pvt_Ltd
 
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Dr.Costas Sachpazis
 
Railway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdfRailway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdf
TeeVichai
 
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
obonagu
 
H.Seo, ICLR 2024, MLILAB, KAIST AI.pdf
H.Seo,  ICLR 2024, MLILAB,  KAIST AI.pdfH.Seo,  ICLR 2024, MLILAB,  KAIST AI.pdf
H.Seo, ICLR 2024, MLILAB, KAIST AI.pdf
MLILAB
 
Vaccine management system project report documentation..pdf
Vaccine management system project report documentation..pdfVaccine management system project report documentation..pdf
Vaccine management system project report documentation..pdf
Kamal Acharya
 
Cosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdfCosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdf
Kamal Acharya
 
DESIGN A COTTON SEED SEPARATION MACHINE.docx
DESIGN A COTTON SEED SEPARATION MACHINE.docxDESIGN A COTTON SEED SEPARATION MACHINE.docx
DESIGN A COTTON SEED SEPARATION MACHINE.docx
FluxPrime1
 

Recently uploaded (20)

block diagram and signal flow graph representation
block diagram and signal flow graph representationblock diagram and signal flow graph representation
block diagram and signal flow graph representation
 
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
 
Immunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary AttacksImmunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary Attacks
 
Gen AI Study Jams _ For the GDSC Leads in India.pdf
Gen AI Study Jams _ For the GDSC Leads in India.pdfGen AI Study Jams _ For the GDSC Leads in India.pdf
Gen AI Study Jams _ For the GDSC Leads in India.pdf
 
LIGA(E)11111111111111111111111111111111111111111.ppt
LIGA(E)11111111111111111111111111111111111111111.pptLIGA(E)11111111111111111111111111111111111111111.ppt
LIGA(E)11111111111111111111111111111111111111111.ppt
 
road safety engineering r s e unit 3.pdf
road safety engineering  r s e unit 3.pdfroad safety engineering  r s e unit 3.pdf
road safety engineering r s e unit 3.pdf
 
Planning Of Procurement o different goods and services
Planning Of Procurement o different goods and servicesPlanning Of Procurement o different goods and services
Planning Of Procurement o different goods and services
 
WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234
 
The role of big data in decision making.
The role of big data in decision making.The role of big data in decision making.
The role of big data in decision making.
 
J.Yang, ICLR 2024, MLILAB, KAIST AI.pdf
J.Yang,  ICLR 2024, MLILAB, KAIST AI.pdfJ.Yang,  ICLR 2024, MLILAB, KAIST AI.pdf
J.Yang, ICLR 2024, MLILAB, KAIST AI.pdf
 
COLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdf
COLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdfCOLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdf
COLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdf
 
Architectural Portfolio Sean Lockwood
Architectural Portfolio Sean LockwoodArchitectural Portfolio Sean Lockwood
Architectural Portfolio Sean Lockwood
 
Standard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - NeometrixStandard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - Neometrix
 
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
 
Railway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdfRailway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdf
 
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
 
H.Seo, ICLR 2024, MLILAB, KAIST AI.pdf
H.Seo,  ICLR 2024, MLILAB,  KAIST AI.pdfH.Seo,  ICLR 2024, MLILAB,  KAIST AI.pdf
H.Seo, ICLR 2024, MLILAB, KAIST AI.pdf
 
Vaccine management system project report documentation..pdf
Vaccine management system project report documentation..pdfVaccine management system project report documentation..pdf
Vaccine management system project report documentation..pdf
 
Cosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdfCosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdf
 
DESIGN A COTTON SEED SEPARATION MACHINE.docx
DESIGN A COTTON SEED SEPARATION MACHINE.docxDESIGN A COTTON SEED SEPARATION MACHINE.docx
DESIGN A COTTON SEED SEPARATION MACHINE.docx
 

Talks@Coursera - A/B Testing @ Internet Scale

  • 1. A/B Testing @ Internet Scale Ya Xu 8/12/2014 @ Coursera
  • 2. A/B Testing in One Slide 20%80% Collect results to determine which one is better Join now Control Treatment
  • 3. Outline § Culture Challenge –  Why A/B testing –  What to A/B test § Building a scalable experimentation system § Best practices 3
  • 5. Amazon Shopping Cart Recommendation 5 •  At Amazon, Greg Linden had this idea of showing recommendations based on cart items •  Trade-offs •  Pro: cross-sell more items (increase average basket size) •  Con: distract people from checking out (reduce conversion) •  HiPPO (Highest Paid Person’s Opinion) : stop the project From Greg Linden’s Blog: http://glinden.blogspot.com/2006/04/early-amazon-shopping-cart.html http://www.exp-platform.com/Documents/2012-08%20Puzzling%20Outcomes%20KDD.pptx
  • 6. MSN Real Estate § “Find a house” widget variations § Revenue to MSN generated every time a user clicks search/find button 6 A B http://www.exp-platform.com/Documents/2012-08%20Puzzling%20Outcomes%20KDD.pptx
  • 7. Take-away Experiments are the only way to prove causality. 7 Use A/B testing to: § Guide product development § Measure impact (assess ROI) § Gain “real” customer feedback
  • 8. What to A/B Test 8
  • 9. Ads CTR Drop 9 Sudden drop on 11/11/2013 Profile top ads
  • 11. What to A/B Test § Evaluating new ideas: –  Visual changes –  Complete redesign of web page –  Relevance algorithms –  … § Platform changes § Code refactoring § Bug fixes 11 Test Everything!
  • 12. Startups vs. Big Websites § Do startups have enough users to A/B test? –  Startups typically look for larger effects –  5% vs. 0.5% difference è 100 times more users! § Startups should establish A/B testing culture early 12
  • 14. A/B Testing 3 Steps 14 Design •  What/Whom to experiment on Deploy •  Code deployment Analyze •  Impact on metrics
  • 15. A/B Testing Platform Architecture 1.  Experiment Management 2.  Online Infrastructure 3.  Offline Analysis 15 Example: Bing A/B
  • 16. 1. Experiment Management § Define experiments –  Whom to target? –  How to split traffic? § Start/stop an experiment § Important addition: –  Define success criteria –  Power analysis 16
  • 17. 2. Online Infrastructure 1)  Hash & partition: random & consistent 2)  Deploy: server-side, as a change to –  The default configuration (Bing) –  The default code path (LinkedIn) 3)  Data logging 17 0% 100% Treatment1 D20% D20% Hash (ID) Treatment2 Control
  • 18. Hash & Partition @ Scale (I) § Pure bucket system (Google/Bing before 200X) 18 0% 100% Exp. 1 D20% D20% Exp. 2 Exp. 3 60% red green yellow 15% 15%30% •  Does not scale •  Traffic management
  • 19. Hash & Partition @ Scale (II) § Fully overlapping system 0% 100% D Exp. 2 A2 B2 control Exp.1 controlA1 D B1 D •  Each experiment gets 100% traffic •  A user is in “all” experiments simultaneously •  Randomization btw experiments are independent (unique hashID) •  Cannot avoid interaction
  • 20. Hash & Partition @ Scale (III) § Hybrid: Layer + Domain 20 •  Centralized management (Bing) •  Central exp. team creates/manages layers/domains •  De-centralized management (LinkedIn) •  Each experiment is one “layer” by default •  Experimenter controls hashID to create a “domain”
  • 21. Data Logging §  Trigger §  Trigger-based logging –  Log whether a request is actually affected by the experiment –  Log for both factual & counter-factual 21 All LinkedIn members 300MM + Triggered: Members visiting contacts page
  • 22. 3. Automated Offline Analysis §  Large-scale data processing, e.g. daily @LinkedIn –  200+ experiments –  700+ metrics –  Billions of experiment trigger events §  Statistical analysis –  Metrics design –  Statistical significance test (p-value, confidence interval) –  Deep-dive: slicing & dicing capability §  Monitoring & alerting –  Data quality –  Early termination 22
  • 25. What to Experiment? Measure one change at a time. Unified Search Experiments 1+2+…N50% En-US Pre-unified search 50% En-US
  • 26. What to Measure? § Success metrics: summarize whether treatment is better § Puzzling example: –  Key metrics for Bing: number of searches & revenue –  Ranking bug in experiment resulted in poor search results –  Number of searches up +10% and revenue up +30% Success metrics should reflect long term impact
  • 27. Scientific Experiment Design § How long to run the experiment? § How much traffic to allocate to treatment? Story: §  Site speed matters –  Bing: +100msec = -0.6% revenue –  Amazon: +100msec = -1.0% revenue –  Google: +100msec = -0.2% queries §  But not for Etsy.com? “Faster results better? … meh” 27
  • 28. Power § Power: the chance of detecting a difference when there really is one. § Two reasons your feature doesn’t move metrics 1.  No “real” impact 2.  Not enough power 28 Properly power up your experiment!
  • 29. Statistical Significance § Which experiment has a bigger impact? 29 Experiment 1 Experiment 2 Pageviews 1.5% 12.9% Revenue 0.8% 2.4%
  • 30. Statistical Significance § Which experiment has a bigger impact? 30 Experiment 1 Experiment 2 Pageviews 1.5% 12.9% Revenue 0.8% Stat. significant 2.4%
  • 31. Statistical Significance 31 § Must consider statistical significance –  A 12.9% delta can still be noise! –  Identify signal from noise; focus on the “real” movers –  Ensure results are reproducible Experiment 1 Experiment 2 Pageviews 1.5% 12.9% Revenue 0.8% Stat. significant 2.4%
  • 32. Multiple Testing § Famous xkcd comic on Jelly Beans 32
  • 33. Multiple Testing Concerns § Multiple ramps –  Pre-decide a ramp to base decision on (e.g. 50/50) § Multiple “peeks” –  Rely on “full”-week results § Multiple variants –  Choose the best, then rerun to see if replicate § Multiple metrics
  • 34. An irrelevant metric is statistically significant. What to do? §  Which metric? §  How “significant”? (p-value) 34 34 All metrics 2nd order metrics 1st order metrics p-value < 0.05 p-value < 0.01 p-value < 0.001 Directly impacted by exp. Maybe impacted by exp. Watch out for multiple testing With 100 metrics, how many would you see stat. significant even if your experiment does NOTHING? 5
  • 35. References §  Tang, Diane, et al. Overlapping Experiment Infrastructure: More, Better, Faster Experimentation. Proceedings 16th Conference on Knowledge Discovery and Data Mining. 2010. §  Kohavi, Ron, et al. Online Controlled Experiments at Large Scale. KDD 2013: Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining. 2013. §  LinkedIn blog post: http://engineering.linkedin.com/ab-testing/xlnt-platform-driving-ab-testing-linkedin Additional Resources: RecSys’14 A/B testing workshop 35