A/B Testing at Pinterest: Building a Culture of Experimentation

•

21 likes•17,125 views

The document discusses building a culture of experimentation at Pinterest and outlines a maturity model for experimentation. It describes 5 stages for experimentation maturity: get started, get big, get better, get out, and get tools. For each stage, it identifies common problems, such as underutilization or needing guidance, and provides recommendations for how to address them, like evangelizing experiments or implementing processes to help scale experimentation. The overall aim is to establish a systematic approach to experimentation that helps transition an organization from initial experimentation to widespread experimentation supported by automation and analysis tools.

Technology

@arburbank
Building a culture
of experimentation
scaling data science at Pinterest
@arburbank
Andrea Burbank

@arburbank
Organizational maturity model
use source control
write unit tests
track bugs
write a spec
build often

@arburbank
Experimentation maturity model

@arburbank
Experimentation maturity model
get started
get big
get better
get out
get tools

@arburbank
Stage 1:
get started
get started

@arburbank
problem:
people making bad decisions
get started

@arburbank
Run experiments
entire
population
control
enable
d

@arburbank
Cultural maturity model
get started
entire
population
control
enabled
data
data
insight

@arburbank
Stage 2:
get big
get started
get big

@arburbank
problem:
underutilization
get started
get big

@arburbank
http://altmba.com/wp-content/uploads/2015/06/fieldofdreamscorn.jpg

@arburbank
if you build it, they won’t come
marketing

@arburbank
if you build it, they won’t come
evangelism

@arburbank
if you build it, they won’t come
salesmanship

@arburbank
Cultural maturity model
evangelize
educate
explain
get big

@arburbank
Stage 3:
get better
get started
get big
get better

@arburbank
problem:
guidance
get started
get big
get better
needed

@arburbank
you are the human in the loop
ensure
successrun test
YOU

@arburbank
you are the human in the loop
ensure
successrun test
YOU
ensure
successrun test
ensure
success
run test
ensure
success
run test

@arburbank
It is not your career
goal to be the
experiments person

@arburbank
(you should have
higher ambitions)

@arburbank
Cultural maturity model
how can I
help?
get better

@arburbank
Stage 4:
get out
get started
get big
get better
get out

@arburbank
problem:
scale yourself
get started
get big
get better
get out

@arburbank
Write down the process
What mistakes do you see in experiments?
What questions do you answer repeatedly?
How will learning this help others?

@arburbank
“if you let engineers
run experiments, they
will screw them up in
every way possible.”

@arburbank
“if you let untrained
engineers run
experiments, they will
screw them up in every
way possible.”

@arburbank
For every important
mistake, explain why
it’s wrong and how to
avoid it.

@arburbank
Make a list, check it twice


 landing
in-flight
launch

@arburbank
Make a list, check it twice
e+r+

@arburbank
Make a list, check it twice
@experiments-help

@arburbank
@experiments-help
names matter:
“help,” not “on-call”

@arburbank
@experiments-help
engineer partners:
move fast, own the process

@arburbank
@experiments-help
the right people:
thoughtful, well-respected

@arburbank
Implement a process
1. Checklists for experiments
2. @experiments-help mention in code review
3. e+ as part of code review
4. Mailing list: experiments-help@
5. Experiment document template
6. Rotation of experiment helpers

@arburbank
Train your successors
So you want to be an experiment helper?
• Step 1: read the documentation
• Step 2: take the experiment quiz
• Step 3: review all experiments for a week

@arburbank
50
trained
experiment
helpers

@arburbank
Cultural maturity model
how would
you answer
that?
get out

@arburbank
Stage 5:
get tools
get started
get big
get better
get out
get tools

@arburbank
problem:
simple mistakes
get started
get big
get better
get out
get tools

@arburbank
remove untriggered experiments

@arburbank
add a control group automatically
when a new variant is introduced

@arburbank
expand experiment groups
at the same rate

@arburbank
Automation: analysis
chi-squared test on group sizes

@arburbank
Automation: analysis
test that groups grew at the same rate

@arburbank
Automation: analysis
verify similar distributions of users

@arburbank
Automation: analysis
hide results that are likely to be wrong

@arburbank
Automation: analysis
automatically track important metrics
(and compute statistical significance)

@arburbank
Automation: analysis
segment important populations

@arburbank
Automation: analysis
measure novelty vs. long-term effects

@arburbank
Cultural maturity model
just use
humans for
the hard part:
thinking
get tools

@arburbank
data science:
changing minds, one at a time
andrea@pinterest.com

What's hot

How to Build a Robust Product Roadmap by Salesforce VP of ProductProduct School

How to Shift to Product-Led GrowthProductPlan

Product Development with Spotify's Product ManagerProduct School

Harnessing the Power of Product Analytics by Dan OlsenDan Olsen

Basics of AB testing in online productsAshish Dua

Product Led Growth StrategyMickey Alon

Brian Balfour: Building A Growth MachineHeavybit

A Playbook for Achieving Product-Market Fit by Dan Olsen at Lean Startup Conf...Dan Olsen

WTF is a Product Roadmap?Fresh Tilled Soil

Startup Metrics for PiratesDave McClure

How to create your Minimum Viable Product - Raff PaquinRaff Paquin

Building a Repeatable, Scalable & Profitable Growth ProcessDavid Skok

A/B Testing for New Product Launches by Booking.com Sr PMProduct School

12 Steps to Effective Growth Hacking (www.wepullthetrigger.com)Trigger

Product Backlog - Refinement and Prioritization TechniquesVikash Karuna

Controlled Experimentation aka A/B Testing for PMs by Tinder Sr PMProduct School

How Product Management plus Design Leads to Product Success by Dan OlsenDan Olsen

A/B Testing Pitfalls and Lessons Learned at SpotifyDanielle Jabin

From Zero to a Million Users - Dropbox and Xobni lessons learnedAdam Smith

Product Discovery At GoogleJohn Gibbon

What's hot (20)

How to Build a Robust Product Roadmap by Salesforce VP of Product

How to Shift to Product-Led Growth

Product Development with Spotify's Product Manager

Harnessing the Power of Product Analytics by Dan Olsen

Basics of AB testing in online products

Product Led Growth Strategy

Brian Balfour: Building A Growth Machine

A Playbook for Achieving Product-Market Fit by Dan Olsen at Lean Startup Conf...

WTF is a Product Roadmap?

Startup Metrics for Pirates

How to create your Minimum Viable Product - Raff Paquin

Building a Repeatable, Scalable & Profitable Growth Process

A/B Testing for New Product Launches by Booking.com Sr PM

12 Steps to Effective Growth Hacking (www.wepullthetrigger.com)

Product Backlog - Refinement and Prioritization Techniques

Controlled Experimentation aka A/B Testing for PMs by Tinder Sr PM

How Product Management plus Design Leads to Product Success by Dan Olsen

A/B Testing Pitfalls and Lessons Learned at Spotify

From Zero to a Million Users - Dropbox and Xobni lessons learned

Product Discovery At Google

Viewers also liked

いまさら聞けない機械学習の評価指標圭輔大曽根

機械学習で大事なことをミニGunosyをつくって学んだ╭( ･ㅂ･)و ̑̑　Seiji Takahashi

Gunosyデータマイニング研究会 #118 これからの強化学習圭輔大曽根

あなただけにそっと教える弊社の分析事情 #data analyst meetup tokyo vol.1 LTHiroaki Kudo

#cwt2016 Apache Kudu 構成とテーブル設計Cloudera Japan

「新製品 Kudu 及び RecordServiceの概要」 #cwt2015Cloudera Japan

Apache Kudu - Updatable Analytical Storage #rakutentechCloudera Japan

“確率的最適化”を読む前に知っておくといいかもしれない関数解析のことHiroaki Kudo

爆速クエリエンジン”Presto”を使いたくなる話Kentaro Yoshida

Gunosy における AWS 上での自然言語処理・機械学習の活用事例圭輔大曽根

論文紹介@ Gunosyデータマイニング研究会 #97圭輔大曽根

記事分類における教師データおよびモデルの管理圭輔大曽根

マイクロサービスとABテスト圭輔大曽根

WebDB Forum 2016 gunosyHiroaki Kudo

Apache Kuduは何がそんなに「速い」DBなのか? #dbts2017Cloudera Japan

Viewers also liked (15)

いまさら聞けない機械学習の評価指標

機械学習で大事なことをミニGunosyをつくって学んだ╭( ･ㅂ･)و ̑̑　

Gunosyデータマイニング研究会 #118 これからの強化学習

あなただけにそっと教える弊社の分析事情 #data analyst meetup tokyo vol.1 LT

#cwt2016 Apache Kudu 構成とテーブル設計

「新製品 Kudu 及び RecordServiceの概要」 #cwt2015

Apache Kudu - Updatable Analytical Storage #rakutentech

“確率的最適化”を読む前に知っておくといいかもしれない関数解析のこと

爆速クエリエンジン”Presto”を使いたくなる話

Gunosy における AWS 上での自然言語処理・機械学習の活用事例

論文紹介@ Gunosyデータマイニング研究会 #97

記事分類における教師データおよびモデルの管理

マイクロサービスとABテスト

WebDB Forum 2016 gunosy

Apache Kuduは何がそんなに「速い」DBなのか? #dbts2017

Similar to A/B Testing at Pinterest: Building a Culture of Experimentation

Workshop #2: User Research For Everyone by Aras Bilgenux singapore

Rachel Meyer Pubcon PresentationRachel Meyer

Designing to save lives: Government technical documentation Laurian Vega

Content Strategy: A Framework for Marketing SuccessLaura Creekmore

Vivien Ibironke Ibiyemi. Comaqa Spring 2018. Enhance your Testing Skills With...COMAQA.BY

D school assignment 3 Prototype and TestLee-Anne Walker

How to avoid research debtCaroline Jarrett

Full Stack Engineering - April 29th, 2014 @ Full Stack Engineering Meetup NYCKarl Stanton

Introduction to bugs measurementVolodya Novostavsky

Data Science Popup Austin: Privilege and Supervised Machine LearningDomino Data Lab

LEARN STARTUP OVERVIEWwe20

[Pcamp19] - Prototyping the Pivotal Moments First: Visualizing the Forks in t...Product Camp Brasil

Digital portfolio 1_v2mustafaalinike

Case study for agile software development: Joe Crespo

Improve the UX of Your Content and Prove ItPam Noreault

It's time to research our designs better. Here's how. UIUX Conference 2018 - ...Sophie Freiermuth

Using cognitive walkthroughs to better review designs for accessibilityIntopia

Cultivating Content: Designing Wiki Solutions That Scalecolleenfry

Pubcon SFIMA Super Awesome Extended Bonus Editionrachelmeyer

5 Essential Tips For Improving Your Website Mockups & Prototypes!Usersnap

Similar to A/B Testing at Pinterest: Building a Culture of Experimentation (20)

Workshop #2: User Research For Everyone by Aras Bilgen

Rachel Meyer Pubcon Presentation

Designing to save lives: Government technical documentation

Content Strategy: A Framework for Marketing Success

Vivien Ibironke Ibiyemi. Comaqa Spring 2018. Enhance your Testing Skills With...

D school assignment 3 Prototype and Test

How to avoid research debt

Full Stack Engineering - April 29th, 2014 @ Full Stack Engineering Meetup NYC

Introduction to bugs measurement

Data Science Popup Austin: Privilege and Supervised Machine Learning

LEARN STARTUP OVERVIEW

[Pcamp19] - Prototyping the Pivotal Moments First: Visualizing the Forks in t...

Digital portfolio 1_v2

Case study for agile software development:

Improve the UX of Your Content and Prove It

It's time to research our designs better. Here's how. UIUX Conference 2018 - ...

Using cognitive walkthroughs to better review designs for accessibility

Cultivating Content: Designing Wiki Solutions That Scale

Pubcon SFIMA Super Awesome Extended Bonus Edition

5 Essential Tips For Improving Your Website Mockups & Prototypes!

Recently uploaded

Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard

Vulnerability_Management_GRC_by Sohang Sengupta.pptxnull - The Open Security Community

AI as an Interface for Commercial BuildingsMemoori

CloudStudio User manual (basic edition):comworks

Pigging Solutions in Pet Food ManufacturingPigging Solutions

My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar

GenCyber Cyber Security Day PresentationMichael W. Hawkins

Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies

The transition to renewables in India.pdfCompetition Advisory Services (India) LLP

#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada

Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4

Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix

Next-generation AAM aircraft unveiled by Supernal, S-A2Hyundai Motor Group

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent

The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad

08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls

Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106

How to convert PDF to text with Nanonetsnaman860154

08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls

Recently uploaded (20)

Maximizing Board Effectiveness 2024 Webinar.pptx

Vulnerability_Management_GRC_by Sohang Sengupta.pptx

AI as an Interface for Commercial Buildings

CloudStudio User manual (basic edition):

Pigging Solutions in Pet Food Manufacturing

My Hashitalk Indonesia April 2024 Presentation

GenCyber Cyber Security Day Presentation

Benefits Of Flutter Compared To Other Frameworks

The transition to renewables in India.pdf

#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024

Azure Monitor & Application Insight to monitor Infrastructure & Application

Swan(sea) Song – personal research during my six years at Swansea ... and bey...

Next-generation AAM aircraft unveiled by Supernal, S-A2

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...

The Codex of Business Writing Software for Real-World Solutions 2.pptx

08448380779 Call Girls In Civil Lines Women Seeking Men

Enhancing Worker Digital Experience: A Hands-on Workshop for Partners

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics

How to convert PDF to text with Nanonets

08448380779 Call Girls In Friends Colony Women Seeking Men

A/B Testing at Pinterest: Building a Culture of Experimentation

1. @arburbank Building a culture of experimentation scaling data science at Pinterest @arburbank Andrea Burbank

2. @arburbank σ, μ

3. @arburbank Organizational maturity model use source control write unit tests track bugs write a spec build often

4. @arburbank Experimentation maturity model

5. @arburbank Experimentation maturity model get started get big get better get out get tools

6. @arburbank Stage 1: get started get started

7. @arburbank problem: people making bad decisions get started

8. @arburbank Run experiments entire population control enable d

9. @arburbank Cultural maturity model get started entire population control enabled data data insight

10. @arburbank Stage 2: get big get started get big

11. @arburbank problem: underutilization get started get big

12. @arburbank http://altmba.com/wp-content/uploads/2015/06/fieldofdreamscorn.jpg

13. @arburbank http://altmba.com/wp-content/uploads/2015/06/fieldofdreamscorn.jpg

14. @arburbank if you build it, they won’t come marketing

15. @arburbank if you build it, they won’t come evangelism

16. @arburbank if you build it, they won’t come salesmanship

17. @arburbank Cultural maturity model evangelize educate explain get big

18. @arburbank Stage 3: get better get started get big get better

19. @arburbank problem: guidance get started get big get better needed

20. @arburbank you are the human in the loop ensure successrun test YOU

21. @arburbank you are the human in the loop ensure successrun test YOU ensure successrun test ensure success run test ensure success run test

22. @arburbank

23. @arburbank It is not your career goal to be the experiments person

24. @arburbank (you should have higher ambitions)

25. @arburbank Cultural maturity model how can I help? get better

26. @arburbank Stage 4: get out get started get big get better get out

27. @arburbank problem: scale yourself get started get big get better get out

28. @arburbank Write down the process What mistakes do you see in experiments? What questions do you answer repeatedly? How will learning this help others?

29. @arburbank

30. @arburbank

31. @arburbank

32. @arburbank “if you let engineers run experiments, they will screw them up in every way possible.”

33. @arburbank “if you let untrained engineers run experiments, they will screw them up in every way possible.”

34. @arburbank@arburbank

35. @arburbank For every important mistake, explain why it’s wrong and how to avoid it.

36. @arburbank launch landing in-flight

37. @arburbank launch @arburbank

38. @arburbank in-flight @arburbank

39. @arburbank landing @arburbank

40. @arburbank Make a list, check it twice    landing in-flight launch

41. @arburbank Make a list, check it twice e+r+

42. @arburbank Make a list, check it twice @experiments-help

43. @arburbank @experiments-help names matter: “help,” not “on-call”

44. @arburbank @experiments-help engineer partners: move fast, own the process

45. @arburbank @experiments-help the right people: thoughtful, well-respected

46. @arburbank

47. @arburbank Implement a process 1. Checklists for experiments 2. @experiments-help mention in code review 3. e+ as part of code review 4. Mailing list: experiments-help@ 5. Experiment document template 6. Rotation of experiment helpers

48. @arburbank Implement a process 1. Checklists for experiments 2. @experiments-help mention in code review 3. e+ as part of code review 4. Mailing list: experiments-help@ 5. Experiment document template 6. Rotation of experiment helpers

49. @arburbank Train your successors So you want to be an experiment helper? • Step 1: read the documentation • Step 2: take the experiment quiz • Step 3: review all experiments for a week

50. @arburbank

51. @arburbank

52. @arburbank 50 trained experiment helpers

53. @arburbank Cultural maturity model how would you answer that? get out

54. @arburbank Stage 5: get tools get started get big get better get out get tools

55. @arburbank problem: simple mistakes get started get big get better get out get tools

56. @arburbank get tools launch

57. @arburbank simplify experiment API

58. @arburbank remove untriggered experiments

59. @arburbank create helper functions

60. @arburbank get tools in flight

61. @arburbank add a control group automatically when a new variant is introduced

62. @arburbank expand experiment groups at the same rate

63. @arburbank get tools landing

64. @arburbank detect errors

65. @arburbank Automation: analysis chi-squared test on group sizes

66. @arburbank Automation: analysis test that groups grew at the same rate

67. @arburbank Automation: analysis verify similar distributions of users

68. @arburbank Automation: analysis hide results that are likely to be wrong

69. @arburbank simplify analysis

70. @arburbank Automation: analysis automatically track important metrics (and compute statistical significance)

71. @arburbank Automation: analysis segment important populations

72. @arburbank Automation: analysis measure novelty vs. long-term effects

73. @arburbank Cultural maturity model just use humans for the hard part: thinking get tools

74. @arburbank Experimentation maturity model get started get big get better get out get tools

75. @arburbank Stage 6: the future ??

76. @arburbank data science: changing minds, one at a time andrea@pinterest.com

Editor's Notes

Hi there! I’m Andrea Burbank, and I’m a data scientist at Pinterest. WHAT MOVES THE NEEDLE When I was asked to speak at a conference for data scientists, I thought for a while about what that means, and which aspects of my experience would be most interesting to folks who are working on the same sorts of problems that I tackle every day. What I ultimately decided was that the work that moves the needle at Pinterest wasn’t just the analysis we do to understand our ecosystem or to predict user engagement, but the culture of experimentation we’ve built up across the entire company. It wasn’t something that happened overnight, and I hope that by sharing our experience I can help you scale data science at your companies as well.
As data scientists, we often think of ourselves as a hybrid between a software engineer and a statistician, blending the best of both to build a talented data machine. When we hit on a problem like AB testing, we tend to approach it from that perspective: what tools and frameworks should I engineer and what statistical comparisons are most relevant in order to build a successful AB testing program? Those are both tremendously important tools, obviously. But what will end up making or breaking your experimentation program is neither of those: it’s the people. It’s building up a culture of AB testing, one person at a time.
Perhaps you’ve heard of a notion of an organizational maturity model. In software, there are basic steps you follow to improve your software engineering quality: Use source control, write unit tests, and so on.
MODEL + PEOPLE + ANTICIPATE Along those lines, I’d like to propose a model for the cultural maturity of experimentation. Every time you solve THE BIG problem facing you at the moment, move on to the next stage of experimentation and you create a new problem. And fundamentally, each of the problems you face is about the people and the culture, and the solutions you form are only as successful as the culture you foster to nourish them. For us, we didn’t recognize this pattern until we’d already stumbled partway through this evolution, and even when we recognized that the solution was in the culture and the people, it took us a while to embrace that approach. My hope is that by talking about these stages I can help you, unlike us, to recognize the stage of the maturity model you’re currently in, to frame and solve it as a human problem, and then to start to anticipate the next phase before it becomes absolutely necessary. So what are those stages?
So what are those stages? I’d say they look like this. Get STARTED. Get BIG. Get BETTER. Get OUT. Get TOOLS. Let’s dive in.
Stage 1: get started. This is where you actually build the experiment framework.
The problem: people are making bad decisions. Maybe they’re shipping things willy-nilly without measuring them at all, or maybe they’re watching trends over time and attributing change to newly released products when in fact the change might be completely unrelated. So you decide to build an AB testing framework.
FRAMEWORK + PIPELINE + UI -> WE MADE IT! In my first couple months at Pinterest, I built up our experiment framework to have all the capabilities I thought were important: - triggering users at the moment the experiment actually affected their experience, - keeping track of novelty effects, and - functioning correctly for offline experiments. I built a data pipeline to capture all the most important metrics automatically and a UI to surface those metrics. I ran a few experiments myself, validated the findings with A/A tests and on real experiments, and figured AHA! We’d made it. Now we could run experiments and actually understand the effects of the feature changes we made.
When you’ve toiled and coded and tested and built, you may think you’re done. After all, you now have a working framework. But in fact, you only just got started.
ACTUALLY USE IT The next stage is to get BIG. What I mean by that is to get people to actually use the framework you’ve built.
DRIVE ADOPTION The problem you’re facing now is that your framework on its own is useless; you need to drive adoption. I think it’s easy … to underestimate how important this phase is.
A GREAT PRODUCT SPEAKS FOR ITSELF I think it’s easy to underestimate how important this phase is. Again, we are engineers. There’s a part of us that really wants to believe that a great product speaks for itself. It’s so tremendously useful! You’ve anticipated all the use cases and made it easy to actually understand the effect your feature is having on users! How on earth is this not the holy grail??
Unfortunately, this is almost never actually true. Even high-quality tools don’t magically attract users. So stage 2 is about getting people to actually adopt your new framework, to buy into the idea of running experiments.
Once you have your experiment framework in place, your #1 priority is to get people to use it. That means you do marketing.
That means you do evangelism.
TECH TALKS + DEMO + PM + BENEFITS + STRATEGIC PROJECT That means that you have to be a salesman (or woman). Give tech talks. Do a demo. Give impassioned speeches to anyone who will listen. Whenever you hear about a feature going out, go find the PM, chat with the engineer, try to convince them to run an experiment. Tell them what they will learn, how they will benefit, how easy it will be. Find a strategically important project, suggest running it as an experiment, and don’t take no for an answer.
SHOW VALUE -> AGAIN AND AGAIN If you demonstrate the metrics effect of a strategic initiative, or you earn people call-outs at the company all-hands for lifting a metric by 5%, or you help the company avoid a huge mistake, people like it. They want you to do it again. And again. And now, you’ve done it: you got big.
LOTS OF PEOPLE -> NUDGES In stage 3, your experiment framework is big and things are going swimmingly. People start running lots of experiments, and they firmly believe that running an AB test is the best way to understand the performance of their feature. But now that you’re not the person running all the experiments, you find that they need some nudges here and there to make their experiments run correctly. Instead of evangelizing, you spend your time helping people run experiments: come up with a hypothesis, determine how they’ll detect failure, consider how changes might affect individual users’ experience.
DECIDE ON OWN -> NEED YOUR GUIDANCE In stage 2, no one was trying to run experiments unless you cajoled them into it, so you were always right there to help with implementation. Now that folks have bought in and are doing it on their own, guidance is needed, and you become the human to provide that guidance.
FUN! And depending on your personality, your patience, and how quickly your company is growing, this stage might last a while. If you’re in this stage now, you might think it’s pretty great. Your framework is getting used, people are making good decisions, and you have the added perk that you get to be connected to feature development across the whole company, so you always know what’s going on. And honestly, that’s a lot of fun.
SPOF! NAPA But after a while, you realize that you are a single point of failure. When you’re not there, people ship experiments when there aren’t enough users. They add new variants without thinking about how to measure them. They start experiments that accidentally trigger for everyone instead of only the affected users. For me, stage 3 became suboptimal pretty abruptly when I found myself trying to do code reviews on my iPhone while on my anniversary bike trip in Napa.
GO INSANE OR STOP LEARNING Now, I hit this problem fairly quickly because Pinterest was growing at a breakneck pace. You might think you can last in stage 3 for a while, or even indefinitely. But as someone who enjoyed that stage tremendously, I’d advise against it. If your organization grows and you don’t scale, the culture will spin apart and you’ll go insane. If it doesn’t grow, you’ll keep needing to play the same role of experiments diva, and you won’t get a chance to learn what else you can contribute.
This is important. It’s not your career goal to be the experiments person. (You should have higher ambitions.)
Making experiments run is important. It’s interesting. But it’s not what you should be doing with the rest of your life.
HELP PEOPLE SUCCEED. So that’s stage three. Once you have momentum behind your experiment framework, help people succeed with it. Help them think through their setup and their data. Help them figure out whether they have enough people, or it’s not working for a subset of the population. Help review their code, check their triggering, and figure out how to relaunch when things go wrong. Having you in the loop will make your company’s experiments successful. But also: start thinking about how you can move on to stage 4.
TEACH OTHERS In stage 4, you start thinking about how you can teach others to fulfill the role you’ve been taking on in helping people to run successful experiments, and how you can get out. In stage 4, you start to (flip) scale yourself.
Scale yourself. Figure out what you do and write it down. Develop repeatable processes, guidelines, checklists.
LIST ERRORS At this point, you’ve been helping people with experiments for a while. What mistakes do you see happening? What questions do you answer repeatedly? And how can you get others to want to understand experiments better? The first thing I did was try to write down every problem I’d seen in an experiment. I dug up that list when I was writing this talk.
It was three pages long in small font.
It was three pages long in small font.
It was three pages long in small font. When I shared this list with a coworker, he said:
It was three pages long in small font. When I shared this list with a coworker, he said:
TRAIN ENGINEERS -> PEOPLE PROBLEM + PEOPLE SOLUTION I rephrased it as follows. But then the question is: how do you train engineers to run experiments accurately? Again, this is a people problem, and the solution again comes from people.
The answer: make your process clear and easily repeatable. If you haven’t read Atul Gawande’s book, you should. Even the most complex human processes: performing surgery, flying a plane, building a skyscraper, can be improved by simple checklists. There are so many pieces to keep track of that having a simple list can help you get the important things right.
To make a checklist: for every important mistake, explain why it’s wrong and how to avoid it.
LIFECYCLE We also thought about the experiment lifecycle. In the end, there are three major phases of every experiment. First, the experiment has to launch. Before it actually take off with users on board, you want to make sure that it’s configured correctly so we can learn what we want. Once an experiment is in flight, we may need to make adjustments. Perhaps we had an idea for a new take on the feature we’re testing, or we just want to increase our experimental power to measure the effect on a larger population. And finally, when we’re ready to land the experiment, we need to make sure that we’ve learned what we want to learn and that we’re making the right decision from the data. So we built checklists: what should you watch for in each of these phases?
THINKING Launch is the most important thing. If the experiment is trying to measure the wrong thing or is set up incorrectly, you won’t learn anything from it. Before an experiment begins, most of the work is in the thinking. What are you trying to do? Why? Can you measure what you want to change?
WHY CHANGE? Sometimes an experiment owner will want to make changes to an experiment after launch. Usually they want to increase group sizes to get more statistical power, but sometimes they want to change the population they’re measuring or add new types of treatment. Sometimes they want to change an experiment but they haven’t actually checked to see whether it’s working as expected. All of these things then turned into the in-flight checklist.
READY? RIGHT DECISION? Lastly, at some point every experiment should be shut down. Sometimes people try to shut it down too early, before they have enough data or before we can understand the long-term effects. Or they’re right that it should be shut down now, but they decide to turn it off when it actually should be shipped because metrics are up, or they decide to ship it even though metrics are down but they don’t acknowledge it. The landing checklist tries to anticipate these issues and make sure we’re avoiding them.
IMPLEMENT: NO FRICTION + GET OTHERS ON BOARD But all the checklists in the world are meaningless if nobody implements them. So we spent a while thinking about two things: how we could try to improve the quality of experiments being run without introducing too much friction, and how we could get others on board to help monitor experiments’ quality.
PIGGYBACK ON R+ = OK. OPTIONAL IN THEORY. INTERNS. E+. YOUR CULTURE. To improve the quality of experiments without introducing friction, we piggybacked on the concept we already have of getting an r+. If you’re not familiar with an r+, it’s a naming convention that we adopted from Mozilla, but it’s just a way of signing off on a code review. When a code reviewer signs off with r+ on a review, it means that they think the new code improves the codebase. We had a culture of r+ that we stole for e+. We never said it was mandatory, just that it was recommended, but practically it was mandatory. No one ships code without an r+ except for new people and interns. For e+, we took that and said, hey, just as with code review, making sure that an experiment goes out correctly is critical. When you set up or change an experiment in a code review, ask someone who knows about experiments to take a look at it and provide feedback on your experiment setup. You need to find something that works within your culture. We could leverage this part of our existing engineering culture to create improved experiments. What is that lever at your company?
The other key to our success was getting others on board to be the experiment reviewers. I think there were a couple pieces that were important here: 1) Calling it experiments-help. We considered experiments on-call but who finds on-call glamorous? Everyone wants to help others. 2) Getting partners in engineering: move faster, badge value (certification) and owning the process themselves, not gatekeepers. 3) Choosing the right people. The first few helpers were really thoughtful, well-respected engineers in the organization. Other people looked to them as leaders.
The other key to our success was getting others on board to be the experiment reviewers. I think there were a couple pieces that were important here: 1) Calling it experiments-help. We considered experiments on-call but who finds on-call glamorous? Everyone wants to help others. 2) Getting partners in engineering: move faster, badge value (certification) and owning the process themselves, not gatekeepers. 3) Choosing the right people. The first few helpers were really thoughtful, well-respected engineers in the organization. Other people looked to them as leaders.
The other key to our success was getting others on board to be the experiment reviewers. I think there were a couple pieces that were important here: 1) Calling it experiments-help. We considered experiments on-call but who finds on-call glamorous? Everyone wants to help others. 2) Getting partners in engineering: move faster, badge value (certification) and owning the process themselves, not gatekeepers. 3) Choosing the right people. The first few helpers were really thoughtful, well-respected engineers in the organization. Other people looked to them as leaders.
The other key to our success was getting others on board to be the experiment reviewers. I think there were a couple pieces that were important here: 1) Calling it experiments-help. We considered experiments on-call but who finds on-call glamorous? Everyone wants to help others. 2) Getting partners in engineering: move faster, badge value (certification) and owning the process themselves, not gatekeepers. 3) Choosing the right people. The first few helpers were really thoughtful, well-respected engineers in the organization. Other people looked to them as leaders.
And so we announced a process. We introduced the experiments-help@ email alias and just asked people to come to us for help if they wanted to learn from their experiments.
ALL THE PIECES. TRAIN FIRST SET Now we had all the pieces in place: checklists for experiments, a way for people to ask for help in code review and for a certified helper to sign off, and a way for people to write about experiments in a standard way. Now we just had to train our first set of experiment helpers.
LEARN BY DOING -> APPRENTICESHIP Now we had all the pieces in place: checklists for experiments, a way for people to ask for help in code review and for a certified helper to sign off, and a way for people to ask questions about experiments outside code reviews as well. Now we just had to train first set of experiment helpers. We are strong believers in learning by doing. So we set up the experiment helper program as an apprenticeship.
QUIZ + ON THE HOOK Sure, people could read the documentation. But it’s not until they were put on the spot that they’d really begin to develop a sense of what to do. We created a quiz for prospective experiment helpers to test their understanding and ability to detect common problems. Nothing fancy – ours was just a Google doc with an answer key at the end. And then when your week of the rotation came along, you were on the hook for every question that came into experiments-help and every code review. When you were exposed to the variety of experiments people ran and had to be the person who kept them going in the right direction, you learned quickly.
And so we expanded from just me, to me and Dan and John.
And from Dan and John to a small set of respected engineers on a variety of teams, who start to build up the culture of experiments within their own smaller organizations.
50 PEOPLE + SELF-PROPELLING: QUEUE, COMMUNITY, TEAMS And now we have 50 trained experiment helpers distributed across all the product engineering teams at the company. It’s become self-propelling: we have queues of folks waiting to train as helpers, folks jumping in to answer each other’s questions, and individual engineering teams honing their own team’s experiment processes. We add questions to the quiz as new problems arise, and we now have a small army of folks equipped with experimental understanding who can explain new changes to their teams and help our process continue to grow.
REMOVE YOURSELF FROM THE LOOP BY TRAINING OTHERS. GROWING VOLUME OF EXPERIMENTS. So that’s stage four. Remove yourself from the loop by training others to take over your role. Get them to ask the hard questions, to help experiment owners avoid pitfalls and follow best practices. At this point, you’ve built a well-oiled, self-sustaining machine. The volume of experiments grows and grows. Problems that were rare when you started now crop up often enough that they’re really starting to get irritating, and so you start to think about what else you could invest in to simplify experiments and increase their likelihood of success.
MANY ERRORS ARE HUMAN, BUT SIMPLE ONES (FLIP) ARE PREVENTABLE. LETS HUMANS FOCUS ON THINKING. A lot of the things that can go wrong with an experiment are human: you can’t automate them away. Is it worth running an experiment in the first place? Does your hypothesis make sense, given the feature you’re building? Have you thought about what will happen to users if you remove the treatment? How will we decide whether the experiment is a success? But as you step back from individually reviewing everyone’s experiments, you may start to notice patterns of where simple things are going wrong,(flip) and you have the opportunity to step back and try to eliminate the problems that can be solved by better tools and automation. By solving this set of problems, you allow humans to focus on the hard stuff: the thinking.
MANY ERRORS ARE HUMAN, BUT SIMPLE ONES (FLIP) ARE PREVENTABLE. LETS HUMANS FOCUS ON THINKING. A lot of the things that can go wrong with an experiment are human: you can’t automate them away. Is it worth running an experiment in the first place? Does your hypothesis make sense, given the feature you’re building? Have you thought about what will happen to users if you remove the treatment? How will we decide whether the experiment is a success? But as you step back from individually reviewing everyone’s experiments, you may start to notice patterns of where simple things are going wrong, and you have the opportunity to step back and try to eliminate the problems that can be solved by better tools and automation. By solving this set of problems, you allow humans to focus on the hard stuff: the thinking.
Some of the simple mistakes happen at launch. If you can remove all the implementation details, you allow the experiment helper to focus on the important questions of what the experiment is trying to measure.
For us, that meant simplifying the experiment API, removing untriggered experiments, and creating helper functions for common user populations, like only experimenting on the latest app version.
For us, that meant simplifying the experiment API, removing untriggered experiments, and creating helper functions for common user populations, like only experimenting on the latest app version.
LAST ONE For us, that meant simplifying the experiment API, removing untriggered experiments, and creating helper functions for common user populations, like only experimenting on the latest app version.
Other mistakes happen in-flight. By building tools to take care of those details, we allowed the experiment helper to pay attention instead to why the experiment was changing and how it would be measured.
LAST ONE
WRONG DECISION -> HURTS USERS, WRONG DIRECTION Perhaps the most worrisome set of mistakes happens when someone decides to land an experiment. If they make the wrong decision here, it could result not only in shipping a product that hurts users, but in shaping future product decisions based on erroneous learnings! So we invested especially heavily in helping people avoid mistakes in interpreting their experiment results.
First off, an experiment will be invalid if the randomization produced groups that aren’t actually the same, so we built a number of tools to detect errors.
(last error)
Other mistakes resulted from people trying to do their own analysis on metrics: querying the data incorrectly, making comparisons that didn’t make sense, or just not thinking about statistical significance.
So that’s stage five, where Pinterest currently finds itself. After stepping back from the day-to-day review of experiments, we built tools so that the experiment helpers can focus on the important part: deciding what to build and understanding how it affects our users.
SUMMARIZE. NOT JUST ENGINEERING: PEOPLE. BUY-IN, TEACHING, HARDER TO SHOOT FOOT. We’ve built an experiment framework that allows us to track changes on all parts of our service, gotten it widespread adoption, built up a core set of 50 engineers who lead their teams in running experiments, and automated tools to make all of the aspects of the experiment lifecycle harder to screw up. At each stage, while engineering and statistical know-how were part of the equation, the real solution lay in building a culture of experimentation: getting the humans who make up the organization to buy into experiments, teaching them to help each other make decisions, and building tools that make it harder to shoot yourself in the foot.
NEXT?? LESSONS BEYOND EXPERIMENTATION. I don’t know yet what the next stage will look like. If you do, I’d love to find out. But I think the lessons extend beyond just experimentation. (flip) Data science is not just engineering and statistics: your recommendation model will not be used unless you convince someone it’s useful, and your analysis will not change product strategy until it’s changed people’s minds.
NOT JUST ENGINEERING AND STATS. CONVINCE PEOPLE. Data science is not just engineering and statistics: your recommender system will gather dust unless you convince someone it’s useful, and your analysis will not change product strategy until it’s changed people’s minds. Spending time actively investing in building a data-driven culture will pay off handsomely in the long run.

A/B Testing at Pinterest: Building a Culture of Experimentation

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (15)

Similar to A/B Testing at Pinterest: Building a Culture of Experimentation

Similar to A/B Testing at Pinterest: Building a Culture of Experimentation (20)

More from WrangleConf

More from WrangleConf (12)

Recently uploaded

Recently uploaded (20)

A/B Testing at Pinterest: Building a Culture of Experimentation

Editor's Notes