The document discusses building a culture of experimentation at Pinterest and outlines a maturity model for experimentation. It describes 5 stages for experimentation maturity: get started, get big, get better, get out, and get tools. For each stage, it identifies common problems, such as underutilization or needing guidance, and provides recommendations for how to address them, like evangelizing experiments or implementing processes to help scale experimentation. The overall aim is to establish a systematic approach to experimentation that helps transition an organization from initial experimentation to widespread experimentation supported by automation and analysis tools.
28. @arburbank
Write down the process
What mistakes do you see in experiments?
What questions do you answer repeatedly?
How will learning this help others?
47. @arburbank
Implement a process
1. Checklists for experiments
2. @experiments-help mention in code review
3. e+ as part of code review
4. Mailing list: experiments-help@
5. Experiment document template
6. Rotation of experiment helpers
48. @arburbank
Implement a process
1. Checklists for experiments
2. @experiments-help mention in code review
3. e+ as part of code review
4. Mailing list: experiments-help@
5. Experiment document template
6. Rotation of experiment helpers
49. @arburbank
Train your successors
So you want to be an experiment helper?
• Step 1: read the documentation
• Step 2: take the experiment quiz
• Step 3: review all experiments for a week
Hi there! I’m Andrea Burbank, and I’m a data scientist at Pinterest. WHAT MOVES THE NEEDLE
When I was asked to speak at a conference for data scientists, I thought for a while about what that means, and which aspects of my experience would be most interesting to folks who are working on the same sorts of problems that I tackle every day. What I ultimately decided was that the work that moves the needle at Pinterest wasn’t just the analysis we do to understand our ecosystem or to predict user engagement, but the culture of experimentation we’ve built up across the entire company. It wasn’t something that happened overnight, and I hope that by sharing our experience I can help you scale data science at your companies as well.
As data scientists, we often think of ourselves as a hybrid between a software engineer and a statistician, blending the best of both to build a talented data machine. When we hit on a problem like AB testing, we tend to approach it from that perspective: what tools and frameworks should I engineer and what statistical comparisons are most relevant in order to build a successful AB testing program?
Those are both tremendously important tools, obviously. But what will end up making or breaking your experimentation program is neither of those: it’s the people. It’s building up a culture of AB testing, one person at a time.
Perhaps you’ve heard of a notion of an organizational maturity model. In software, there are basic steps you follow to improve your software engineering quality:
Use source control, write unit tests, and so on.
MODEL + PEOPLE + ANTICIPATE
Along those lines, I’d like to propose a model for the cultural maturity of experimentation. Every time you solve THE BIG problem facing you at the moment, move on to the next stage of experimentation and you create a new problem.
And fundamentally, each of the problems you face is about the people and the culture, and the solutions you form are only as successful as the culture you foster to nourish them. For us, we didn’t recognize this pattern until we’d already stumbled partway through this evolution, and even when we recognized that the solution was in the culture and the people, it took us a while to embrace that approach.
My hope is that by talking about these stages I can help you, unlike us, to recognize the stage of the maturity model you’re currently in, to frame and solve it as a human problem, and then to start to anticipate the next phase before it becomes absolutely necessary.
So what are those stages?
So what are those stages? I’d say they look like this.
Get STARTED. Get BIG. Get BETTER. Get OUT. Get TOOLS.
Let’s dive in.
Stage 1: get started. This is where you actually build the experiment framework.
The problem: people are making bad decisions. Maybe they’re shipping things willy-nilly without measuring them at all, or maybe they’re watching trends over time and attributing change to newly released products when in fact the change might be completely unrelated. So you decide to build an AB testing framework.
FRAMEWORK + PIPELINE + UI -> WE MADE IT!
In my first couple months at Pinterest, I built up our experiment framework to have all the capabilities I thought were important:
- triggering users at the moment the experiment actually affected their experience,
- keeping track of novelty effects, and
- functioning correctly for offline experiments.
I built a data pipeline to capture all the most important metrics automatically and a UI to surface those metrics. I ran a few experiments myself, validated the findings with A/A tests and on real experiments, and figured AHA! We’d made it. Now we could run experiments and actually understand the effects of the feature changes we made.
When you’ve toiled and coded and tested and built, you may think you’re done. After all, you now have a working framework. But in fact, you only just got started.
ACTUALLY USE IT
The next stage is to get BIG. What I mean by that is to get people to actually use the framework you’ve built.
DRIVE ADOPTION
The problem you’re facing now is that your framework on its own is useless; you need to drive adoption. I think it’s easy … to underestimate how important this phase is.
A GREAT PRODUCT SPEAKS FOR ITSELF
I think it’s easy to underestimate how important this phase is. Again, we are engineers. There’s a part of us that really wants to believe that a great product speaks for itself. It’s so tremendously useful! You’ve anticipated all the use cases and made it easy to actually understand the effect your feature is having on users! How on earth is this not the holy grail??
Unfortunately, this is almost never actually true. Even high-quality tools don’t magically attract users. So stage 2 is about getting people to actually adopt your new framework, to buy into the idea of running experiments.
Once you have your experiment framework in place, your #1 priority is to get people to use it. That means you do marketing.
That means you do evangelism.
TECH TALKS + DEMO + PM + BENEFITS + STRATEGIC PROJECT
That means that you have to be a salesman (or woman).
Give tech talks. Do a demo. Give impassioned speeches to anyone who will listen.
Whenever you hear about a feature going out, go find the PM, chat with the engineer, try to convince them to run an experiment.
Tell them what they will learn, how they will benefit, how easy it will be.
Find a strategically important project, suggest running it as an experiment, and don’t take no for an answer.
SHOW VALUE -> AGAIN AND AGAIN
If you demonstrate the metrics effect of a strategic initiative, or you earn people call-outs at the company all-hands for lifting a metric by 5%, or you help the company avoid a huge mistake, people like it. They want you to do it again. And again.
And now, you’ve done it: you got big.
LOTS OF PEOPLE -> NUDGES
In stage 3, your experiment framework is big and things are going swimmingly. People start running lots of experiments, and they firmly believe that running an AB test is the best way to understand the performance of their feature.
But now that you’re not the person running all the experiments, you find that they need some nudges here and there to make their experiments run correctly. Instead of evangelizing, you spend your time helping people run experiments: come up with a hypothesis, determine how they’ll detect failure, consider how changes might affect individual users’ experience.
DECIDE ON OWN -> NEED YOUR GUIDANCE
In stage 2, no one was trying to run experiments unless you cajoled them into it, so you were always right there to help with implementation. Now that folks have bought in and are doing it on their own, guidance is needed, and you become the human to provide that guidance.
FUN!
And depending on your personality, your patience, and how quickly your company is growing, this stage might last a while. If you’re in this stage now, you might think it’s pretty great. Your framework is getting used, people are making good decisions, and you have the added perk that you get to be connected to feature development across the whole company, so you always know what’s going on.
And honestly, that’s a lot of fun.
SPOF! NAPA
But after a while, you realize that you are a single point of failure. When you’re not there, people ship experiments when there aren’t enough users. They add new variants without thinking about how to measure them. They start experiments that accidentally trigger for everyone instead of only the affected users.
For me, stage 3 became suboptimal pretty abruptly when I found myself trying to do code reviews on my iPhone while on my anniversary bike trip in Napa.
GO INSANE OR STOP LEARNING
Now, I hit this problem fairly quickly because Pinterest was growing at a breakneck pace. You might think you can last in stage 3 for a while, or even indefinitely. But as someone who enjoyed that stage tremendously, I’d advise against it. If your organization grows and you don’t scale, the culture will spin apart and you’ll go insane. If it doesn’t grow, you’ll keep needing to play the same role of experiments diva, and you won’t get a chance to learn what else you can contribute.
This is important. It’s not your career goal to be the experiments person. (You should have higher ambitions.)
Making experiments run is important. It’s interesting. But it’s not what you should be doing with the rest of your life.
HELP PEOPLE SUCCEED.
So that’s stage three. Once you have momentum behind your experiment framework, help people succeed with it.
Help them think through their setup and their data. Help them figure out whether they have enough people, or it’s not working for a subset of the population. Help review their code, check their triggering, and figure out how to relaunch when things go wrong. Having you in the loop will make your company’s experiments successful.
But also: start thinking about how you can move on to stage 4.
TEACH OTHERS
In stage 4, you start thinking about how you can teach others to fulfill the role you’ve been taking on in helping people to run successful experiments, and how you can get out. In stage 4, you start to (flip) scale yourself.
Scale yourself. Figure out what you do and write it down. Develop repeatable processes, guidelines, checklists.
LIST ERRORS
At this point, you’ve been helping people with experiments for a while. What mistakes do you see happening? What questions do you answer repeatedly? And how can you get others to want to understand experiments better?
The first thing I did was try to write down every problem I’d seen in an experiment.
I dug up that list when I was writing this talk.
It was three pages long in small font.
It was three pages long in small font.
It was three pages long in small font.
When I shared this list with a coworker, he said:
It was three pages long in small font.
When I shared this list with a coworker, he said:
TRAIN ENGINEERS -> PEOPLE PROBLEM + PEOPLE SOLUTION
I rephrased it as follows. But then the question is: how do you train engineers to run experiments accurately? Again, this is a people problem, and the solution again comes from people.
The answer: make your process clear and easily repeatable. If you haven’t read Atul Gawande’s book, you should. Even the most complex human processes: performing surgery, flying a plane, building a skyscraper, can be improved by simple checklists. There are so many pieces to keep track of that having a simple list can help you get the important things right.
To make a checklist: for every important mistake, explain why it’s wrong and how to avoid it.
LIFECYCLE
We also thought about the experiment lifecycle. In the end, there are three major phases of every experiment. First, the experiment has to launch. Before it actually take off with users on board, you want to make sure that it’s configured correctly so we can learn what we want.
Once an experiment is in flight, we may need to make adjustments. Perhaps we had an idea for a new take on the feature we’re testing, or we just want to increase our experimental power to measure the effect on a larger population.
And finally, when we’re ready to land the experiment, we need to make sure that we’ve learned what we want to learn and that we’re making the right decision from the data.
So we built checklists: what should you watch for in each of these phases?
THINKING
Launch is the most important thing. If the experiment is trying to measure the wrong thing or is set up incorrectly, you won’t learn anything from it. Before an experiment begins, most of the work is in the thinking. What are you trying to do? Why? Can you measure what you want to change?
WHY CHANGE?
Sometimes an experiment owner will want to make changes to an experiment after launch. Usually they want to increase group sizes to get more statistical power, but sometimes they want to change the population they’re measuring or add new types of treatment. Sometimes they want to change an experiment but they haven’t actually checked to see whether it’s working as expected.
All of these things then turned into the in-flight checklist.
READY? RIGHT DECISION?
Lastly, at some point every experiment should be shut down. Sometimes people try to shut it down too early, before they have enough data or before we can understand the long-term effects. Or they’re right that it should be shut down now, but they decide to turn it off when it actually should be shipped because metrics are up, or they decide to ship it even though metrics are down but they don’t acknowledge it. The landing checklist tries to anticipate these issues and make sure we’re avoiding them.
IMPLEMENT: NO FRICTION + GET OTHERS ON BOARD
But all the checklists in the world are meaningless if nobody implements them. So we spent a while thinking about two things: how we could try to improve the quality of experiments being run without introducing too much friction, and how we could get others on board to help monitor experiments’ quality.
PIGGYBACK ON R+ = OK. OPTIONAL IN THEORY. INTERNS. E+. YOUR CULTURE.
To improve the quality of experiments without introducing friction, we piggybacked on the concept we already have of getting an r+.
If you’re not familiar with an r+, it’s a naming convention that we adopted from Mozilla, but it’s just a way of signing off on a code review. When a code reviewer signs off with r+ on a review, it means that they think the new code improves the codebase.
We had a culture of r+ that we stole for e+. We never said it was mandatory, just that it was recommended, but practically it was mandatory. No one ships code without an r+ except for new people and interns.
For e+, we took that and said, hey, just as with code review, making sure that an experiment goes out correctly is critical. When you set up or change an experiment in a code review, ask someone who knows about experiments to take a look at it and provide feedback on your experiment setup.
You need to find something that works within your culture. We could leverage this part of our existing engineering culture to create improved experiments. What is that lever at your company?
The other key to our success was getting others on board to be the experiment reviewers. I think there were a couple pieces that were important here:
1) Calling it experiments-help. We considered experiments on-call but who finds on-call glamorous? Everyone wants to help others.
2) Getting partners in engineering: move faster, badge value (certification) and owning the process themselves, not gatekeepers.
3) Choosing the right people. The first few helpers were really thoughtful, well-respected engineers in the organization. Other people looked to them as leaders.
The other key to our success was getting others on board to be the experiment reviewers. I think there were a couple pieces that were important here:
1) Calling it experiments-help. We considered experiments on-call but who finds on-call glamorous? Everyone wants to help others.
2) Getting partners in engineering: move faster, badge value (certification) and owning the process themselves, not gatekeepers.
3) Choosing the right people. The first few helpers were really thoughtful, well-respected engineers in the organization. Other people looked to them as leaders.
The other key to our success was getting others on board to be the experiment reviewers. I think there were a couple pieces that were important here:
1) Calling it experiments-help. We considered experiments on-call but who finds on-call glamorous? Everyone wants to help others.
2) Getting partners in engineering: move faster, badge value (certification) and owning the process themselves, not gatekeepers.
3) Choosing the right people. The first few helpers were really thoughtful, well-respected engineers in the organization. Other people looked to them as leaders.
The other key to our success was getting others on board to be the experiment reviewers. I think there were a couple pieces that were important here:
1) Calling it experiments-help. We considered experiments on-call but who finds on-call glamorous? Everyone wants to help others.
2) Getting partners in engineering: move faster, badge value (certification) and owning the process themselves, not gatekeepers.
3) Choosing the right people. The first few helpers were really thoughtful, well-respected engineers in the organization. Other people looked to them as leaders.
And so we announced a process. We introduced the experiments-help@ email alias and just asked people to come to us for help if they wanted to learn from their experiments.
ALL THE PIECES. TRAIN FIRST SET
Now we had all the pieces in place: checklists for experiments, a way for people to ask for help in code review and for a certified helper to sign off, and a way for people to write about experiments in a standard way. Now we just had to train our first set of experiment helpers.
LEARN BY DOING -> APPRENTICESHIP
Now we had all the pieces in place: checklists for experiments, a way for people to ask for help in code review and for a certified helper to sign off, and a way for people to ask questions about experiments outside code reviews as well. Now we just had to train first set of experiment helpers.
We are strong believers in learning by doing.
So we set up the experiment helper program as an apprenticeship.
QUIZ + ON THE HOOK
Sure, people could read the documentation. But it’s not until they were put on the spot that they’d really begin to develop a sense of what to do.
We created a quiz for prospective experiment helpers to test their understanding and ability to detect common problems. Nothing fancy – ours was just a Google doc with an answer key at the end.
And then when your week of the rotation came along, you were on the hook for every question that came into experiments-help and every code review. When you were exposed to the variety of experiments people ran and had to be the person who kept them going in the right direction, you learned quickly.
And so we expanded from just me, to me and Dan and John.
And from Dan and John to a small set of respected engineers on a variety of teams, who start to build up the culture of experiments within their own smaller organizations.
50 PEOPLE + SELF-PROPELLING: QUEUE, COMMUNITY, TEAMS
And now we have 50 trained experiment helpers distributed across all the product engineering teams at the company.
It’s become self-propelling: we have queues of folks waiting to train as helpers, folks jumping in to answer each other’s questions, and individual engineering teams honing their own team’s experiment processes.
We add questions to the quiz as new problems arise, and we now have a small army of folks equipped with experimental understanding who can explain new changes to their teams and help our process continue to grow.
REMOVE YOURSELF FROM THE LOOP BY TRAINING OTHERS. GROWING VOLUME OF EXPERIMENTS.
So that’s stage four. Remove yourself from the loop by training others to take over your role. Get them to ask the hard questions, to help experiment owners avoid pitfalls and follow best practices.
At this point, you’ve built a well-oiled, self-sustaining machine.
The volume of experiments grows and grows. Problems that were rare when you started now crop up often enough that they’re really starting to get irritating, and so you start to think about what else you could invest in to simplify experiments and increase their likelihood of success.
MANY ERRORS ARE HUMAN, BUT SIMPLE ONES (FLIP) ARE PREVENTABLE. LETS HUMANS FOCUS ON THINKING.
A lot of the things that can go wrong with an experiment are human: you can’t automate them away. Is it worth running an experiment in the first place? Does your hypothesis make sense, given the feature you’re building? Have you thought about what will happen to users if you remove the treatment? How will we decide whether the experiment is a success?
But as you step back from individually reviewing everyone’s experiments, you may start to notice patterns of where simple things are going wrong,(flip) and you have the opportunity to step back and try to eliminate the problems that can be solved by better tools and automation.
By solving this set of problems, you allow humans to focus on the hard stuff: the thinking.
MANY ERRORS ARE HUMAN, BUT SIMPLE ONES (FLIP) ARE PREVENTABLE. LETS HUMANS FOCUS ON THINKING.
A lot of the things that can go wrong with an experiment are human: you can’t automate them away. Is it worth running an experiment in the first place? Does your hypothesis make sense, given the feature you’re building? Have you thought about what will happen to users if you remove the treatment? How will we decide whether the experiment is a success?
But as you step back from individually reviewing everyone’s experiments, you may start to notice patterns of where simple things are going wrong, and you have the opportunity to step back and try to eliminate the problems that can be solved by better tools and automation.
By solving this set of problems, you allow humans to focus on the hard stuff: the thinking.
Some of the simple mistakes happen at launch. If you can remove all the implementation details, you allow the experiment helper to focus on the important questions of what the experiment is trying to measure.
For us, that meant simplifying the experiment API, removing untriggered experiments, and creating helper functions for common user populations, like only experimenting on the latest app version.
For us, that meant simplifying the experiment API, removing untriggered experiments, and creating helper functions for common user populations, like only experimenting on the latest app version.
LAST ONE
For us, that meant simplifying the experiment API, removing untriggered experiments, and creating helper functions for common user populations, like only experimenting on the latest app version.
Other mistakes happen in-flight. By building tools to take care of those details, we allowed the experiment helper to pay attention instead to why the experiment was changing and how it would be measured.
LAST ONE
WRONG DECISION -> HURTS USERS, WRONG DIRECTION
Perhaps the most worrisome set of mistakes happens when someone decides to land an experiment. If they make the wrong decision here, it could result not only in shipping a product that hurts users, but in shaping future product decisions based on erroneous learnings! So we invested especially heavily in helping people avoid mistakes in interpreting their experiment results.
First off, an experiment will be invalid if the randomization produced groups that aren’t actually the same, so we built a number of tools to detect errors.
(last error)
Other mistakes resulted from people trying to do their own analysis on metrics: querying the data incorrectly, making comparisons that didn’t make sense, or just not thinking about statistical significance.
So that’s stage five, where Pinterest currently finds itself. After stepping back from the day-to-day review of experiments, we built tools so that the experiment helpers can focus on the important part: deciding what to build and understanding how it affects our users.
SUMMARIZE. NOT JUST ENGINEERING: PEOPLE. BUY-IN, TEACHING, HARDER TO SHOOT FOOT.
We’ve built an experiment framework that allows us to track changes on all parts of our service, gotten it widespread adoption, built up a core set of 50 engineers who lead their teams in running experiments, and automated tools to make all of the aspects of the experiment lifecycle harder to screw up.
At each stage, while engineering and statistical know-how were part of the equation, the real solution lay in building a culture of experimentation: getting the humans who make up the organization to buy into experiments, teaching them to help each other make decisions, and building tools that make it harder to shoot yourself in the foot.
NEXT?? LESSONS BEYOND EXPERIMENTATION.
I don’t know yet what the next stage will look like. If you do, I’d love to find out.
But I think the lessons extend beyond just experimentation. (flip) Data science is not just engineering and statistics: your recommendation model will not be used unless you convince someone it’s useful, and your analysis will not change product strategy until it’s changed people’s minds.
NOT JUST ENGINEERING AND STATS. CONVINCE PEOPLE.
Data science is not just engineering and statistics: your recommender system will gather dust unless you convince someone it’s useful, and your analysis will not change product strategy until it’s changed people’s minds. Spending time actively investing in building a data-driven culture will pay off handsomely in the long run.