SlideShare a Scribd company logo
1 of 42
Download to read offline
1
by Asa Schachar
Ship Confidently with
Progressive Delivery
and Experimentation
A book of best practices to enable your engineering team to adopt feature flags,
phased rollouts, A/B testing, and other proven techniques to deliver the right
features faster and with confidence.
2
Today, we’re living through the third major change in the way companies
deliver better software products. First came the Agile Manifesto,
encouraging teams to iterate quickly based on customer feedback,
resulting in teams building and releasing features in small pieces. Secondly,
as software development moved to the cloud, many teams also adopted
DevOps practices like continuous integration and continuous delivery
to push code to production more frequently, further reducing risk and
increasing speed to market for new features.
However, today’s most successful software companies like Google,
Facebook, Amazon, Netflix, Lyft, Uber, and Airbnb have gone one step
further, ushering in a third major change. They move fast by releasing code
to small portions of their traffic and improve confidence in product decisions
by testing their hypotheses with real customers. Instead of guessing at
the best user experience and deploying it to everyone, these companies
progressively release new features to live traffic in the form of gradual
rollouts, targeted feature flags, and A/B tests. This process helps product
and engineering teams reduce uncertainty, make data-driven decisions,
and deliver the right experience to end users faster. When customers are
more engaged with features and products, it ultimately drives retention and
increased revenue for these businesses.
The name for this third major shift is progressive delivery and
experimentation. Progressive delivery with feature flags gives teams new
ways to test in production and validate changes before releasing them to
everyone, instead of scrambling to roll back changes or deploy hotfixes.
With experimentation, development teams gain the confidence of knowing
they’re building the most impactful products and features because they
can validate product decisions well before expensive investments are lost.
At the core of this practice is a platform that gives teams the ability to control
the release of new features to production, decouple deployment from
feature enablement, and measure the impact of those changes with real
users in production.
As with any new process or platform, incorporating progressive delivery and
experimentation into your software development and delivery process can
bring up many questions for engineering teams:
It’s not enough to
continuously integrate
and continuously
deliver new code.
Progressive delivery
and experimentation
enable you to test and
learn to move quickly
with the confidence
you’re delivering the
right features.
How can we get started with feature flagging and A/B testing without
creating technical debt down the line?
Intro
3
Get started with progressive
delivery and experimentation
Enable company-wide
experimentation and safe feature
delivery
Use progressive delivery and
experimentation to innovate faster
Contents
01
02
03
04
p4
p17
p26
p38
Scale from tens to hundreds of
feature flags and experiments
In today’s fast-moving world, it’s no longer enough to just ship small
changes quickly. Teams must master these new best practices to test and
learn in order to deliver better software, products, and growth faster. By
going from a foundational understanding to learning how to avoid pitfalls
in process, technology, and strategy, this book will enable any engineering
team to successfully incorporate feature flags, phased rollouts, and data-
driven A/B tests as core development practices. By following this journey,
you and your team will unlock the power of progressive delivery and
experimentation like the software giants already have.
This book is intended for software engineers and software engineering
leaders, but adjacent disciplines will also find it useful. If your job involves
softwareengineering,product,or qualityassurance,then you’rein theright place.
How can we scale to thousands of flags and still retain good governance and
QA processes?
How can we adopt these new practices organization-wide without slowing
down development and delivery?
4
A feature flag (aka feature toggle), in its most basic form, is a switch that
allows product teams to enable or disable functionality of their product
without deploying new code changes.
For example, let’s say you’re building a new front-end dashboard. You
could wait until the entire dashboard code is complete before merging and
releasing. Alternatively, you could put unfinished dashboard code behind a
feature flag that is currently disabled and only enable the flag for your users
once the dashboard works as expected.
In code, a feature flag might look like the following:
Get started with
progressive delivery and
experimentation
In building for experimentation, you’ll first want to understand the
different ways that software products can enable progressive delivery
and experimentation. In this first chapter, we’ll cover the basics of
feature flags, phased rollouts, and A/B tests, and best practices
for implementing them. You’ll discover how these techniques all fit
together, where an effective feature flag process can easily evolve
into product experimentation and statistically rigorous A/B testing.
Once you’ve implemented these techniques, you’ll be able to ship the
right features safely.
01
Feature flags: Enable feature releases
without deploying code
if (isFeatureEnabled(‘new_dashboard’)) { // true or false
showNewDashboard();
} else {
showExistingDashboard();
}
Feature flags toggle
features on and off
giving you another
layer of control
over what your users
experience.
5
New Feature Feature Flag
or Toggle
Consumers
Seamless releases
Feature kill switches
Trunk-based development
Platform for experimentation
When the function isFeatureEnabled is connected to a database or remote
connection to control whether it returns true or false, we can dynamically
control the visibility of the new dashboard without deploying new code. Even
with simple feature flags like this one, you get the benefits of:
Instead of worrying about the feature
code merging and releasing at the
right time, you can use a feature flag
for more control over when features
are released. Use feature flags for
a fully controllable product launch,
allowing you to decide whether the
code is ready to show your users.
If a release goes wrong, a feature
flag can act as a kill switch that
allows you to quickly turn off the
feature and mitigate the impact of a
bad release.
Instead of having engineers work
in long-lived, hard-to-merge,
conflict-ridden feature branches,
the team can merge code faster and
more frequently in a trunk-based
development workflow.[1]
A simple feature flag is the
start of a platform that enables
experimentation, as we will see later
in this chapter.
New feature Feature flag
or toggle
6
There are two general ways to perform a feature rollout, targeted and
random, which both suit different use cases.
01 Targeted rollout: For specific users first
A targeted rollout enables features for specific users at a time, allowing for
different types of experimentation.
Experiment with beta software
If you have power users or early adopters who are excited to use your
features as soon as they are developed, then you can release your
features directly to these users first, getting valuable feedback early
on while the feature is still subject to change.
Experiment across regions
If your users behave differently based on their attributes, like their
country of origin, you can use targeted rollouts to expose features
to specific subsets of users or configure the features differently for
each group.
Experiment with prototype designs
Similarly, for new designs that dramatically change the experience
for users, it’s useful to target specific test users to see how these
changes will realistically be used when put in the hands of real users.
Basic feature flags are either enabled or disabled for everyone, but feature
flags become more powerful when you can control whether a feature flag is
exposed to a portion of your traffic. A feature rollout is the idea that you only
enable a feature for a subset of your users at a time rather than all at once.
In our dashboard example, let’s say you have many different users for your
application. By providing a user identifier to isFeatureEnabled, the method
will have just enough information to return different values for different users.
Feature rollouts: Enable safe,
feedback-driven releases
// true or false depending on the user
if (isFeatureEnabled(‘new_dashboard’, ‘user123’))
showNewDashboard();
} else {
showExistingDashboard();
}
Rollouts allow you to
control which subset
of users can see your
flagged feature.
7
02 Random rollout: For small samples of users first
Another way of performing a feature rollout is by random percentage.
Perhaps at first you show the new feature to only 10% of your users; then a
week later, you enable the new feature for 50% of your users; and finally, you
enable the feature for all 100% of your users in the third week. With a phased
rollout that introduces a feature to parts of your traffic, you unlock several
different types of experimentation.
Experiment with gradual launches: Instead of focusing on a big launch day
for your feature that has the risk of going horribly wrong for all your users,
you can slowly roll out your new feature to fractions of your traffic as an
experiment to make sure you catch bugs early and mitigate the risk of losing
user trust. Gradually rolling out your features limits your blast radius to only
the affected customers versus your entire user base. This process limits
negative user sentiment and potential revenue loss.
Experiment with risky changes: A random rollout can give you confidence
that particularly risky changes, like data migrations, large refactors, or
infrastructure changes, are not going to negatively impact your product or
business.
Experiment with scale: Performance and scale issues are challenging.
Unlike other software problems, they are hard to predict and hard to
simulate. Often, you only find out that your feature has a scale problem when
it’s too late. Instead of waiting for your performance dashboards to show
ugly spikes in response times or error rates, using a phased rollout can help
you foresee any real-world scale problems beforehand by understanding the
scaling characteristics of parts of your traffic first.
Experiment with painted-door tests
If you’re not sure whether you should build a feature, you may consider
running a painted-door test[2]
where you build only the suggestion of
the feature and analyze how different random users are affected by the
appearance of the new feature. For instance, adding a new button or
navigation element in your UI for a feature you’re considering can show you
how many people interact with it.
New Feature Feature Rollout
Some users
get the feature
New feature Feature rollout
8
Relying on an external
database to store
whether your feature is
enabled can increase
latency, while hard
coding the variable
diminishes your ability
to dynamically change.
Find an architecture
that strikes a balance
between the two. Fetch feature flag configuration when the application starts up
Cache feature flag configuration in-memory so that decisions can be
made with low latency
Listen for updates to the feature flag configuration so that updates
are pushed to the application in as real time as possible
Poll for updates to the feature flag configuration at regular intervals,
so if a push fails, the application is still guaranteed to have the latest
feature configuration within some well-defined interval
1
2
3
4
Best practice: Balancing low-latency decisions with dynamism
When the logic of your codebase depends on the return value of a function
like isFeatureEnabled, you’ll want to make sure that isFeatureEnabled
returns its decision as fast as possible.
In the worst case, if you rely on an external database to store whether the
feature is enabled, you risk increasing the latency of your application by
requiring a roundtrip external network request across the internet, even
when the new feature is not enabled.
In the best-case performance, the isFeatureEnabled function is hard
coded to true or false either as a variable or environment variable, but
then you lose the ability to dynamically change the value of the feature flag
without code deploys or reconfiguring your application.
So, one of the first challenges of feature flags is striking this balance
between a low-latency decision and the ability to change that decision
dynamically and quickly.
There are multiple methods for achieving this balance to suit the needs
and capabilities of different applications. An architecture that strikes this
balance well and is suitable on many platforms will:
As an example, a mobile application may initially fetch the feature flag
configuration when the app is started, then cache the feature flag
configuration in-memory on the phone. The mobile app can use push
notifications to listen for updates to the feature flag configuration as well
as poll on regular, 10-minute intervals to ensure that the feature flags are
always up to date in case the push notification fails.
9
Admin panel Client apps App servers
10
With phased rollouts, your application has the ability to simultaneously
deliver two different experiences: one with the feature on and another with
the feature off. But how do you know which one is better? And how can you
use data to determine which is best for your users and the metrics you care
about? An A/B test can point you in the right direction. By shipping different
versions of your product simultaneously to different portions of your traffic,
you can use the usage data to determine which version is better. If you’re
resource constrained, you can simply test the presence or absence of a new
feature to validate whether it has a positive, negative, or neutral impact on
application performance and user metrics.
By being precise with how, when, and who is exposed to these different
feature configurations, you can run a controlled product experiment, get
statistically significant data, and be scientific about developing the features
that are right for your users, rather than relying on educated guesses. If you
want to use objective-truth data to resolve differing opinions within your
organization, then running an A/B test is right for you.
A/B tests: Make data-driven product decisions
New Feature
Variation A
A/B Test
Some users
get variation
A or B
Variation B
Simply testing the
presence or absence of
a new feature can help
you validate whether
it has a positive,
negative, or neutral
impact on application
performance and user
metrics.
When your A/B test
is precise with the
how, when, and who is
exposed to different
feature configurations,
you can make data-
driven decisions as
opposed to educated
guesses.
New feature A/B test
11
Because of the properties of a good hash function, you are always
guaranteed a deterministic but random output given the same inputs, which
gives you several benefits:
01 Best practice: Deterministic experiment bucketing —
hashing over Math.Random()
If you’re building a progressive delivery and experimentation platform, you
may be tempted to rely on a built-in function like Math.random() to randomly
bucket users into variations. Once bucketed, a user should only see their
assigned variation for the lifetime of the experiment. However, introducing
Math.random() adds indeterminism to your codebase, which will be hard
to reason about and hard to test later. Storing the result of the bucketing
decision also forces your platform to be stateful. A better approach is to
use hashing as a random but deterministic and stateless way of bucketing
users.
To visualize how hashing can be used for bucketing, let’s represent
the traffic to your application as a number line from 0 to 10,000. For an
experiment with two variations of 50% each, the first 5,000 numbers of your
number line can correspond to the 50% of your traffic that will get variation
A, and the second 5,000 numbers can correspond to the 50% of your traffic
that will receive variation B. The bucketing process is simplified to assigning
a number between 0 and 10,0000 for each user.
Using a hashing function that takes as input the user id (ex: user123) and
experiment id (ex: homepage_experiment) or feature key and outputs a
number between 0 and 10,0000, you achieve that numbered assignment for
assigning variations:
hash(‘user123’, ‘homepage_experiment’) -> 6756 // variation B
Your application runs predictably for a given user
Automated tests run predictably because the inputs can be controlled
Your progressive delivery and experimentation platform is stateless by
re-evaluating the hashing function at any time rather than storing the
result
A large hashing range like 0 to 10,000 allows assigning traffic granularity
at fine increments of 0.01%
The same pseudo-random bucketing can be used for random phased
rollouts
You can exclude a percentage of traffic from an experiment by excluding
a portion of the hashing range
12
Everwonderwhichlandingpagewouldleadtothemostsignupsforyour
product?Duringthe2008presidentialcampaign,BarackObama’soptimization
teamranseveralA/BteststodeterminethebestimageofObamaand
correspondingbuttontexttoputonthelandingpageofthecampaignwebsite.
TheseA/Btestedadjustmentsincreasedsignupsandledto$60millionof
additionaldonationsfromtheirwebsite.
Product metrics: For product improvements
Landing page signups
02 Best practice: Use A/B tests for insight on specific metrics
A/B tests make the most sense when you want to test different hypotheses
for improving a specific metric. The following are examples of the types of
metrics that allow you to run statistically significant A/B tests.
Referral signups through shares
Wanttoknowwhichreferralprogramwouldincreaseviralityofyourproduct
themostcost-effectivelythroughsharing?Ride-sharingserviceslikeLyftand
Uberoftenexperimentontheamountofmoneytorewardusersforbringing
otheruserstotheirplatforms(ex:give$20toafriendandget$20yourself).
It’simportanttogetthisamountrightsothecostofgrowthdoesn’tnegatively
impactyourbusinessinthelongterm.
0 10000
5000
user123
1812 5934 8981
user456
user981
hash() hash() hash()
user123
gets bucketed into
Variation A
user456
gets bucketed into
Variation B
Variation A
Variation B
13
An impression event occurs when a user is assigned to a variation of an
A/B test. For these events, the following information is useful to send as a
payload to an analytics system: an identifier of the user, an identifier of the
variation the user was exposed to, and a timestamp of when the user was
exposed to the variation. With this information as an event in an analytics
system, you can attribute all subsequent actions (or conversion events) that
the user takes to being exposed to that specific variation.
Conversion event
A conversion event corresponds to the desired outcome of the experiment.
Looking at the example metrics above, you could have conversion events
for when a user signs up, when a user shares a product, the time it takes a
dashboard to load, or when an error occurs while using the product. With
conversion events, the following information is useful to send as a payload to
an analytics system: an identifier of the user, an identifier of the type of event
that happened (ex: signup, share, etc.), and a timestamp.
03 Best practice: Capture event information
for analyzing an A/B test
When instrumenting for A/B testing and tracking metrics, it’s important to
track both impression and conversion events because each type includes
key information about the experiment.
Operational metrics: For infrastructure improvements
Latency & throughput
Ifengineersaredebatingoverwhichimplementationwillperformbestunder
real-worldconditions,youcangatherstatisticalsignificanceonwhichsolution
ismoreperformantwithmetricslikethroughputandlatency.
Error rates
Impression event
Ifyourteamisworkingonaplatformorlanguageshiftandhasatheorythatyour
applicationwillresultinfewererrorsafterthechange,thenerrorratescanserve
asametrictodeterminewhichplatformismorestable.
14
04 Best practice: Avoid common experiment analysis pitfalls
Once you have the above events, you can run an experiment analysis to
compare the number of conversion events in each variation and determine
which one is statistically stronger.
However, experiment analysis is not always straightforward. It’s best
to consult data scientists or trained statisticians to help ensure your
experiment analysis is done correctly. Although this book does not dive deep
into statistics, you should keep an eye out for these common pitfalls.
Example impression events
Example conversion events
U S E R I D
U S E R I D
Caroline
Caroline
Dennis
Dennis
Flynn
Flynn
Flynn
Erin
original
purchase
free-shipping
purchase
add_to_cart
free-shipping
add_to_cart
signed_up
2019-10-08T02:13:01
50 2019-10-08T00:05:32
2019-10-08T05:30:46
30 2019-10-08T00:07:19
10 2019-10-09T01:15:51
2019-10-09T01:11:20
5 2019-10-09T01:14:23
- 2019-11-09T12:02:36
VA R I AT I O N I D
E V E N T I D
T I M E STA M P
VA L U E T I M E STA M P
Creating too many variations or evaluating too many metrics will increase
the likelihood of seeing a false positive just by chance. To avoid that
outcome, make sure the variations of your experiment are backed by a
meaningful hypothesis or use a statistical platform that provides false
discovery rate control.
Note: The identifiers used in the
table on the right are just for
illustration purposes. Typically,
identifiers are globally unique
and are non-identifiable strings
of digits and characters. Also
note that a value is included
in the conversion events that
are non-binary (ex: how much
money was associated with
a purchase event). However,
binary events like someone
signing up, does not have a
value associated. Rather these
events are binary: they either
happened or did not.
If you calculate the results of an A/B test when only a small number of users
have been exposed to the experiment, the results may be due to random
chance rather than the difference between variations. Make sure your
sample size is big enough for the statistical confidence you want.
Multiple comparisons
Small sample size
15
A simple feature flag is just an on-and-off switch that corresponds to the
A and B variations of an A/B test. However, feature flags can become more
powerful when they expose not only whether the feature is enabled, but
also how the feature is configured. For example, if you were building a new
dashboard feature for an email application, you could expose an integer
variable that controls the number of emails to show in the dashboard at a
time, a boolean variable that determines whether you show a preview of
each email in the dashboard list, or a string variable that controls the button
text to remove emails from the dashboard list.
By exposing feature configurations as remote variables, you can enable
A/B tests beyond just two variations. In the dashboard example, you can
experiment not only with turning the dashboard on or off, but also with
different versions of the dashboard itself. You can see whether email
previews and fewer emails on screen will enable users to go through their
email faster.
A/B/n tests go beyond two variations to test
feature configurations
{
title: “Latest Updates”,
Feature Configuration
color: “#36d4FF”,
num_items: 3,
}
A classical experiment should be set up and run to completion before any
statistical analysis is done to determine which variation is a winner. This
is referred to as fixed-horizon testing. Allowing experimenters to peek at
the results before the experiment has reached its sample size increases
the likelihood of seeing false positives and making the wrong decisions
based on the experiment. However, in the modern digital world, employing
solutions like sequential testing can allow analysis to be done in real time
during the course of the experiment.
Peeking at results
Feature configuration
16
One challenge besides knowing what to feature flag for an experiment
is knowing how to integrate this new process into your team’s existing
software development cycle.
Taking a look at each of your in-progress initiatives and asking questions
upfront can help you build feature flags into your standard development
process. For example, by asking “How can we de-risk this project with
feature flags?” you highlight how the benefits of feature flags outweigh
the cost of an expensive bug in production or a disruptive, off-hours hotfix.
Similarly, by asking “How can we run an experiment to validate or invalidate
our hypothesis for why we should build this feature?” you will find that
spending engineering time building the wrong feature is much more costly
than investing in a well-designed experiment. These questions should
speed your overall net productivity by enabling your team to move toward a
progressive delivery and experiment-driven process.
You can start to incorporate feature flags into your development cycle by
asking a simple question in your technical design document template.
For instance, by asking “What feature flag or experiment are you using to
rollout/validate your feature?” you insert feature flags into discussions early
in the development process. Because technical design docs are used as
a standard process for large features, the document template is a natural
place for feature flags to help de-risk complex or big launches.
Feature flag driven development
“How can we de-risk
this project with
feature flags?”
“How can we run an
experiment to validate
or invalidate our
hypothesis for why
we should build this
feature?”
Best practice: Ask
feature-flag questions
in technical design docs
17
One challenge with experiments, feature flags, and rollouts is that you
may be tempted to use them for every possible change. It’s a good idea to
recognize that even in an advanced experimentation organization, you likely
won’t be feature flagging every single change or A/B testing every feature.
This high-level decision tree can be useful when determining when to run an
experiment or set up a feature behind a flag.
Scale from tens to
hundreds of feature flags
and experiments
After completing your first few experiments, you will probably want
to take a step back and start thinking about improvements to help
scale your experimentation program. The following best practices are
things you will need to consider when scaling from tens to hundreds
of feature flags and experiments.
02
Decide when to feature flag, rollout, or A/B test
Best practice:
Don’t feature flag or
A/B test every change.
18
Should I run an experiment or a rollout?
Working on
docs?
Working on
refactor?
Working on
a bug?
Working on
a feature?
19
02 Share constants to keep applications in sync
Some feature flags and experiments are cross-product or cross-codebase.
By having a centralized list of feature key constants, you are less likely
to have typos that prevent this type of coordination across your product.
For example, let’s say you have a feature flag file that is stored in your
application backend but passed to your frontend. This way the frontend and
backend not only reference the same feature keys but you can also deliver a
consistent experience across the frontend and backend.
When dealing with experiments or feature flags, it’s best practice to use a
human-readable string identifier or key to refer to the experiment so that
the code describes itself. For example, you might use the key ‘promotional_
discount’ to refer to the feature flag powering a promotional discount
feature that is enabled for certain users.
It’s easy to define a constant for this feature flag or experiment key exactly
where it’s going to be used in your codebase. However, as you start using a
lot of feature flags and experiments, your codebase will soon be riddled with
keys. Centralizing the constants into one place can help.
Reduce complexity by centralizing constants
01 Centralize constants to visualize complexity
Centralizing all feature keys gives you a better sense of how many features
exist in a codebase. This enables an engineering team to develop processes
around revisiting the list of feature flags for removal. Having a sense of how
many feature flags exist also gives a sense of the codebase and product
complexity. Some organizations may decide to have a limit on the number of
active feature flags to reduce this complexity.
Best practice: Compile
all your feature flags
currently in use in
an application into a
single constants file.
Best practice: Use
the same feature key
constants across the
backend and frontend of
your application.
20
As a company implements more feature flags and experiments, the
codebase gets more identifiers or keys referencing these items (like: site_
redesign_phase_1 or homepage_hero_experiment). However, the keys that
are used in-code will inevitably lack the full context of what the rollout or
experiment actually does. For example, if an engineer saw site_redesign_
phase_1, it’s unclear what the redesign includes or what is included in phase
1. Although you could just increase the verbosity of these keys so that they
are self-explanatory, it’s a better practice to have a process by which anyone
can understand the original context or documentation behind a given
feature rollout or experiment.
Document your feature flags to understand original
context
Ensuring your software works before your users try it out is paramount
to building trustworthy solutions. A common way for engineering teams
to ensure their software is running as expected is to write automated
unit, integration, and end-to-end tests. However, because you’re scaling
experimentation, you’ll need a strategy to ensure you still deliver high quality
software without an explosion of automated tests.
Having a strategy to test every combination is not going to be sustainable.
As an example, let’s say you have an application with 10 features and 10
corresponding automated tests. If you add just 8 on/off feature flags, you
theoretically now have 2^8 = 256 possible additional states, which is nearly
25 times as many tests as you started with. Because testing every possible
combination is nearly impossible, you’ll want to get the most value out of
writing automated tests. Make sure you understand the different levels of
automated testing, which include:
Ensure quality with the right
automated testing & QA strategy
What the rollout or experiment changes were
The owner for the rollout or experiment
If the rollout or experiment can be safely paused or rolled back
Whether the lifetime of this experiment or rollout is temporary or
permanent
Often times, engineering teams will rely on an integration between their task
tracking system and their feature flag and experiment service to add context
to their feature flags and experiments.
Best practice: Make sure your team can easily find out:
21
End-to-end tests
Integration tests
Manual
QA
Unit tests
Best practice: Ensure the building
blocks of your application are well
tested with lots of unit tests. The
smaller units are often unaware
of experiment or feature state. For
those units that are aware, use
mocks and stubs to control this
white-box testing environment.
01 Unit tests—test frequently for solid building blocks
Unit tests are the smallest pieces of testable code. It’s best practice that
these units are so small that they are not aware or are not affected by
experiments or feature flags. As an example, if a feature flag forks into
two separate code paths, each code path should have its own set of
independent unit tests. You should frequently test these small units of
code to ensure high code coverage, just as you would if you didn’t have any
feature flags or experiments in your codebase.
If the code you are unit testing does need to contain code that is affected by
a feature flag or experiment, take a look at the techniques of mocking and
stubbing described in the integration tests section below.
Manual
22
Unit tests
End-to-end tests
Integration tests
Manual
QA
02 Integration tests—force states to test code
For integration tests, you are combining units into higher-level business
logic. This is where experiments and feature flags will likely affect the logical
flow of the code, and you’ll have to force a particular variation or a state of a
feature flag in order to test the code.
In some integration tests, you’ll still have complete access to the code’s
executing environment where you can mock out the function calls to
external systems or internal SDKs that power your experiments to force
particular code paths to execute during your integration tests. For example,
you can mock an isFeatureEnabled SDK call to always return true in an
integration test. This removes any unpredictability, allowing your tests to run
deterministically.
In other integration tests, you may not have access to individual function
calls, but you can still stub out API calls to external systems. For example,
you can stub data powering the feature flag or experimentation platform to
return an experiment in a specific state to force a given code path.
Although you can mock out indeterminism coming from experiments or
feature flags at this stage of testing, it’s still best practice for your code and
tests to have as little awareness of experiment or feature flag as possible,
and focus on the code paths of the variations executing as expected.
Best practice: Use mocks and stubs
to control feature and experiment
states. Focus on individual code
paths to ensure proper integration
and business logic.
23
Unit tests
Integration tests
Manual
QA
End-to-end tests
Best practice: Do not test every
possible combination of experiment
or feature with end-to-end tests.
Instead, focus on important
variations or tests that ensure your
application still works if all features
are on/off.
03 End-to-end tests—focus testing on critical variations
End-to-end tests are the most expensive tests to write and maintain
because they’re often black-box tests that don’t provide good control over
their running environment and you may have to rely on external systems.
For this reason, avoid relying on end-to-end or fully black-box tests to
verify every branch of every experiment or feature flag. This combinatorial
explosion of end-to-end tests will slow down your product development.
Instead, reserve end-to-end tests for the most business-critical paths of an
experiment or feature flag or use them to test the state of your application
when most or all of your feature flags are in a given state. For example, you
may want one end-to-end test for when all your feature flags are on, and
another when all your feature flags are off. The latter test can simulate what
would happen if the system powering your feature flags goes down and
must degrade gracefully.
When you do require end-to-end tests, make sure you can still control the
experiment or feature-flag state to remove indeterminism. For example,
in a web application, you may want to have a special test user, a special
test cookie, or a special test query parameter that can be used to force
a particular variation of an experiment or feature flag. Note that when
implementing these special overrides, be sure to make them internal-only
so that your users don’t have the same control over their own feature or
experiment states.
24
Unit tests
Integration tests
End-to-end tests
Manual
QA
Best practice: Save time and
resources by reserving manual QA
to test the most critical variations.
Make sure you provide tools for QA to
force feature and experiment states.
As more individuals contribute to your progressive delivery process, it
becomes imperative to have safe and transparent practices. Permissioning,
exposing user states, and emulation enable your team to keep the process
secure and viable as you scale.
Increase safety with the right permissions
04 Manual verification (QA)—reserve for business-critical
functions
Similar to end-to-end tests, manual verification of different variations can
be difficult and time consuming, which is why organizations typically have
only a few manual QA tests. Reserve manual verification for business-critical
functions. And if you implemented special parameters to control the states
of experiments or feature flags for end-to-end tests, these same parameters
can be used by a QA team to force a variation and verify a particular
experience.
01 Establish permissions based on your roles
With rollouts and experiments, typically your team will have a dashboard
where you can edit production configurations without having to make
changes to the core development repository. With this setup, you’ll want
to consider the permissions of different parts of your organization and
their ability to make changes to your rollouts and experiments. Ideally, your
permissions should match the permissioning that you would typically use for
feature development, which include:
25
Anyone who can do standard feature development should have standard
edit-level access. However, it’s best practice to require a higher level of edit
access for important system-wide infrastructure or billing configuration.
Individuals with the ability to provision or change permissions for standard
feature development should have administrative access to the feature flag
and experimentation configuration setup. This allows an IT team or super
user to provision and secure the above roles for individual developers on the team.
Edit-level access
Administrative access
Almost everyone at your company should likely at least have read-level
access, allowing them to see rollouts and experiments actively running at
any given time.
Read-level access
02 Expose user state for observability
As your company uses more feature flags and experiments, the different
possible states that a given customer could experience begins to
combinatorially explode. But if one of those customers encounters an issue,
you’ll want to know which states of rollouts or experiments may be active to
understand the full state of the world for that particular individual. It’s best
practice to have an interface either through a UI, a command line tool, or a
dashboard to query which feature flags or experiments a given customer/
user/request receives for increased observability of your system.
If your bucketing tool is deterministic (it will always give the same bucketing
decision given the same inputs), then you can easily provide a tool that
takes the customer information as inputs and the state they should be
experiencing as outputs.
Best practice: Consider
enabling features
customer by customer
and using a centralized
tool for anyone to
input a customerId and
see what combination
of features that
particular customer has
enabled.
03 Allow emulation for faster debugging
Even if you know the particular combination of features and experiments
that a given customer has access to, it can sometimes be difficult to reason
why they are still seeing a particular experience. This is similar to a more
general problem of debugging complex customer issues. Many engineering
organizations build an ability to emulate a user’s view of the product to
make it easier to see what the customer sees when debugging. If using this
technique, ensure you use appropriate access controls, permissions and
restrictions. Some engineering organizations also have the ability to let their
production site load a development version of their code in order to enable
engineers to test out fixes on hard-to-replicate setups. Both techniques are
extremely helpful in minimizing the time to debug specific issues that are
only relevant to a certain combination of feature flag or experiment states.
26
Whenyouoptforaseparateexperimentationorrolloutservicetocontrolyour
configuration,youmustbepreparedforwhentheservicegoesdown.Thisis
wheresmartdefaultscanhelpbyansweringthefollowingquestions:
Ifthefeatureflagservicewentdown,
Someorganizationssaveasnapshotofthefeatureflagandexperimentstate
inthecodebaseataregularfixedinterval.Thisprocessprovidesasmart,local
fallbackthatisfairlyrecentincasetheremotesystemgoesdown.
Enable company-wide
software experimentation
Scaling your experiment program across your entire organization can
become complicated quickly. With these best practices you’ll be able
to minimize the complexity of an advanced system running more than
hundreds of experiments or rollouts simultaneously.
03
Prevent emergencies with smart defaults
Wouldyoupreferthatallusershavetheexperienceofthefeaturebeingonoroff?
Wouldyoupreferthatallusersgettheversionthatwasmostrecentlydeliveredby
thefeatureflagorexperimentationservice?
Whatconfigurationorfeaturevariablevalueswouldbepreferredforyourusers?
Best practice: Think
through all possible
failure scenarios and
prepare for them with
smart defaults for when
your feature flagging
or experimentation
services go down.
27
Balance low-latency, consistency, and
compatibility in a microservice architecture
When you start to deploy multiple versions of features and experiments
across multiple separate applications or services you’ll want to ensure
the services are consistent with how they evaluate feature flags and
experiments. If they aren’t—where some evaluate the feature flag as on
and others evaluate it as off—you could run into backward or forward
compatibility issues between different services and your application
might break. Below are two options for developing in this microservice
architecture.
Best practice: Keep
dependencies between
flags simple to avoid
broken states.
Avoid broken states with simple
feature flag dependency graphs
As a company builds more features and experiments, it’s likely that an
engineering team will find a feature flag or experiment built on top of
an existing one. Take, for example, a feature flag used to rollout a new
dashboard UI interface. As you roll out the new dashboard, your team may
want to experiment on a component of the new UI. Although you could
manage both the rollout and the experiment with the same feature flag,
there are reasons you might want to separate the two to allow for the rollout
to happen independently from the experiment. However, to see the different
variations of the dashboard component, you have to both enable the feature
flag and be in a particular variation of the experiment. You now have a kind
of dependency graph of your feature flags and experiments. Naturally, your
systems will develop more and more of these feature flag dependencies
where one feature depends on another.
It’s best practice to minimize these dependencies as much as possible
and strive for feature flag combinations that won’t break the application
state. If feature flags A and B, but not C, result in an un-working application
setup, then it’s likely your team lacks this contextual knowledge and
accidentally puts your application in a bad state just by changing feature
flag configurations.
One option is to keep your feature flag hierarchy extremely shallow—for
instance, a simple 1:1 mapping between feature flags and customer-facing
features—ensuring there are few dependencies between flags.
28
Store.com
browser user
Store.com
native mobile user
01 Services independently evaluate feature state
The benefit of each service independently evaluating the state of feature
flags on its own is that it minimizes the dependencies of a given service.
The downside is that it requires updating every service. Also, if the services
are truly independent, then they will be less consistent with the state of a
feature flag. For example, when you toggle a feature flag on, there will be a
time when some services get the update and evaluate the feature flag as on
while others are still receiving the update and evaluate it as off. Eventually,
all services will get the update and evaluate the flag as on. In other words,
the independent services are eventually consistent. In this case, it’s best
practice to put in the extra work to make sure the different feature flag and
experiment states are forward and backward compatible with the other
services to prevent unexpected states across services.
Best practice: Put in
the extra work to make
sure your different
states are forward and
backward compatible
with the other
services.
Example: Services are independent
29
Store.com
browser user
Store.com
native mobile user
02 Services depend on feature state service
In this architecture, the services all depend on a centralized hub, ensuring
they’re all consistent with the way they evaluate a feature flag or experiment.
Although this architecture is consistent and does not have to worry about
backward and forward compatibility, it comes at the cost of latency.
Because each service has to communicate to this separate feature flag
or experimentation service, you will have added the necessary latency to
achieve a consistent state across services.
A separate feature and experiment service does have the added benefits of
being:
Example: Services depend on central service
Easily implemented in a microservice architecture
Compatible with other services in different languages by exposing APIs in
the form of language-agnostic protocols like HTTP or gRPC
Centralized for easier maintenance, monitoring, and upgrading
Best practice: Expect
some latency at the
expense of consistent
evaluation of feature
flags or experiments
across services.
30
Theseexperimentsareintendedtobeusedonlyintheearlyphasesof
thesoftwaredevelopmentlifecycleandaren’tintendedtobekeptin
theproductaftertheyhavevalidatedorin-validatedtheexperiment
hypothesis.
01 Remove temporary flags and experiments
If a feature is designed to be rolled out to everyone and you don’t expect
to experiment on the feature once it’s launched, then you’ll want to ensure
you have a ticket tracking its removal as soon as the feature has been fully
launched to your users. These temporary flags may last weeks, months, or
quarters. Examples include:
Painted-door experiments
Performance experiments
Large-scale refactors
Theseexperimentsareintendedtoputtwodifferentimplementations
againsteachotherinalive,real-worldperformanceexperiment.Once
enoughdatahasbeengatheredtodeterminethemoreperformant
solution,it’susuallybesttomovealltrafficovertothehigherperforming
variation.
Whenmovingbetweenframeworks,languages,orimplementation
details,it’susefultodeploytheseratherriskychangesbehindafeature
flagsothatyouhaveconfidencetheywillnotnegativelyimpactyourusers
orbusiness.However,oncethere-factorisdone,youhopefullywon’tgo
backintheotherdirection.
Prevent technical debt by understanding
feature flag and experiment lifetimes
As your organization uses more feature flags and experiments, it’s
paramount to understand that some of these changes are ephemeral and
should be removed from your codebase before they become outdated and
add technical debt and complexity.
One heuristic you can track is how long the feature flag or experiment has
been in your system and how many different states it’s in. If the feature flag
has been in your system for a long time and all of your users have the same
state of the feature flag, then it should likely be removed. However, it’s smart
to always evaluate the purpose of a flag or experiment before removing it.
The real lifetime of an experiment or feature flag depends heavily on its use
case.
Best practice: To
avoid technical debt,
regularly review
flags in case they’re
obsolete or should be
deprecated, even if
they’re meant to be
permanent.
31
Theseflagsareusefulifyouhavedifferentpermissionlevelsinyour
productlikeread-onlythatdon’talloweditaccesstothefeature.Theyare
alsousefulifyouhavemodularpricinglikeaninexpensive“starterplan”
thatdoesn’thavethefeature,butamorecostly“enterpriseplan”thatdoes
havethefeature.
Operational flags
Configuration-based software
Theseflagscontroltheoperationalknobsofyourapplication.For
example,theseflagscancontrolwhetheryoubatcheventssentfromyour
applicationtominimizethenumberofoutboundrequests.Youcouldalso
usethemtocontrolthenumberofmachinesthatareusedforhorizontally
scalingyourapplication.Inaddition,theycanbeusedtodisablea
computationalexpensivenon-essentialserviceorallowforagraceful
switchoverfromonethird-partyservicetoanotherinanoutage.
Foranysoftwareorproductthatispoweredbyaconfigfile,thisisagreat
placetoseamlesslyinsertexperimentationthathasalowcosttomaintain
andstillallowsendlessproductexperimentation.Forexample,some
companiesmayhavetheirproductlayoutpoweredbyaconfigfilethat
describesinabstracttermswhetherdifferentmodulesareincludedand
howtheyarepositionedonthescreen.Withthisarchitecturalsetup,even
ifyouaren’trunninganexperimentrightnow,youcanstillenablefuture
productexperimentation.
Note that even if a flag is meant to be permanent, it’s still paramount
to regularly review these flags in case they are obsolete or should be
deprecated. Otherwise, keeping these permanent flags may add technical
debt to your codebase.
Permission flags
Product re-brands
Ifyourbusinessdecidestochangethelookandfeelofyourproductfor
brandpurposes,it’susefultohavearollouttogracefullymovetothenew
branding.Afterthenewbrandingisestablished,it’sagoodideatoremove
thefeatureflagpoweringtheswitch.
02 Review permanent flags and experiments
If a feature is designed to have different states for different customers, or
you want to control its configuration for operational processes, then it’s
likely the flag will stay in your product for a longer period of time. Examples of
these flags and experiments include:
32
01 Individual ownership
Individual developers are labeled as owners of a feature or an experiment.
At a regular cadence, for example every two quarters, ownership is re-
evaluated and transferred if necessary.
Simple and understandable.
Pros
Hard to maintain if engineers frequently move between projects or code areas.
Cons
02 Feature team ownership
The team responsible for building the feature takes ownership of the feature
and experiment.
Resilient to individual contributor changes.
Pros
Hard to maintain if teams are constantly changing or are unbalanced and have uneven
distribution of ownership.
Cons
03 Centralized ownership
This ownership falls under a dedicated experimentation or growth team with
experts who set up, run, and remove experiments. The downside is that it
severely limits the scale of your experimentation program to the size of this
Make your organization resilient with a code-
ownership strategy
As with any feature you build, the individuals and teams that originally
engineered the feature or experiment are not going to be around forever.
As a best practice, your engineering organization should agree on who is
responsible for owning and maintaining a feature or experiment. This is
particularly important for when you need to remove or clean up old feature
flags or experiments. Options around ownership include:
Some organizations use an integration between a task tracking system and
their feature flag and experiment service to manage this cleanup process
seamlessly and quickly. If the state of feature flags and experiments can
be synced with a ticket tracking tool, then an engineering manager can
run queries for all feature flags and experiments whose state has not been
changed in the past 30 days and track down owners of the feature flags and
experiments to evaluate their review. Other organizations have a recurring
feature flag and experiment removal day in which engineers review the
oldest items in the list at a regular cadence.
33
Minimize on-call surprises by treating changes
like deploys
Companies often have core engineering hours for a given product or feature,
for example, a core team working Monday through Friday in similar time
zones. Even companies that do continuous integration and continuous
delivery realize that deploying production code changes outside of these
core working hours (either late at night or on the weekends) is usually a bad
idea. If something goes wrong with a deploy outside of these working hours,
teams run the risk of releasing issues when the full team is not around. With
fewer teammates, it’s slower to fix and mitigate the impact that the issue
will have. Because experiments and rollouts give individuals the ability to
easily make production code changes, it’s best practice to treat changes in
an experiment or a rollout with the same level of focus as standard software
deploys. Avoid making changes when no one is around or no one is aware of
the change. If it is advantageous to make changes during off-hours, do so
transparently with proper communication so no one is caught by surprise.
central team. The upside is that this centralized team can be the experiment
experts at your company and help ensure experiments are always run with
high quality. This method can be especially helpful when getting started, and
it’s useful to have one team prove the best practices before fanning them out
to other parts of the organization.
Expect changes and recover from
mistakes faster with audit logs
When someone at your organization does a deployment of code changes
to your product, it’s best practice to have a change log that lets everyone
know what the change was, who made it, and why. With this information, you
won’t be surprised by changes to your product or if a user sees something
new. This practice of increasing visibility into the changes of your application
are no different for feature flags and experiments—you’ll want to be able to
quickly answer questions like:
Auser’sexperiencechangedrecently,didwestartanexperiment?
Best practice: Save
rollout or experiment
changes for core
working hours and avoid
making these changes on
Fridays and before a
weekend or a holiday.
If you must, do it
responsibly.
Central teams aren’t going to be the experts where the experiment is actually implemented
and may require a lot of help from other teams. The size of this team will eventually limit the
number of experiments and feature flags that your company can handle.
Resilient to many changes and simplest to reason about.
Pros
Cons
34
Code smarter systems by making them
feature flag and experiment agnostic
Not all your code has to know about all the experiments and feature flags in
your product. In fact, the less your code has to worry about the experiments
being run on your product, the better. By striving for code that’s experiment
agnostic or feature-flag unaware, you can focus on the particular product or
feature you are building without having to worry about the different states.
The techniques below are just two examples of different design patterns
available[3]
:
Understand your systems with
intelligent alerting
With many possible states, you’ll want visibility into what’s actually
happening in production for your customers and have intelligent alerting for
when things are not acting as expected. For example, you may have thought
you released a feature to all of your customers, only to realize that a targeting
condition prevented the feature from being visible to only a portion of them.
Having an alert for when the feature flag has been accessed by X number
of users can be a useful way to ensure that your feature flags are acting as
expected in the wild. Some organizations even set up systems to auto-stop a
rollout if errors can be attributed to the new feature-flag state in production.
If you have a feature flag with two different experiences, moving the point
where you fork the experience can affect whether individual code paths are
dependent on the feature-flag state. For example, you could either have a
frontend component fork the experience inside the component, or you could
have the frontend page swap out different components entirely, so they
don’t have to be aware of the feature-flag state.
Move the fork
Having a change history or audit log for your feature management and
experimentation platform is key to scaling your experiments while still having
the ability to quickly diagnose changes to your user’s experience. An audit
log can also speed the time to recovery by pinpointing the root cause of an
undesirable change and more quickly understanding its implications.
Best practice: Build
in broader visibility
with a change history
or audit logs so you
can quickly diagnose
issues and pinpoint
root causes.
Didanyonerecentlychangethetrafficallocationortargetingofthis
featuretoincludethissetofusers?
Anunexpectedbugrecentlyoccurredforacustomer,butwedidn’t
deployrecently;didanyonechangeanythingregardinganexperiment
orfeature-flagstate?
35
Or you could decide to remove the conditional and have the subject just be
provided as a variable by the experimentation service:
Or you could recognize that you can declaratively code the subject as a
variable property on your email class that has defaults that a feature or
experimentation service can override:
In these latter two implementation techniques, you reduce how experiment
aware or feature-flag aware your email code paths are.
email = new Email();
email.subject = “Welcome to Optimizely”;
variation = activateExperiment(“email_feature”)
if (variation == “treatment_A”) {
email.subject = “Hello from Optimizely”;
}
@feature(“email_feature”)
class Email {
@variable(“subject”)
subject = “Welcome to Optimizely”
}
email = new Email();
email.subject = getVariable(“email_feature”, “subject”);
Instead of deploying a feature flag or experiment using if/else branches
in an imperative style, consider a declarative style where your feature is
controlled by a variable configuration that could be managed by not only a
feature flag service but any remote service.
For example, if you were experimenting on an email subject, you could either
code the variation as:
Avoid conditionals
36
Now that you’ve learned about progressive delivery and experimentation,
you might be considering whether to build your own testing framework,
integrate an open source solution, or extend a commercial progressive
Integrating feature flags and experiments into the same tools and systems
that you already use will make scaling your experimentation program easier.
To some, this means making experiments and feature flags ergonomic, to
others, this means working where you work and how you work. For React,
it’s much easier to develop with the mindset of components. For Express, it’s
much easier to develop with the mindset of middleware.
Each framework and platform has its own idiomatic ways of developing.
The closer feature flags and experimentations match those same idiomatic
patterns, the more likely you are to easily integrate feature flags and
experiments into your development, testing, and deploying processes.
Leverage configuration-based
design to foster a culture of
experimentation
To truly achieve a culture of experimentation, you have to enable non-
technical users to be able to experiment across your business and product.
This starts with architecting a system that does not require engineering
involvement, rather it has enough safeguards to prevent individuals from
breaking your product.
Configuration-based development is an architectural pattern commonly
used to enable this type of large-scale experimentation because it uses
configuration files to power your product. For instance, a configuration file
that powers the layout of content in a mobile application, or a configuration
file that controls the scaling characteristics of a distributed system. By
centralizing the different possible product states to a configuration file that
can be validated programmatically, you can enable experiments to power
different configuration setups while maintaining confidence that your
application can’t be put in a broken state.
Best practice:
Architect a system
that’s accessible for
non-technical users and
use configuration files
to power your products
and features.
Best practice:
Aim to match your
feature flags and
experimentations to
the idiomatic patterns
of your development
framework to easily
integrate them into
your development,
testing, and deploying
processes.
Increase developer speed with
framework-specific engineering
Evaluating feature delivery and
experimentation platforms
37
Usability for both technical and non-technical users can be the difference
between running a few experiments a year to running thousands. An
enterprise system often includes remote configuration capabilities—the
ability to start and stop a rollout or experiment, change traffic allocation, or
update targeting conditions in real time from a central dashboard without a
code deploy.
When a progressive delivery and experimentation system is easy for your
engineering organization to adopt, more teams will be able to deploy quickly
and safely. Developers will spend less time figuring out how to manage the
release or experiment of their code, and more time working on customer-
facing feature work. Look for systems with robust documentation, multiple
implementation options, and open APIs.
To learn quickly through experimentation, teams need to trust that tests are
being run reliably and that the results are accurate. You’ll need a tool that
can track events, provide or connect to a data pipeline that can filter and
aggregate those events, and correctly integrate into your system. Vet the
statistical models used to calculate significance to ensure your team can
make decisions quickly, backed by accurate data.
Ease of use for developers and non-technical stakeholders
Statistical rigor and system accuracy
Building in-house or adopting an open source framework typically comes
with a relatively small upfront investment. Over time, additional features and
customizations become necessary as more teams use the platform, and
maintenance burdens like bug fixes, UI improvements, and more begin to
distract engineers from a core product focus.
Committing to building a platform yourself is a commitment to continuing
to innovate on experimentation and develop new functionality to support
your teams. Companies that successfully scale experimentation with a
homebuilt system have engineers on staff dedicated to enabling others and
supporting the system with ongoing maintenance.
Total cost of developing and maintaining your system
delivery and experimentation platform like Optimizely. When deciding which
option is right for your organization, consider the following:
38
Use progressive delivery
and experimentation to
innovate faster
04
In this book, we’ve gone from building a basic foundation of progressive
delivery and experimentation, to more advanced best practices on how to do
it well. Many of the most successful software companies have gone on this
journey to not only deploy features safely and effectively, but also to make
sure they’re building the right features to begin with. Because in today’s
age of rapid change, having the tools and techniques to quickly adapt and
experiment is the most crucial aspect to staying ahead of the curve.
39
Appendix
What is trunk-based development?
Why trunk-based development
Trunk-based development is software development strategy where
engineers merge smaller changes more frequently into the main codebase
and work off the trunk copy rather than work on long-lived feature branches.
With many engineers working in the same codebase, it’s important to have a
strategy for how individuals work together. To avoid overriding each other’s
changes, engineers create their own copy of the codebase, called branches.
Following an analogy of a tree, the master copy is sometimes called the
mainline or the trunk. The process of incorporating the changes of an
individual’s copy into the main master trunk is called merging.
To understand how trunk-based development works, it’s useful to first look
at the alternative strategy, feature branch development.
In feature branch development, individual software developers or teams
of engineers do not merge their new branch until a feature is complete,
sometimes working for weeks or months at a time on a separate copy.
01 Trunk-based development workflow
Feature-branched development
master
(or trunk)
long lived
feature
branches
-
40
Trunk-branched development
When you aren’t sure whether to build a feature, then a painted-door
experiment is a high value experiment. In a painted-door experiment,
instead of putting in all the time to build a feature, you first verify that the
feature is worth it by building just the suggestion of the feature into your
product and measuring engagement.
This long stretch of time can make the process of merging difficult because
the trunk or master has likely changed due to other engineers merging their
code changes. This can result in a lengthy code review process where the
changes in different pull requests are analyzed to resolved merge conflicts.
Benefits of trunk-based development
Trunk-based development takes a more continuous delivery approach
to software development, and branches are short-lived and merged as
frequently as possible. The branches are smaller because they often contain
only a part of a feature. These short-lived branches make the process of
merging easier because there is less time for divergence between the main
trunk and the branch copies.
Thus, trunk-based development is a methodology for releasing new
features and small changes quickly while helping to avoid lengthy bug
fixes and “merge hell.” It is a growing popular devops practice among agile
development teams, and is often paired with feature flags to ensure that
any new features can be rolled back quickly and easily if any bugs are
discovered.
master
(or trunk)
Merging is done more
frequently and more easily
for shorter branches
short lived
branches
02 Painted-door experiment in-depth example
-
41
Let’s say your company was deciding whether to build a “Recommended
For You” feature where users are recommended content based on what
they have previously viewed. Building such a system may require not only
frontend changes, but backend API changes, backend model changes, a
place to run a recommendation algorithm, as well as enough data to make
the recommendations worth it. This feature will take countless hours of
engineering time to build. Furthermore, in the time you take to build this new
feature, you may not have confidence that users will actually end up using
the recommended content.
So how do you gain confidence that users will use the new feature without
building the entire thing?
One option is to build just the frontend button to view the recommended
content, but have the button lead to a notice that the team is still working on
this feature and is still seeking feedback on how it should be built. This way,
you build only a suggestion of the “Recommended For You” feature, yet you
can measure how many users engage and interact with it. This is a painted-
door experiment.
These experiments either validate the hypothesis that the feature is worth
building, or they invalidate the hypothesis and save countless engineering
hours that would have otherwise been spent building a feature not worth the
time.
03 Additional implementation techniques
Checkout Martin Fowler’s blog for additional techniques for implementing feature
toggles: https://optimize.ly/2TVig0C
42
Asa is the Principal Developer Advocate for Optimizely. Previously, he was
the engineering manager for Optimizely’s Full Stack product, responsible
for leading multiple cross-functional engineering teams in charge of
Optimizely’s fastest growing product to enable companies to experiment
more across websites, apps, and every level of the stack. Prior to joining
Optimizely, Asa worked at Microsoft as a software developer where he
built the Internet Explorer site recommendation algorithm. He studied
mathematics and computer science at Harvard University and is an avid
break dancer.
Bio
Asa Schachar
Principal Developer Advocate
© 2020 Optimizely, Inc.

More Related Content

Similar to Ship Confidently with progressive delivery and experimentation.pdf

Unifying feature management with experiments - Server Side Webinar (1).pdf
Unifying feature management with experiments - Server Side Webinar (1).pdfUnifying feature management with experiments - Server Side Webinar (1).pdf
Unifying feature management with experiments - Server Side Webinar (1).pdfVWO
 
Feature flag launchdarkly
Feature flag launchdarklyFeature flag launchdarkly
Feature flag launchdarklySandeep Soni
 
Implementing a testing strategy
Implementing a testing strategyImplementing a testing strategy
Implementing a testing strategyDaniel Giraldo
 
How Crowd Testing Works
How Crowd Testing WorksHow Crowd Testing Works
How Crowd Testing Works99tests
 
Tackling software testing challenges in the agile era
Tackling software testing challenges in the agile eraTackling software testing challenges in the agile era
Tackling software testing challenges in the agile eraQASymphony
 
What is Regression Testing Definition, Tools, Examples.pdf
What is Regression Testing Definition, Tools, Examples.pdfWhat is Regression Testing Definition, Tools, Examples.pdf
What is Regression Testing Definition, Tools, Examples.pdfRohitBhandari66
 
Automation Testing Best Practices.pdf
Automation Testing Best Practices.pdfAutomation Testing Best Practices.pdf
Automation Testing Best Practices.pdfKMSSolutionsMarketin
 
6 Things To Consider When Selecting Mobile Testing Tools?
6 Things To Consider When Selecting Mobile Testing Tools?6 Things To Consider When Selecting Mobile Testing Tools?
6 Things To Consider When Selecting Mobile Testing Tools?headspin2
 
What is Cloud Testing Everything you need to know.pdf
What is Cloud Testing Everything you need to know.pdfWhat is Cloud Testing Everything you need to know.pdf
What is Cloud Testing Everything you need to know.pdfpcloudy2
 
An Ultimate Guide to Continuous Testing in Agile Projects.pdf
An Ultimate Guide to Continuous Testing in Agile Projects.pdfAn Ultimate Guide to Continuous Testing in Agile Projects.pdf
An Ultimate Guide to Continuous Testing in Agile Projects.pdfKMSSolutionsMarketin
 
White-Paper-Continuous-Delivery
White-Paper-Continuous-DeliveryWhite-Paper-Continuous-Delivery
White-Paper-Continuous-Deliveryalkhan50
 
Testing Hourglass at Jira Frontend - by Alexey Shpakov, Sr. Developer @ Atlas...
Testing Hourglass at Jira Frontend - by Alexey Shpakov, Sr. Developer @ Atlas...Testing Hourglass at Jira Frontend - by Alexey Shpakov, Sr. Developer @ Atlas...
Testing Hourglass at Jira Frontend - by Alexey Shpakov, Sr. Developer @ Atlas...Applitools
 
Smoke Testing
Smoke TestingSmoke Testing
Smoke TestingKanoah
 
Enabling Continuous Quality in Mobile App Development
Enabling Continuous Quality in Mobile App DevelopmentEnabling Continuous Quality in Mobile App Development
Enabling Continuous Quality in Mobile App DevelopmentMatthew Young
 
Step by-step mobile testing approaches and strategies
Step by-step mobile testing approaches and strategiesStep by-step mobile testing approaches and strategies
Step by-step mobile testing approaches and strategiesAlisha Henderson
 
The ultimate guide to release management process
The ultimate guide to release management processThe ultimate guide to release management process
The ultimate guide to release management processEnov8
 
Software Testing Trends in 2023
Software Testing Trends in 2023Software Testing Trends in 2023
Software Testing Trends in 2023Enov8
 
What is DevOps Services_ Tools and Benefits.pdf
What is DevOps Services_ Tools and Benefits.pdfWhat is DevOps Services_ Tools and Benefits.pdf
What is DevOps Services_ Tools and Benefits.pdfkomalmanu87
 
What is DevOps Services_ Tools and Benefits.pdf
What is DevOps Services_ Tools and Benefits.pdfWhat is DevOps Services_ Tools and Benefits.pdf
What is DevOps Services_ Tools and Benefits.pdfkomalmanu87
 

Similar to Ship Confidently with progressive delivery and experimentation.pdf (20)

Unifying feature management with experiments - Server Side Webinar (1).pdf
Unifying feature management with experiments - Server Side Webinar (1).pdfUnifying feature management with experiments - Server Side Webinar (1).pdf
Unifying feature management with experiments - Server Side Webinar (1).pdf
 
Feature flag launchdarkly
Feature flag launchdarklyFeature flag launchdarkly
Feature flag launchdarkly
 
Implementing a testing strategy
Implementing a testing strategyImplementing a testing strategy
Implementing a testing strategy
 
How Crowd Testing Works
How Crowd Testing WorksHow Crowd Testing Works
How Crowd Testing Works
 
Tackling software testing challenges in the agile era
Tackling software testing challenges in the agile eraTackling software testing challenges in the agile era
Tackling software testing challenges in the agile era
 
What is Regression Testing Definition, Tools, Examples.pdf
What is Regression Testing Definition, Tools, Examples.pdfWhat is Regression Testing Definition, Tools, Examples.pdf
What is Regression Testing Definition, Tools, Examples.pdf
 
Automation Testing Best Practices.pdf
Automation Testing Best Practices.pdfAutomation Testing Best Practices.pdf
Automation Testing Best Practices.pdf
 
6 Things To Consider When Selecting Mobile Testing Tools?
6 Things To Consider When Selecting Mobile Testing Tools?6 Things To Consider When Selecting Mobile Testing Tools?
6 Things To Consider When Selecting Mobile Testing Tools?
 
What is Cloud Testing Everything you need to know.pdf
What is Cloud Testing Everything you need to know.pdfWhat is Cloud Testing Everything you need to know.pdf
What is Cloud Testing Everything you need to know.pdf
 
An Ultimate Guide to Continuous Testing in Agile Projects.pdf
An Ultimate Guide to Continuous Testing in Agile Projects.pdfAn Ultimate Guide to Continuous Testing in Agile Projects.pdf
An Ultimate Guide to Continuous Testing in Agile Projects.pdf
 
White-Paper-Continuous-Delivery
White-Paper-Continuous-DeliveryWhite-Paper-Continuous-Delivery
White-Paper-Continuous-Delivery
 
Testing Hourglass at Jira Frontend - by Alexey Shpakov, Sr. Developer @ Atlas...
Testing Hourglass at Jira Frontend - by Alexey Shpakov, Sr. Developer @ Atlas...Testing Hourglass at Jira Frontend - by Alexey Shpakov, Sr. Developer @ Atlas...
Testing Hourglass at Jira Frontend - by Alexey Shpakov, Sr. Developer @ Atlas...
 
Smoke Testing
Smoke TestingSmoke Testing
Smoke Testing
 
Agile case studies
Agile case studiesAgile case studies
Agile case studies
 
Enabling Continuous Quality in Mobile App Development
Enabling Continuous Quality in Mobile App DevelopmentEnabling Continuous Quality in Mobile App Development
Enabling Continuous Quality in Mobile App Development
 
Step by-step mobile testing approaches and strategies
Step by-step mobile testing approaches and strategiesStep by-step mobile testing approaches and strategies
Step by-step mobile testing approaches and strategies
 
The ultimate guide to release management process
The ultimate guide to release management processThe ultimate guide to release management process
The ultimate guide to release management process
 
Software Testing Trends in 2023
Software Testing Trends in 2023Software Testing Trends in 2023
Software Testing Trends in 2023
 
What is DevOps Services_ Tools and Benefits.pdf
What is DevOps Services_ Tools and Benefits.pdfWhat is DevOps Services_ Tools and Benefits.pdf
What is DevOps Services_ Tools and Benefits.pdf
 
What is DevOps Services_ Tools and Benefits.pdf
What is DevOps Services_ Tools and Benefits.pdfWhat is DevOps Services_ Tools and Benefits.pdf
What is DevOps Services_ Tools and Benefits.pdf
 

Recently uploaded

Avoid the 2025 web accessibility rush: do not fear WCAG compliance
Avoid the 2025 web accessibility rush: do not fear WCAG complianceAvoid the 2025 web accessibility rush: do not fear WCAG compliance
Avoid the 2025 web accessibility rush: do not fear WCAG complianceDamien ROBERT
 
Labour Day Celebrating Workers and Their Contributions.pptx
Labour Day Celebrating Workers and Their Contributions.pptxLabour Day Celebrating Workers and Their Contributions.pptx
Labour Day Celebrating Workers and Their Contributions.pptxelizabethella096
 
GreenSEO April 2024: Join the Green Web Revolution
GreenSEO April 2024: Join the Green Web RevolutionGreenSEO April 2024: Join the Green Web Revolution
GreenSEO April 2024: Join the Green Web RevolutionWilliam Barnes
 
Unraveling the Mystery of the Hinterkaifeck Murders.pptx
Unraveling the Mystery of the Hinterkaifeck Murders.pptxUnraveling the Mystery of the Hinterkaifeck Murders.pptx
Unraveling the Mystery of the Hinterkaifeck Murders.pptxelizabethella096
 
Enjoy Night⚡Call Girls Dlf City Phase 4 Gurgaon >༒8448380779 Escort Service
Enjoy Night⚡Call Girls Dlf City Phase 4 Gurgaon >༒8448380779 Escort ServiceEnjoy Night⚡Call Girls Dlf City Phase 4 Gurgaon >༒8448380779 Escort Service
Enjoy Night⚡Call Girls Dlf City Phase 4 Gurgaon >༒8448380779 Escort ServiceDelhi Call girls
 
Social Samosa Guidebook for SAMMIES 2024.pdf
Social Samosa Guidebook for SAMMIES 2024.pdfSocial Samosa Guidebook for SAMMIES 2024.pdf
Social Samosa Guidebook for SAMMIES 2024.pdfSocial Samosa
 
Brand experience Peoria City Soccer Presentation.pdf
Brand experience Peoria City Soccer Presentation.pdfBrand experience Peoria City Soccer Presentation.pdf
Brand experience Peoria City Soccer Presentation.pdftbatkhuu1
 
April 2024 - VBOUT Partners Meeting Group
April 2024 - VBOUT Partners Meeting GroupApril 2024 - VBOUT Partners Meeting Group
April 2024 - VBOUT Partners Meeting GroupVbout.com
 
BDSM⚡Call Girls in Sector 150 Noida Escorts >༒8448380779 Escort Service
BDSM⚡Call Girls in Sector 150 Noida Escorts >༒8448380779 Escort ServiceBDSM⚡Call Girls in Sector 150 Noida Escorts >༒8448380779 Escort Service
BDSM⚡Call Girls in Sector 150 Noida Escorts >༒8448380779 Escort ServiceDelhi Call girls
 
Unraveling the Mystery of The Circleville Letters.pptx
Unraveling the Mystery of The Circleville Letters.pptxUnraveling the Mystery of The Circleville Letters.pptx
Unraveling the Mystery of The Circleville Letters.pptxelizabethella096
 
Brighton SEO April 2024 - The Good, the Bad & the Ugly of SEO Success
Brighton SEO April 2024 - The Good, the Bad & the Ugly of SEO SuccessBrighton SEO April 2024 - The Good, the Bad & the Ugly of SEO Success
Brighton SEO April 2024 - The Good, the Bad & the Ugly of SEO SuccessVarn
 
Aryabhata I, II of mathematics of both.pptx
Aryabhata I, II of mathematics of both.pptxAryabhata I, II of mathematics of both.pptx
Aryabhata I, II of mathematics of both.pptxtegevi9289
 
VIP 7001035870 Find & Meet Hyderabad Call Girls Film Nagar high-profile Call ...
VIP 7001035870 Find & Meet Hyderabad Call Girls Film Nagar high-profile Call ...VIP 7001035870 Find & Meet Hyderabad Call Girls Film Nagar high-profile Call ...
VIP 7001035870 Find & Meet Hyderabad Call Girls Film Nagar high-profile Call ...aditipandeya
 
Marketing Management Presentation Final.pptx
Marketing Management Presentation Final.pptxMarketing Management Presentation Final.pptx
Marketing Management Presentation Final.pptxabhishekshetti14
 

Recently uploaded (20)

Top 5 Breakthrough AI Innovations Elevating Content Creation and Personalizat...
Top 5 Breakthrough AI Innovations Elevating Content Creation and Personalizat...Top 5 Breakthrough AI Innovations Elevating Content Creation and Personalizat...
Top 5 Breakthrough AI Innovations Elevating Content Creation and Personalizat...
 
Avoid the 2025 web accessibility rush: do not fear WCAG compliance
Avoid the 2025 web accessibility rush: do not fear WCAG complianceAvoid the 2025 web accessibility rush: do not fear WCAG compliance
Avoid the 2025 web accessibility rush: do not fear WCAG compliance
 
Labour Day Celebrating Workers and Their Contributions.pptx
Labour Day Celebrating Workers and Their Contributions.pptxLabour Day Celebrating Workers and Their Contributions.pptx
Labour Day Celebrating Workers and Their Contributions.pptx
 
Turn Digital Reputation Threats into Offense Tactics - Daniel Lemin
Turn Digital Reputation Threats into Offense Tactics - Daniel LeminTurn Digital Reputation Threats into Offense Tactics - Daniel Lemin
Turn Digital Reputation Threats into Offense Tactics - Daniel Lemin
 
GreenSEO April 2024: Join the Green Web Revolution
GreenSEO April 2024: Join the Green Web RevolutionGreenSEO April 2024: Join the Green Web Revolution
GreenSEO April 2024: Join the Green Web Revolution
 
Unraveling the Mystery of the Hinterkaifeck Murders.pptx
Unraveling the Mystery of the Hinterkaifeck Murders.pptxUnraveling the Mystery of the Hinterkaifeck Murders.pptx
Unraveling the Mystery of the Hinterkaifeck Murders.pptx
 
Generative AI Master Class - Generative AI, Unleash Creative Opportunity - Pe...
Generative AI Master Class - Generative AI, Unleash Creative Opportunity - Pe...Generative AI Master Class - Generative AI, Unleash Creative Opportunity - Pe...
Generative AI Master Class - Generative AI, Unleash Creative Opportunity - Pe...
 
Brand Strategy Master Class - Juntae DeLane
Brand Strategy Master Class - Juntae DeLaneBrand Strategy Master Class - Juntae DeLane
Brand Strategy Master Class - Juntae DeLane
 
Enjoy Night⚡Call Girls Dlf City Phase 4 Gurgaon >༒8448380779 Escort Service
Enjoy Night⚡Call Girls Dlf City Phase 4 Gurgaon >༒8448380779 Escort ServiceEnjoy Night⚡Call Girls Dlf City Phase 4 Gurgaon >༒8448380779 Escort Service
Enjoy Night⚡Call Girls Dlf City Phase 4 Gurgaon >༒8448380779 Escort Service
 
Social Samosa Guidebook for SAMMIES 2024.pdf
Social Samosa Guidebook for SAMMIES 2024.pdfSocial Samosa Guidebook for SAMMIES 2024.pdf
Social Samosa Guidebook for SAMMIES 2024.pdf
 
Foundation First - Why Your Website and Content Matters - David Pisarek
Foundation First - Why Your Website and Content Matters - David PisarekFoundation First - Why Your Website and Content Matters - David Pisarek
Foundation First - Why Your Website and Content Matters - David Pisarek
 
Brand experience Peoria City Soccer Presentation.pdf
Brand experience Peoria City Soccer Presentation.pdfBrand experience Peoria City Soccer Presentation.pdf
Brand experience Peoria City Soccer Presentation.pdf
 
April 2024 - VBOUT Partners Meeting Group
April 2024 - VBOUT Partners Meeting GroupApril 2024 - VBOUT Partners Meeting Group
April 2024 - VBOUT Partners Meeting Group
 
BDSM⚡Call Girls in Sector 150 Noida Escorts >༒8448380779 Escort Service
BDSM⚡Call Girls in Sector 150 Noida Escorts >༒8448380779 Escort ServiceBDSM⚡Call Girls in Sector 150 Noida Escorts >༒8448380779 Escort Service
BDSM⚡Call Girls in Sector 150 Noida Escorts >༒8448380779 Escort Service
 
Unraveling the Mystery of The Circleville Letters.pptx
Unraveling the Mystery of The Circleville Letters.pptxUnraveling the Mystery of The Circleville Letters.pptx
Unraveling the Mystery of The Circleville Letters.pptx
 
Creator Influencer Strategy Master Class - Corinne Rose Guirgis
Creator Influencer Strategy Master Class - Corinne Rose GuirgisCreator Influencer Strategy Master Class - Corinne Rose Guirgis
Creator Influencer Strategy Master Class - Corinne Rose Guirgis
 
Brighton SEO April 2024 - The Good, the Bad & the Ugly of SEO Success
Brighton SEO April 2024 - The Good, the Bad & the Ugly of SEO SuccessBrighton SEO April 2024 - The Good, the Bad & the Ugly of SEO Success
Brighton SEO April 2024 - The Good, the Bad & the Ugly of SEO Success
 
Aryabhata I, II of mathematics of both.pptx
Aryabhata I, II of mathematics of both.pptxAryabhata I, II of mathematics of both.pptx
Aryabhata I, II of mathematics of both.pptx
 
VIP 7001035870 Find & Meet Hyderabad Call Girls Film Nagar high-profile Call ...
VIP 7001035870 Find & Meet Hyderabad Call Girls Film Nagar high-profile Call ...VIP 7001035870 Find & Meet Hyderabad Call Girls Film Nagar high-profile Call ...
VIP 7001035870 Find & Meet Hyderabad Call Girls Film Nagar high-profile Call ...
 
Marketing Management Presentation Final.pptx
Marketing Management Presentation Final.pptxMarketing Management Presentation Final.pptx
Marketing Management Presentation Final.pptx
 

Ship Confidently with progressive delivery and experimentation.pdf

  • 1. 1 by Asa Schachar Ship Confidently with Progressive Delivery and Experimentation A book of best practices to enable your engineering team to adopt feature flags, phased rollouts, A/B testing, and other proven techniques to deliver the right features faster and with confidence.
  • 2. 2 Today, we’re living through the third major change in the way companies deliver better software products. First came the Agile Manifesto, encouraging teams to iterate quickly based on customer feedback, resulting in teams building and releasing features in small pieces. Secondly, as software development moved to the cloud, many teams also adopted DevOps practices like continuous integration and continuous delivery to push code to production more frequently, further reducing risk and increasing speed to market for new features. However, today’s most successful software companies like Google, Facebook, Amazon, Netflix, Lyft, Uber, and Airbnb have gone one step further, ushering in a third major change. They move fast by releasing code to small portions of their traffic and improve confidence in product decisions by testing their hypotheses with real customers. Instead of guessing at the best user experience and deploying it to everyone, these companies progressively release new features to live traffic in the form of gradual rollouts, targeted feature flags, and A/B tests. This process helps product and engineering teams reduce uncertainty, make data-driven decisions, and deliver the right experience to end users faster. When customers are more engaged with features and products, it ultimately drives retention and increased revenue for these businesses. The name for this third major shift is progressive delivery and experimentation. Progressive delivery with feature flags gives teams new ways to test in production and validate changes before releasing them to everyone, instead of scrambling to roll back changes or deploy hotfixes. With experimentation, development teams gain the confidence of knowing they’re building the most impactful products and features because they can validate product decisions well before expensive investments are lost. At the core of this practice is a platform that gives teams the ability to control the release of new features to production, decouple deployment from feature enablement, and measure the impact of those changes with real users in production. As with any new process or platform, incorporating progressive delivery and experimentation into your software development and delivery process can bring up many questions for engineering teams: It’s not enough to continuously integrate and continuously deliver new code. Progressive delivery and experimentation enable you to test and learn to move quickly with the confidence you’re delivering the right features. How can we get started with feature flagging and A/B testing without creating technical debt down the line? Intro
  • 3. 3 Get started with progressive delivery and experimentation Enable company-wide experimentation and safe feature delivery Use progressive delivery and experimentation to innovate faster Contents 01 02 03 04 p4 p17 p26 p38 Scale from tens to hundreds of feature flags and experiments In today’s fast-moving world, it’s no longer enough to just ship small changes quickly. Teams must master these new best practices to test and learn in order to deliver better software, products, and growth faster. By going from a foundational understanding to learning how to avoid pitfalls in process, technology, and strategy, this book will enable any engineering team to successfully incorporate feature flags, phased rollouts, and data- driven A/B tests as core development practices. By following this journey, you and your team will unlock the power of progressive delivery and experimentation like the software giants already have. This book is intended for software engineers and software engineering leaders, but adjacent disciplines will also find it useful. If your job involves softwareengineering,product,or qualityassurance,then you’rein theright place. How can we scale to thousands of flags and still retain good governance and QA processes? How can we adopt these new practices organization-wide without slowing down development and delivery?
  • 4. 4 A feature flag (aka feature toggle), in its most basic form, is a switch that allows product teams to enable or disable functionality of their product without deploying new code changes. For example, let’s say you’re building a new front-end dashboard. You could wait until the entire dashboard code is complete before merging and releasing. Alternatively, you could put unfinished dashboard code behind a feature flag that is currently disabled and only enable the flag for your users once the dashboard works as expected. In code, a feature flag might look like the following: Get started with progressive delivery and experimentation In building for experimentation, you’ll first want to understand the different ways that software products can enable progressive delivery and experimentation. In this first chapter, we’ll cover the basics of feature flags, phased rollouts, and A/B tests, and best practices for implementing them. You’ll discover how these techniques all fit together, where an effective feature flag process can easily evolve into product experimentation and statistically rigorous A/B testing. Once you’ve implemented these techniques, you’ll be able to ship the right features safely. 01 Feature flags: Enable feature releases without deploying code if (isFeatureEnabled(‘new_dashboard’)) { // true or false showNewDashboard(); } else { showExistingDashboard(); } Feature flags toggle features on and off giving you another layer of control over what your users experience.
  • 5. 5 New Feature Feature Flag or Toggle Consumers Seamless releases Feature kill switches Trunk-based development Platform for experimentation When the function isFeatureEnabled is connected to a database or remote connection to control whether it returns true or false, we can dynamically control the visibility of the new dashboard without deploying new code. Even with simple feature flags like this one, you get the benefits of: Instead of worrying about the feature code merging and releasing at the right time, you can use a feature flag for more control over when features are released. Use feature flags for a fully controllable product launch, allowing you to decide whether the code is ready to show your users. If a release goes wrong, a feature flag can act as a kill switch that allows you to quickly turn off the feature and mitigate the impact of a bad release. Instead of having engineers work in long-lived, hard-to-merge, conflict-ridden feature branches, the team can merge code faster and more frequently in a trunk-based development workflow.[1] A simple feature flag is the start of a platform that enables experimentation, as we will see later in this chapter. New feature Feature flag or toggle
  • 6. 6 There are two general ways to perform a feature rollout, targeted and random, which both suit different use cases. 01 Targeted rollout: For specific users first A targeted rollout enables features for specific users at a time, allowing for different types of experimentation. Experiment with beta software If you have power users or early adopters who are excited to use your features as soon as they are developed, then you can release your features directly to these users first, getting valuable feedback early on while the feature is still subject to change. Experiment across regions If your users behave differently based on their attributes, like their country of origin, you can use targeted rollouts to expose features to specific subsets of users or configure the features differently for each group. Experiment with prototype designs Similarly, for new designs that dramatically change the experience for users, it’s useful to target specific test users to see how these changes will realistically be used when put in the hands of real users. Basic feature flags are either enabled or disabled for everyone, but feature flags become more powerful when you can control whether a feature flag is exposed to a portion of your traffic. A feature rollout is the idea that you only enable a feature for a subset of your users at a time rather than all at once. In our dashboard example, let’s say you have many different users for your application. By providing a user identifier to isFeatureEnabled, the method will have just enough information to return different values for different users. Feature rollouts: Enable safe, feedback-driven releases // true or false depending on the user if (isFeatureEnabled(‘new_dashboard’, ‘user123’)) showNewDashboard(); } else { showExistingDashboard(); } Rollouts allow you to control which subset of users can see your flagged feature.
  • 7. 7 02 Random rollout: For small samples of users first Another way of performing a feature rollout is by random percentage. Perhaps at first you show the new feature to only 10% of your users; then a week later, you enable the new feature for 50% of your users; and finally, you enable the feature for all 100% of your users in the third week. With a phased rollout that introduces a feature to parts of your traffic, you unlock several different types of experimentation. Experiment with gradual launches: Instead of focusing on a big launch day for your feature that has the risk of going horribly wrong for all your users, you can slowly roll out your new feature to fractions of your traffic as an experiment to make sure you catch bugs early and mitigate the risk of losing user trust. Gradually rolling out your features limits your blast radius to only the affected customers versus your entire user base. This process limits negative user sentiment and potential revenue loss. Experiment with risky changes: A random rollout can give you confidence that particularly risky changes, like data migrations, large refactors, or infrastructure changes, are not going to negatively impact your product or business. Experiment with scale: Performance and scale issues are challenging. Unlike other software problems, they are hard to predict and hard to simulate. Often, you only find out that your feature has a scale problem when it’s too late. Instead of waiting for your performance dashboards to show ugly spikes in response times or error rates, using a phased rollout can help you foresee any real-world scale problems beforehand by understanding the scaling characteristics of parts of your traffic first. Experiment with painted-door tests If you’re not sure whether you should build a feature, you may consider running a painted-door test[2] where you build only the suggestion of the feature and analyze how different random users are affected by the appearance of the new feature. For instance, adding a new button or navigation element in your UI for a feature you’re considering can show you how many people interact with it. New Feature Feature Rollout Some users get the feature New feature Feature rollout
  • 8. 8 Relying on an external database to store whether your feature is enabled can increase latency, while hard coding the variable diminishes your ability to dynamically change. Find an architecture that strikes a balance between the two. Fetch feature flag configuration when the application starts up Cache feature flag configuration in-memory so that decisions can be made with low latency Listen for updates to the feature flag configuration so that updates are pushed to the application in as real time as possible Poll for updates to the feature flag configuration at regular intervals, so if a push fails, the application is still guaranteed to have the latest feature configuration within some well-defined interval 1 2 3 4 Best practice: Balancing low-latency decisions with dynamism When the logic of your codebase depends on the return value of a function like isFeatureEnabled, you’ll want to make sure that isFeatureEnabled returns its decision as fast as possible. In the worst case, if you rely on an external database to store whether the feature is enabled, you risk increasing the latency of your application by requiring a roundtrip external network request across the internet, even when the new feature is not enabled. In the best-case performance, the isFeatureEnabled function is hard coded to true or false either as a variable or environment variable, but then you lose the ability to dynamically change the value of the feature flag without code deploys or reconfiguring your application. So, one of the first challenges of feature flags is striking this balance between a low-latency decision and the ability to change that decision dynamically and quickly. There are multiple methods for achieving this balance to suit the needs and capabilities of different applications. An architecture that strikes this balance well and is suitable on many platforms will: As an example, a mobile application may initially fetch the feature flag configuration when the app is started, then cache the feature flag configuration in-memory on the phone. The mobile app can use push notifications to listen for updates to the feature flag configuration as well as poll on regular, 10-minute intervals to ensure that the feature flags are always up to date in case the push notification fails.
  • 9. 9 Admin panel Client apps App servers
  • 10. 10 With phased rollouts, your application has the ability to simultaneously deliver two different experiences: one with the feature on and another with the feature off. But how do you know which one is better? And how can you use data to determine which is best for your users and the metrics you care about? An A/B test can point you in the right direction. By shipping different versions of your product simultaneously to different portions of your traffic, you can use the usage data to determine which version is better. If you’re resource constrained, you can simply test the presence or absence of a new feature to validate whether it has a positive, negative, or neutral impact on application performance and user metrics. By being precise with how, when, and who is exposed to these different feature configurations, you can run a controlled product experiment, get statistically significant data, and be scientific about developing the features that are right for your users, rather than relying on educated guesses. If you want to use objective-truth data to resolve differing opinions within your organization, then running an A/B test is right for you. A/B tests: Make data-driven product decisions New Feature Variation A A/B Test Some users get variation A or B Variation B Simply testing the presence or absence of a new feature can help you validate whether it has a positive, negative, or neutral impact on application performance and user metrics. When your A/B test is precise with the how, when, and who is exposed to different feature configurations, you can make data- driven decisions as opposed to educated guesses. New feature A/B test
  • 11. 11 Because of the properties of a good hash function, you are always guaranteed a deterministic but random output given the same inputs, which gives you several benefits: 01 Best practice: Deterministic experiment bucketing — hashing over Math.Random() If you’re building a progressive delivery and experimentation platform, you may be tempted to rely on a built-in function like Math.random() to randomly bucket users into variations. Once bucketed, a user should only see their assigned variation for the lifetime of the experiment. However, introducing Math.random() adds indeterminism to your codebase, which will be hard to reason about and hard to test later. Storing the result of the bucketing decision also forces your platform to be stateful. A better approach is to use hashing as a random but deterministic and stateless way of bucketing users. To visualize how hashing can be used for bucketing, let’s represent the traffic to your application as a number line from 0 to 10,000. For an experiment with two variations of 50% each, the first 5,000 numbers of your number line can correspond to the 50% of your traffic that will get variation A, and the second 5,000 numbers can correspond to the 50% of your traffic that will receive variation B. The bucketing process is simplified to assigning a number between 0 and 10,0000 for each user. Using a hashing function that takes as input the user id (ex: user123) and experiment id (ex: homepage_experiment) or feature key and outputs a number between 0 and 10,0000, you achieve that numbered assignment for assigning variations: hash(‘user123’, ‘homepage_experiment’) -> 6756 // variation B Your application runs predictably for a given user Automated tests run predictably because the inputs can be controlled Your progressive delivery and experimentation platform is stateless by re-evaluating the hashing function at any time rather than storing the result A large hashing range like 0 to 10,000 allows assigning traffic granularity at fine increments of 0.01% The same pseudo-random bucketing can be used for random phased rollouts You can exclude a percentage of traffic from an experiment by excluding a portion of the hashing range
  • 12. 12 Everwonderwhichlandingpagewouldleadtothemostsignupsforyour product?Duringthe2008presidentialcampaign,BarackObama’soptimization teamranseveralA/BteststodeterminethebestimageofObamaand correspondingbuttontexttoputonthelandingpageofthecampaignwebsite. TheseA/Btestedadjustmentsincreasedsignupsandledto$60millionof additionaldonationsfromtheirwebsite. Product metrics: For product improvements Landing page signups 02 Best practice: Use A/B tests for insight on specific metrics A/B tests make the most sense when you want to test different hypotheses for improving a specific metric. The following are examples of the types of metrics that allow you to run statistically significant A/B tests. Referral signups through shares Wanttoknowwhichreferralprogramwouldincreaseviralityofyourproduct themostcost-effectivelythroughsharing?Ride-sharingserviceslikeLyftand Uberoftenexperimentontheamountofmoneytorewardusersforbringing otheruserstotheirplatforms(ex:give$20toafriendandget$20yourself). It’simportanttogetthisamountrightsothecostofgrowthdoesn’tnegatively impactyourbusinessinthelongterm. 0 10000 5000 user123 1812 5934 8981 user456 user981 hash() hash() hash() user123 gets bucketed into Variation A user456 gets bucketed into Variation B Variation A Variation B
  • 13. 13 An impression event occurs when a user is assigned to a variation of an A/B test. For these events, the following information is useful to send as a payload to an analytics system: an identifier of the user, an identifier of the variation the user was exposed to, and a timestamp of when the user was exposed to the variation. With this information as an event in an analytics system, you can attribute all subsequent actions (or conversion events) that the user takes to being exposed to that specific variation. Conversion event A conversion event corresponds to the desired outcome of the experiment. Looking at the example metrics above, you could have conversion events for when a user signs up, when a user shares a product, the time it takes a dashboard to load, or when an error occurs while using the product. With conversion events, the following information is useful to send as a payload to an analytics system: an identifier of the user, an identifier of the type of event that happened (ex: signup, share, etc.), and a timestamp. 03 Best practice: Capture event information for analyzing an A/B test When instrumenting for A/B testing and tracking metrics, it’s important to track both impression and conversion events because each type includes key information about the experiment. Operational metrics: For infrastructure improvements Latency & throughput Ifengineersaredebatingoverwhichimplementationwillperformbestunder real-worldconditions,youcangatherstatisticalsignificanceonwhichsolution ismoreperformantwithmetricslikethroughputandlatency. Error rates Impression event Ifyourteamisworkingonaplatformorlanguageshiftandhasatheorythatyour applicationwillresultinfewererrorsafterthechange,thenerrorratescanserve asametrictodeterminewhichplatformismorestable.
  • 14. 14 04 Best practice: Avoid common experiment analysis pitfalls Once you have the above events, you can run an experiment analysis to compare the number of conversion events in each variation and determine which one is statistically stronger. However, experiment analysis is not always straightforward. It’s best to consult data scientists or trained statisticians to help ensure your experiment analysis is done correctly. Although this book does not dive deep into statistics, you should keep an eye out for these common pitfalls. Example impression events Example conversion events U S E R I D U S E R I D Caroline Caroline Dennis Dennis Flynn Flynn Flynn Erin original purchase free-shipping purchase add_to_cart free-shipping add_to_cart signed_up 2019-10-08T02:13:01 50 2019-10-08T00:05:32 2019-10-08T05:30:46 30 2019-10-08T00:07:19 10 2019-10-09T01:15:51 2019-10-09T01:11:20 5 2019-10-09T01:14:23 - 2019-11-09T12:02:36 VA R I AT I O N I D E V E N T I D T I M E STA M P VA L U E T I M E STA M P Creating too many variations or evaluating too many metrics will increase the likelihood of seeing a false positive just by chance. To avoid that outcome, make sure the variations of your experiment are backed by a meaningful hypothesis or use a statistical platform that provides false discovery rate control. Note: The identifiers used in the table on the right are just for illustration purposes. Typically, identifiers are globally unique and are non-identifiable strings of digits and characters. Also note that a value is included in the conversion events that are non-binary (ex: how much money was associated with a purchase event). However, binary events like someone signing up, does not have a value associated. Rather these events are binary: they either happened or did not. If you calculate the results of an A/B test when only a small number of users have been exposed to the experiment, the results may be due to random chance rather than the difference between variations. Make sure your sample size is big enough for the statistical confidence you want. Multiple comparisons Small sample size
  • 15. 15 A simple feature flag is just an on-and-off switch that corresponds to the A and B variations of an A/B test. However, feature flags can become more powerful when they expose not only whether the feature is enabled, but also how the feature is configured. For example, if you were building a new dashboard feature for an email application, you could expose an integer variable that controls the number of emails to show in the dashboard at a time, a boolean variable that determines whether you show a preview of each email in the dashboard list, or a string variable that controls the button text to remove emails from the dashboard list. By exposing feature configurations as remote variables, you can enable A/B tests beyond just two variations. In the dashboard example, you can experiment not only with turning the dashboard on or off, but also with different versions of the dashboard itself. You can see whether email previews and fewer emails on screen will enable users to go through their email faster. A/B/n tests go beyond two variations to test feature configurations { title: “Latest Updates”, Feature Configuration color: “#36d4FF”, num_items: 3, } A classical experiment should be set up and run to completion before any statistical analysis is done to determine which variation is a winner. This is referred to as fixed-horizon testing. Allowing experimenters to peek at the results before the experiment has reached its sample size increases the likelihood of seeing false positives and making the wrong decisions based on the experiment. However, in the modern digital world, employing solutions like sequential testing can allow analysis to be done in real time during the course of the experiment. Peeking at results Feature configuration
  • 16. 16 One challenge besides knowing what to feature flag for an experiment is knowing how to integrate this new process into your team’s existing software development cycle. Taking a look at each of your in-progress initiatives and asking questions upfront can help you build feature flags into your standard development process. For example, by asking “How can we de-risk this project with feature flags?” you highlight how the benefits of feature flags outweigh the cost of an expensive bug in production or a disruptive, off-hours hotfix. Similarly, by asking “How can we run an experiment to validate or invalidate our hypothesis for why we should build this feature?” you will find that spending engineering time building the wrong feature is much more costly than investing in a well-designed experiment. These questions should speed your overall net productivity by enabling your team to move toward a progressive delivery and experiment-driven process. You can start to incorporate feature flags into your development cycle by asking a simple question in your technical design document template. For instance, by asking “What feature flag or experiment are you using to rollout/validate your feature?” you insert feature flags into discussions early in the development process. Because technical design docs are used as a standard process for large features, the document template is a natural place for feature flags to help de-risk complex or big launches. Feature flag driven development “How can we de-risk this project with feature flags?” “How can we run an experiment to validate or invalidate our hypothesis for why we should build this feature?” Best practice: Ask feature-flag questions in technical design docs
  • 17. 17 One challenge with experiments, feature flags, and rollouts is that you may be tempted to use them for every possible change. It’s a good idea to recognize that even in an advanced experimentation organization, you likely won’t be feature flagging every single change or A/B testing every feature. This high-level decision tree can be useful when determining when to run an experiment or set up a feature behind a flag. Scale from tens to hundreds of feature flags and experiments After completing your first few experiments, you will probably want to take a step back and start thinking about improvements to help scale your experimentation program. The following best practices are things you will need to consider when scaling from tens to hundreds of feature flags and experiments. 02 Decide when to feature flag, rollout, or A/B test Best practice: Don’t feature flag or A/B test every change.
  • 18. 18 Should I run an experiment or a rollout? Working on docs? Working on refactor? Working on a bug? Working on a feature?
  • 19. 19 02 Share constants to keep applications in sync Some feature flags and experiments are cross-product or cross-codebase. By having a centralized list of feature key constants, you are less likely to have typos that prevent this type of coordination across your product. For example, let’s say you have a feature flag file that is stored in your application backend but passed to your frontend. This way the frontend and backend not only reference the same feature keys but you can also deliver a consistent experience across the frontend and backend. When dealing with experiments or feature flags, it’s best practice to use a human-readable string identifier or key to refer to the experiment so that the code describes itself. For example, you might use the key ‘promotional_ discount’ to refer to the feature flag powering a promotional discount feature that is enabled for certain users. It’s easy to define a constant for this feature flag or experiment key exactly where it’s going to be used in your codebase. However, as you start using a lot of feature flags and experiments, your codebase will soon be riddled with keys. Centralizing the constants into one place can help. Reduce complexity by centralizing constants 01 Centralize constants to visualize complexity Centralizing all feature keys gives you a better sense of how many features exist in a codebase. This enables an engineering team to develop processes around revisiting the list of feature flags for removal. Having a sense of how many feature flags exist also gives a sense of the codebase and product complexity. Some organizations may decide to have a limit on the number of active feature flags to reduce this complexity. Best practice: Compile all your feature flags currently in use in an application into a single constants file. Best practice: Use the same feature key constants across the backend and frontend of your application.
  • 20. 20 As a company implements more feature flags and experiments, the codebase gets more identifiers or keys referencing these items (like: site_ redesign_phase_1 or homepage_hero_experiment). However, the keys that are used in-code will inevitably lack the full context of what the rollout or experiment actually does. For example, if an engineer saw site_redesign_ phase_1, it’s unclear what the redesign includes or what is included in phase 1. Although you could just increase the verbosity of these keys so that they are self-explanatory, it’s a better practice to have a process by which anyone can understand the original context or documentation behind a given feature rollout or experiment. Document your feature flags to understand original context Ensuring your software works before your users try it out is paramount to building trustworthy solutions. A common way for engineering teams to ensure their software is running as expected is to write automated unit, integration, and end-to-end tests. However, because you’re scaling experimentation, you’ll need a strategy to ensure you still deliver high quality software without an explosion of automated tests. Having a strategy to test every combination is not going to be sustainable. As an example, let’s say you have an application with 10 features and 10 corresponding automated tests. If you add just 8 on/off feature flags, you theoretically now have 2^8 = 256 possible additional states, which is nearly 25 times as many tests as you started with. Because testing every possible combination is nearly impossible, you’ll want to get the most value out of writing automated tests. Make sure you understand the different levels of automated testing, which include: Ensure quality with the right automated testing & QA strategy What the rollout or experiment changes were The owner for the rollout or experiment If the rollout or experiment can be safely paused or rolled back Whether the lifetime of this experiment or rollout is temporary or permanent Often times, engineering teams will rely on an integration between their task tracking system and their feature flag and experiment service to add context to their feature flags and experiments. Best practice: Make sure your team can easily find out:
  • 21. 21 End-to-end tests Integration tests Manual QA Unit tests Best practice: Ensure the building blocks of your application are well tested with lots of unit tests. The smaller units are often unaware of experiment or feature state. For those units that are aware, use mocks and stubs to control this white-box testing environment. 01 Unit tests—test frequently for solid building blocks Unit tests are the smallest pieces of testable code. It’s best practice that these units are so small that they are not aware or are not affected by experiments or feature flags. As an example, if a feature flag forks into two separate code paths, each code path should have its own set of independent unit tests. You should frequently test these small units of code to ensure high code coverage, just as you would if you didn’t have any feature flags or experiments in your codebase. If the code you are unit testing does need to contain code that is affected by a feature flag or experiment, take a look at the techniques of mocking and stubbing described in the integration tests section below. Manual
  • 22. 22 Unit tests End-to-end tests Integration tests Manual QA 02 Integration tests—force states to test code For integration tests, you are combining units into higher-level business logic. This is where experiments and feature flags will likely affect the logical flow of the code, and you’ll have to force a particular variation or a state of a feature flag in order to test the code. In some integration tests, you’ll still have complete access to the code’s executing environment where you can mock out the function calls to external systems or internal SDKs that power your experiments to force particular code paths to execute during your integration tests. For example, you can mock an isFeatureEnabled SDK call to always return true in an integration test. This removes any unpredictability, allowing your tests to run deterministically. In other integration tests, you may not have access to individual function calls, but you can still stub out API calls to external systems. For example, you can stub data powering the feature flag or experimentation platform to return an experiment in a specific state to force a given code path. Although you can mock out indeterminism coming from experiments or feature flags at this stage of testing, it’s still best practice for your code and tests to have as little awareness of experiment or feature flag as possible, and focus on the code paths of the variations executing as expected. Best practice: Use mocks and stubs to control feature and experiment states. Focus on individual code paths to ensure proper integration and business logic.
  • 23. 23 Unit tests Integration tests Manual QA End-to-end tests Best practice: Do not test every possible combination of experiment or feature with end-to-end tests. Instead, focus on important variations or tests that ensure your application still works if all features are on/off. 03 End-to-end tests—focus testing on critical variations End-to-end tests are the most expensive tests to write and maintain because they’re often black-box tests that don’t provide good control over their running environment and you may have to rely on external systems. For this reason, avoid relying on end-to-end or fully black-box tests to verify every branch of every experiment or feature flag. This combinatorial explosion of end-to-end tests will slow down your product development. Instead, reserve end-to-end tests for the most business-critical paths of an experiment or feature flag or use them to test the state of your application when most or all of your feature flags are in a given state. For example, you may want one end-to-end test for when all your feature flags are on, and another when all your feature flags are off. The latter test can simulate what would happen if the system powering your feature flags goes down and must degrade gracefully. When you do require end-to-end tests, make sure you can still control the experiment or feature-flag state to remove indeterminism. For example, in a web application, you may want to have a special test user, a special test cookie, or a special test query parameter that can be used to force a particular variation of an experiment or feature flag. Note that when implementing these special overrides, be sure to make them internal-only so that your users don’t have the same control over their own feature or experiment states.
  • 24. 24 Unit tests Integration tests End-to-end tests Manual QA Best practice: Save time and resources by reserving manual QA to test the most critical variations. Make sure you provide tools for QA to force feature and experiment states. As more individuals contribute to your progressive delivery process, it becomes imperative to have safe and transparent practices. Permissioning, exposing user states, and emulation enable your team to keep the process secure and viable as you scale. Increase safety with the right permissions 04 Manual verification (QA)—reserve for business-critical functions Similar to end-to-end tests, manual verification of different variations can be difficult and time consuming, which is why organizations typically have only a few manual QA tests. Reserve manual verification for business-critical functions. And if you implemented special parameters to control the states of experiments or feature flags for end-to-end tests, these same parameters can be used by a QA team to force a variation and verify a particular experience. 01 Establish permissions based on your roles With rollouts and experiments, typically your team will have a dashboard where you can edit production configurations without having to make changes to the core development repository. With this setup, you’ll want to consider the permissions of different parts of your organization and their ability to make changes to your rollouts and experiments. Ideally, your permissions should match the permissioning that you would typically use for feature development, which include:
  • 25. 25 Anyone who can do standard feature development should have standard edit-level access. However, it’s best practice to require a higher level of edit access for important system-wide infrastructure or billing configuration. Individuals with the ability to provision or change permissions for standard feature development should have administrative access to the feature flag and experimentation configuration setup. This allows an IT team or super user to provision and secure the above roles for individual developers on the team. Edit-level access Administrative access Almost everyone at your company should likely at least have read-level access, allowing them to see rollouts and experiments actively running at any given time. Read-level access 02 Expose user state for observability As your company uses more feature flags and experiments, the different possible states that a given customer could experience begins to combinatorially explode. But if one of those customers encounters an issue, you’ll want to know which states of rollouts or experiments may be active to understand the full state of the world for that particular individual. It’s best practice to have an interface either through a UI, a command line tool, or a dashboard to query which feature flags or experiments a given customer/ user/request receives for increased observability of your system. If your bucketing tool is deterministic (it will always give the same bucketing decision given the same inputs), then you can easily provide a tool that takes the customer information as inputs and the state they should be experiencing as outputs. Best practice: Consider enabling features customer by customer and using a centralized tool for anyone to input a customerId and see what combination of features that particular customer has enabled. 03 Allow emulation for faster debugging Even if you know the particular combination of features and experiments that a given customer has access to, it can sometimes be difficult to reason why they are still seeing a particular experience. This is similar to a more general problem of debugging complex customer issues. Many engineering organizations build an ability to emulate a user’s view of the product to make it easier to see what the customer sees when debugging. If using this technique, ensure you use appropriate access controls, permissions and restrictions. Some engineering organizations also have the ability to let their production site load a development version of their code in order to enable engineers to test out fixes on hard-to-replicate setups. Both techniques are extremely helpful in minimizing the time to debug specific issues that are only relevant to a certain combination of feature flag or experiment states.
  • 26. 26 Whenyouoptforaseparateexperimentationorrolloutservicetocontrolyour configuration,youmustbepreparedforwhentheservicegoesdown.Thisis wheresmartdefaultscanhelpbyansweringthefollowingquestions: Ifthefeatureflagservicewentdown, Someorganizationssaveasnapshotofthefeatureflagandexperimentstate inthecodebaseataregularfixedinterval.Thisprocessprovidesasmart,local fallbackthatisfairlyrecentincasetheremotesystemgoesdown. Enable company-wide software experimentation Scaling your experiment program across your entire organization can become complicated quickly. With these best practices you’ll be able to minimize the complexity of an advanced system running more than hundreds of experiments or rollouts simultaneously. 03 Prevent emergencies with smart defaults Wouldyoupreferthatallusershavetheexperienceofthefeaturebeingonoroff? Wouldyoupreferthatallusersgettheversionthatwasmostrecentlydeliveredby thefeatureflagorexperimentationservice? Whatconfigurationorfeaturevariablevalueswouldbepreferredforyourusers? Best practice: Think through all possible failure scenarios and prepare for them with smart defaults for when your feature flagging or experimentation services go down.
  • 27. 27 Balance low-latency, consistency, and compatibility in a microservice architecture When you start to deploy multiple versions of features and experiments across multiple separate applications or services you’ll want to ensure the services are consistent with how they evaluate feature flags and experiments. If they aren’t—where some evaluate the feature flag as on and others evaluate it as off—you could run into backward or forward compatibility issues between different services and your application might break. Below are two options for developing in this microservice architecture. Best practice: Keep dependencies between flags simple to avoid broken states. Avoid broken states with simple feature flag dependency graphs As a company builds more features and experiments, it’s likely that an engineering team will find a feature flag or experiment built on top of an existing one. Take, for example, a feature flag used to rollout a new dashboard UI interface. As you roll out the new dashboard, your team may want to experiment on a component of the new UI. Although you could manage both the rollout and the experiment with the same feature flag, there are reasons you might want to separate the two to allow for the rollout to happen independently from the experiment. However, to see the different variations of the dashboard component, you have to both enable the feature flag and be in a particular variation of the experiment. You now have a kind of dependency graph of your feature flags and experiments. Naturally, your systems will develop more and more of these feature flag dependencies where one feature depends on another. It’s best practice to minimize these dependencies as much as possible and strive for feature flag combinations that won’t break the application state. If feature flags A and B, but not C, result in an un-working application setup, then it’s likely your team lacks this contextual knowledge and accidentally puts your application in a bad state just by changing feature flag configurations. One option is to keep your feature flag hierarchy extremely shallow—for instance, a simple 1:1 mapping between feature flags and customer-facing features—ensuring there are few dependencies between flags.
  • 28. 28 Store.com browser user Store.com native mobile user 01 Services independently evaluate feature state The benefit of each service independently evaluating the state of feature flags on its own is that it minimizes the dependencies of a given service. The downside is that it requires updating every service. Also, if the services are truly independent, then they will be less consistent with the state of a feature flag. For example, when you toggle a feature flag on, there will be a time when some services get the update and evaluate the feature flag as on while others are still receiving the update and evaluate it as off. Eventually, all services will get the update and evaluate the flag as on. In other words, the independent services are eventually consistent. In this case, it’s best practice to put in the extra work to make sure the different feature flag and experiment states are forward and backward compatible with the other services to prevent unexpected states across services. Best practice: Put in the extra work to make sure your different states are forward and backward compatible with the other services. Example: Services are independent
  • 29. 29 Store.com browser user Store.com native mobile user 02 Services depend on feature state service In this architecture, the services all depend on a centralized hub, ensuring they’re all consistent with the way they evaluate a feature flag or experiment. Although this architecture is consistent and does not have to worry about backward and forward compatibility, it comes at the cost of latency. Because each service has to communicate to this separate feature flag or experimentation service, you will have added the necessary latency to achieve a consistent state across services. A separate feature and experiment service does have the added benefits of being: Example: Services depend on central service Easily implemented in a microservice architecture Compatible with other services in different languages by exposing APIs in the form of language-agnostic protocols like HTTP or gRPC Centralized for easier maintenance, monitoring, and upgrading Best practice: Expect some latency at the expense of consistent evaluation of feature flags or experiments across services.
  • 30. 30 Theseexperimentsareintendedtobeusedonlyintheearlyphasesof thesoftwaredevelopmentlifecycleandaren’tintendedtobekeptin theproductaftertheyhavevalidatedorin-validatedtheexperiment hypothesis. 01 Remove temporary flags and experiments If a feature is designed to be rolled out to everyone and you don’t expect to experiment on the feature once it’s launched, then you’ll want to ensure you have a ticket tracking its removal as soon as the feature has been fully launched to your users. These temporary flags may last weeks, months, or quarters. Examples include: Painted-door experiments Performance experiments Large-scale refactors Theseexperimentsareintendedtoputtwodifferentimplementations againsteachotherinalive,real-worldperformanceexperiment.Once enoughdatahasbeengatheredtodeterminethemoreperformant solution,it’susuallybesttomovealltrafficovertothehigherperforming variation. Whenmovingbetweenframeworks,languages,orimplementation details,it’susefultodeploytheseratherriskychangesbehindafeature flagsothatyouhaveconfidencetheywillnotnegativelyimpactyourusers orbusiness.However,oncethere-factorisdone,youhopefullywon’tgo backintheotherdirection. Prevent technical debt by understanding feature flag and experiment lifetimes As your organization uses more feature flags and experiments, it’s paramount to understand that some of these changes are ephemeral and should be removed from your codebase before they become outdated and add technical debt and complexity. One heuristic you can track is how long the feature flag or experiment has been in your system and how many different states it’s in. If the feature flag has been in your system for a long time and all of your users have the same state of the feature flag, then it should likely be removed. However, it’s smart to always evaluate the purpose of a flag or experiment before removing it. The real lifetime of an experiment or feature flag depends heavily on its use case. Best practice: To avoid technical debt, regularly review flags in case they’re obsolete or should be deprecated, even if they’re meant to be permanent.
  • 31. 31 Theseflagsareusefulifyouhavedifferentpermissionlevelsinyour productlikeread-onlythatdon’talloweditaccesstothefeature.Theyare alsousefulifyouhavemodularpricinglikeaninexpensive“starterplan” thatdoesn’thavethefeature,butamorecostly“enterpriseplan”thatdoes havethefeature. Operational flags Configuration-based software Theseflagscontroltheoperationalknobsofyourapplication.For example,theseflagscancontrolwhetheryoubatcheventssentfromyour applicationtominimizethenumberofoutboundrequests.Youcouldalso usethemtocontrolthenumberofmachinesthatareusedforhorizontally scalingyourapplication.Inaddition,theycanbeusedtodisablea computationalexpensivenon-essentialserviceorallowforagraceful switchoverfromonethird-partyservicetoanotherinanoutage. Foranysoftwareorproductthatispoweredbyaconfigfile,thisisagreat placetoseamlesslyinsertexperimentationthathasalowcosttomaintain andstillallowsendlessproductexperimentation.Forexample,some companiesmayhavetheirproductlayoutpoweredbyaconfigfilethat describesinabstracttermswhetherdifferentmodulesareincludedand howtheyarepositionedonthescreen.Withthisarchitecturalsetup,even ifyouaren’trunninganexperimentrightnow,youcanstillenablefuture productexperimentation. Note that even if a flag is meant to be permanent, it’s still paramount to regularly review these flags in case they are obsolete or should be deprecated. Otherwise, keeping these permanent flags may add technical debt to your codebase. Permission flags Product re-brands Ifyourbusinessdecidestochangethelookandfeelofyourproductfor brandpurposes,it’susefultohavearollouttogracefullymovetothenew branding.Afterthenewbrandingisestablished,it’sagoodideatoremove thefeatureflagpoweringtheswitch. 02 Review permanent flags and experiments If a feature is designed to have different states for different customers, or you want to control its configuration for operational processes, then it’s likely the flag will stay in your product for a longer period of time. Examples of these flags and experiments include:
  • 32. 32 01 Individual ownership Individual developers are labeled as owners of a feature or an experiment. At a regular cadence, for example every two quarters, ownership is re- evaluated and transferred if necessary. Simple and understandable. Pros Hard to maintain if engineers frequently move between projects or code areas. Cons 02 Feature team ownership The team responsible for building the feature takes ownership of the feature and experiment. Resilient to individual contributor changes. Pros Hard to maintain if teams are constantly changing or are unbalanced and have uneven distribution of ownership. Cons 03 Centralized ownership This ownership falls under a dedicated experimentation or growth team with experts who set up, run, and remove experiments. The downside is that it severely limits the scale of your experimentation program to the size of this Make your organization resilient with a code- ownership strategy As with any feature you build, the individuals and teams that originally engineered the feature or experiment are not going to be around forever. As a best practice, your engineering organization should agree on who is responsible for owning and maintaining a feature or experiment. This is particularly important for when you need to remove or clean up old feature flags or experiments. Options around ownership include: Some organizations use an integration between a task tracking system and their feature flag and experiment service to manage this cleanup process seamlessly and quickly. If the state of feature flags and experiments can be synced with a ticket tracking tool, then an engineering manager can run queries for all feature flags and experiments whose state has not been changed in the past 30 days and track down owners of the feature flags and experiments to evaluate their review. Other organizations have a recurring feature flag and experiment removal day in which engineers review the oldest items in the list at a regular cadence.
  • 33. 33 Minimize on-call surprises by treating changes like deploys Companies often have core engineering hours for a given product or feature, for example, a core team working Monday through Friday in similar time zones. Even companies that do continuous integration and continuous delivery realize that deploying production code changes outside of these core working hours (either late at night or on the weekends) is usually a bad idea. If something goes wrong with a deploy outside of these working hours, teams run the risk of releasing issues when the full team is not around. With fewer teammates, it’s slower to fix and mitigate the impact that the issue will have. Because experiments and rollouts give individuals the ability to easily make production code changes, it’s best practice to treat changes in an experiment or a rollout with the same level of focus as standard software deploys. Avoid making changes when no one is around or no one is aware of the change. If it is advantageous to make changes during off-hours, do so transparently with proper communication so no one is caught by surprise. central team. The upside is that this centralized team can be the experiment experts at your company and help ensure experiments are always run with high quality. This method can be especially helpful when getting started, and it’s useful to have one team prove the best practices before fanning them out to other parts of the organization. Expect changes and recover from mistakes faster with audit logs When someone at your organization does a deployment of code changes to your product, it’s best practice to have a change log that lets everyone know what the change was, who made it, and why. With this information, you won’t be surprised by changes to your product or if a user sees something new. This practice of increasing visibility into the changes of your application are no different for feature flags and experiments—you’ll want to be able to quickly answer questions like: Auser’sexperiencechangedrecently,didwestartanexperiment? Best practice: Save rollout or experiment changes for core working hours and avoid making these changes on Fridays and before a weekend or a holiday. If you must, do it responsibly. Central teams aren’t going to be the experts where the experiment is actually implemented and may require a lot of help from other teams. The size of this team will eventually limit the number of experiments and feature flags that your company can handle. Resilient to many changes and simplest to reason about. Pros Cons
  • 34. 34 Code smarter systems by making them feature flag and experiment agnostic Not all your code has to know about all the experiments and feature flags in your product. In fact, the less your code has to worry about the experiments being run on your product, the better. By striving for code that’s experiment agnostic or feature-flag unaware, you can focus on the particular product or feature you are building without having to worry about the different states. The techniques below are just two examples of different design patterns available[3] : Understand your systems with intelligent alerting With many possible states, you’ll want visibility into what’s actually happening in production for your customers and have intelligent alerting for when things are not acting as expected. For example, you may have thought you released a feature to all of your customers, only to realize that a targeting condition prevented the feature from being visible to only a portion of them. Having an alert for when the feature flag has been accessed by X number of users can be a useful way to ensure that your feature flags are acting as expected in the wild. Some organizations even set up systems to auto-stop a rollout if errors can be attributed to the new feature-flag state in production. If you have a feature flag with two different experiences, moving the point where you fork the experience can affect whether individual code paths are dependent on the feature-flag state. For example, you could either have a frontend component fork the experience inside the component, or you could have the frontend page swap out different components entirely, so they don’t have to be aware of the feature-flag state. Move the fork Having a change history or audit log for your feature management and experimentation platform is key to scaling your experiments while still having the ability to quickly diagnose changes to your user’s experience. An audit log can also speed the time to recovery by pinpointing the root cause of an undesirable change and more quickly understanding its implications. Best practice: Build in broader visibility with a change history or audit logs so you can quickly diagnose issues and pinpoint root causes. Didanyonerecentlychangethetrafficallocationortargetingofthis featuretoincludethissetofusers? Anunexpectedbugrecentlyoccurredforacustomer,butwedidn’t deployrecently;didanyonechangeanythingregardinganexperiment orfeature-flagstate?
  • 35. 35 Or you could decide to remove the conditional and have the subject just be provided as a variable by the experimentation service: Or you could recognize that you can declaratively code the subject as a variable property on your email class that has defaults that a feature or experimentation service can override: In these latter two implementation techniques, you reduce how experiment aware or feature-flag aware your email code paths are. email = new Email(); email.subject = “Welcome to Optimizely”; variation = activateExperiment(“email_feature”) if (variation == “treatment_A”) { email.subject = “Hello from Optimizely”; } @feature(“email_feature”) class Email { @variable(“subject”) subject = “Welcome to Optimizely” } email = new Email(); email.subject = getVariable(“email_feature”, “subject”); Instead of deploying a feature flag or experiment using if/else branches in an imperative style, consider a declarative style where your feature is controlled by a variable configuration that could be managed by not only a feature flag service but any remote service. For example, if you were experimenting on an email subject, you could either code the variation as: Avoid conditionals
  • 36. 36 Now that you’ve learned about progressive delivery and experimentation, you might be considering whether to build your own testing framework, integrate an open source solution, or extend a commercial progressive Integrating feature flags and experiments into the same tools and systems that you already use will make scaling your experimentation program easier. To some, this means making experiments and feature flags ergonomic, to others, this means working where you work and how you work. For React, it’s much easier to develop with the mindset of components. For Express, it’s much easier to develop with the mindset of middleware. Each framework and platform has its own idiomatic ways of developing. The closer feature flags and experimentations match those same idiomatic patterns, the more likely you are to easily integrate feature flags and experiments into your development, testing, and deploying processes. Leverage configuration-based design to foster a culture of experimentation To truly achieve a culture of experimentation, you have to enable non- technical users to be able to experiment across your business and product. This starts with architecting a system that does not require engineering involvement, rather it has enough safeguards to prevent individuals from breaking your product. Configuration-based development is an architectural pattern commonly used to enable this type of large-scale experimentation because it uses configuration files to power your product. For instance, a configuration file that powers the layout of content in a mobile application, or a configuration file that controls the scaling characteristics of a distributed system. By centralizing the different possible product states to a configuration file that can be validated programmatically, you can enable experiments to power different configuration setups while maintaining confidence that your application can’t be put in a broken state. Best practice: Architect a system that’s accessible for non-technical users and use configuration files to power your products and features. Best practice: Aim to match your feature flags and experimentations to the idiomatic patterns of your development framework to easily integrate them into your development, testing, and deploying processes. Increase developer speed with framework-specific engineering Evaluating feature delivery and experimentation platforms
  • 37. 37 Usability for both technical and non-technical users can be the difference between running a few experiments a year to running thousands. An enterprise system often includes remote configuration capabilities—the ability to start and stop a rollout or experiment, change traffic allocation, or update targeting conditions in real time from a central dashboard without a code deploy. When a progressive delivery and experimentation system is easy for your engineering organization to adopt, more teams will be able to deploy quickly and safely. Developers will spend less time figuring out how to manage the release or experiment of their code, and more time working on customer- facing feature work. Look for systems with robust documentation, multiple implementation options, and open APIs. To learn quickly through experimentation, teams need to trust that tests are being run reliably and that the results are accurate. You’ll need a tool that can track events, provide or connect to a data pipeline that can filter and aggregate those events, and correctly integrate into your system. Vet the statistical models used to calculate significance to ensure your team can make decisions quickly, backed by accurate data. Ease of use for developers and non-technical stakeholders Statistical rigor and system accuracy Building in-house or adopting an open source framework typically comes with a relatively small upfront investment. Over time, additional features and customizations become necessary as more teams use the platform, and maintenance burdens like bug fixes, UI improvements, and more begin to distract engineers from a core product focus. Committing to building a platform yourself is a commitment to continuing to innovate on experimentation and develop new functionality to support your teams. Companies that successfully scale experimentation with a homebuilt system have engineers on staff dedicated to enabling others and supporting the system with ongoing maintenance. Total cost of developing and maintaining your system delivery and experimentation platform like Optimizely. When deciding which option is right for your organization, consider the following:
  • 38. 38 Use progressive delivery and experimentation to innovate faster 04 In this book, we’ve gone from building a basic foundation of progressive delivery and experimentation, to more advanced best practices on how to do it well. Many of the most successful software companies have gone on this journey to not only deploy features safely and effectively, but also to make sure they’re building the right features to begin with. Because in today’s age of rapid change, having the tools and techniques to quickly adapt and experiment is the most crucial aspect to staying ahead of the curve.
  • 39. 39 Appendix What is trunk-based development? Why trunk-based development Trunk-based development is software development strategy where engineers merge smaller changes more frequently into the main codebase and work off the trunk copy rather than work on long-lived feature branches. With many engineers working in the same codebase, it’s important to have a strategy for how individuals work together. To avoid overriding each other’s changes, engineers create their own copy of the codebase, called branches. Following an analogy of a tree, the master copy is sometimes called the mainline or the trunk. The process of incorporating the changes of an individual’s copy into the main master trunk is called merging. To understand how trunk-based development works, it’s useful to first look at the alternative strategy, feature branch development. In feature branch development, individual software developers or teams of engineers do not merge their new branch until a feature is complete, sometimes working for weeks or months at a time on a separate copy. 01 Trunk-based development workflow Feature-branched development master (or trunk) long lived feature branches -
  • 40. 40 Trunk-branched development When you aren’t sure whether to build a feature, then a painted-door experiment is a high value experiment. In a painted-door experiment, instead of putting in all the time to build a feature, you first verify that the feature is worth it by building just the suggestion of the feature into your product and measuring engagement. This long stretch of time can make the process of merging difficult because the trunk or master has likely changed due to other engineers merging their code changes. This can result in a lengthy code review process where the changes in different pull requests are analyzed to resolved merge conflicts. Benefits of trunk-based development Trunk-based development takes a more continuous delivery approach to software development, and branches are short-lived and merged as frequently as possible. The branches are smaller because they often contain only a part of a feature. These short-lived branches make the process of merging easier because there is less time for divergence between the main trunk and the branch copies. Thus, trunk-based development is a methodology for releasing new features and small changes quickly while helping to avoid lengthy bug fixes and “merge hell.” It is a growing popular devops practice among agile development teams, and is often paired with feature flags to ensure that any new features can be rolled back quickly and easily if any bugs are discovered. master (or trunk) Merging is done more frequently and more easily for shorter branches short lived branches 02 Painted-door experiment in-depth example -
  • 41. 41 Let’s say your company was deciding whether to build a “Recommended For You” feature where users are recommended content based on what they have previously viewed. Building such a system may require not only frontend changes, but backend API changes, backend model changes, a place to run a recommendation algorithm, as well as enough data to make the recommendations worth it. This feature will take countless hours of engineering time to build. Furthermore, in the time you take to build this new feature, you may not have confidence that users will actually end up using the recommended content. So how do you gain confidence that users will use the new feature without building the entire thing? One option is to build just the frontend button to view the recommended content, but have the button lead to a notice that the team is still working on this feature and is still seeking feedback on how it should be built. This way, you build only a suggestion of the “Recommended For You” feature, yet you can measure how many users engage and interact with it. This is a painted- door experiment. These experiments either validate the hypothesis that the feature is worth building, or they invalidate the hypothesis and save countless engineering hours that would have otherwise been spent building a feature not worth the time. 03 Additional implementation techniques Checkout Martin Fowler’s blog for additional techniques for implementing feature toggles: https://optimize.ly/2TVig0C
  • 42. 42 Asa is the Principal Developer Advocate for Optimizely. Previously, he was the engineering manager for Optimizely’s Full Stack product, responsible for leading multiple cross-functional engineering teams in charge of Optimizely’s fastest growing product to enable companies to experiment more across websites, apps, and every level of the stack. Prior to joining Optimizely, Asa worked at Microsoft as a software developer where he built the Internet Explorer site recommendation algorithm. He studied mathematics and computer science at Harvard University and is an avid break dancer. Bio Asa Schachar Principal Developer Advocate © 2020 Optimizely, Inc.