Scaling API-first – The story of a global engineering organization
War Stories and a Few Lessons Learned
1. War Stories and a Few
Lessons Learned
Tommy Guy
Principal Data Scientist
Microsoft
Content from the entire Analysis and Experimentation Team at Microsoft.
3. Too much of a good thing
CONTROL (12 slides) TREATMENT (16 slides)
4. Too much of a good thing
WHAT HAPPENED
MSN noticed that competitors were displaying more content on their infopanes.
They decided to run an experiment to test whether or not a similar feature
would be useful for their customers.
Despite the initial hypothesis, the test results appeared to indicate that the
treatment (16 panes) was significantly worse than the control (12 panes).
A sample-ratio-mismatch (SRM) alert notified the experimenters there were
fewer users in treatment than control (49.8% instead of 50%).
5. Too much of a good thing
THE REAL CULPRIT
It turns out that the 16 pane variant was doing
extremely well. A little TOO well.
Treatment engagement increased so much that the
heaviest users were being classified as bots.
Power users were removed from the experiment
altogether causing an uneven distribution of visitors
and the subsequent SRM alert.
CONCLUSION
Once the issue was fixed, the SRM alert no longer
fired and the 16 pane variant performed extremely
well. This resulted in a significant improvement in
engagement and $1.2m annual revenue.
But this wasn't the only insight. MSN's machine
learning algorithms had to be retrained to account
for the change in user behavior, which helped
prevent lost revenue during the experiment.
SRM notifications helped uncover major issues that,
when resolved, both dramatically changed the
outcome of an experiment and shined a light on
deeper, classification flaws!
7. THE RESULT
An experiment quality alert indicated there were more users in the
treatment rather than in the control.
THIS IS AN ODD (RARE) EVENT. WHY?
Our Store telemetry has two event types, page views and clicks.
All events should be logged, but prior issues with 1st impression event loss had occurred.
Typically event loss occurs equally in treatment and control (or in treatment occasionally).
Increased event loss in the control is unusual.
Uneven Telemetry Loss
8. What was the Reason?
An Analysis of the scorecard revealed that the SRM was caused
by a change in user behavior interacting with logging anomalies.
Meaning…
While Treatment and Control had the same rate of lost first impressions
The treatment had extra clicks from users closing the lightbox
Hence the treatment ended up with more total users
Uneven Telemetry Loss
9. STRATEGIC QUESTIONS
Needed to set a testable product wide
guardrail that teams can independently
work within.
Should Bing invest in reducing time to
display search results?
EXPERIMENT DEFINITION
Add artificial delays to Page Load Time.
Study the effects of small and large
differences on Bing metrics like revenue
and sessions.
aka.ms/exp/paper/speed
The Value of Speed
What’s your estimate? An improvement of _______ seconds can pay for a senior
engineer for one year.
10. What was the result?
IMPACT AND DECISIONS MADE
A ‘performance’ budget was created per
team to enable distributed ship decisions.
Using the ExP metrics authoring service
Bing added Ads Revenue to the set of
required Bing guardrail metrics ensuring
it was computed for all future Bing
experiments.
A 4 millisecond gain can fund one engineer for a year.
The Value of Speed
11. TREATMENTCONTROL
Making a Successful Experiment Better
The Outlook team had a hypothesis that adding an 'unread'
badge would increase frequent engagement with the
application. An experiment was run to test this idea on
Android phones.
12. THE RESULT..?
The test was a strong winner at the
aggregate level. The experiment
was a good ship candidate.
Making a Successful Experiment Better
Why? because some manufactures
used a custom skin that hid the badge!
THE DISCOVERY
However, Segments of Interest
alerted the Outlook team to
additional insights:
On certain devices the lift was flat,
while others saw substantial
positive results.
13. THE OUTCOME
Armed with this knowledge,
Outlook invested additional
resources to create a version of the
badge for each incompatible
device.
The initial test success was
replicated across all devices.
Using Segments of Interest helped
Outlook maximize the reach of a
successful idea.
Making a Successful Experiment Better