When in doubt, go live
Techniques for decision making
based on real user behavior
© 2020 ThoughtWorks
Irene Torres
Klaus Fleerkötter
You save time and make better decisions
by establishing shorter feedback loops
from feature idea to feature usage.
© 2020 ThoughtWorks
Irene Torres
Developer @ TW
PhD Neuroscience
Science perspective
Klaus Fleerkötter
Developer @ TW
Information Systems
Techie perspective
Klaus
Who’s talking?
© 2020 ThoughtWorks
What is this talk about?
Specific use cases
that worked for us
Tech & Research
And what is it not...
© 2020 ThoughtWorks
Extensive coverage of
user research
Software testing
One of Germany’s
biggest online retailers
Top 5 highest traffic
e-commerce sites
(Germany)
Orders: <= 10 per second
Qualified visits:
Ø 1.6 million / day
Examples
© 2020 ThoughtWorks
PO
Establishing Feedback Loops
Users
Team
Stakeholders
Users
PO
Establishing Feedback Loops
Users
Team
Stakeholders
Users
PO
Delivery
Pipeline
Feature
Toggle
Shadow
TrafficLab
Test Focus
Group
Survey
Visual
Report
A/B
Test
Establishing Feedback Loops
Prerequisites
© 2020 ThoughtWorks
PO
An Iterative and Incremental development process
© 2020 ThoughtWorks
Services that can be built independently by cross-functional
teams that are structured around business domains
© 2020 ThoughtWorks
Dev
PO
QA Ops
UX
DA
The Delivery Pipeline
© 2020 ThoughtWorks
Delivery
Pipeline
Iterative and
Incremental
development
Independent
Teams
The Delivery Pipeline
© 2020 ThoughtWorks
Build Test Deploy
Gain situational awareness
Knowing that you went live and nothing’s on fire
© 2020 ThoughtWorks
Feature Toggles
© 2020 ThoughtWorks
Delivery
Pipeline
Feature
Toggle
Iterative and
Incremental
development
Independent
Teams
Feature Toggles
Decouple go-live from deployment
© 2020 ThoughtWorks
© CC BY 2.0 "Switch" Jon_Callow_Images
if (toggleIsOn) then {
executeNewBehavior()
} else {
executeOldBehavior()
}
Feature Toggles
Flip for experimentation
© CC BY-ND 2.0 "Off?" Nicholas Liby
Without Recompile?
Without Restart?
Per Request?
© 2020 ThoughtWorks
While
developing,
go live
© 2020 ThoughtWorks
Shadow Traffic
© 2020 ThoughtWorks
Delivery
Pipeline
Feature
Toggle
Shadow
Traffic
Iterative and
Incremental
development
Independent
Teams
Shadow Traffic
Not just for testing
© 2020 ThoughtWorks
User
Old Behavior
New Behavior
sees no difference
Run
both
Team
Shadow Traffic
Get early feedback
60% 40%
Min 3 items?
Mostly fashion?
Not sold out?
Max 1 of each kind?
Maximize!
© 2020 ThoughtWorks
Visual Report
© 2020 ThoughtWorks
Delivery
Pipeline
Feature
Toggle
Shadow
Traffic
Visual
Report
Iterative and
Incremental
development
Independent
Teams
Visual Report
Quality of a feature
© 2020 ThoughtWorks
Visual Report
Quality of a feature
© 2020 ThoughtWorks
Visual Report
Quality of a feature
© 2020 ThoughtWorks
Assess that the MVP has the correct business rules
● Visual report (e.g. html page)
Visual Report
Quality of a feature
Beach pants
manual auto
Leather bags
Jackets
© 2020 ThoughtWorks
Go live
without
flying blind
© 2020 ThoughtWorks
A/B Testing
© 2020 ThoughtWorks
Delivery
Pipeline
Feature
Toggle
Shadow
Traffic
A/B
Test
Visual
Report
Iterative and
Incremental
development
Independent
Teams
A/B testing
© 2020 ThoughtWorks
“You want your data to inform, to guide, to improve your business model, to help
you decide on a course of action.” Lean Analytics
A/B testing
© 2020 ThoughtWorks
“You want your data to inform, to guide, to improve your business model, to help
you decide on a course of action.” Lean Analytics
Focus on the understanding of the underlying statistics that drives the
calculation of a sample size.
STATS
A/B testing
© 2020 ThoughtWorks
“You want your data to inform, to guide, to improve your business model, to help
you decide on a course of action.” Lean Analytics
A/B testing ≡ a set of statistical tests that evaluate two independent groups, a
control and a test group
“Independent groups” -> between-subjects design
STATS
Focus on the understanding of the underlying statistics that drives the
calculation of a sample size.
groups =
variants
“Independent groups” -> between-subjects design
A/B testing
© 2020 ThoughtWorks
Control [A]
Test [B]
A/B testing
A/B testing mostly uses statistical hypothesis testing to calculate the likelihood of a change in your
website being meaningful.
Null hypothesis (H0): The state of the world. There is no effect, no difference when you apply
changes.
H0: Our <KPIs> remained the “same” in the control group and in the test group
Alternative hypothesis (H1): the changes in the test group had a real effect.
H1: Our users are actively engaged in clicking the button and therefore our A2B is relatively increased
by 5%
© 2020 ThoughtWorks
A/B testing
© 2020 ThoughtWorks
Alternative hypothesis (H1): the changes in the test group had a real effect.
H1: Our users are actively engaged in clicking the button and therefore our A2B is relatively increased
by 5%
A/B testing
© 2020 ThoughtWorks
Source: https://abtestguide.com/abtestsize/
A/B testing
© 2020 ThoughtWorks
Source: https://abtestguide.com/abtestsize/
Metrics
we know
A/B testing
© 2020 ThoughtWorks
Source: https://abtestguide.com/abtestsize/
Metrics
we know We decide from
previous data or
knowledge about
this variable
[effect size]
A/B testing
© 2020 ThoughtWorks
Source: https://abtestguide.com/abtestsize/
Metrics
we know
We decide from
previous data or
knowledge about
this variable
[effect size]
Dependent on the
variable and what we
are looking for
[normally two-sided]
A/B testing
© 2020 ThoughtWorks
Source: https://abtestguide.com/abtestsize/
Metrics
we know
We decide from
previous data or
knowledge about
this variable
[effect size]
We can play but
mostly by
convention and
dependent on traffic
[accuracy]
Dependent on the
variable and what we
are looking for
[normally two-sided]
A/B testing
© 2020 ThoughtWorks
Source: https://abtestguide.com/abtestsize/
Effect size
The magnitude of the effect, how important the difference is
A/B testing
© 2020 ThoughtWorks
Source: https://abtestguide.com/abtestsize/
Test conversion rate = 15 * 2 + 2 = 2.3% (± 0.3%)
Effect size
The magnitude of the effect, how important the difference is
Improvement that is meaningful for your business
Test conversion rate - Control conversion rate
Control conversion rate
Relative improvement*100 =
100
A/B testing
© 2020 ThoughtWorks
Source: https://abtestguide.com/abtestsize/
One-sided or two-sided?
ControlTest
Mean test
Mean
control
Is the difference significant
enough to reject the null
hypothesis?
H0 : 𝝻t = 𝝻c
𝝻t : mean test
𝝻c : mean control
difference in
means
A/B testing
© 2020 ThoughtWorks
Source: https://abtestguide.com/abtestsize/
One-sided or two-sided?
H1 : 𝝻t > 𝝻c
(one-sided)
directional
H1 : 𝝻t ≠ 𝝻c
(two-sided)
Two-sided tends to be the best option
𝝻t : mean test
𝝻c : mean control
A/B testing
© 2020 ThoughtWorks
Power, significance level & confidence level
A/B testing
© 2020 ThoughtWorks
Power of a test: the probability of finding an effect when it is really there. It is the inverse of the
type II error (false negatives)
Source: https://towardsdatascience.com/a-guide-for-selecting-an-appropriate-metric-for-your-a-b-test-9068cccb7fb
Typical value is 80% (a convention)
Power Chance to miss a true effect
Sample size
Power, significance level & confidence level
A/B testing
© 2020 ThoughtWorks
Source: https://www.youtube.com/watch?v=CSBCKVQLf8c
Our study
Effect present Effect absent
Real World
Effect
present
Reject H0
Type II error
(miss)
Effect absent
Type I error
(false alarm)
Reject H1
Type II error : probability to miss an effect that is really there (the odds to not detect it)
A/B testing
© 2020 ThoughtWorks
Source: https://www.youtube.com/watch?v=CSBCKVQLf8c
Our study
Effect present Effect absent
Real World
Effect present
Reject H0
(power 1-𝛃)
Type II error
(miss)
( 𝛃 risk)
Effect absent
Type I error
(false alarm)
Reject H1
Type II error : miss -> probability less than 20% (𝛃)
Power is 1-𝛃 -> 80%
Power Chance to miss a true effect
Sample size
A/B testing
© 2020 ThoughtWorks
Source: https://towardsdatascience.com/a-guide-for-selecting-an-appropriate-metric-for-your-a-b-test-9068cccb7fb
Typical value is 95% (a convention)
Significance level (𝛂): the probability of detecting an effect that is really not there
Power, significance level & confidence level
A/B testing
© 2020 ThoughtWorks
Source: https://www.youtube.com/watch?v=CSBCKVQLf8c
Type I error : false alarm -> probability less than 5% (𝛂) Confidence level is 1- 𝛂 : 95%
Significance level 𝛂 related to p-value: 𝛂 > p-value
Our study
Effect present Effect absent
Real World
Effect present Reject H0 Type II error (miss)
Effect
absent
Type I error
(false alarm)
(𝛂 risk)
Reject H1
A/B testing
© 2020 ThoughtWorks
Confidence level: the inverse of the significance level. The probability that the value of a
parameter falls within a specified range of values
Source: https://towardsdatascience.com/a-guide-for-selecting-an-appropriate-metric-for-your-a-b-test-9068cccb7fb
Typical value is 95% (a convention)
Significance level (𝛂)
Confidence level
Sample size
(significance level 𝛂 tells you about the
probability that the effect you found was
just chance; 𝛂 > p-value)
Power, significance level & confidence level
Significance level ~ 0.05 (5%)
P-value < 0.05
A/B testing
© 2020 ThoughtWorks
Source: https://abtestguide.com/abtestsize/
Meaningful for your
business
Power and confidence
level influence your
sample size and the
probability of finding a
true effect
A/B testing
© 2020 ThoughtWorks
High traffic Low traffic
Important points
Choose KPIs wisely,
low effect size
Choose KPIs with high
increase (large effect size)
A/B testing
© 2020 ThoughtWorks
High traffic Low traffic
Choose KPIs wisely,
low effect size
Choose KPIs with high
increase (large effect size)
Important points
+5%
+0.5%
A/B testing
© 2020 ThoughtWorks
High traffic Low traffic
Choose KPIs wisely, low effect
size
Accuracy, minimise risk
Choose KPIs with high
increase (large effect size)
Important points
A/B testing
© 2020 ThoughtWorks
High traffic Low traffic
Choose KPIs wisely, low effect
size
Preferably AB but also MVT
Choose KPIs with high
increase (large effect size)
AB
Run Qualitative tests
Never stop an experiment before time even if you “find” significant results (danger! False
positives raising!)
Source: https://www.evanmiller.org/how-not-to-run-an-ab-test.html
https://vwo.com/blog/ab-split-testing-low-traffic-sites/
Important points
Before
development
© 2020 ThoughtWorks
Focus Group Survey
© 2020 ThoughtWorks
Delivery
Pipeline
Feature
Toggle
Shadow
TrafficFocus
Group
Survey Visual
Report
Iterative and
Incremental
development
Independent
Teams
A/B
Test
Focus Group Survey
© 2020 ThoughtWorks
Delivery
Pipeline
Feature
Toggle
Shadow
TrafficFocus
Group
Survey Visual
Report
Iterative and
Incremental
development
Independent
Teams
What is it
Study using inferential statistics
to verify an hypothesis.
When
As part of the discovery of a
feature, during development
Why
Short feedback loops
Data-driven decisions
Caution! You need experience
designing and analysing statistical
tests.
The shopteaser survey
© 2020 ThoughtWorks
Focus Group Survey
Focus Group Survey
© 2020 ThoughtWorks
Stronglydisagree
Disagree
Neutral
Agree
Stronglyagree
Likert Scale
[categorical variable]
The shopteaser survey
Your research question will
drive the design of the
experiment and also the
analysis of your data
trial
trial
trial
trial
Focus Group Survey
© 2020 ThoughtWorks
Stronglydisagree
Disagree
Neutral
Agree
Stronglyagree
Likert Scale
The shopteaser survey
trial
trial
trial
Things that could go wrong:
- Familiarity bias
Methodology examples:
- Gave 5s per trial so the
answers would be
spontaneous
- The first trials were
discarded
[categorical variable that can be
transformed to continuous -
scale 1-5]
During the design phase we also took into account:
● Collect demographic data: there is no such thing as enough data
● Collect feedback at the end of the survey: did they understand the task, did
something go wrong?
● Make clear instructions: if you are not there, they cannot ask and will “assume”
© 2020 ThoughtWorks
Focus Group Survey
The shopteaser survey
Insights from a focus group
The shopteaser survey
© 2020 ThoughtWorks
selectedmanual
Lab test
© 2020 ThoughtWorks
Delivery
Pipeline
Feature
Toggle
Shadow
Traffic
Lab
Test
Focus
Group
Survey Visual
Report
A/B
Test
Iterative and
Incremental
development
Independent
Teams
UX designers test the design and usability of a
feature on a test group.
● Small group of people in-person (~5-10 pp)
● Web-based testing of users remote
● Qualitative questions
○ e.g. did you like it? Was it easy to find?
UX Lab tests
© 2020 ThoughtWorks
Wrapping up
© 2020 ThoughtWorks
PO
Delivery
Pipeline
Feature
Toggle
Shadow
TrafficLab
Test Focus
Group
Survey
Visual
Report
A/B
Test
Techniques for faster and better decisions
Iterative and
Incremental
development
In-
dependent
Team
When is your next release? Could it be earlier?
Do you have a solid hypothesis and measurable KPIs for it?
Which measurements could you be using instead of
assuming the user’s preference?
Which of your meetings in the next 2 weeks could be
replaced by a lean experiment?
© 2020 ThoughtWorks
Thank you
Irene Torres
Klaus Fleerkötter
© 2020 ThoughtWorks
Questions?
© 2020 ThoughtWorks
#talk5-when-in-doubt-go-live
Irene Torres
Klaus Fleerkötter

When in doubt, go live

  • 1.
    When in doubt,go live Techniques for decision making based on real user behavior © 2020 ThoughtWorks Irene Torres Klaus Fleerkötter
  • 2.
    You save timeand make better decisions by establishing shorter feedback loops from feature idea to feature usage. © 2020 ThoughtWorks
  • 3.
    Irene Torres Developer @TW PhD Neuroscience Science perspective Klaus Fleerkötter Developer @ TW Information Systems Techie perspective Klaus Who’s talking? © 2020 ThoughtWorks
  • 4.
    What is thistalk about? Specific use cases that worked for us Tech & Research And what is it not... © 2020 ThoughtWorks Extensive coverage of user research Software testing
  • 5.
    One of Germany’s biggestonline retailers Top 5 highest traffic e-commerce sites (Germany) Orders: <= 10 per second Qualified visits: Ø 1.6 million / day Examples © 2020 ThoughtWorks
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
    PO An Iterative andIncremental development process © 2020 ThoughtWorks
  • 11.
    Services that canbe built independently by cross-functional teams that are structured around business domains © 2020 ThoughtWorks Dev PO QA Ops UX DA
  • 12.
    The Delivery Pipeline ©2020 ThoughtWorks Delivery Pipeline Iterative and Incremental development Independent Teams
  • 13.
    The Delivery Pipeline ©2020 ThoughtWorks Build Test Deploy
  • 14.
    Gain situational awareness Knowingthat you went live and nothing’s on fire © 2020 ThoughtWorks
  • 15.
    Feature Toggles © 2020ThoughtWorks Delivery Pipeline Feature Toggle Iterative and Incremental development Independent Teams
  • 16.
    Feature Toggles Decouple go-livefrom deployment © 2020 ThoughtWorks © CC BY 2.0 "Switch" Jon_Callow_Images if (toggleIsOn) then { executeNewBehavior() } else { executeOldBehavior() }
  • 17.
    Feature Toggles Flip forexperimentation © CC BY-ND 2.0 "Off?" Nicholas Liby Without Recompile? Without Restart? Per Request? © 2020 ThoughtWorks
  • 18.
  • 19.
    Shadow Traffic © 2020ThoughtWorks Delivery Pipeline Feature Toggle Shadow Traffic Iterative and Incremental development Independent Teams
  • 20.
    Shadow Traffic Not justfor testing © 2020 ThoughtWorks User Old Behavior New Behavior sees no difference Run both Team
  • 21.
    Shadow Traffic Get earlyfeedback 60% 40% Min 3 items? Mostly fashion? Not sold out? Max 1 of each kind? Maximize! © 2020 ThoughtWorks
  • 22.
    Visual Report © 2020ThoughtWorks Delivery Pipeline Feature Toggle Shadow Traffic Visual Report Iterative and Incremental development Independent Teams
  • 23.
    Visual Report Quality ofa feature © 2020 ThoughtWorks
  • 24.
    Visual Report Quality ofa feature © 2020 ThoughtWorks
  • 25.
    Visual Report Quality ofa feature © 2020 ThoughtWorks
  • 26.
    Assess that theMVP has the correct business rules ● Visual report (e.g. html page) Visual Report Quality of a feature Beach pants manual auto Leather bags Jackets © 2020 ThoughtWorks
  • 27.
  • 28.
    A/B Testing © 2020ThoughtWorks Delivery Pipeline Feature Toggle Shadow Traffic A/B Test Visual Report Iterative and Incremental development Independent Teams
  • 29.
    A/B testing © 2020ThoughtWorks “You want your data to inform, to guide, to improve your business model, to help you decide on a course of action.” Lean Analytics
  • 30.
    A/B testing © 2020ThoughtWorks “You want your data to inform, to guide, to improve your business model, to help you decide on a course of action.” Lean Analytics Focus on the understanding of the underlying statistics that drives the calculation of a sample size. STATS
  • 31.
    A/B testing © 2020ThoughtWorks “You want your data to inform, to guide, to improve your business model, to help you decide on a course of action.” Lean Analytics A/B testing ≡ a set of statistical tests that evaluate two independent groups, a control and a test group “Independent groups” -> between-subjects design STATS Focus on the understanding of the underlying statistics that drives the calculation of a sample size. groups = variants “Independent groups” -> between-subjects design
  • 32.
    A/B testing © 2020ThoughtWorks Control [A] Test [B]
  • 33.
    A/B testing A/B testingmostly uses statistical hypothesis testing to calculate the likelihood of a change in your website being meaningful. Null hypothesis (H0): The state of the world. There is no effect, no difference when you apply changes. H0: Our <KPIs> remained the “same” in the control group and in the test group Alternative hypothesis (H1): the changes in the test group had a real effect. H1: Our users are actively engaged in clicking the button and therefore our A2B is relatively increased by 5% © 2020 ThoughtWorks
  • 34.
    A/B testing © 2020ThoughtWorks Alternative hypothesis (H1): the changes in the test group had a real effect. H1: Our users are actively engaged in clicking the button and therefore our A2B is relatively increased by 5%
  • 35.
    A/B testing © 2020ThoughtWorks Source: https://abtestguide.com/abtestsize/
  • 36.
    A/B testing © 2020ThoughtWorks Source: https://abtestguide.com/abtestsize/ Metrics we know
  • 37.
    A/B testing © 2020ThoughtWorks Source: https://abtestguide.com/abtestsize/ Metrics we know We decide from previous data or knowledge about this variable [effect size]
  • 38.
    A/B testing © 2020ThoughtWorks Source: https://abtestguide.com/abtestsize/ Metrics we know We decide from previous data or knowledge about this variable [effect size] Dependent on the variable and what we are looking for [normally two-sided]
  • 39.
    A/B testing © 2020ThoughtWorks Source: https://abtestguide.com/abtestsize/ Metrics we know We decide from previous data or knowledge about this variable [effect size] We can play but mostly by convention and dependent on traffic [accuracy] Dependent on the variable and what we are looking for [normally two-sided]
  • 40.
    A/B testing © 2020ThoughtWorks Source: https://abtestguide.com/abtestsize/ Effect size The magnitude of the effect, how important the difference is
  • 41.
    A/B testing © 2020ThoughtWorks Source: https://abtestguide.com/abtestsize/ Test conversion rate = 15 * 2 + 2 = 2.3% (± 0.3%) Effect size The magnitude of the effect, how important the difference is Improvement that is meaningful for your business Test conversion rate - Control conversion rate Control conversion rate Relative improvement*100 = 100
  • 42.
    A/B testing © 2020ThoughtWorks Source: https://abtestguide.com/abtestsize/ One-sided or two-sided? ControlTest Mean test Mean control Is the difference significant enough to reject the null hypothesis? H0 : 𝝻t = 𝝻c 𝝻t : mean test 𝝻c : mean control difference in means
  • 43.
    A/B testing © 2020ThoughtWorks Source: https://abtestguide.com/abtestsize/ One-sided or two-sided? H1 : 𝝻t > 𝝻c (one-sided) directional H1 : 𝝻t ≠ 𝝻c (two-sided) Two-sided tends to be the best option 𝝻t : mean test 𝝻c : mean control
  • 44.
    A/B testing © 2020ThoughtWorks Power, significance level & confidence level
  • 45.
    A/B testing © 2020ThoughtWorks Power of a test: the probability of finding an effect when it is really there. It is the inverse of the type II error (false negatives) Source: https://towardsdatascience.com/a-guide-for-selecting-an-appropriate-metric-for-your-a-b-test-9068cccb7fb Typical value is 80% (a convention) Power Chance to miss a true effect Sample size Power, significance level & confidence level
  • 46.
    A/B testing © 2020ThoughtWorks Source: https://www.youtube.com/watch?v=CSBCKVQLf8c Our study Effect present Effect absent Real World Effect present Reject H0 Type II error (miss) Effect absent Type I error (false alarm) Reject H1 Type II error : probability to miss an effect that is really there (the odds to not detect it)
  • 47.
    A/B testing © 2020ThoughtWorks Source: https://www.youtube.com/watch?v=CSBCKVQLf8c Our study Effect present Effect absent Real World Effect present Reject H0 (power 1-𝛃) Type II error (miss) ( 𝛃 risk) Effect absent Type I error (false alarm) Reject H1 Type II error : miss -> probability less than 20% (𝛃) Power is 1-𝛃 -> 80% Power Chance to miss a true effect Sample size
  • 48.
    A/B testing © 2020ThoughtWorks Source: https://towardsdatascience.com/a-guide-for-selecting-an-appropriate-metric-for-your-a-b-test-9068cccb7fb Typical value is 95% (a convention) Significance level (𝛂): the probability of detecting an effect that is really not there Power, significance level & confidence level
  • 49.
    A/B testing © 2020ThoughtWorks Source: https://www.youtube.com/watch?v=CSBCKVQLf8c Type I error : false alarm -> probability less than 5% (𝛂) Confidence level is 1- 𝛂 : 95% Significance level 𝛂 related to p-value: 𝛂 > p-value Our study Effect present Effect absent Real World Effect present Reject H0 Type II error (miss) Effect absent Type I error (false alarm) (𝛂 risk) Reject H1
  • 50.
    A/B testing © 2020ThoughtWorks Confidence level: the inverse of the significance level. The probability that the value of a parameter falls within a specified range of values Source: https://towardsdatascience.com/a-guide-for-selecting-an-appropriate-metric-for-your-a-b-test-9068cccb7fb Typical value is 95% (a convention) Significance level (𝛂) Confidence level Sample size (significance level 𝛂 tells you about the probability that the effect you found was just chance; 𝛂 > p-value) Power, significance level & confidence level Significance level ~ 0.05 (5%) P-value < 0.05
  • 51.
    A/B testing © 2020ThoughtWorks Source: https://abtestguide.com/abtestsize/ Meaningful for your business Power and confidence level influence your sample size and the probability of finding a true effect
  • 52.
    A/B testing © 2020ThoughtWorks High traffic Low traffic Important points Choose KPIs wisely, low effect size Choose KPIs with high increase (large effect size)
  • 53.
    A/B testing © 2020ThoughtWorks High traffic Low traffic Choose KPIs wisely, low effect size Choose KPIs with high increase (large effect size) Important points +5% +0.5%
  • 54.
    A/B testing © 2020ThoughtWorks High traffic Low traffic Choose KPIs wisely, low effect size Accuracy, minimise risk Choose KPIs with high increase (large effect size) Important points
  • 55.
    A/B testing © 2020ThoughtWorks High traffic Low traffic Choose KPIs wisely, low effect size Preferably AB but also MVT Choose KPIs with high increase (large effect size) AB Run Qualitative tests Never stop an experiment before time even if you “find” significant results (danger! False positives raising!) Source: https://www.evanmiller.org/how-not-to-run-an-ab-test.html https://vwo.com/blog/ab-split-testing-low-traffic-sites/ Important points
  • 56.
  • 57.
    Focus Group Survey ©2020 ThoughtWorks Delivery Pipeline Feature Toggle Shadow TrafficFocus Group Survey Visual Report Iterative and Incremental development Independent Teams A/B Test
  • 58.
    Focus Group Survey ©2020 ThoughtWorks Delivery Pipeline Feature Toggle Shadow TrafficFocus Group Survey Visual Report Iterative and Incremental development Independent Teams What is it Study using inferential statistics to verify an hypothesis. When As part of the discovery of a feature, during development Why Short feedback loops Data-driven decisions Caution! You need experience designing and analysing statistical tests.
  • 59.
    The shopteaser survey ©2020 ThoughtWorks Focus Group Survey
  • 60.
    Focus Group Survey ©2020 ThoughtWorks Stronglydisagree Disagree Neutral Agree Stronglyagree Likert Scale [categorical variable] The shopteaser survey Your research question will drive the design of the experiment and also the analysis of your data trial trial trial trial
  • 61.
    Focus Group Survey ©2020 ThoughtWorks Stronglydisagree Disagree Neutral Agree Stronglyagree Likert Scale The shopteaser survey trial trial trial Things that could go wrong: - Familiarity bias Methodology examples: - Gave 5s per trial so the answers would be spontaneous - The first trials were discarded [categorical variable that can be transformed to continuous - scale 1-5]
  • 62.
    During the designphase we also took into account: ● Collect demographic data: there is no such thing as enough data ● Collect feedback at the end of the survey: did they understand the task, did something go wrong? ● Make clear instructions: if you are not there, they cannot ask and will “assume” © 2020 ThoughtWorks Focus Group Survey The shopteaser survey
  • 63.
    Insights from afocus group The shopteaser survey © 2020 ThoughtWorks selectedmanual
  • 64.
    Lab test © 2020ThoughtWorks Delivery Pipeline Feature Toggle Shadow Traffic Lab Test Focus Group Survey Visual Report A/B Test Iterative and Incremental development Independent Teams
  • 65.
    UX designers testthe design and usability of a feature on a test group. ● Small group of people in-person (~5-10 pp) ● Web-based testing of users remote ● Qualitative questions ○ e.g. did you like it? Was it easy to find? UX Lab tests © 2020 ThoughtWorks
  • 66.
    Wrapping up © 2020ThoughtWorks
  • 67.
    PO Delivery Pipeline Feature Toggle Shadow TrafficLab Test Focus Group Survey Visual Report A/B Test Techniques forfaster and better decisions Iterative and Incremental development In- dependent Team
  • 68.
    When is yournext release? Could it be earlier? Do you have a solid hypothesis and measurable KPIs for it? Which measurements could you be using instead of assuming the user’s preference? Which of your meetings in the next 2 weeks could be replaced by a lean experiment? © 2020 ThoughtWorks
  • 69.
    Thank you Irene Torres KlausFleerkötter © 2020 ThoughtWorks
  • 70.