2. Motivation
• Innovation iteration -> correct evaluation
– Blindingly obvious
– Clear but deductive reasoning (involved)
– A/B Testing
• Segment based optimization
• Multi dimensional impact and stochastic
• Incremental Radicalism
• Disclaimer: Some parts of this platform are in existence but
more will come to life and we will solicit more inputs and
involvement
3. Experimentation Platform
Components
• Bucketing (A or B)
– Web Bucketing on User Cohorts
– Supply Chain Bucketing on Order Basket or
Warehouse (e.g. Packing)
• Control variables – what is being tested
– Price
– Gift Wrap
– Position on Web Page
– Recommendation Positioning
4. Experimentation Platform
• Result variables (often studied for a week to a
month)
– Repeat Visit
– Repeat Buy
– Repeat Engagement
– Spend
• Result interpretation
– Z-test
– T-test
– Chi Squared
5. Bucketing (Web)
• Bucketing: Declarative Common Cohorts
– User (sync): Cohorts are complex queries often run
async. If sufficiently complex e.g.
• Users who bought Books with increasing spend but did not
buy electronics
• User Activity Store searches, clicks, views etc.
• Cached and hit at web scale
• Cohorts can be selected declaratively e.g.
– Category Purchased
– Search Ranking
– Email Marketing
– Spend slope
6. Bucketing (Fulfilment)
– Order Fulfilment (async): Rules
• RETE evaluation of rules: Predicates evaluate minimal
number of times 1000 rules
• Async process => on the fly evaluation
– Interaction Plots need to be looked into for
multiple experiments
– Exclusive buckets on control variables
• e.g. 2 experiments cannot both decide on gift wrap
• Price cannot be influenced by 2 different experiments
7. Control Variables
• Control Variables: Configuration Based delta
– Price elasticity
– Position on page
– Recommendation
– Gift Wrap
– Business Flow (e.g. in Mumbai a new Packing
technique) => BPM
8. Execution
• Execution
– Client Library to evaluate
– if (experiment45) { ….. }
– Configuration based deviators
• Better still evaluate experiment deviator e.g.
• SLA = SLA - experimentDelta (experimenting with early
delivery)
– experimentDelta comes from config service
Multi-armed bandit to apply the changes?
90% Greedy and 10% random
9. Binomial at Large # -> Normal
• Binomial (Most human decisions) -> Normal
(p + q)n = Sum(nCr prq(n-r))
Yr = nCr prq(n-r)
(Yr+1 – Yr)/Yr [Large n]
dy = -x2
Y (std dev)2
10. Interaction Plot
– From Peltier Stats on OKCupid Data
– Smile no interaction with eye contact
– Flirty face significant interaction
Beware of interaction
Between experiments
11. Result Interpretation
• Result Interpretation
– T-test: Samples less than 30 [Fatter tail]
– Z-test: (x-m)/(std dev) = 1.95 [Normal]
– Paired t-test: Return/Refund-> Gift -> Repeat Buys
– Chi Squared
– F test
• Do we lose anything by repeated testing until test
convergence?
12. Development Paradigm
– Simplify during experiment
– Scalability: Build experiment to work out of memory
– Availability: Fail-Open
– Sharding and Database: Not big scale
– Performance: In Memory for a few
– Figure out control variables
Upper bound of expected results -> 90% of experiments
may not need to be scaled out
13. Decision Paradigm
– No code needed to test an idea
– Experiments run in parallel
– Need to test for interaction and main effects
14. Development Paradigm
– Scalability: Build experiment to work out of
memory
– Availability: Fail-Open
– Sharding and Database: Not big scale
– Performance: In Memory for a few nodes
Upper bound of expected results -> 90% of
experiments may not need to be scaled out
15. Summary
• A/B Testing Platform becomes key beyond
trivially obvious
• Configuration based A/B tests (trivial to check
on curiousity)
• Result interpretation is non trivial and varies
Editor's Notes
Don’t be brave, you will be wrong. Your predecessors were bright too a good breakfast enhanced your mood more than your IQ. Experiment without fear. Free to experiment but not free to put things into production until sure that it will help.Try every experimentEnable everyone in the company to experiment