Building an experimentation framework

2,957 views

Published on

OSCON talk on building a simple but powerful framework for feature ramp ups, A/B and multivariate testing, and other types of experiments in web apps.

Published in: Technology, Education
0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,957
On SlideShare
0
From Embeds
0
Number of Embeds
44
Actions
Shares
0
Downloads
40
Comments
0
Likes
4
Embeds 0
No embeds

No notes for slide

Building an experimentation framework

  1. 1. Building an experimentation framework for web apps Zhi-Da Zhong zz@etsy.comTuesday, July 26, 2011
  2. 2. About the talk Why What Framework Break / hack Tech Details Test design AnalysisTuesday, July 26, 2011
  3. 3. Why?Tuesday, July 26, 2011
  4. 4. Questions “What will happen if I do X”? “Is X better than Y?”Tuesday, July 26, 2011
  5. 5. The future & alternate universes (We’re bad at those.)Tuesday, July 26, 2011
  6. 6. Then what?Tuesday, July 26, 2011
  7. 7. ExperimentsTuesday, July 26, 2011
  8. 8. Experiments Try it out.Tuesday, July 26, 2011
  9. 9. Experiments Try it out. Data beats speculation.Tuesday, July 26, 2011
  10. 10. Experiments Try different alternatives on different people.Tuesday, July 26, 2011
  11. 11. Experiments Try different alternatives on different people.Tuesday, July 26, 2011
  12. 12. Which is better? v.s.Tuesday, July 26, 2011
  13. 13. Not a great experimentTuesday, July 26, 2011
  14. 14. Web appsTuesday, July 26, 2011
  15. 15. Front end experiments • Layout, colors, images, copy, ... • No functional changes • Impact can be surprisingly highTuesday, July 26, 2011
  16. 16. A little more complex... • Multipage flows • Functionality changesTuesday, July 26, 2011
  17. 17. Backend experiments • Why not? • Algorithms, architectures, batch processes, ...Tuesday, July 26, 2011
  18. 18. The Etsy search backend Web app • New algorithm search() • New RPC protocol searchA() searchB() • New result data structure Search Search • New Solr trunk snapshot cluster A cluster BTuesday, July 26, 2011
  19. 19. DB re-architecture • Postgres => Sharded MySQL • Multiple experimentsTuesday, July 26, 2011
  20. 20. Whole new features New pages + New DB tables + New batch jobs + ...Tuesday, July 26, 2011
  21. 21. Not just 2 variants • A/B/C... tests • Multi-variate testsTuesday, July 26, 2011
  22. 22. Caveats • Content not under your control • Price tests? • Hard-to-measure/quantify things • Long term impact?Tuesday, July 26, 2011
  23. 23. Other tests • Internal users testing • Whitelisted user testingTuesday, July 26, 2011
  24. 24. Opt-in experimentsTuesday, July 26, 2011
  25. 25. Complementary techniques • Observed/recorded testing - show different people the same thing • Side-by-side testing - show each person 2 alternativesTuesday, July 26, 2011
  26. 26. Side by side testingTuesday, July 26, 2011
  27. 27. HowTuesday, July 26, 2011
  28. 28. A common approach • JS-based • Non-techie UI • “No IT!” • “Designed For Marketers, By Marketers”Tuesday, July 26, 2011
  29. 29. Our approach • The developer is the user • Code as configuration • An integral part of the dev processTuesday, July 26, 2011
  30. 30. Developer as the user • The builder of the feature writes the test • Not just a marketing toolTuesday, July 26, 2011
  31. 31. Code as config • Simplicity • Expressivity • Quality • Version => complete system state • Revision historyTuesday, July 26, 2011
  32. 32. Part of the dev process Every change is an experiment!Tuesday, July 26, 2011
  33. 33. What does it look like?Tuesday, July 26, 2011
  34. 34. Tuesday, July 26, 2011
  35. 35. Default => Experiment => (new) DefaultTuesday, July 26, 2011
  36. 36. To add a new feature... + $config[‘new_search’] = array( + ‘enabled’ => ‘off’ + ); function search() { + if ($cfg->isEnabled(‘new_search’)) { + return do_new_search(); + } // existing stuff }Tuesday, July 26, 2011
  37. 37. Deploy thatTuesday, July 26, 2011
  38. 38. Now we go crazy... function do_new_search() { // exciting new stuff // that might or might not work // but we can deploy it anyway // since it’s flagged off }Tuesday, July 26, 2011
  39. 39. Internal user testing $config[‘new_search’] = array( + ‘enabled’ => ‘rampup’, + ‘rampup’ => array( + ‘admin’ => true ) );Tuesday, July 26, 2011
  40. 40. Whitelists $config[‘new_search’] = array( ‘enabled’ => ‘rampup’, ‘rampup’ => array( + ‘whitelist’ => array(zhida), ‘admin’ => true ) );Tuesday, July 26, 2011
  41. 41. Opt-in experiments $config[‘new_search’] = array( ‘enabled’ => ‘rampup’, ‘rampup’ => array( + ‘group’ => 12345, ‘admin’ => true ) );Tuesday, July 26, 2011
  42. 42. A/B $config[‘new_search’] = array( ‘enabled’ => ‘rampup’, ‘rampup’ => array( + ‘percent’ => 1.5, ‘admin’ => true ) );Tuesday, July 26, 2011
  43. 43. If it works... $config[‘new_search’] = array( + ‘enabled’ => ‘on’ );Tuesday, July 26, 2011
  44. 44. Order matters Whitelist / Blacklist > Internal > Opt-in > RandomTuesday, July 26, 2011
  45. 45. The frameworkTuesday, July 26, 2011
  46. 46. As easy as...Tuesday, July 26, 2011
  47. 47. As easy as... 1. Pick a variantTuesday, July 26, 2011
  48. 48. As easy as... 1. Pick a variant 2. Do what it saysTuesday, July 26, 2011
  49. 49. As easy as... 1. Pick a variant 2. Do what it says 3. Log the eventTuesday, July 26, 2011
  50. 50. Whats in a test?Tuesday, July 26, 2011
  51. 51. Variants • Key-value pairs • interpreted by the app • Name • mostly for loggingTuesday, July 26, 2011
  52. 52. SubjectIdProvider function getID() • Why? • hashing and other selectors • logging • Types of subjects • Users...but not always • Different groups of users - sellers vs buyers, etc. • Different ways to identify them - signed in vs signed outTuesday, July 26, 2011
  53. 53. Selectors function select($subjectID) => Variant NameTuesday, July 26, 2011
  54. 54. Combining multiple selectors • OR • breaks blacklists • AND • breaks whitelists • Sequence • works!Tuesday, July 26, 2011
  55. 55. Selector sequence • Defines an ordering • Returns A/B/C/... or <dont care>Tuesday, July 26, 2011
  56. 56. Loggers function log($testKey, $variantKey, $subjectKey)Tuesday, July 26, 2011
  57. 57. More => better • More data • More ways to track • access logs • 3P analytics • customTuesday, July 26, 2011
  58. 58. Access log augmentation • Apache note • Lots of log analysis tools • grep • $$Tuesday, July 26, 2011
  59. 59. 3P Analytics • Quick to start • May be cheap • Volume? • Lag time? • Flexibility / customization?Tuesday, July 26, 2011
  60. 60. 3P Analytics - how • Custom variables • take note of number & size limits • Custom segments • Canned metricsTuesday, July 26, 2011
  61. 61. 3P Analytics - example <script type="text/javascript"> var pageTracker = _gat._getTracker("UA-1234567-8"); pageTracker._initData(); pageTracker._setCustomVar(2, "AB", "search_test.variantC", 3); pageTracker._trackPageview(); </script>Tuesday, July 26, 2011
  62. 62. Our own event tracking HTML, event JS beacon Web app • HTML beacons Event log • Hadoop Hadoop • Cloud ResultsTuesday, July 26, 2011
  63. 63. Break / hack https://github.com/etsy/abTuesday, July 26, 2011
  64. 64. Building on top of the core APITuesday, July 26, 2011
  65. 65. Test builders • Capture common patterns • feature ramp ups • opt-in experiments • Help with test design • weight equalization • multivariate testingTuesday, July 26, 2011
  66. 66. Automatic Dispatchers • Separate dispatching and work • Work with components that have well-defined invocation APIs • Define a particular level of granularity • Feel like magicTuesday, July 26, 2011
  67. 67. Dispatcher example - MVC • View dispatch • Controller dispatch • Spring framework, etc.Tuesday, July 26, 2011
  68. 68. Selector Registry • Reuse $selectorReg = array( ‘staff’ => ‘InternalUserSelector’, • Clarity ‘whitelist’ => ‘WhitelistSelector’, ‘percent’ => ‘WeightedSelector’ • Documentation );Tuesday, July 26, 2011
  69. 69. Randomized SelectorTuesday, July 26, 2011
  70. 70. What does it mean?Tuesday, July 26, 2011
  71. 71. What does it mean? • Independent of subject attributesTuesday, July 26, 2011
  72. 72. What does it mean? • Independent of subject attributes • Independent of other testsTuesday, July 26, 2011
  73. 73. What does it mean? • Independent of subject attributes • Independent of other tests • Independent of (coarse-grained) timeTuesday, July 26, 2011
  74. 74. PersistenceTuesday, July 26, 2011
  75. 75. Persistence • Better experienceTuesday, July 26, 2011
  76. 76. Persistence • Better experience • Better dataTuesday, July 26, 2011
  77. 77. Persistence • Better experience • Better data • Multi-part testsTuesday, July 26, 2011
  78. 78. Persistence • Better experience • Better data • Multi-part tests • ...but not foreverTuesday, July 26, 2011
  79. 79. Ramping up/down • Vary group sizes • Reduce risk • Distribute loadTuesday, July 26, 2011
  80. 80. Persistence + Ramping • Minimize inconsistency • Ramping up • Should just add people to the treatment group • Ramping down • Should just remove part of the treatment groupTuesday, July 26, 2011
  81. 81. rand() • Explicit persistence • Cookie • DB • Scaling • MaintenanceTuesday, July 26, 2011
  82. 82. Hashing variant = H(id)Tuesday, July 26, 2011
  83. 83. Hashing variant = H(id) PersistenceTuesday, July 26, 2011
  84. 84. Hashing variant = H(id) PersistenceTuesday, July 26, 2011
  85. 85. Hashing variant = H(id) Attribute independence PersistenceTuesday, July 26, 2011
  86. 86. Hashing variant = H(id) Persistence Attribute independenceTuesday, July 26, 2011
  87. 87. Hashing variant = H(id) Test independence? Persistence Attribute independenceTuesday, July 26, 2011
  88. 88. Hashing variant = H(test id, id) Test independence Persistence Attribute independenceTuesday, July 26, 2011
  89. 89. Hashing variant = H(test id, id) Persistence Attribute independence Test independenceTuesday, July 26, 2011
  90. 90. Hashing variant = H(test id, id) What else? Persistence Attribute independence Test independenceTuesday, July 26, 2011
  91. 91. Hashing variant = H(test id, id) Weights! Persistence Attribute independence Test independenceTuesday, July 26, 2011
  92. 92. Hashing h = H(test id, id) Persistence Attribute independence Test independenceTuesday, July 26, 2011
  93. 93. Hashing h = H(test id, id) variant = P(h, weights) Persistence Attribute independence Test independenceTuesday, July 26, 2011
  94. 94. Partitioning Hash 0 1Tuesday, July 26, 2011
  95. 95. Partitioning Hash 0 1 .5 PartitionTuesday, July 26, 2011
  96. 96. Partitioning Hash 0 A B 1 .5 PartitionTuesday, July 26, 2011
  97. 97. Ramping up Hash 0 A B 1 .7 PartitionTuesday, July 26, 2011
  98. 98. Which hash function? • MD5/SHA-256/... • Test it! • But be careful...Tuesday, July 26, 2011
  99. 99. A/B + opt-in • Need to separate the groups for analysis • Solution: use more than 2 variants! • Act according to variant properties • Track by variant nameTuesday, July 26, 2011
  100. 100. AnalysisTuesday, July 26, 2011
  101. 101. ... Confidence interval ... something something ... Binomial ... blah blah ...Tuesday, July 26, 2011
  102. 102. Confidence Intervals • How sure are we? • What if it were random?Tuesday, July 26, 2011
  103. 103. Binomial experimentsTuesday, July 26, 2011
  104. 104. Binomial experiments HT HTTT HT H HTuesday, July 26, 2011
  105. 105. Binomial experiments HT HTTT HT H H T HT HTT H HT HTuesday, July 26, 2011
  106. 106. ResultsTuesday, July 26, 2011
  107. 107. DashboardsTuesday, July 26, 2011
  108. 108. A few test design tipsTuesday, July 26, 2011
  109. 109. Whatʼs the question?Tuesday, July 26, 2011
  110. 110. Whatʼs the question? What metrics?Tuesday, July 26, 2011
  111. 111. Whatʼs the question? What metrics? How much better?Tuesday, July 26, 2011
  112. 112. Who? • Different roles • Old vs new • Novelty • Habit • ExpectationTuesday, July 26, 2011
  113. 113. When? • User types vary • Activity patterns vary • Site content might vary • Performance might vary • Full weeks are often a good starting pointTuesday, July 26, 2011
  114. 114. SummaryTuesday, July 26, 2011
  115. 115. Better living through experimentation • More risk taking => better product • MTTR • Lower stressTuesday, July 26, 2011
  116. 116. You can too.Tuesday, July 26, 2011

×