[@IndeedEng] Managing Experiments and Behavior Dynamically with Proctor

5,247 views

Published on

Video available at: http://youtu.be/Q1T5J0KXUwY

At this very moment, Indeed is running more than one hundred A/B experiments. In previous @IndeedEng talks, we have discussed how we use A/B testing to develop better products.

In this tech talk, software engineer Matt Schemmel and product manager Tom Bergman describe Proctor, the system we developed to define and manage all of these experiments. They explain how we use Proctor to target users using data-driven rules, adjust experiments on-the-fly, and ensure clean results for multi-variate tests. Over time, Proctor has evolved from a system designed for managing experiments to one that manages overall system behavior through dynamic "feature toggle" functionality. Matt and Tom also share lessons we have learned from years of experimenting at web scale.

Matt Schemmel is a Senior Software Engineer working primarily on our Resume products.

Tom Bergman is a Product Manager currently working on our Aggregation systems. He previously helped evolve many of Indeed's data analysis tools, and also helped us launch and grow our sites in Japan, Korea, and China.

Published in: Technology, Business
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
5,247
On SlideShare
0
From Embeds
0
Number of Embeds
3,813
Actions
Shares
0
Downloads
38
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

[@IndeedEng] Managing Experiments and Behavior Dynamically with Proctor

  1. 1. Proctor Managing A/B Tests and More
  2. 2. Tom Bergman Product Manager Aggregation Matt Schemmel Software Engineer Resume
  3. 3. We help people get jobs.
  4. 4. What's best for the job seeker?
  5. 5. Test & Measure EVERYTHING
  6. 6. A/B Testing: Definition A/B testing is an experimental methodology comparing at least two variants, a control group A and test group B, in a controlled experiment
  7. 7. A/B Testing Key Points Test and Control Groups should be: 1. Unbiased 2. Independent 3. Representative
  8. 8. 103 tests 315 variations 2^147 combinations
  9. 9. Control
  10. 10. Control 10% test 10% test 10% test 10% test 10% test 10% test
  11. 11. Control +2.9% +2.0% +2.3% +12.8% +5.2% +9.6%
  12. 12. Control +2.9% +2.0% +2.3% +12.8% +5.2% +9.6% +614M emails
  13. 13. Why A/B Testing?
  14. 14. Before and After Before and After is bad science.
  15. 15. Visitors Weekly Traffic Mon Tues Wed Thur Fri
  16. 16. Visitors Yearly Traffic
  17. 17. Mid Year Test B Visitors A A<B
  18. 18. End of Year Test A B A>B
  19. 19. Obligatory XKCD Comic
  20. 20. History of A/B Testing @Indeed
  21. 21. Next we tried ... ● Multiple Code Versions ● Separate Configuration ● "Sampling by Load Balancer"
  22. 22. Load Balancer: Multiple Versions Load Balancer CONTROL (Old Version Code) TEST (New Version Code)
  23. 23. Load Balancer: Multiple Versions It worked, but ... 1. Tedious 2. Expensive 3. Inflexible
  24. 24. Finally ... Built Libraries, hand-write code per test to: 1. Arbitrarily Group Users 2. Select Test Groups 3. Implement Variations
  25. 25. Custom Coded Tests Allowed us: 1. Sophisticated Tests 2. Scientifically Valid Methods 3. Low Operational Overhead
  26. 26. Custom Coded Tests: Stats
  27. 27. Goals: 1. Increase Engineering Velocity 2. Standardize Representation 3. Work Seamlessly Across Products
  28. 28. Proctor Indeed’s Java Framework for Managing A/B Tests and More
  29. 29. Proctor Indeed's Open Source Java Framework for Managing A/B Tests and More github.com/indeedeng/proctor
  30. 30. Using Proctor 1. Background and Design 2. Running A/B Tests with Proctor 3. Beyond the Basics
  31. 31. Background and Design
  32. 32. Running a Test 1. Define the Experiment 2. Select Groups 3. Implement the Behavior 4. Log the Results
  33. 33. Running a Test 1. Define the Experiment 2. Select Groups 3. Implement the Behavior 4. Log the Results
  34. 34. Existing Behavior
  35. 35. Test Behavior Save Alert
  36. 36. Define the Experiment Key characteristics: 1. Buckets 2. Sample Sizes
  37. 37. 50% Control: Gray 50% Test: Blue
  38. 38. Division of Responsibilities Define the Experiment Apply the Experiment (global) (each product) Proctor Library Test Definition Test Specification
  39. 39. Buckets Enumerate the Test Variations ● ID, for code ● Long Description, for people ● Short Name, for people
  40. 40. 0 "Control Group" Gray 1 "Test Group" Blue
  41. 41. Sizing the Buckets 1. Buckets 2. Sample Sizes
  42. 42. Selecting a Test Bucket Good science requires good sampling: ● Independent ● Unbiased Good user experience does, too: ● Fast ● Consistent
  43. 43. Round robin assignment Assign each subsequent visitor to the next bucket. ● Requires global state for "next bucket" ● Requires state for assigned buckets ✘Fast ~ Consistent ✓Independent ✓Unbiased
  44. 44. Randomized Assignment At small scale, you might need round-robin to ensure equal sample sizes. At large scale, randomized assignment is uniform enough. ? Fast ? Consistent ✓Independent ✓Unbiased
  45. 45. Roll the dice as needed Select a bucket at random at the point of execution. Consistent ✓Fast ✘ ✓Independent ✓Unbiased
  46. 46. Roll Once and Cache in a Cookie ● Single-domain, Single-device ● N cookies: Hard to evolve ● One cookie: Fragile to edit ● Size scales with # experiments ~ Fast ~ Consistent ✓Independent ✓Unbiased
  47. 47. Roll Once and Cache in Session ● Consistent only to length of session ● Tied to one server / data-center ● Many apps don’t use sessions Consistent ~ Fast ✘ ✓Independent ✓Unbiased
  48. 48. Roll Dice and Cache in DB ● DB hit on every request ● More infrastructure ✘Fast ~ Consistent ✓Independent ✓Unbiased
  49. 49. We can do better Flaws stem from the need to record selected buckets. What if we didn't?
  50. 50. Don’t Record. Recalculate. 1. Assign each user a unique ID 2. Map that ID to a bucket 3. Store the ID, not the assignments ? Fast ? Consistent ? Independent ? Unbiased
  51. 51. Simple Mapping: Mod N id mod N=> bucket Doesn’t work: ● Should provide uniform distribution; mod N assumes it. ● Limited bucket distributions
  52. 52. Range Mapping id / MAX_ID => bucket control 0 test 0.5 1
  53. 53. Buckets can be any size control test 0 control 0 0.5 1 (inactive) 1 test 0.5 1
  54. 54. Sequential IDs No Longer Uniform MAX_ID 2 control 0 test 0.5 ✘Unbiased 1
  55. 55. Hashed Range Mapping hash ( id ) => bucket control MIN_INT test 0 MAX_INT Kept: ● Arbitrary bucket allocations ok
  56. 56. Unbiased Distribution for Any ID 50 / 50: 33 / 33 / 33: ✓Unbiased
  57. 57. But is it independent? Sign Up vs Activate Sign Up vs Sign Up
  58. 58. Should look like this 25% Sign Up 25% Sign Up 25% Activate 25% Activate
  59. 59. But our inputs are consistent hash ( id ) => bucket control MIN_INT test 0 MAX_INT
  60. 60. Text Color So our buckets are identical S A S S A S A A A S A S A S A S S A S A A A S A S A
  61. 61. And we look like this 50% 0% Sign Up 0% Activate 50% Sign Up Activate Independent ✘
  62. 62. Add Salt to Test hash ( id + test.salt ) => bucket Kept: ● Arbitrary bucket allocations ● Uniform distribution
  63. 63. Text Color Uncorrelated Distribution A S A S A S S A A A S S A A S A S A S S A A A S S A ✓Independent
  64. 64. But is it fast? 0.90 0.85 0.80 0.75 0.70 0.65 0.60 Resume Editor Resume Search
  65. 65. But is it consistent? Consistency bounded only by ID
  66. 66. We Usually Use Tracking Cookies ● Easy ● Ubiquitous on the web ● Require no server-side storage ● Best we can do with no user action ~ Consistent
  67. 67. Best we’ve seen so far… ✓Fast ~ Consistent ✓Independent ✓Unbiased
  68. 68. Definitions Map Buckets to ID Range Each bucket maps to a % of the hashed range Bucket Range gray 0.50 blue 0.50
  69. 69. Sometimes, Though, Cookies Won't Do ● Some People Block Cookies ● Cross-Domain ● Cross-Device ● Cookies are Web-Only
  70. 70. Many Ways to ID a User Session ID 557206C363F… Email Address me@indeed.com Tracking cookie: UID#1 Access Token: 4/rymOMYE… Account # 12345
  71. 71. Proctor Uses Any Set of IDs We use… ID Type... Tracked By... USER Tracking Cookie ACCOUNT Account ID EMAIL Email Address … …
  72. 72. Account ID ● Authenticated ● Consistent across domains ● Consistent across devices ● Consistent across visits
  73. 73. Email Address ● Sometimes available without account ● Identified, though not authenticated
  74. 74. Each Test Applies to One ID Type ● Test groups split by that identifier ● Visitors without that identifier are ignored
  75. 75. Running A/B Tests
  76. 76. Test Definitions Encoded in JSON ● Compact ● Simple and Flexible ● Editable by Humans ● Editable by Machines
  77. 77. Basic Data in the Test Definition "description": "Button colors", "salt": "buttonBgColorTst", "type": "USER"
  78. 78. Buckets in the Test Definition "buckets": [{ "id": 0, "name": "gray", "description": "Control group" }, { "id": 1, "name": "blue", "description": "Test group" }]
  79. 79. Mapping Buckets to Ranges "ranges": [{ "bucketValue": 0, "length": 0.5 }, { "bucketValue": 1, "length": 0.5 }]
  80. 80. Complete Test Definition { "description": "Button colors", "type": "USER", "salt": "buttonBgColorTst", "buckets": […], "allocations": [{ "ranges": […] }], }
  81. 81. Division of Responsibilities Define the Experiment Apply the Experiment proctor data (each product) Proctor Library Test Definition Test Specification
  82. 82. Proctor includes several modules Common Maven Builder Proctor Ant Builder Codegen
  83. 83. Product Test Specification lists active tests References into the global pool: "tests": [{ "buttonBgcolorTest": { "buckets": { "gray": 0, "blue": 1 } } }]
  84. 84. Apply the Experiment On every request… 1. Select Groups 2. Render the Response 3. Log the Action
  85. 85. Determining Buckets in Code On every request… 1. Collect identifiers 2. Select buckets for opted-in tests
  86. 86. Collect identifiers for all ID Types // Product code String cookie = getTrackingCookie(request); String accountId = getAccountIdOrNull(request); // Proctor preparation Identifiers identifiers = Identifiers.of( TestType.USER, cookie, TestType.ACCOUNT, accountId );
  87. 87. Select Buckets for Opted-In Tests // Proctor preparation Identifiers identifiers = Identifiers.of( TestType.USER, trackingCookie, TestType.ACCOUNT, accountId ); // Proctor assignments ProctorResult assignments = proctor.determineBuckets(identifiers);
  88. 88. Apply the Experiment On every request… 1. Select Groups 2. Render the Response 3. Log the Action
  89. 89. Choose behavior for selected bucket int bgColorBucket; /* … */ // Choose a background color for templates if (bgColorBucket == 1) { // Test model.put("buttonBgColor", "#00f"); } else { // Control group model.put("buttonBgColor", "#ccc"); }
  90. 90. ProctorResult exposes buckets… verbosely // Proctor assignments ProctorResult assignments = proctor.determineBuckets(identifiers); // Get selected bucket for this user int bgColorBucket = assignments // Map<String, TestBucket>: All tests .getBuckets() // TestBucket: This assignment .get("buttonBgColorTst") // TestBucket // int: Enumerated ID .getValue();
  91. 91. "Redundant" names in test spec… "buttonBgColorTest": { "buckets": { "gray": 0, "blue": 1 } }
  92. 92. … are used to generate helper methods // Choose a background color for templates ResumeSearchGroups groups = new ResumeSearchGroups(assignments); // Enumerated value by test name groups.getButtonBgColorTstValue(); // Boolean accessors for each test & bucket groups.isButtonBgColorTstGray(); groups.isButtonBgColorTstBlue();
  93. 93. Helper designed for use in UI layer This immutable bean is trivial to: ● Read from JSP/JSF ● Read from Templates ○ Freemarker, Velocity, Closure, etc ● Serialize as JSON
  94. 94. Apply the Experiment On every request… 1. Select Groups 2. Render the Response 3. Log the Action
  95. 95. Logging Bucket Assignments Proctor just selects the buckets. When and how you log are up to you: ● On related events only ● On every event
  96. 96. Publication
  97. 97. Test Definitions in Source Control ● No new infrastructure ● Lots of desirable features for free History Diff Access Control
  98. 98. Proctor Data Test Definitions Publish Artifact Periodic Refresh App App Servers
  99. 99. Publication is also via Source Control Individual test changes pushed to a named branch: /trunk /branches/production
  100. 100. Overwriting Tests on a Named Branch Not required to use proctor, but beneficial: ● Same features for free History, Diff, ACL ● No merging ● Easy roll-back, roll-forward
  101. 101. Proctor Data Project Test Definitions Test Specifications Compile Publish Deliverable Build Servers Deploy Artifact Periodic Refresh App App Servers
  102. 102. Beyond the Basics
  103. 103. Test Segmentation
  104. 104. Segmentation Test often apply to only certain users: ● Specific markets ● Specific languages ● Specific devices
  105. 105. Segmentation through Test Rules ● Test definition allows one optional rule ● A rule is simply a boolean expression ● If the rule passes, the user is assigned to a test bucket Rules are written in Unified EL
  106. 106. Simple Things are Simple ● No deployment needed ● Changes live within minutes { "description": "Button colors", "rule": "country == ‘CA’" "buckets": […] }
  107. 107. Primitive and rich data types "userAgent.phone || userAgent.tablet" "userAgent.supports.html5" "userAgent.supports.geolocation" "userAgent.supports.fileUpload"
  108. 108. Commons EL is Easily Extended JSTL Standard Functions "rule": "fn:endsWith( account.email, '@indeed.com')" Custom code "rule": "proctor:contains( ['US', 'CA'], country)"
  109. 109. Arbitrary Complexity Sometimes rules are unavoidably complex: "Android v2.1+": userAgent.android && ( userAgent.OS.majorVersion gt 2 || ( userAgent.OS.majorVersion == 2 && userAgent.OS.minorVersion gte 1 ) )
  110. 110. What context is available? So far we've seen: ● country ● language ● userAgent ● account What's the full list of available context variables?
  111. 111. Context Defined in Test Specification ● Test spec declares available context variables ● This is a contract to provide values at runtime { "tests": […], "providedContext": { "country": "String", "language": "String" "userAgent": "com.indeed.web.UserAgent" } }
  112. 112. Provided While Determining Buckets Also generated from test specification: private ResumeSearchProctor proctor; // Proctor assignments ProctorResult assignments = proctor.determineBuckets( identifiers, country, language, userAgent);
  113. 113. Payloads
  114. 114. Even Tiny Changes Need Deploys // Choose a background color for templates if (bgColorBucket == 1) { // Test model.put("btnBgcolor", "#00f"); } else { // Control group model.put("btnBgcolor", "#ccc"); }
  115. 115. Some Tests Just Vary Data Many tests have no behavioral change: ● CSS Colors ● Display Text ● Algorithm Weights
  116. 116. Payloads ● Values added for each bucket in a test ● Proctor verifies payloads are "all or none" Control: Gray Test: Blue
  117. 117. Payloads ● Values added for each bucket in a test ● Proctor verifies payloads are "all or none" Control: Gray "#ccc" Test: Blue "#00f"
  118. 118. Part of Test Definition ● No deployment needed ● Changes live within minutes "buckets": [{ "id": 0, "name": "gray", "description": "Control group", "payload": { "stringValue": "#ccc" } }, …]
  119. 119. Declared in Project Test Specification ● Type definition only ● Must match test definition "buttonBgColorTst": { "buckets": […], "payload": { "type": "stringValue" } }
  120. 120. Cleaner Code, Only Data Deploy // Choose a background color model.put( "btnBgcolor", groups.getButtonBgColorTstPayload() );
  121. 121. Cross-Product Tests
  122. 122. Cross-Product Tests Many flavors of cross-product test, including ● Peer webapps ● Client / Service ● Mobile Native / Web
  123. 123. Cross-Product Tests Even more ways to coordinate tests ● Tracking parameters on links, requests ● Service response metadata ● Different service calls Proctor offers an interesting alternative
  124. 124. Two products can share test groups As long as both products ● Share the test’s identifier ● Provide the context variables it uses Deterministic selection guarantees identical bucket assignment.
  125. 125. Evolving Tests
  126. 126. Evolving Tests control test
  127. 127. Evolving Tests control test (inactive) 10%
  128. 128. Changed allocations, not ID mapping control OOPS! test ● Inconsistent experience ● Polluted results
  129. 129. Evolving Tests Smoothly control test (inactive) [ 10%, 10%] [ 10%, 10%, 80% ]
  130. 130. Evolving Tests Smoothly control test (inactive) [ 10%, 10%, 80% ] [ 10%, 10%, 40%, 40%] control test control test
  131. 131. Evolving Tests Smoothly control (inactive) test [ 10%, 80%, 10% ] [ 50%, 50% ] control test
  132. 132. Evolving Tests… Turbulently hash ( uid + test.salt ) => bucket Test range: control test1 Any ID: test1 After re-salt: test
  133. 133. Contextual Sampling
  134. 134. Contextual Allocation 10% (US): control (inactive) test (inactive) test 20% (CA): control 50% (Rest of World): control test
  135. 135. Allocations Each test definition ● has one or more allocations Each allocation ● has a rule and ranges totaling 1.0 ● except the last, which has no rule.
  136. 136. Allocation Rules ● Use Unified EL, same as test rules. ● Use the same context variables as test rules. ● Choose the first matching allocation.
  137. 137. Allocations in the Test Definition { "description": "Button colors", "type": "USER", "salt": "buttonBgColorTst", "buckets": [ … ], "allocations": [{ "rule": "country == 'US'", "ranges": [ … ] }, { "ranges": [ … ] }] }
  138. 138. Pre-Production
  139. 139. Environments Local commit Integration push QA push Production
  140. 140. Show test matrix /private/showTestMatrix
  141. 141. Show test bucket assignments /private/showGroups /private/showGroups
  142. 142. Privileged users can force assignments
  143. 143. Privileged users can force assignments ?prforceGroups=buttonColorTst1
  144. 144. Beyond A/B Testing Proctor Patterns for Managing Behavior
  145. 145. Kill Switch When ● New Feature How ● 'Active' bucket @ 100%
  146. 146. Phased Rollout When ● Experimental Feature How ● 'Active' bucket @ 0% ● 'Active' → 1% → 5% → 100%
  147. 147. Throttle When ● Downsampling ○ trace logging ○ survey How ● 'Active' bucket @ 0% ● 'Active' → 1% → 10% → 5% → ??
  148. 148. Feature Toggles When ● Localized Behavior ● Device-Specific Behavior ● Logged-in, w/ Resume, etc. How ● Multiple Allocations ● Targeted Rules
  149. 149. Dark Deploys When ● Partial Implementations ● Additional QA is needed How ● 'Active' bucket @ 0% ● 'Active' → 100%
  150. 150. Cross-Product Coordination When ● Dependencies between products ○ Resume Wizard feature How ● 'Active' bucket at 0% ● Resume Wizard allocation: → 100% ● Home page promo allocation: → 100%
  151. 151. Pre-Proctor Tests
  152. 152. Post-Proctor Tests 103 Proctor 42
  153. 153. Post-Proctor Tests + Toggles 103 Proctor 65 42 10
  154. 154. Proctor Webapp A/B Test Change Management (Coming Soon to github)
  155. 155. Proctor Webapp
  156. 156. Proctor Webapp
  157. 157. Proctor Webapp
  158. 158. Proctor Webapp
  159. 159. Building On Proctor (Not Open Source)
  160. 160. Description: Group 0: control - Job alert label: Save Alert (control) Group 1: labelSubscribe - Job alert label: Subscribe Group 2: labelSignUp - Job alert label: Sign up Group 3: labelGetJobs - Job alert label: Get jobs Group 4: labelSendMeNewJobs - Job alert label: Send me new jobs Group 5: labelActivate - Job alert label: Activate Group 6: labelSave - Job alert label: Save
  161. 161. History: jack @ 2013-03-12 (r203267): Promoting jasxjabtnlbltst (trunk r203089) to production JASX-11365: jasxjabtnlbltst disabled ketan @ 2012-12-11 (r190675): merged r190418: JASX-10663: Stop jasxjabtnlbltst in all languages except nl will @ 2012-11-29 (r188801): merged r187452: JASX-10457: exclude US from jasxjabtnlbltst ketan @ 2012-10-25 (r182881): merged r182688: JASX-10234 - Adding new langauges to job alert button label test ketan @ 2012-10-25 (r182876): merged r181938: JASX-10234 - Adding test definition and allocations for job alert button label test
  162. 162. DEMO Get out your Phones and Tablets
  163. 163. http://go.indeed.com/demo Simple: test different background colors 25% 25% 50% 25% 25%
  164. 164. http://go.indeed.com/demo Let’s increase our bucket size... 50% 50%
  165. 165. http://go.indeed.com/demo We have a winner! 50% 100%
  166. 166. http://go.indeed.com/demo Let’s do something wacky! Android iOS Android >= 4 iOS >= 7
  167. 167. http://go.indeed.com/demo Also a reference implementation Running on heroku -- feel free to clone! http://indeedeng-hello-proctor.herokuapp.com Source: github.com/indeedeng/proctor-demo
  168. 168. Q&A Source: github.com/indeedeng/proctor Docs: indeedeng.github.io/proctor
  169. 169. Next @IndeedEng Talk Boxcar Self-balancing distributed services Wednesday, October 30 R.B. Boyer

×