Building better products through Experimentation - SDForum Business Intelligence SIG


Published on

  • Be the first to comment

Building better products through Experimentation - SDForum Business Intelligence SIG

  1. 1. Building better products through Experimentation Deepak Nadig, eBay Principal Architect SDForum Business Intelligence SIG March 27, 2008
  2. 2. What we’re up against • eBay manages … – Over 276,000,000 registered users – Over 1 Billion photos – eBay users worldwide trade more than $2039 worth of goods every second – eBay averages well over 1 billion page views per day – At any given time, there are over 113 million items for sale on the site – eBay stores over 2 Petabytes of data – over 200 times the size of the Library of Congress! – eBay analytics processes over 25 Petabytes of data on any day – The eBay platform handles 4.4 billion API calls per month An SUV good every 5 minutes A sportingis soldsells every 2 seconds Over ½ Million pounds of Kimchi are sold every year! • In a dynamic environment – 300+ features per quarter – We roll 100,000+ lines of code every two weeks • In 39 countries, in seven languages, 24x7 >44 Billion SQL executions/day! 2
  3. 3. Site Statistics: in a typical day… June 1999 Q1 2007 Growth Outbound Emails 1M 41 M 41x Total Page Views 54 M >1 B 19x 16 Gbps 59x 0 150 M N/A ~97% 99.94% 50x 43 mins/day 50 sec/day Peak Network Utilization API Calls Availability 3 268 Mbps
  4. 4. Velocity of eBay -- Software Development Process 276M Users 300+ Features Per Quarter 99.94% 100K LOC/Wk 6M LOC • Our site is our product. We change it incrementally through implementing new features. • Very predictable development process – trains leave on-time at regular intervals (weekly). • Parallel development process with significant output -- 100,000 LOC per release. • Always on – over 99.94% available. All while supporting a 24x7 environment 4
  5. 5. James Lind and cure for scurvy cider 5 elixir of vitriol sea water garlic vinegar mustard horseradish orange lemon
  6. 6. Reminder for data/analytics driven decisions • Auction vs. Stores • Combined search results – Return a broader mix of inventory – Listings of core + stores were combined – More exposure to store listings • Results – Business metrics were down – bids, average sales price, etc. – Latency in discovering this • Analysis – Overall cost of a store listing is less than that of auction listing – Sellers shifted inventory to save on fees • Rolled back in 03/2006 – Higher fees for store listings 6
  7. 7. Many Insights Methods (By Data Source vs. Approach) Focus Groups / “Voices” Desirability studies Exit Surveys Phone Interviews Self-reported (stated) Cardsorting Product Tracker Diary/Camera Study Message Board Mining DATA SOURCE (Onsite interviews) “Visits” / Ethnographic Field Studies mixture Intent Discovery Usability Lab Studies (task-based) (Extended observation) / Quantitative user experience assessments Usability benchmarking (in lab) Observed Behavior / Data mining Eyetracking Experimentation Clickstreams Qualitative (direct) APPROACH Quantitative (indirect) KEY – Context of data collection with respect to product use 7 De-contextualized / not using product Scripted or lab-based use of product Natural use of product Combination / hybrid
  8. 8. Concepts • Unit (of experimentation, analysis) – Entity on whom the experimentation or analysis is being made – e.g. user, seller, buyer, item • Factor (or variable) – Something that can have multiple values – Independent or controlled (cause), Dependent or response (effect) • Treatment (or experience) – A variation of information (e.g. page flow, page, module) served to the unit. The variation is characterized by change in one or more factors or variables • Sample – A group of users who are served the same treatment. • Evaluation Metric – A metric used to compare the response to different treatments • Experimentation – A method of comparing 2 or more treatments based on measurable metric. One variant, the status quo, is referred as the ‘control’. 8
  9. 9. Treatment (or experience) • Module – Strict subset of the page – User is treated to changes to a module – For e.g. zebra vs. integrated vs. distinct ads • Page – User is treated to different variations of the page – For e.g. 2L1R (Left column is twice as wide as right) vs. 1L2R • Page Flow or Use Case – User is treated to different variations of a use case – For e.g. different flows for listing an item for sale 9
  10. 10. Sampling • Population – Group you want to generalize to People • Sample – Units from the population selected • Sampling – Process of selecting units from a population of interest – By studying the sample you can fairly generalize the results to the population • External validity (Generalizability) • Mechanisms – Random – Stratified random – … • What matters is number of samples 10 Time Place Setting
  11. 11. Experiments • A/B testing – A form of testing in which two treatments, a control (‘A’) and variant (‘B’) are compared. – No emphasis on cause (factor) • Single-factor testing – A form of testing in which treatments corresponding to values of a single-factor are compared – For e.g. Ad – Yes/No • Multi-factorial testing (DOE) – A method of testing in which treatments corresponding to multiple-values of multiple-factors are compared – For e.g. Ad – Yes/No, Location – Top/Bottom – Manual vs. Automated 11
  12. 12. Objective • To explore relationship between factors • Relationships – None – Co-relational, Synchronized • Positive vs. Negative • Third-variable problem – Causal relationship • Establishing causal relationship – If X, then Y – If not X, then not Y • Distinguish significant factors and interactions • Measure impact on the metric 12
  13. 13. Experiment Lifecycle •Metrics •Reporting •Idea (!) •Learning 7. Analysis & Results 1. Hypothesis •Tracking 5. Measurement •Monitoring 4. Launch Experiment •User (Experiment, Treatment) •Serve Treatment 13 eBay Experimentation Platform 2. Experimental Design •DOE •Define Samples, Treatments, Factors 3. Setup Experiment •Setup Experiment Samples Treatments, Factors •Implementation
  14. 14. Reduce Email Guessing • Purpose – Measure decline in registrations from introduction of blocking message – Users cannot create username which equals email address – E.g. Username: cooky1 Email: • Metrics – Number of registrations – Reduction in phishing • Samples – 3% US • Treatments – Classic, Blocked • Outcome – No difference in registrations – Improved security 14
  15. 15. Text Ads on SRP • Purpose – Determine whether the use of text ads on search result pages • Metrics – Overall revenue • Samples – 1% US, International • Treatments – Ad, No-ad • Outcome – Overall revenue increased in certain markets 15
  16. 16. Home Page • Purpose – Optimal construction of page – Per user segment? • Metrics – Overall revenue • Samples – Varied per treatment • Treatments – 100s of variations – Ads, Merchandising, P13N, Navigation, Layout • Outcome – Page structures different for different user segments 16
  17. 17. What we think about Fidelity of Experiments The quality of the model and its testing conditions in representing the final feature or product under actual use conditions Cost of Experiments The total cost of designing, building, running, and analyzing an experiment Iteration time The time from planning experiments to when the analyzed results are available and used for planning another iteration Concurrency The number of experiments that can be run at the same time Signal/Noise Ratio The extent to which the signal (response) of interest is obscured by noise Type/Level of Experiment Types and Levels of experiment that can be carried out 17
  18. 18. Experimentation Platform Access eBay user Experimenter Access Page, Module Design Experimentation Service Results Observations Experience Finding Finding Finding Selling Selling Selling Buying Buying Buying Experiment Metadata Experiment Lifecycle Management Experience Response Message Bus Results Observations Analysis 18 Metrics / Experience Experiences Responses Alert Listener Data Cube File Log
  19. 19. Implementation Considerations • User identification • User – – – – – Sample No bias towards any experiment or treatment Sticky-ness between activities (and sessions) No interaction between experiments Enabling a user to try out a specific treatment Ramping-up to understand generalization effects • Sample Treatment – No bias • Splitting traffic – – – – Inline Application server Load balancer Browser • Factor-driven development 19
  20. 20. Measurement – A case of traveling shoppers Sunday Monday Tuesday Wednesday Thursday Friday Saturday Page-1 Alice Alice Bob Charlie Bob Charlie Alice 3 Page-2 Bob Alice Bob Alice Bob Bob Alice 2 2 1 1 2 1 2 1 3 20
  21. 21. Limitations and ways to overcome them … • Sticky-ness to user – Session-level analysis • What, not why – When qualitative research complements • Short-term vs. Long-term effects – Think about the duration of the experiment • Newness effect – Consider burn-in periods • Minor vs. major differences – Think about amount of effort being committed • Anonymity of tests – When qualitative research spills the beans 21
  22. 22. Key takeaways • Experimentation is one of the most effective approaches for gaining quantitative insights • Enables businesses to quickly understand and establish relationships between product changes and their impact on business metrics • Different types and levels of experiments can be used to gain different amounts of insights • Experimentation has limitations, but they can be overcome • Think about “experiment-ability”, as one another “-ability” in product design 22
  23. 23. Experimentation Confirms Innovation 23