Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Ed Snelson. Counterfactual Analysis


Published on


Published in: Software
  • Be the first to comment

  • Be the first to like this

Ed Snelson. Counterfactual Analysis

  1. 1. Counterfactual analysis: a Big Data case-study using Cosmos/SCOPE Ed Snelson
  2. 2. Work by Jonas Peters Joaquin Quiñonero Candela Denis Xavier Charles D. Max Chickering Elon Portugaly Dipankar Ray Patrice Simard Ed Snelson Léon Bottou
  4. 4. Search ads
  5. 5. The search ads ecosystem User Advertiser Queries Ads & Bids Ads Prices Clicks (and consequences) Learning ADVERTISER FEEDBACK LOOP LEARNING FEEDBACK LOOP USER FEEDBACK LOOP Search-engine
  6. 6. Learning to run a marketplace • The learning machine is not a machine but is an organization with lots of people doing stuff! How can we help? • Goal: improve marketplace machinery such that its long term revenue is maximal • Approximate goal by improving multiple performance measures (KPIs) related to all players • Provide data for decision making • Automatically optimize parts of the system
  7. 7. Outline from here on II. Online Experimentation III. Counterfactual measurements IV. Cosmos/SCOPE V. Implementation details
  9. 9. How do parameters affect KPIs? • We want to determine how certain auction parameters affect KPIs • Three options: 1. Offline log analysis – “correlational” 2. Auction simulation 3. Online experimentation – “causal”
  10. 10. The problem with correlation analysis (Simpson’s paradox) Trying to decide whether a drug helps or not • Historical data: • Conclusion: don’t give the drug But what if the Drs. were saving the drug for the severe cases? • Conclusion reversed: drug helps for both severe and mild cases All Survived Died Survival Rate Treated 5,000 2,100 2,900 42% Not Treated 5,000 2,900 2,100 58% Severe cases (treatment rate 80%) All Survived Died Survival Rate Treated 4,000 1,200 2,800 30% Not Treated 1,000 100 900 10% Mild case (treatment rate 20%) All Survived Died Survival Rate Treated 1,000 900 100 90% Not Treated 4,000 2,800 1,300 70%
  11. 11. Overkill? Pervasive causation paradoxes in ad data! Example. – Logged data shows a positive correlation between event A “First mainline ad gets a high quality score” and event B “Second mainline ad receives a click”. – Do high quality ads encourage clicking below? – Controlling for event C ”Query categorized as commercial” reverses the correlation for both commercial and non-commercial queries.
  12. 12. Randomized experiments Randomly select who to treat • Selection independent of all confounding factors • Therefore eliminates Simpson’s paradox and allows: Counterfactual estimates • If we had given drug to 𝑥% of the patients, the success rate would have been 60% × 𝑥 + 40% × 1 − 𝑥 All population (treatment rate 30%) All Survived Died Survival Rate Treated 3,000 1,800 1,200 60% Not Treated 7,000 2,800 4,200 40%
  13. 13. Experiments in the online world • A/B tests are used throughout the online world to compare different versions of the system – A random fraction of the traffic (a flight) uses click- prediction system A – Another random fraction uses click-prediction system B • Wait for a week, measure KPIs, choose best! • Our framework takes this one step further…
  15. 15. Counterfactuals Measuring something that did not happen “How would the system have performed if, when the data was collected, we had used 𝑠𝑦𝑠𝑡𝑒𝑚∗ instead of 𝑠𝑦𝑠𝑡𝑒𝑚?”
  16. 16. Replaying past data Classification example • Collect labeled data in existing setup • Replay the past data to evaluate what the performance would have been if we had used classifier θ. • Requires knowledge of all functions connecting the point of change to the point of measurement. 𝑠 *
  17. 17. Concrete example: mainline reserve (MLR) Mainline Sidebar Ad Score > MLR
  18. 18. Online randomization Q: Can we estimate the results of a change counterfactually (without actually performing the change)? A: Yes, if 𝑠𝑦𝑠𝑡𝑒𝑚∗ and 𝑠𝑦𝑠𝑡𝑒𝑚 are non-deterministic (and close enough) 𝑃(𝑀𝐿𝑅) 𝑃∗(𝑀𝐿𝑅) MLR MLR 𝑀𝐿𝑅 𝑀𝐿𝑅 ∗ Deterministic Randomized For each auction, a random MLR is used online, drawn from the data-collection distribution 𝑃(𝑀𝐿𝑅)
  19. 19. Estimating counterfactual KPIs 𝐶𝑙𝑖𝑐𝑘𝑠𝑡𝑜𝑡𝑎𝑙 ∗ ~ 𝑖 𝑤𝑖 ∗ 𝐶𝑙𝑖𝑐𝑘𝑠(𝑎𝑢𝑐𝑡𝑖𝑜𝑛𝑖) 𝐶𝑙𝑖𝑐𝑘𝑠𝑡𝑜𝑡𝑎𝑙 = 𝑖 𝐶𝑙𝑖𝑐𝑘𝑠(𝑎𝑢𝑐𝑡𝑖𝑜𝑛𝑖) Usual additive KPI: Counterfactual KPI: • Weighted sum: auctions with MLRs “closer” to the counterfactual distribution get higher weight 𝑤𝑖 ∗ = 𝑃∗ (𝑀𝐿𝑅𝑖) 𝑃 𝑀𝐿𝑅𝑖
  20. 20. Exploration 𝑃(𝜔) 𝑃∗(𝜔) Quality of the estimation • Confidence intervals reveal whether the data collection distribution 𝑃 𝜔 performs sufficient exploration to answer the counterfactual question of interest. 𝑃(𝜔) 𝑃∗(𝜔)
  21. 21. Clicks vs MLR Inner “exploration” intervalOuter “sample- size” interval Control with no randomization Control with 18% lower MLR
  22. 22. Number of Mainline Ads vs MLR This is easy to estimate
  23. 23. Revenue vs MLR Revenue has always high sample variance
  24. 24. More with the same data How is this related to A/B testing? • A/B testing tests 2 specific settings against each other • Need to know what questions you want to ask beforehand! Big advantage of more general randomization: • Collect data first, choose question(s) later • Randomizing more stuff increases opportunities But… • Requires more sophisticated offline log processing
  25. 25. IV. COSMOS/SCOPE
  26. 26. Ad Auction Logs • ≈ 10TB per day ad-auction logs • Cooked and joined from various raw logs • Stored in Cosmos, queried via SCOPE • Small fraction of total Bing logs and jobs: – Tens of thousands SCOPE jobs daily – Tens of PBs read/write daily
  27. 27. Cosmos/SCOPE ≈ PIG/HIVE ≈ HDFS
  28. 28. Cosmos • Microsoft’s internal distributed data store • Tens of thousands of commodity servers ≈ HDFS, GFS • Append-only file system, optimized for sequential I/O • Data replication and compression
  29. 29. Data Representation 1. Unstructured streams – Custom Extractors: converts a sequence of bytes into a RowSet, specifying a schema for the columns 2. Structured streams – Data stored alongside metadata information: a well- defined schema, and structural properties (e.g. partitioning and sorting information) – Can be horizontally partitioned into tens of thousands of partitions e.g. hash or range partitioning – Indexes for random access and index-based joins
  30. 30. SCOPE scripting language • SQL-like (in syntax) declarative language specifying data transformation pipeline • Each scope statement takes as input one or more RowSets, and outputs another RowSet • Highly extensible with C# expressions, custom operators and data types • Scope compiler and optimizer responsible for generating a data flow DAG for an efficient parallel execution
  31. 31. C# Expressions and functions R1 = SELECT A+C AS ac, B.Trim() AS B1 FROM R WHERE StringOccurs(C, “xyz”) > 2; #CS public static int StringOccurs(string str, string ptrn) { int cnt=0; int pos=-1; while (pos+1 < str.Length) { pos = str.IndexOf(ptrn, pos+1); if (pos < 0) break; cnt++; } return cnt; } #ENDCS C# String method C# String expression
  32. 32. C# User-defined types (UDTs) – Arbitrary C# classes can be used as column types in scripts – Extremely convenient for easy serialization/deserialization – Can be referenced in external dlls, C# backing files, and in-script (#CS … #ENDCS) SELECT UserId, SessionId, new RequestInfo(binaryData) AS Request FROM InputStream WHERE Request.Browser.IsIE();
  33. 33. C# User-defined operators – User defined aggregates • Aggregate Interface: Intialize, Accumulate, Finalize • Can be declared recursive: allows partial aggregation – MapReduce-like extensions • PROCESS • REDUCE – Can be declared recursive • COMBINE
  34. 34. SCOPE compilation and execution SELECT query, COUNT() AS count FROM "search.log“ USING LogExtractor GROUP BY query HAVING count > 1000 ORDER BY count DESC; OUTPUT TO "qcount.result"; Runtime cost-based optimizer
  35. 35. SCOPE: Pros/Cons (an opinion) • Pros: – Very quick to write simple queries without thinking about parallelization and execution – Highly extensible with deep C# integration – UDT columns and C# functions – Easy development and debugging from VS • Intellisense • Cons: – No loop/iteration support means a poor fit for many ML algorithms – Batch, rather than interactive
  37. 37. Counterfactual computation • Ideal for Map-Reduce setting • Map: 𝑎𝑢𝑐𝑡𝑖𝑜𝑛𝑖 → 𝐾𝑃𝐼(𝑎𝑢𝑐𝑡𝑖𝑜𝑛𝑖) • Reduce: 𝑖 𝑤𝑖 ∗ … 𝐾𝑃𝐼𝑡𝑜𝑡𝑎𝑙 ∗ = 𝑖 𝑤𝑖 ∗ 𝐾𝑃𝐼(𝑎𝑢𝑐𝑡𝑖𝑜𝑛𝑖)
  38. 38. Counterfactual grid
  39. 39. SCOPE pseudo-code for counterfactuals AuctionLogs = VIEW CosmosLogPath; SELECT Auction FROM AuctionLogs; SELECT ComputeKPIs(Auction) AS KPIs, ComputeWeightGrid(Auction) AS WeightGrid; SELECT ComputeWeightedKPIs(KPIs, GridPoint) AS wKPIs, CROSS APPLY WeightGrid AS GridPoint; SELECT AggregateKPIs(wKPIs) AS TotalKPIs GROUP BY GridPoint; SELECT GridPoint, TotalKPIs.Finalize() AS FinalKPIs OUTPUT TO “Results.tsv”; C# UDT: Wraps all logged info about a single auction C# UDFs Call instance method on “TotalKPIs” UDT Recursive Aggregator: 𝑤𝑖, 𝑤𝑖 𝐾𝑃𝐼𝑖 etc. Unroll the weight grid
  40. 40. Conclusions • There are systems in the real world that are too complex to easily formalize • Causal inference clarifies many problems – Ignoring causality => Simpson’s paradox – Randomness allows inferring causality • The counterfactual framework is modular – Randomize in advance, ask later • Counterfactual analysis ideally suited to batch map-reduce