Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Machine Learning Summer School 2016

3,335 views

Published on

Lectures delivered Aug 8-9, 2016 at MLSS.cc (Arequipa, Peru)
"Data Science @ The New York Times".
Topics include:

descriptive/predictive/prescriptive modeling
boosting & surrogate loss functions pp103-117
causality via generative models (p141-215)
POISE: “policy optimization via importance sample estimation”
connection w/causal inference & matching (p188-215)
brief intro to instrumental variables (p230-248)
bandits (p249-316)
connecting thompson sampling with generative modeling (p259-305)

Published in: Engineering

Machine Learning Summer School 2016

  1. 1. data science @ NYT Chris Wiggins Aug 8/9, 2016
  2. 2. Outline 1. overview of DS@NYT
  3. 3. Outline 1. overview of DS@NYT 2. prediction + supervised learning
  4. 4. Outline 1. overview of DS@NYT 2. prediction + supervised learning 3. prescription, causality, and RL
  5. 5. Outline 1. overview of DS@NYT 2. prediction + supervised learning 3. prescription, causality, and RL 4. description + inference
  6. 6. Outline 1. overview of DS@NYT 2. prediction + supervised learning 3. prescription, causality, and RL 4. description + inference 5. (if interest) designing data products
  7. 7. 0. Thank the organizers! Figure 1: prepping slides until last minute
  8. 8. Lecture 1: overview of ds@NYT
  9. 9. data science @ The New York Times chris.wiggins@columbia.edu chris.wiggins@nytimes.com @chrishwiggins references: http://bit.ly/stanf16
  10. 10. data science @ The New York Times
  11. 11. data science: searches
  12. 12. data science: mindset & toolset drew conway, 2010
  13. 13. modern history: 2009
  14. 14. modern history: 2009
  15. 15. biology: 1892 vs. 1995
  16. 16. biology: 1892 vs. 1995 biology changed for good.
  17. 17. biology: 1892 vs. 1995 new toolset, new mindset
  18. 18. genetics: 1837 vs. 2012 ML toolset; data science mindset
  19. 19. genetics: 1837 vs. 2012
  20. 20. genetics: 1837 vs. 2012 ML toolset; data science mindset arxiv.org/abs/1105.5821 ; github.com/rajanil/mkboost
  21. 21. data science: mindset & toolset
  22. 22. 1851
  23. 23. news: 20th century church state
  24. 24. church
  25. 25. church
  26. 26. church
  27. 27. news: 20th century church state
  28. 28. news: 21st century church state data
  29. 29. 1851 1996 newspapering: 1851 vs. 1996
  30. 30. example: millions of views per hour2015
  31. 31. "...social activities generate large quantities of potentially valuable data...The data were not generated for the purpose of learning; however, the potential for learning is great’’
  32. 32. "...social activities generate large quantities of potentially valuable data...The data were not generated for the purpose of learning; however, the potential for learning is great’’ - J Chambers, Bell Labs,1993, “GLS”
  33. 33. data science: the web
  34. 34. data science: the web is your “online presence”
  35. 35. data science: the web is a microscope
  36. 36. data science: the web is an experimental tool
  37. 37. data science: the web is an optimization tool
  38. 38. 1851 1996 newspapering: 1851 vs. 1996 vs. 2008 2008
  39. 39. “a startup is a temporary organization in search of a repeatable and scalable business model” —Steve Blank
  40. 40. every publisher is now a startup
  41. 41. every publisher is now a startup
  42. 42. every publisher is now a startup
  43. 43. every publisher is now a startup
  44. 44. news: 21st century church state data
  45. 45. news: 21st century church state data
  46. 46. learnings
  47. 47. learnings - predictive modeling - descriptive modeling - prescriptive modeling
  48. 48. (actually ML, shhhh…) - (supervised learning) - (unsupervised learning) - (reinforcement learning)
  49. 49. (actually ML, shhhh…) h/t michael littman
  50. 50. (actually ML, shhhh…) h/t michael littman Supervised Learning Reinforcement Learning Unsupervised Learning (reports)
  51. 51. learnings - predictive modeling - descriptive modeling - prescriptive modeling cf. modelingsocialdata.org
  52. 52. stats.stackexchange.com
  53. 53. from “are you a bayesian or a frequentist” —michael jordan L = NX i=1 ϕ (yif(xi; β)) + λ||β||
  54. 54. predictive modeling, e.g., cf. modelingsocialdata.org
  55. 55. predictive modeling, e.g., “the funnel” cf. modelingsocialdata.org
  56. 56. interpretable predictive modeling supercoolstuff cf. modelingsocialdata.org
  57. 57. interpretable predictive modeling supercoolstuff cf. modelingsocialdata.org arxiv.org/abs/q-bio/0701021
  58. 58. optimization & learning, e.g., “How The New York Times Works “popular mechanics, 2015
  59. 59. optimization & prediction, e.g., (some models) (somemoneys) “newsvendor problem,” literally (+prediction+experiment)
  60. 60. recommendation as inference
  61. 61. recommendation as inference bit.ly/AlexCTM
  62. 62. descriptive modeling, e.g, “segments” cf. modelingsocialdata.org
  63. 63. descriptive modeling, e.g, “segments” cf. modelingsocialdata.org
  64. 64. descriptive modeling, e.g, “segments” argmax_z p(z|x)=14 cf. modelingsocialdata.org
  65. 65. descriptive modeling, e.g, “segments” “baby boomer” cf. modelingsocialdata.org
  66. 66. - descriptive data product cf. daeilkim.com
  67. 67. descriptive modeling, e.g, cf. daeilkim.com ; import bnpy
  68. 68. modeling your audience bit.ly/Hughes-Kim-Sudderth-AISTATS15
  69. 69. modeling your audience (optimization, ultimately)
  70. 70. also allows insight+targeting as inference modeling your audience
  71. 71. prescriptive modeling
  72. 72. prescriptive modeling
  73. 73. prescriptive modeling
  74. 74. “off policy value estimation” (cf. “causal effect estimation”) cf. Langford `08-`16; Horvitz & Thompson `52; Holland `86
  75. 75. “off policy value estimation” (cf. “causal effect estimation”) Vapnik’s razor “ When solving a (learning) problem of interest, do not solve a more complex problem as an intermediate step.”
  76. 76. prescriptive modeling cf. modelingsocialdata.org
  77. 77. prescriptive modeling aka “A/B testing”; RCT cf. modelingsocialdata.org
  78. 78. Reporting Learning Test aka “A/B testing”; business as usual (esp. predictive) Some of the most recognizable personalization in our service is the collection of “genre” rows. …Members connect with these rows so well that we measure an increase in member retention by placing the most tailored rows higher on the page instead of lower. cf. modelingsocialdata.org prescriptive modeling: from A/B to….
  79. 79. real-time A/B -> “bandits” GOOG blog: cf. modelingsocialdata.org
  80. 80. prescriptive modeling, e.g,
  81. 81. prescriptive modeling, e.g,
  82. 82. prescriptive modeling, e.g,
  83. 83. prescriptive modeling, e.g, leverage methods which are predictive yet performant
  84. 84. NB: data-informed, not data-driven
  85. 85. predicting views/cascades: doable? KDD 09: how many people are online?
  86. 86. predicting views/cascades: doable? WWW 14: FB shares
  87. 87. predicting views/cascades: features?
  88. 88. predicting views/cascades: features?
  89. 89. predicting views/cascades: doable? WWW 16: TWIT RT’s
  90. 90. predicting views/cascades: doable?
  91. 91. Reporting Learning Test Optimizing Exploredescriptive: predictive: prescriptive:
  92. 92. Reporting Learning Test Optimizing Exploredescriptive: predictive: prescriptive:
  93. 93. things: what does DS team deliver? - build data product - build APIs - impact roadmaps
  94. 94. data science @ The New York Times chris.wiggins@columbia.edu chris.wiggins@nytimes.com @chrishwiggins references: http://bit.ly/stanf16
  95. 95. Lecture 2: predictive modeling @ NYT
  96. 96. desc/pred/pres Figure 2: desc/pred/pres ◮ caveat: difference between observation and experiment. why?
  97. 97. blossom example Figure 3: Reminder: Blossom
  98. 98. blossom + boosting (‘exponential’) Figure 4: Reminder: Surrogate Loss Functions
  99. 99. tangent: logistic function as surrogate loss function ◮ define f (x) ≡ log p(y = 1|x)/p(y = −1|x) ∈ R
  100. 100. tangent: logistic function as surrogate loss function ◮ define f (x) ≡ log p(y = 1|x)/p(y = −1|x) ∈ R ◮ p(y = 1|x) + p(y = −1|x) = 1 → p(y|x) = 1/(1 + exp(−yf ))
  101. 101. tangent: logistic function as surrogate loss function ◮ define f (x) ≡ log p(y = 1|x)/p(y = −1|x) ∈ R ◮ p(y = 1|x) + p(y = −1|x) = 1 → p(y|x) = 1/(1 + exp(−yf )) ◮ − log2 p({y}N 1 ) = i log2 1 + e−yi f (xi ) ≡ i ℓ(yi f (xi ))
  102. 102. tangent: logistic function as surrogate loss function ◮ define f (x) ≡ log p(y = 1|x)/p(y = −1|x) ∈ R ◮ p(y = 1|x) + p(y = −1|x) = 1 → p(y|x) = 1/(1 + exp(−yf )) ◮ − log2 p({y}N 1 ) = i log2 1 + e−yi f (xi ) ≡ i ℓ(yi f (xi )) ◮ ℓ′′ > 0, ℓ(µ) > 1[µ < 0] ∀µ ∈ R.
  103. 103. tangent: logistic function as surrogate loss function ◮ define f (x) ≡ log p(y = 1|x)/p(y = −1|x) ∈ R ◮ p(y = 1|x) + p(y = −1|x) = 1 → p(y|x) = 1/(1 + exp(−yf )) ◮ − log2 p({y}N 1 ) = i log2 1 + e−yi f (xi ) ≡ i ℓ(yi f (xi )) ◮ ℓ′′ > 0, ℓ(µ) > 1[µ < 0] ∀µ ∈ R. ◮ ∴ maximizing log-likelihood is minimizing a surrogate convex loss function for classification (though not strongly convex, cf. Yoram’s talk)
  104. 104. tangent: logistic function as surrogate loss function ◮ define f (x) ≡ log p(y = 1|x)/p(y = −1|x) ∈ R ◮ p(y = 1|x) + p(y = −1|x) = 1 → p(y|x) = 1/(1 + exp(−yf )) ◮ − log2 p({y}N 1 ) = i log2 1 + e−yi f (xi ) ≡ i ℓ(yi f (xi )) ◮ ℓ′′ > 0, ℓ(µ) > 1[µ < 0] ∀µ ∈ R. ◮ ∴ maximizing log-likelihood is minimizing a surrogate convex loss function for classification (though not strongly convex, cf. Yoram’s talk) ◮ but i log2 1 + e−yi wT h(xi ) not as easy as i e−yi wT h(xi )
  105. 105. boosting 1 L exponential surrogate loss function, summed over examples: ◮ L[F] = i exp (−yi F(xi ))
  106. 106. boosting 1 L exponential surrogate loss function, summed over examples: ◮ L[F] = i exp (−yi F(xi )) ◮ = i exp −yi t t′ wt′ ht′ (xi ) ≡ Lt(wt)
  107. 107. boosting 1 L exponential surrogate loss function, summed over examples: ◮ L[F] = i exp (−yi F(xi )) ◮ = i exp −yi t t′ wt′ ht′ (xi ) ≡ Lt(wt) ◮ Draw ht ∈ H large space of rules s.t. h(x) ∈ {−1, +1}
  108. 108. boosting 1 L exponential surrogate loss function, summed over examples: ◮ L[F] = i exp (−yi F(xi )) ◮ = i exp −yi t t′ wt′ ht′ (xi ) ≡ Lt(wt) ◮ Draw ht ∈ H large space of rules s.t. h(x) ∈ {−1, +1} ◮ label y ∈ {−1, +1}
  109. 109. boosting 1 L exponential surrogate loss function, summed over examples: ◮ Lt+1(wt; w) ≡ i dt i exp (−yi wht+1(xi )) Punchlines: sparse, predictive, interpretable, fast (to execute), and easy to extend, e.g., trees, flexible hypotheses spaces, L1, L∞ 1, . . .
  110. 110. boosting 1 L exponential surrogate loss function, summed over examples: ◮ Lt+1(wt; w) ≡ i dt i exp (−yi wht+1(xi )) ◮ = y=h′ dt i e−w + y=h′ dt i e+w ≡ e−w D+ + e+w D− Punchlines: sparse, predictive, interpretable, fast (to execute), and easy to extend, e.g., trees, flexible hypotheses spaces, L1, L∞ 1, . . .
  111. 111. boosting 1 L exponential surrogate loss function, summed over examples: ◮ Lt+1(wt; w) ≡ i dt i exp (−yi wht+1(xi )) ◮ = y=h′ dt i e−w + y=h′ dt i e+w ≡ e−w D+ + e+w D− ◮ ∴ wt+1 = argminw Lt+1(w) = (1/2) log D+/D− Punchlines: sparse, predictive, interpretable, fast (to execute), and easy to extend, e.g., trees, flexible hypotheses spaces, L1, L∞ 1, . . .
  112. 112. boosting 1 L exponential surrogate loss function, summed over examples: ◮ Lt+1(wt; w) ≡ i dt i exp (−yi wht+1(xi )) ◮ = y=h′ dt i e−w + y=h′ dt i e+w ≡ e−w D+ + e+w D− ◮ ∴ wt+1 = argminw Lt+1(w) = (1/2) log D+/D− ◮ Lt+1(wt+1) = 2 D+D− = 2 ν+(1 − ν+)/D, where 0 ≤ ν+ ≡ D+/D = D+/Lt ≤ 1 Punchlines: sparse, predictive, interpretable, fast (to execute), and easy to extend, e.g., trees, flexible hypotheses spaces, L1, L∞ 1, . . .
  113. 113. boosting 1 L exponential surrogate loss function, summed over examples: ◮ Lt+1(wt; w) ≡ i dt i exp (−yi wht+1(xi )) ◮ = y=h′ dt i e−w + y=h′ dt i e+w ≡ e−w D+ + e+w D− ◮ ∴ wt+1 = argminw Lt+1(w) = (1/2) log D+/D− ◮ Lt+1(wt+1) = 2 D+D− = 2 ν+(1 − ν+)/D, where 0 ≤ ν+ ≡ D+/D = D+/Lt ≤ 1 ◮ update example weights dt+1 i = dt i e∓w Punchlines: sparse, predictive, interpretable, fast (to execute), and easy to extend, e.g., trees, flexible hypotheses spaces, L1, L∞ 1, . . . 1 Duchi + Singer “Boosting with structural sparsity” ICML ’09
  114. 114. predicting people ◮ “customer journey” prediction
  115. 115. predicting people ◮ “customer journey” prediction ◮ fun covariates
  116. 116. predicting people ◮ “customer journey” prediction ◮ fun covariates ◮ observational complication v structural models
  117. 117. predicting people (reminder) Figure 5: both in science and in real world, feature analysis guides future experiments
  118. 118. single copy (reminder) Figure 6: from Lecture 1
  119. 119. example in CAR (computer assisted reporting) Figure 7: Tabuchi article
  120. 120. example in CAR (computer assisted reporting) ◮ cf. Friedman’s “Statistical models and Shoe Leather”2 2 Freedman, David A. “Statistical models and shoe leather.” Sociological methodology 21.2 (1991): 291-313.
  121. 121. example in CAR (computer assisted reporting) ◮ cf. Friedman’s “Statistical models and Shoe Leather”2 ◮ Takata airbag fatalities 2 Freedman, David A. “Statistical models and shoe leather.” Sociological methodology 21.2 (1991): 291-313.
  122. 122. example in CAR (computer assisted reporting) ◮ cf. Friedman’s “Statistical models and Shoe Leather”2 ◮ Takata airbag fatalities ◮ 2219 labeled3 examples from 33,204 comments 2 Freedman, David A. “Statistical models and shoe leather.” Sociological methodology 21.2 (1991): 291-313. 3 By Hiroko Tabuchi, a Pulitzer winner
  123. 123. example in CAR (computer assisted reporting) ◮ cf. Friedman’s “Statistical models and Shoe Leather”2 ◮ Takata airbag fatalities ◮ 2219 labeled3 examples from 33,204 comments ◮ cf. Box’s “Science and Statistics”4 2 Freedman, David A. “Statistical models and shoe leather.” Sociological methodology 21.2 (1991): 291-313. 3 By Hiroko Tabuchi, a Pulitzer winner 4 Science and Statistics, George E. P. Box Journal of the American Statistical Association, Vol. 71, No. 356. (Dec., 1976), pp. 791-799.
  124. 124. computer assisted reporting ◮ Impact Figure 8: impact
  125. 125. Lecture 3: prescriptive modeling @ NYT
  126. 126. the natural abstraction ◮ operators5 make decisions 5 In the sense of business deciders; that said, doctors, including those who operate, also have to make decisions, cf., personalized medicines
  127. 127. the natural abstraction ◮ operators5 make decisions ◮ faster horses v. cars 5 In the sense of business deciders; that said, doctors, including those who operate, also have to make decisions, cf., personalized medicines
  128. 128. the natural abstraction ◮ operators5 make decisions ◮ faster horses v. cars ◮ general insights v. optimal policies 5 In the sense of business deciders; that said, doctors, including those who operate, also have to make decisions, cf., personalized medicines
  129. 129. maximizing outcome ◮ the problem: maximizing an outcome over policies. . .
  130. 130. maximizing outcome ◮ the problem: maximizing an outcome over policies. . . ◮ . . . while inferring causality from observation
  131. 131. maximizing outcome ◮ the problem: maximizing an outcome over policies. . . ◮ . . . while inferring causality from observation ◮ different from predicting outcome in absence of action/policy
  132. 132. examples ◮ observation is not experiment
  133. 133. examples ◮ observation is not experiment ◮ e.g., (Med.) smoking hurts vs unhealthy people smoke
  134. 134. examples ◮ observation is not experiment ◮ e.g., (Med.) smoking hurts vs unhealthy people smoke ◮ e.g., (Med.) affluent get prescribed different meds/treatment
  135. 135. examples ◮ observation is not experiment ◮ e.g., (Med.) smoking hurts vs unhealthy people smoke ◮ e.g., (Med.) affluent get prescribed different meds/treatment ◮ e.g., (life) veterans earn less vs the rich serve less6 6 Angrist, Joshua D. (1990). “Lifetime Earnings and the Vietnam Draft Lottery: Evidence from Social Security Administrative Records”. American Economic Review 80 (3): 313–336.
  136. 136. examples ◮ observation is not experiment ◮ e.g., (Med.) smoking hurts vs unhealthy people smoke ◮ e.g., (Med.) affluent get prescribed different meds/treatment ◮ e.g., (life) veterans earn less vs the rich serve less6 ◮ e.g., (life) admitted to school vs learn at school? 6 Angrist, Joshua D. (1990). “Lifetime Earnings and the Vietnam Draft Lottery: Evidence from Social Security Administrative Records”. American Economic Review 80 (3): 313–336.
  137. 137. reinforcement/machine learning/graphical models ◮ key idea: model joint p(y, a, x)
  138. 138. reinforcement/machine learning/graphical models ◮ key idea: model joint p(y, a, x) ◮ explore/exploit: family of joints pα(y, a, x)
  139. 139. reinforcement/machine learning/graphical models ◮ key idea: model joint p(y, a, x) ◮ explore/exploit: family of joints pα(y, a, x) ◮ “causality”: pα(y, a, x) = p(y|a, x)pα(a|x)p(x) “a causes y”
  140. 140. reinforcement/machine learning/graphical models ◮ key idea: model joint p(y, a, x) ◮ explore/exploit: family of joints pα(y, a, x) ◮ “causality”: pα(y, a, x) = p(y|a, x)pα(a|x)p(x) “a causes y” ◮ nomenclature: ‘response’, ‘policy’/‘bias’, ‘prior’ above
  141. 141. in general Figure 9: policy/bias, response, and prior define the distribution also describes both the ‘exploration’ and ‘exploitation’ distributions
  142. 142. randomized controlled trial Figure 10: RCT: ‘bias’ removed, random ‘policy’ (response and prior unaffected) also Pearl’s ‘do’ distribution: a distribution with “no arrows” pointing to the action variable.
  143. 143. POISE: calculation, estimation, optimization ◮ POISE: “policy optimization via importance sample estimation”
  144. 144. POISE: calculation, estimation, optimization ◮ POISE: “policy optimization via importance sample estimation” ◮ Monte Carlo importance sampling estimation
  145. 145. POISE: calculation, estimation, optimization ◮ POISE: “policy optimization via importance sample estimation” ◮ Monte Carlo importance sampling estimation ◮ aka “off policy estimation”
  146. 146. POISE: calculation, estimation, optimization ◮ POISE: “policy optimization via importance sample estimation” ◮ Monte Carlo importance sampling estimation ◮ aka “off policy estimation” ◮ role of “IPW”
  147. 147. POISE: calculation, estimation, optimization ◮ POISE: “policy optimization via importance sample estimation” ◮ Monte Carlo importance sampling estimation ◮ aka “off policy estimation” ◮ role of “IPW” ◮ reduction
  148. 148. POISE: calculation, estimation, optimization ◮ POISE: “policy optimization via importance sample estimation” ◮ Monte Carlo importance sampling estimation ◮ aka “off policy estimation” ◮ role of “IPW” ◮ reduction ◮ normalization
  149. 149. POISE: calculation, estimation, optimization ◮ POISE: “policy optimization via importance sample estimation” ◮ Monte Carlo importance sampling estimation ◮ aka “off policy estimation” ◮ role of “IPW” ◮ reduction ◮ normalization ◮ hyper-parameter searching
  150. 150. POISE: calculation, estimation, optimization ◮ POISE: “policy optimization via importance sample estimation” ◮ Monte Carlo importance sampling estimation ◮ aka “off policy estimation” ◮ role of “IPW” ◮ reduction ◮ normalization ◮ hyper-parameter searching ◮ unexpected connection: personalized medicine
  151. 151. POISE setup and Goal ◮ “a causes y” ⇐⇒ ∃ family pα(y, a, x) = p(y|a, x)pα(a|x)p(x)
  152. 152. POISE setup and Goal ◮ “a causes y” ⇐⇒ ∃ family pα(y, a, x) = p(y|a, x)pα(a|x)p(x) ◮ define off-policy/exploration distribution p−(y, a, x) = p(y|a, x)p−(a|x)p(x)
  153. 153. POISE setup and Goal ◮ “a causes y” ⇐⇒ ∃ family pα(y, a, x) = p(y|a, x)pα(a|x)p(x) ◮ define off-policy/exploration distribution p−(y, a, x) = p(y|a, x)p−(a|x)p(x) ◮ define exploitation distribution p+(y, a, x) = p(y|a, x)p+(a|x)p(x)
  154. 154. POISE setup and Goal ◮ “a causes y” ⇐⇒ ∃ family pα(y, a, x) = p(y|a, x)pα(a|x)p(x) ◮ define off-policy/exploration distribution p−(y, a, x) = p(y|a, x)p−(a|x)p(x) ◮ define exploitation distribution p+(y, a, x) = p(y|a, x)p+(a|x)p(x) ◮ Goal: Maximize E+(Y ) over p+(a|x) using data drawn from p−(y, a, x).
  155. 155. POISE setup and Goal ◮ “a causes y” ⇐⇒ ∃ family pα(y, a, x) = p(y|a, x)pα(a|x)p(x) ◮ define off-policy/exploration distribution p−(y, a, x) = p(y|a, x)p−(a|x)p(x) ◮ define exploitation distribution p+(y, a, x) = p(y|a, x)p+(a|x)p(x) ◮ Goal: Maximize E+(Y ) over p+(a|x) using data drawn from p−(y, a, x).
  156. 156. POISE setup and Goal ◮ “a causes y” ⇐⇒ ∃ family pα(y, a, x) = p(y|a, x)pα(a|x)p(x) ◮ define off-policy/exploration distribution p−(y, a, x) = p(y|a, x)p−(a|x)p(x) ◮ define exploitation distribution p+(y, a, x) = p(y|a, x)p+(a|x)p(x) ◮ Goal: Maximize E+(Y ) over p+(a|x) using data drawn from p−(y, a, x). notation: {x, a, y} ∈ {X, A, Y } i.e., Eα(Y ) is not a function of y
  157. 157. POISE math: IS+Monte Carlo estimation=ISE i.e, “importance sampling estimation” ◮ E+(Y ) ≡ yax yp+(y, a, x)
  158. 158. POISE math: IS+Monte Carlo estimation=ISE i.e, “importance sampling estimation” ◮ E+(Y ) ≡ yax yp+(y, a, x) ◮ E+(Y ) = yax yp−(y, a, x)(p+(y, a, x)/p−(y, a, x))
  159. 159. POISE math: IS+Monte Carlo estimation=ISE i.e, “importance sampling estimation” ◮ E+(Y ) ≡ yax yp+(y, a, x) ◮ E+(Y ) = yax yp−(y, a, x)(p+(y, a, x)/p−(y, a, x)) ◮ E+(Y ) = yax yp−(y, a, x)(p+(a|x)/p−(a|x))
  160. 160. POISE math: IS+Monte Carlo estimation=ISE i.e, “importance sampling estimation” ◮ E+(Y ) ≡ yax yp+(y, a, x) ◮ E+(Y ) = yax yp−(y, a, x)(p+(y, a, x)/p−(y, a, x)) ◮ E+(Y ) = yax yp−(y, a, x)(p+(a|x)/p−(a|x)) ◮ E+(Y ) ≈ N−1 i yi (p+(ai |xi )/p−(ai |xi ))
  161. 161. POISE math: IS+Monte Carlo estimation=ISE i.e, “importance sampling estimation” ◮ E+(Y ) ≡ yax yp+(y, a, x) ◮ E+(Y ) = yax yp−(y, a, x)(p+(y, a, x)/p−(y, a, x)) ◮ E+(Y ) = yax yp−(y, a, x)(p+(a|x)/p−(a|x)) ◮ E+(Y ) ≈ N−1 i yi (p+(ai |xi )/p−(ai |xi ))
  162. 162. POISE math: IS+Monte Carlo estimation=ISE i.e, “importance sampling estimation” ◮ E+(Y ) ≡ yax yp+(y, a, x) ◮ E+(Y ) = yax yp−(y, a, x)(p+(y, a, x)/p−(y, a, x)) ◮ E+(Y ) = yax yp−(y, a, x)(p+(a|x)/p−(a|x)) ◮ E+(Y ) ≈ N−1 i yi (p+(ai |xi )/p−(ai |xi )) let’s spend some time getting to know this last equation, the importance sampling estimate of outcome in a “causal model” (“a causes y”) among {y, a, x}
  163. 163. Observation (cf. Bottou7 ) ◮ factorizing P±(x): P+(x) P−(x) = Πfactors P+but not−(x) P−but not+(x)
  164. 164. Observation (cf. Bottou7 ) ◮ factorizing P±(x): P+(x) P−(x) = Πfactors P+but not−(x) P−but not+(x) ◮ origin: importance sampling Eq(f ) = Ep(fq/p) (as in variational methods)
  165. 165. Observation (cf. Bottou7 ) ◮ factorizing P±(x): P+(x) P−(x) = Πfactors P+but not−(x) P−but not+(x) ◮ origin: importance sampling Eq(f ) = Ep(fq/p) (as in variational methods) ◮ the “causal” model pα(y, a, x) = p(y|a, x)pα(a|x)p(x) helps here
  166. 166. Observation (cf. Bottou7 ) ◮ factorizing P±(x): P+(x) P−(x) = Πfactors P+but not−(x) P−but not+(x) ◮ origin: importance sampling Eq(f ) = Ep(fq/p) (as in variational methods) ◮ the “causal” model pα(y, a, x) = p(y|a, x)pα(a|x)p(x) helps here ◮ factors left over are numerator (p+(a|x), to optimize) and denominator (p−(a|x), to infer if not a RCT)
  167. 167. Observation (cf. Bottou7 ) ◮ factorizing P±(x): P+(x) P−(x) = Πfactors P+but not−(x) P−but not+(x) ◮ origin: importance sampling Eq(f ) = Ep(fq/p) (as in variational methods) ◮ the “causal” model pα(y, a, x) = p(y|a, x)pα(a|x)p(x) helps here ◮ factors left over are numerator (p+(a|x), to optimize) and denominator (p−(a|x), to infer if not a RCT) ◮ unobserved confounders will confound us (later) 7 Counterfactual Reasoning and Learning Systems, arXiv:1209.2355
  168. 168. Reduction (cf. Langford8 ,9 ,10 (’05, ’08, ’09 )) ◮ consider numerator for deterministic policy: p+(a|x) = 1[a = h(x)]
  169. 169. Reduction (cf. Langford8 ,9 ,10 (’05, ’08, ’09 )) ◮ consider numerator for deterministic policy: p+(a|x) = 1[a = h(x)] ◮ E+(Y ) ∝ i (yi /p−(a|x))1[a = h(x)] ≡ i wi 1[a = h(x)]
  170. 170. Reduction (cf. Langford8 ,9 ,10 (’05, ’08, ’09 )) ◮ consider numerator for deterministic policy: p+(a|x) = 1[a = h(x)] ◮ E+(Y ) ∝ i (yi /p−(a|x))1[a = h(x)] ≡ i wi 1[a = h(x)] ◮ Note: 1[c = d] = 1 − 1[c = d]
  171. 171. Reduction (cf. Langford8 ,9 ,10 (’05, ’08, ’09 )) ◮ consider numerator for deterministic policy: p+(a|x) = 1[a = h(x)] ◮ E+(Y ) ∝ i (yi /p−(a|x))1[a = h(x)] ≡ i wi 1[a = h(x)] ◮ Note: 1[c = d] = 1 − 1[c = d] ◮ ∴ E+(Y ) ∝ constant − i wi 1[a = h(x)]
  172. 172. Reduction (cf. Langford8 ,9 ,10 (’05, ’08, ’09 )) ◮ consider numerator for deterministic policy: p+(a|x) = 1[a = h(x)] ◮ E+(Y ) ∝ i (yi /p−(a|x))1[a = h(x)] ≡ i wi 1[a = h(x)] ◮ Note: 1[c = d] = 1 − 1[c = d] ◮ ∴ E+(Y ) ∝ constant − i wi 1[a = h(x)] ◮ ∴ reduces policy optimization to (weighted) classification 8 Langford & Zadrozny “Relating Reinforcement Learning Performance to Classification Performance” ICML 2005 9 Beygelzimer & Langford “The offset tree for learning with partial labels” (KDD 2009) 10 Tutorial on “Reductions” (including at ICML 2009)
  173. 173. Reduction w/optimistic complication ◮ Prescription ⇐⇒ classification L = i wi 1[ai = h(xi )]
  174. 174. Reduction w/optimistic complication ◮ Prescription ⇐⇒ classification L = i wi 1[ai = h(xi )] ◮ weight wi = yi /p−(ai |xi ), inferred or RCT
  175. 175. Reduction w/optimistic complication ◮ Prescription ⇐⇒ classification L = i wi 1[ai = h(xi )] ◮ weight wi = yi /p−(ai |xi ), inferred or RCT ◮ destroys measure by treating p−(a|x) differently than 1/p−(a|x)
  176. 176. Reduction w/optimistic complication ◮ Prescription ⇐⇒ classification L = i wi 1[ai = h(xi )] ◮ weight wi = yi /p−(ai |xi ), inferred or RCT ◮ destroys measure by treating p−(a|x) differently than 1/p−(a|x) ◮ normalize as ˜L ≡ i y1[ai =h(xi )]/p−(ai |xi ) i 1[ai =h(xi )]/p−(ai |xi )
  177. 177. Reduction w/optimistic complication ◮ Prescription ⇐⇒ classification L = i wi 1[ai = h(xi )] ◮ weight wi = yi /p−(ai |xi ), inferred or RCT ◮ destroys measure by treating p−(a|x) differently than 1/p−(a|x) ◮ normalize as ˜L ≡ i y1[ai =h(xi )]/p−(ai |xi ) i 1[ai =h(xi )]/p−(ai |xi ) ◮ destroys lovely reduction
  178. 178. Reduction w/optimistic complication ◮ Prescription ⇐⇒ classification L = i wi 1[ai = h(xi )] ◮ weight wi = yi /p−(ai |xi ), inferred or RCT ◮ destroys measure by treating p−(a|x) differently than 1/p−(a|x) ◮ normalize as ˜L ≡ i y1[ai =h(xi )]/p−(ai |xi ) i 1[ai =h(xi )]/p−(ai |xi ) ◮ destroys lovely reduction ◮ simply11 L(λ) = i (yi − λ)1[ai = h(xi )]/p−(ai |xi ) 11 Suggestion by Dan Hsu
  179. 179. Reduction w/optimistic complication ◮ Prescription ⇐⇒ classification L = i wi 1[ai = h(xi )] ◮ weight wi = yi /p−(ai |xi ), inferred or RCT ◮ destroys measure by treating p−(a|x) differently than 1/p−(a|x) ◮ normalize as ˜L ≡ i y1[ai =h(xi )]/p−(ai |xi ) i 1[ai =h(xi )]/p−(ai |xi ) ◮ destroys lovely reduction ◮ simply11 L(λ) = i (yi − λ)1[ai = h(xi )]/p−(ai |xi ) ◮ hidden here is a 2nd parameter, in classification, ∴ harder search 11 Suggestion by Dan Hsu
  180. 180. POISE punchlines ◮ allows policy planning even with implicit logged exploration data12 12 Strehl, Alex, et al. “Learning from logged implicit exploration data.” Advances in Neural Information Processing Systems. 2010.
  181. 181. POISE punchlines ◮ allows policy planning even with implicit logged exploration data12 ◮ e.g., two hospital story 12 Strehl, Alex, et al. “Learning from logged implicit exploration data.” Advances in Neural Information Processing Systems. 2010.
  182. 182. POISE punchlines ◮ allows policy planning even with implicit logged exploration data12 ◮ e.g., two hospital story ◮ “personalized medicine” is also a policy 12 Strehl, Alex, et al. “Learning from logged implicit exploration data.” Advances in Neural Information Processing Systems. 2010.
  183. 183. POISE punchlines ◮ allows policy planning even with implicit logged exploration data12 ◮ e.g., two hospital story ◮ “personalized medicine” is also a policy ◮ abundant data available, under-explored IMHO 12 Strehl, Alex, et al. “Learning from logged implicit exploration data.” Advances in Neural Information Processing Systems. 2010.
  184. 184. tangent: causality as told by an economist different, related goal ◮ they think in terms of ATE/ITE instead of policy
  185. 185. tangent: causality as told by an economist different, related goal ◮ they think in terms of ATE/ITE instead of policy ◮ ATE
  186. 186. tangent: causality as told by an economist different, related goal ◮ they think in terms of ATE/ITE instead of policy ◮ ATE ◮ τ ≡ E0(Y |a = 1) − E0(Y |a = 0) ≡ Q(a = 1) − Q(a = 0)
  187. 187. tangent: causality as told by an economist different, related goal ◮ they think in terms of ATE/ITE instead of policy ◮ ATE ◮ τ ≡ E0(Y |a = 1) − E0(Y |a = 0) ≡ Q(a = 1) − Q(a = 0) ◮ CATE aka Individualized Treatment Effect (ITE)
  188. 188. tangent: causality as told by an economist different, related goal ◮ they think in terms of ATE/ITE instead of policy ◮ ATE ◮ τ ≡ E0(Y |a = 1) − E0(Y |a = 0) ≡ Q(a = 1) − Q(a = 0) ◮ CATE aka Individualized Treatment Effect (ITE) ◮ τ(x) ≡ E0(Y |a = 1, x) − E0(Y |a = 0, x)
  189. 189. tangent: causality as told by an economist different, related goal ◮ they think in terms of ATE/ITE instead of policy ◮ ATE ◮ τ ≡ E0(Y |a = 1) − E0(Y |a = 0) ≡ Q(a = 1) − Q(a = 0) ◮ CATE aka Individualized Treatment Effect (ITE) ◮ τ(x) ≡ E0(Y |a = 1, x) − E0(Y |a = 0, x) ◮ ≡ Q(a = 1, x) − Q(a = 0, x)
  190. 190. Q-note: “generalizing” Monte Carlo w/kernels ◮ MC: Ep(f ) = x p(x)f (x) ≈ N−1 i∼p f (xi )
  191. 191. Q-note: “generalizing” Monte Carlo w/kernels ◮ MC: Ep(f ) = x p(x)f (x) ≈ N−1 i∼p f (xi ) ◮ K: p ≈ N−1 i K(x|xi )
  192. 192. Q-note: “generalizing” Monte Carlo w/kernels ◮ MC: Ep(f ) = x p(x)f (x) ≈ N−1 i∼p f (xi ) ◮ K: p ≈ N−1 i K(x|xi ) ◮ ⇒ x p(x)f (x) ≈ N−1 i x f (x)K(x|xi )
  193. 193. Q-note: “generalizing” Monte Carlo w/kernels ◮ MC: Ep(f ) = x p(x)f (x) ≈ N−1 i∼p f (xi ) ◮ K: p ≈ N−1 i K(x|xi ) ◮ ⇒ x p(x)f (x) ≈ N−1 i x f (x)K(x|xi ) ◮ K can be any normalized function, e.g., K(x|xi ) = δx,xi , which yields MC.
  194. 194. Q-note: “generalizing” Monte Carlo w/kernels ◮ MC: Ep(f ) = x p(x)f (x) ≈ N−1 i∼p f (xi ) ◮ K: p ≈ N−1 i K(x|xi ) ◮ ⇒ x p(x)f (x) ≈ N−1 i x f (x)K(x|xi ) ◮ K can be any normalized function, e.g., K(x|xi ) = δx,xi , which yields MC. ◮ multivariate Ep(f ) ≈ N−1 i yax f (y, a, x)K1(y|yi )K2(a|ai )K3(x|xi )
  195. 195. Q-note: application w/strata+matching, setup Helps think about economists’ approach: ◮ Q(a, x) ≡ E(Y |a, x) = y yp(y|a, x) = y y p−(y,a,x) p−(a|x)p(x)
  196. 196. Q-note: application w/strata+matching, setup Helps think about economists’ approach: ◮ Q(a, x) ≡ E(Y |a, x) = y yp(y|a, x) = y y p−(y,a,x) p−(a|x)p(x) ◮ = 1 p−(a|x)p(x) y yp−(y, a, x)
  197. 197. Q-note: application w/strata+matching, setup Helps think about economists’ approach: ◮ Q(a, x) ≡ E(Y |a, x) = y yp(y|a, x) = y y p−(y,a,x) p−(a|x)p(x) ◮ = 1 p−(a|x)p(x) y yp−(y, a, x) ◮ stratify x using z(x) such that ∪z = X, and ∩z, z′ =
  198. 198. Q-note: application w/strata+matching, setup Helps think about economists’ approach: ◮ Q(a, x) ≡ E(Y |a, x) = y yp(y|a, x) = y y p−(y,a,x) p−(a|x)p(x) ◮ = 1 p−(a|x)p(x) y yp−(y, a, x) ◮ stratify x using z(x) such that ∪z = X, and ∩z, z′ = ◮ n(x) = i 1[z(xi ) = z(x)]=number of points in x’s stratum
  199. 199. Q-note: application w/strata+matching, setup Helps think about economists’ approach: ◮ Q(a, x) ≡ E(Y |a, x) = y yp(y|a, x) = y y p−(y,a,x) p−(a|x)p(x) ◮ = 1 p−(a|x)p(x) y yp−(y, a, x) ◮ stratify x using z(x) such that ∪z = X, and ∩z, z′ = ◮ n(x) = i 1[z(xi ) = z(x)]=number of points in x’s stratum ◮ Ω(x) = x′ 1[z(x′) = z(x)]=area of x’s stratum
  200. 200. Q-note: application w/strata+matching, setup Helps think about economists’ approach: ◮ Q(a, x) ≡ E(Y |a, x) = y yp(y|a, x) = y y p−(y,a,x) p−(a|x)p(x) ◮ = 1 p−(a|x)p(x) y yp−(y, a, x) ◮ stratify x using z(x) such that ∪z = X, and ∩z, z′ = ◮ n(x) = i 1[z(xi ) = z(x)]=number of points in x’s stratum ◮ Ω(x) = x′ 1[z(x′) = z(x)]=area of x’s stratum ◮ ∴ K3(x|xi ) = 1[z(x) = z(xi )]/Ω(x)
  201. 201. Q-note: application w/strata+matching, setup Helps think about economists’ approach: ◮ Q(a, x) ≡ E(Y |a, x) = y yp(y|a, x) = y y p−(y,a,x) p−(a|x)p(x) ◮ = 1 p−(a|x)p(x) y yp−(y, a, x) ◮ stratify x using z(x) such that ∪z = X, and ∩z, z′ = ◮ n(x) = i 1[z(xi ) = z(x)]=number of points in x’s stratum ◮ Ω(x) = x′ 1[z(x′) = z(x)]=area of x’s stratum ◮ ∴ K3(x|xi ) = 1[z(x) = z(xi )]/Ω(x) ◮ as in MC, K1(y|yi ) = δy,yi , K2(a|ai ) = δa,ai
  202. 202. Q-note: application w/strata+matching, payoff ◮ y yp−(y, a, x) ≈ N−1Ω(x)−1 ai =a,z(xi )=z(x) yi
  203. 203. Q-note: application w/strata+matching, payoff ◮ y yp−(y, a, x) ≈ N−1Ω(x)−1 ai =a,z(xi )=z(x) yi ◮ p(x) ≈ (n(x)/N)Ω(x)−1
  204. 204. Q-note: application w/strata+matching, payoff ◮ y yp−(y, a, x) ≈ N−1Ω(x)−1 ai =a,z(xi )=z(x) yi ◮ p(x) ≈ (n(x)/N)Ω(x)−1 ◮ ∴ Q(a, x) ≈ p−(a|x)−1n(x)−1 ai =a,z(xi )=z(x) yi
  205. 205. Q-note: application w/strata+matching, payoff ◮ y yp−(y, a, x) ≈ N−1Ω(x)−1 ai =a,z(xi )=z(x) yi ◮ p(x) ≈ (n(x)/N)Ω(x)−1 ◮ ∴ Q(a, x) ≈ p−(a|x)−1n(x)−1 ai =a,z(xi )=z(x) yi
  206. 206. Q-note: application w/strata+matching, payoff ◮ y yp−(y, a, x) ≈ N−1Ω(x)−1 ai =a,z(xi )=z(x) yi ◮ p(x) ≈ (n(x)/N)Ω(x)−1 ◮ ∴ Q(a, x) ≈ p−(a|x)−1n(x)−1 ai =a,z(xi )=z(x) yi “matching” means: choose each z to contain 1 positive example & 1 negative example, ◮ p−(a|x) ≈ 1/2, n(x) = 2
  207. 207. Q-note: application w/strata+matching, payoff ◮ y yp−(y, a, x) ≈ N−1Ω(x)−1 ai =a,z(xi )=z(x) yi ◮ p(x) ≈ (n(x)/N)Ω(x)−1 ◮ ∴ Q(a, x) ≈ p−(a|x)−1n(x)−1 ai =a,z(xi )=z(x) yi “matching” means: choose each z to contain 1 positive example & 1 negative example, ◮ p−(a|x) ≈ 1/2, n(x) = 2 ◮ ∴ τ(a, x) = Q(a = 1, x) − Q(a = 0, x) = y1(x) − y0(x)
  208. 208. Q-note: application w/strata+matching, payoff ◮ y yp−(y, a, x) ≈ N−1Ω(x)−1 ai =a,z(xi )=z(x) yi ◮ p(x) ≈ (n(x)/N)Ω(x)−1 ◮ ∴ Q(a, x) ≈ p−(a|x)−1n(x)−1 ai =a,z(xi )=z(x) yi “matching” means: choose each z to contain 1 positive example & 1 negative example, ◮ p−(a|x) ≈ 1/2, n(x) = 2 ◮ ∴ τ(a, x) = Q(a = 1, x) − Q(a = 0, x) = y1(x) − y0(x) ◮ z-generalizations: graphs, digraphs, k-NN, “matching”
  209. 209. Q-note: application w/strata+matching, payoff ◮ y yp−(y, a, x) ≈ N−1Ω(x)−1 ai =a,z(xi )=z(x) yi ◮ p(x) ≈ (n(x)/N)Ω(x)−1 ◮ ∴ Q(a, x) ≈ p−(a|x)−1n(x)−1 ai =a,z(xi )=z(x) yi “matching” means: choose each z to contain 1 positive example & 1 negative example, ◮ p−(a|x) ≈ 1/2, n(x) = 2 ◮ ∴ τ(a, x) = Q(a = 1, x) − Q(a = 0, x) = y1(x) − y0(x) ◮ z-generalizations: graphs, digraphs, k-NN, “matching” ◮ K-generalizations: continuous a, any metric or similarity you like,. . .
  210. 210. Q-note: application w/strata+matching, payoff ◮ y yp−(y, a, x) ≈ N−1Ω(x)−1 ai =a,z(xi )=z(x) yi ◮ p(x) ≈ (n(x)/N)Ω(x)−1 ◮ ∴ Q(a, x) ≈ p−(a|x)−1n(x)−1 ai =a,z(xi )=z(x) yi “matching” means: choose each z to contain 1 positive example & 1 negative example, ◮ p−(a|x) ≈ 1/2, n(x) = 2 ◮ ∴ τ(a, x) = Q(a = 1, x) − Q(a = 0, x) = y1(x) − y0(x) ◮ z-generalizations: graphs, digraphs, k-NN, “matching” ◮ K-generalizations: continuous a, any metric or similarity you like,. . .
  211. 211. Q-note: application w/strata+matching, payoff ◮ y yp−(y, a, x) ≈ N−1Ω(x)−1 ai =a,z(xi )=z(x) yi ◮ p(x) ≈ (n(x)/N)Ω(x)−1 ◮ ∴ Q(a, x) ≈ p−(a|x)−1n(x)−1 ai =a,z(xi )=z(x) yi “matching” means: choose each z to contain 1 positive example & 1 negative example, ◮ p−(a|x) ≈ 1/2, n(x) = 2 ◮ ∴ τ(a, x) = Q(a = 1, x) − Q(a = 0, x) = y1(x) − y0(x) ◮ z-generalizations: graphs, digraphs, k-NN, “matching” ◮ K-generalizations: continuous a, any metric or similarity you like,. . . IMHO underexplored
  212. 212. causality, as understood in marketing ◮ a/b testing and RCT Figure 11: Blattberg, Robert C., Byung-Do Kim, and Scott A. Neslin. Database Marketing, Springer New York, 2008
  213. 213. causality, as understood in marketing ◮ a/b testing and RCT ◮ yield optimization Figure 11: Blattberg, Robert C., Byung-Do Kim, and Scott A. Neslin. Database Marketing, Springer New York, 2008
  214. 214. causality, as understood in marketing ◮ a/b testing and RCT ◮ yield optimization ◮ Lorenz curve (vs ROC plots) Figure 11: Blattberg, Robert C., Byung-Do Kim, and Scott A. Neslin. Database Marketing, Springer New York, 2008
  215. 215. unobserved confounders vs. “causality” modeling ◮ truth: pα(y, a, x, u) = p(y|a, x, u)pα(a|x, u)p(x, u)
  216. 216. unobserved confounders vs. “causality” modeling ◮ truth: pα(y, a, x, u) = p(y|a, x, u)pα(a|x, u)p(x, u) ◮ but: p+(y, a, x, u) = p(y|a, x, u)p−(a|x)p(x, u)
  217. 217. unobserved confounders vs. “causality” modeling ◮ truth: pα(y, a, x, u) = p(y|a, x, u)pα(a|x, u)p(x, u) ◮ but: p+(y, a, x, u) = p(y|a, x, u)p−(a|x)p(x, u) ◮ E+(Y ) ≡ yaxu yp+(yaxu) ≈ N−1 i∼p− yi p+(a|x)/p−(a|x, u)
  218. 218. unobserved confounders vs. “causality” modeling ◮ truth: pα(y, a, x, u) = p(y|a, x, u)pα(a|x, u)p(x, u) ◮ but: p+(y, a, x, u) = p(y|a, x, u)p−(a|x)p(x, u) ◮ E+(Y ) ≡ yaxu yp+(yaxu) ≈ N−1 i∼p− yi p+(a|x)/p−(a|x, u) ◮ denominator can not be inferred, ignore at your peril
  219. 219. cautionary tale problem: Simpson’s paradox ◮ a: admissions (a=1: admitted, a=0: declined)
  220. 220. cautionary tale problem: Simpson’s paradox ◮ a: admissions (a=1: admitted, a=0: declined) ◮ x: gender (x=1: female, x=0: male)
  221. 221. cautionary tale problem: Simpson’s paradox ◮ a: admissions (a=1: admitted, a=0: declined) ◮ x: gender (x=1: female, x=0: male) ◮ lawsuit (1973): .44 = p(a = 1|x = 0) > p(a = 1|x = 1) = .35
  222. 222. cautionary tale problem: Simpson’s paradox ◮ a: admissions (a=1: admitted, a=0: declined) ◮ x: gender (x=1: female, x=0: male) ◮ lawsuit (1973): .44 = p(a = 1|x = 0) > p(a = 1|x = 1) = .35 ◮ ‘resolved’ by Bickel (1975)13 (See also Pearl14 ) 13 P.J. Bickel, E.A. Hammel and J.W. O’Connell (1975). “Sex Bias in Graduate Admissions: Data From Berkeley”. Science 187 (4175): 398–404 14 Pearl, Judea (December 2013). “Understanding Simpson’s paradox”. UCLA Cognitive Systems Laboratory, Technical Report R-414.
  223. 223. cautionary tale problem: Simpson’s paradox ◮ a: admissions (a=1: admitted, a=0: declined) ◮ x: gender (x=1: female, x=0: male) ◮ lawsuit (1973): .44 = p(a = 1|x = 0) > p(a = 1|x = 1) = .35 ◮ ‘resolved’ by Bickel (1975)13 (See also Pearl14 ) ◮ u: unobserved department they applied to 13 P.J. Bickel, E.A. Hammel and J.W. O’Connell (1975). “Sex Bias in Graduate Admissions: Data From Berkeley”. Science 187 (4175): 398–404 14 Pearl, Judea (December 2013). “Understanding Simpson’s paradox”. UCLA Cognitive Systems Laboratory, Technical Report R-414.
  224. 224. cautionary tale problem: Simpson’s paradox ◮ a: admissions (a=1: admitted, a=0: declined) ◮ x: gender (x=1: female, x=0: male) ◮ lawsuit (1973): .44 = p(a = 1|x = 0) > p(a = 1|x = 1) = .35 ◮ ‘resolved’ by Bickel (1975)13 (See also Pearl14 ) ◮ u: unobserved department they applied to ◮ p(a|x) = u=6 u=1 p(a|x, u)p(u|x) 13 P.J. Bickel, E.A. Hammel and J.W. O’Connell (1975). “Sex Bias in Graduate Admissions: Data From Berkeley”. Science 187 (4175): 398–404 14 Pearl, Judea (December 2013). “Understanding Simpson’s paradox”. UCLA Cognitive Systems Laboratory, Technical Report R-414.
  225. 225. cautionary tale problem: Simpson’s paradox ◮ a: admissions (a=1: admitted, a=0: declined) ◮ x: gender (x=1: female, x=0: male) ◮ lawsuit (1973): .44 = p(a = 1|x = 0) > p(a = 1|x = 1) = .35 ◮ ‘resolved’ by Bickel (1975)13 (See also Pearl14 ) ◮ u: unobserved department they applied to ◮ p(a|x) = u=6 u=1 p(a|x, u)p(u|x) ◮ e.g., gender-blind: p(a|1) − p(a|0) = p(a|u) · (p(u|1) − p(u|0)) 13 P.J. Bickel, E.A. Hammel and J.W. O’Connell (1975). “Sex Bias in Graduate Admissions: Data From Berkeley”. Science 187 (4175): 398–404 14 Pearl, Judea (December 2013). “Understanding Simpson’s paradox”. UCLA Cognitive Systems Laboratory, Technical Report R-414.
  226. 226. confounded approach: quasi-experiments + instruments 17 ◮ Q: does engagement drive retention? (NYT, NFLX, . . . )
  227. 227. confounded approach: quasi-experiments + instruments 17 ◮ Q: does engagement drive retention? (NYT, NFLX, . . . ) ◮ we don’t directly control engagement
  228. 228. confounded approach: quasi-experiments + instruments 17 ◮ Q: does engagement drive retention? (NYT, NFLX, . . . ) ◮ we don’t directly control engagement ◮ nonetheless useful since many things can influence it
  229. 229. confounded approach: quasi-experiments + instruments 17 ◮ Q: does engagement drive retention? (NYT, NFLX, . . . ) ◮ we don’t directly control engagement ◮ nonetheless useful since many things can influence it ◮ Q: does serving in Vietnam war decrease earnings15? 15 Angrist, Joshua D. “Lifetime earnings and the Vietnam era draft lottery: evidence from social security administrative records.” The American Economic Review (1990): 313-336.
  230. 230. confounded approach: quasi-experiments + instruments 17 ◮ Q: does engagement drive retention? (NYT, NFLX, . . . ) ◮ we don’t directly control engagement ◮ nonetheless useful since many things can influence it ◮ Q: does serving in Vietnam war decrease earnings15? ◮ US didn’t directly control serving in Vietnam, either16 15 Angrist, Joshua D. “Lifetime earnings and the Vietnam era draft lottery: evidence from social security administrative records.” The American Economic Review (1990): 313-336. 16 cf., George Bush, Donald Trump, Bill Clinton, Dick Cheney. . .
  231. 231. confounded approach: quasi-experiments + instruments 17 ◮ Q: does engagement drive retention? (NYT, NFLX, . . . ) ◮ we don’t directly control engagement ◮ nonetheless useful since many things can influence it ◮ Q: does serving in Vietnam war decrease earnings15? ◮ US didn’t directly control serving in Vietnam, either16 ◮ requires strong assumptions, including linear model 15 Angrist, Joshua D. “Lifetime earnings and the Vietnam era draft lottery: evidence from social security administrative records.” The American Economic Review (1990): 313-336. 16 cf., George Bush, Donald Trump, Bill Clinton, Dick Cheney. . . 17 I thank Sinan Aral, MIT Sloan, for bringing this to my attention
  232. 232. IV: graphical model assumption Figure 12: independence assumption
  233. 233. IV: graphical model assumption (sideways) Figure 13: independence assumption
  234. 234. IV: review s/OLS/MOM/ (E is empirical average) ◮ a endogenous
  235. 235. IV: review s/OLS/MOM/ (E is empirical average) ◮ a endogenous ◮ e.g., ∃u s.t. p(y|a, x, u), p(a|x, u)
  236. 236. IV: review s/OLS/MOM/ (E is empirical average) ◮ a endogenous ◮ e.g., ∃u s.t. p(y|a, x, u), p(a|x, u) ◮ linear ansatz: y = βT a + ǫ
  237. 237. IV: review s/OLS/MOM/ (E is empirical average) ◮ a endogenous ◮ e.g., ∃u s.t. p(y|a, x, u), p(a|x, u) ◮ linear ansatz: y = βT a + ǫ ◮ if a exogenous (e.g., OLS), use E[YAj] = E[βT AAj] + E[ǫAj] (note that E[AjAk] gives square matrix; invert for β)
  238. 238. IV: review s/OLS/MOM/ (E is empirical average) ◮ a endogenous ◮ e.g., ∃u s.t. p(y|a, x, u), p(a|x, u) ◮ linear ansatz: y = βT a + ǫ ◮ if a exogenous (e.g., OLS), use E[YAj] = E[βT AAj] + E[ǫAj] (note that E[AjAk] gives square matrix; invert for β) ◮ add instrument x uncorrelated with ǫ
  239. 239. IV: review s/OLS/MOM/ (E is empirical average) ◮ a endogenous ◮ e.g., ∃u s.t. p(y|a, x, u), p(a|x, u) ◮ linear ansatz: y = βT a + ǫ ◮ if a exogenous (e.g., OLS), use E[YAj] = E[βT AAj] + E[ǫAj] (note that E[AjAk] gives square matrix; invert for β) ◮ add instrument x uncorrelated with ǫ ◮ E[YXk] = E[βT AXk] + E[ǫ]E[Xk]
  240. 240. IV: review s/OLS/MOM/ (E is empirical average) ◮ a endogenous ◮ e.g., ∃u s.t. p(y|a, x, u), p(a|x, u) ◮ linear ansatz: y = βT a + ǫ ◮ if a exogenous (e.g., OLS), use E[YAj] = E[βT AAj] + E[ǫAj] (note that E[AjAk] gives square matrix; invert for β) ◮ add instrument x uncorrelated with ǫ ◮ E[YXk] = E[βT AXk] + E[ǫ]E[Xk] ◮ E[Y ] = E[βT A] + E[ǫ] (from ansatz)
  241. 241. IV: review s/OLS/MOM/ (E is empirical average) ◮ a endogenous ◮ e.g., ∃u s.t. p(y|a, x, u), p(a|x, u) ◮ linear ansatz: y = βT a + ǫ ◮ if a exogenous (e.g., OLS), use E[YAj] = E[βT AAj] + E[ǫAj] (note that E[AjAk] gives square matrix; invert for β) ◮ add instrument x uncorrelated with ǫ ◮ E[YXk] = E[βT AXk] + E[ǫ]E[Xk] ◮ E[Y ] = E[βT A] + E[ǫ] (from ansatz) ◮ C(Y , Xk) = βT C(A, Xk), not an “inversion” problem, requires “two stage regression”
  242. 242. IV: binary, binary case (aka “Wald estimator”) ◮ y = βa + ǫ
  243. 243. IV: binary, binary case (aka “Wald estimator”) ◮ y = βa + ǫ ◮ E(Y |x) = βE(A|x) + E(ǫ), evaluate at x = {0, 1}
  244. 244. IV: binary, binary case (aka “Wald estimator”) ◮ y = βa + ǫ ◮ E(Y |x) = βE(A|x) + E(ǫ), evaluate at x = {0, 1} ◮ β = (E(Y |x = 1) − E(Y |x = 0))/(E(A|x = 1) − E(A|x = 0)).
  245. 245. bandits: obligatory slide Figure 14: almost all the talks I’ve gone to on bandits have this image
  246. 246. bandits ◮ wide applicability: humane clinical trials, targeting, . . .
  247. 247. bandits ◮ wide applicability: humane clinical trials, targeting, . . . ◮ replace meetings with code
  248. 248. bandits ◮ wide applicability: humane clinical trials, targeting, . . . ◮ replace meetings with code ◮ requires software engineering to replace decisions with, e.g., Javascript
  249. 249. bandits ◮ wide applicability: humane clinical trials, targeting, . . . ◮ replace meetings with code ◮ requires software engineering to replace decisions with, e.g., Javascript ◮ most useful if decisions or items get “stale” quickly
  250. 250. bandits ◮ wide applicability: humane clinical trials, targeting, . . . ◮ replace meetings with code ◮ requires software engineering to replace decisions with, e.g., Javascript ◮ most useful if decisions or items get “stale” quickly ◮ less useful for one-off, major decisions to be “interpreted”
  251. 251. bandits ◮ wide applicability: humane clinical trials, targeting, . . . ◮ replace meetings with code ◮ requires software engineering to replace decisions with, e.g., Javascript ◮ most useful if decisions or items get “stale” quickly ◮ less useful for one-off, major decisions to be “interpreted”
  252. 252. bandits ◮ wide applicability: humane clinical trials, targeting, . . . ◮ replace meetings with code ◮ requires software engineering to replace decisions with, e.g., Javascript ◮ most useful if decisions or items get “stale” quickly ◮ less useful for one-off, major decisions to be “interpreted” examples ◮ ǫ-greedy (no context, aka ‘vanilla’, aka ‘context-free’)
  253. 253. bandits ◮ wide applicability: humane clinical trials, targeting, . . . ◮ replace meetings with code ◮ requires software engineering to replace decisions with, e.g., Javascript ◮ most useful if decisions or items get “stale” quickly ◮ less useful for one-off, major decisions to be “interpreted” examples ◮ ǫ-greedy (no context, aka ‘vanilla’, aka ‘context-free’) ◮ UCB1 (2002) (no context) + LinUCB (with context)
  254. 254. bandits ◮ wide applicability: humane clinical trials, targeting, . . . ◮ replace meetings with code ◮ requires software engineering to replace decisions with, e.g., Javascript ◮ most useful if decisions or items get “stale” quickly ◮ less useful for one-off, major decisions to be “interpreted” examples ◮ ǫ-greedy (no context, aka ‘vanilla’, aka ‘context-free’) ◮ UCB1 (2002) (no context) + LinUCB (with context) ◮ Thompson Sampling (1933)18,19,20 (general, with or without context) 18 Thompson, William R. “On the likelihood that one unknown probability exceeds another in view of the evidence of two samples”. Biometrika, 25(3–4):285–294, 1933. 19 AKA “probability matching”, “posterior sampling” 20 cf., “Bayesian Bandit Explorer” (link)
  255. 255. TS: connecting w/“generative causal modeling” 0 ◮ WAS p(y, x, a) = p(y|x, a)pα(a|x)p(x)
  256. 256. TS: connecting w/“generative causal modeling” 0 ◮ WAS p(y, x, a) = p(y|x, a)pα(a|x)p(x) ◮ These 3 terms were treated by
  257. 257. TS: connecting w/“generative causal modeling” 0 ◮ WAS p(y, x, a) = p(y|x, a)pα(a|x)p(x) ◮ These 3 terms were treated by ◮ response p(y|a, x): avoid regression/inferring using importance sampling
  258. 258. TS: connecting w/“generative causal modeling” 0 ◮ WAS p(y, x, a) = p(y|x, a)pα(a|x)p(x) ◮ These 3 terms were treated by ◮ response p(y|a, x): avoid regression/inferring using importance sampling ◮ policy pα(a|x): optimize ours, infer theirs
  259. 259. TS: connecting w/“generative causal modeling” 0 ◮ WAS p(y, x, a) = p(y|x, a)pα(a|x)p(x) ◮ These 3 terms were treated by ◮ response p(y|a, x): avoid regression/inferring using importance sampling ◮ policy pα(a|x): optimize ours, infer theirs ◮ (NB: ours was deterministic: p(a|x) = 1[a = h(x)])
  260. 260. TS: connecting w/“generative causal modeling” 0 ◮ WAS p(y, x, a) = p(y|x, a)pα(a|x)p(x) ◮ These 3 terms were treated by ◮ response p(y|a, x): avoid regression/inferring using importance sampling ◮ policy pα(a|x): optimize ours, infer theirs ◮ (NB: ours was deterministic: p(a|x) = 1[a = h(x)]) ◮ prior p(x): either avoid by importance sampling or estimate via kernel methods
  261. 261. TS: connecting w/“generative causal modeling” 0 ◮ WAS p(y, x, a) = p(y|x, a)pα(a|x)p(x) ◮ These 3 terms were treated by ◮ response p(y|a, x): avoid regression/inferring using importance sampling ◮ policy pα(a|x): optimize ours, infer theirs ◮ (NB: ours was deterministic: p(a|x) = 1[a = h(x)]) ◮ prior p(x): either avoid by importance sampling or estimate via kernel methods ◮ In the economics approach we focus on
  262. 262. TS: connecting w/“generative causal modeling” 0 ◮ WAS p(y, x, a) = p(y|x, a)pα(a|x)p(x) ◮ These 3 terms were treated by ◮ response p(y|a, x): avoid regression/inferring using importance sampling ◮ policy pα(a|x): optimize ours, infer theirs ◮ (NB: ours was deterministic: p(a|x) = 1[a = h(x)]) ◮ prior p(x): either avoid by importance sampling or estimate via kernel methods ◮ In the economics approach we focus on ◮ τ(. . .) ≡ Q(a = 1, . . .) − Q(a = 0, . . .) “treatment effect”
  263. 263. TS: connecting w/“generative causal modeling” 0 ◮ WAS p(y, x, a) = p(y|x, a)pα(a|x)p(x) ◮ These 3 terms were treated by ◮ response p(y|a, x): avoid regression/inferring using importance sampling ◮ policy pα(a|x): optimize ours, infer theirs ◮ (NB: ours was deterministic: p(a|x) = 1[a = h(x)]) ◮ prior p(x): either avoid by importance sampling or estimate via kernel methods ◮ In the economics approach we focus on ◮ τ(. . .) ≡ Q(a = 1, . . .) − Q(a = 0, . . .) “treatment effect” ◮ where Q(a, . . .) = y yp(y| . . .)
  264. 264. TS: connecting w/“generative causal modeling” 0 ◮ WAS p(y, x, a) = p(y|x, a)pα(a|x)p(x) ◮ These 3 terms were treated by ◮ response p(y|a, x): avoid regression/inferring using importance sampling ◮ policy pα(a|x): optimize ours, infer theirs ◮ (NB: ours was deterministic: p(a|x) = 1[a = h(x)]) ◮ prior p(x): either avoid by importance sampling or estimate via kernel methods ◮ In the economics approach we focus on ◮ τ(. . .) ≡ Q(a = 1, . . .) − Q(a = 0, . . .) “treatment effect” ◮ where Q(a, . . .) = y yp(y| . . .)
  265. 265. TS: connecting w/“generative causal modeling” 0 ◮ WAS p(y, x, a) = p(y|x, a)pα(a|x)p(x) ◮ These 3 terms were treated by ◮ response p(y|a, x): avoid regression/inferring using importance sampling ◮ policy pα(a|x): optimize ours, infer theirs ◮ (NB: ours was deterministic: p(a|x) = 1[a = h(x)]) ◮ prior p(x): either avoid by importance sampling or estimate via kernel methods ◮ In the economics approach we focus on ◮ τ(. . .) ≡ Q(a = 1, . . .) − Q(a = 0, . . .) “treatment effect” ◮ where Q(a, . . .) = y yp(y| . . .) In Thompson sampling we will generate 1 datum at a time, by ◮ asserting a parameterized generative model for p(y|a, x, θ)
  266. 266. TS: connecting w/“generative causal modeling” 0 ◮ WAS p(y, x, a) = p(y|x, a)pα(a|x)p(x) ◮ These 3 terms were treated by ◮ response p(y|a, x): avoid regression/inferring using importance sampling ◮ policy pα(a|x): optimize ours, infer theirs ◮ (NB: ours was deterministic: p(a|x) = 1[a = h(x)]) ◮ prior p(x): either avoid by importance sampling or estimate via kernel methods ◮ In the economics approach we focus on ◮ τ(. . .) ≡ Q(a = 1, . . .) − Q(a = 0, . . .) “treatment effect” ◮ where Q(a, . . .) = y yp(y| . . .) In Thompson sampling we will generate 1 datum at a time, by ◮ asserting a parameterized generative model for p(y|a, x, θ) ◮ using a deterministic but averaged policy
  267. 267. TS: connecting w/“generative causal modeling” 1 ◮ model true world response function p(y|a, x) parametrically as p(y|a, x, θ∗)
  268. 268. TS: connecting w/“generative causal modeling” 1 ◮ model true world response function p(y|a, x) parametrically as p(y|a, x, θ∗) ◮ (i.e., θ∗ is the true value of the parameter)21 21 Note that θ is a vector, with components for each action.
  269. 269. TS: connecting w/“generative causal modeling” 1 ◮ model true world response function p(y|a, x) parametrically as p(y|a, x, θ∗) ◮ (i.e., θ∗ is the true value of the parameter)21 ◮ if you knew θ: 21 Note that θ is a vector, with components for each action.
  270. 270. TS: connecting w/“generative causal modeling” 1 ◮ model true world response function p(y|a, x) parametrically as p(y|a, x, θ∗) ◮ (i.e., θ∗ is the true value of the parameter)21 ◮ if you knew θ: ◮ could compute Q(a, x, θ) ≡ y yp(y|x, a, θ∗ ) directly 21 Note that θ is a vector, with components for each action.
  271. 271. TS: connecting w/“generative causal modeling” 1 ◮ model true world response function p(y|a, x) parametrically as p(y|a, x, θ∗) ◮ (i.e., θ∗ is the true value of the parameter)21 ◮ if you knew θ: ◮ could compute Q(a, x, θ) ≡ y yp(y|x, a, θ∗ ) directly ◮ then choose h(x; θ) = argmaxaQ(a, x, θ) 21 Note that θ is a vector, with components for each action.
  272. 272. TS: connecting w/“generative causal modeling” 1 ◮ model true world response function p(y|a, x) parametrically as p(y|a, x, θ∗) ◮ (i.e., θ∗ is the true value of the parameter)21 ◮ if you knew θ: ◮ could compute Q(a, x, θ) ≡ y yp(y|x, a, θ∗ ) directly ◮ then choose h(x; θ) = argmaxaQ(a, x, θ) ◮ inducing policy p(a|x, θ) = 1[a = h(x; θ) = argmaxaQ(a, x, θ)] 21 Note that θ is a vector, with components for each action.
  273. 273. TS: connecting w/“generative causal modeling” 1 ◮ model true world response function p(y|a, x) parametrically as p(y|a, x, θ∗) ◮ (i.e., θ∗ is the true value of the parameter)21 ◮ if you knew θ: ◮ could compute Q(a, x, θ) ≡ y yp(y|x, a, θ∗ ) directly ◮ then choose h(x; θ) = argmaxaQ(a, x, θ) ◮ inducing policy p(a|x, θ) = 1[a = h(x; θ) = argmaxaQ(a, x, θ)] ◮ idea: use prior data D = {y, a, x}t 1 to define non-deterministic policy: 21 Note that θ is a vector, with components for each action.
  274. 274. TS: connecting w/“generative causal modeling” 1 ◮ model true world response function p(y|a, x) parametrically as p(y|a, x, θ∗) ◮ (i.e., θ∗ is the true value of the parameter)21 ◮ if you knew θ: ◮ could compute Q(a, x, θ) ≡ y yp(y|x, a, θ∗ ) directly ◮ then choose h(x; θ) = argmaxaQ(a, x, θ) ◮ inducing policy p(a|x, θ) = 1[a = h(x; θ) = argmaxaQ(a, x, θ)] ◮ idea: use prior data D = {y, a, x}t 1 to define non-deterministic policy: ◮ p(a|x) = dθp(a|x, θ)p(θ|D) 21 Note that θ is a vector, with components for each action.
  275. 275. TS: connecting w/“generative causal modeling” 1 ◮ model true world response function p(y|a, x) parametrically as p(y|a, x, θ∗) ◮ (i.e., θ∗ is the true value of the parameter)21 ◮ if you knew θ: ◮ could compute Q(a, x, θ) ≡ y yp(y|x, a, θ∗ ) directly ◮ then choose h(x; θ) = argmaxaQ(a, x, θ) ◮ inducing policy p(a|x, θ) = 1[a = h(x; θ) = argmaxaQ(a, x, θ)] ◮ idea: use prior data D = {y, a, x}t 1 to define non-deterministic policy: ◮ p(a|x) = dθp(a|x, θ)p(θ|D) ◮ p(a|x) = dθ1[a = argmaxa′ Q(a′ , x, θ)]p(θ|D) 21 Note that θ is a vector, with components for each action.
  276. 276. TS: connecting w/“generative causal modeling” 1 ◮ model true world response function p(y|a, x) parametrically as p(y|a, x, θ∗) ◮ (i.e., θ∗ is the true value of the parameter)21 ◮ if you knew θ: ◮ could compute Q(a, x, θ) ≡ y yp(y|x, a, θ∗ ) directly ◮ then choose h(x; θ) = argmaxaQ(a, x, θ) ◮ inducing policy p(a|x, θ) = 1[a = h(x; θ) = argmaxaQ(a, x, θ)] ◮ idea: use prior data D = {y, a, x}t 1 to define non-deterministic policy: ◮ p(a|x) = dθp(a|x, θ)p(θ|D) ◮ p(a|x) = dθ1[a = argmaxa′ Q(a′ , x, θ)]p(θ|D) ◮ hold up: 21 Note that θ is a vector, with components for each action.
  277. 277. TS: connecting w/“generative causal modeling” 1 ◮ model true world response function p(y|a, x) parametrically as p(y|a, x, θ∗) ◮ (i.e., θ∗ is the true value of the parameter)21 ◮ if you knew θ: ◮ could compute Q(a, x, θ) ≡ y yp(y|x, a, θ∗ ) directly ◮ then choose h(x; θ) = argmaxaQ(a, x, θ) ◮ inducing policy p(a|x, θ) = 1[a = h(x; θ) = argmaxaQ(a, x, θ)] ◮ idea: use prior data D = {y, a, x}t 1 to define non-deterministic policy: ◮ p(a|x) = dθp(a|x, θ)p(θ|D) ◮ p(a|x) = dθ1[a = argmaxa′ Q(a′ , x, θ)]p(θ|D) ◮ hold up: ◮ Q1: what’s p(θ|D)? 21 Note that θ is a vector, with components for each action.
  278. 278. TS: connecting w/“generative causal modeling” 1 ◮ model true world response function p(y|a, x) parametrically as p(y|a, x, θ∗) ◮ (i.e., θ∗ is the true value of the parameter)21 ◮ if you knew θ: ◮ could compute Q(a, x, θ) ≡ y yp(y|x, a, θ∗ ) directly ◮ then choose h(x; θ) = argmaxaQ(a, x, θ) ◮ inducing policy p(a|x, θ) = 1[a = h(x; θ) = argmaxaQ(a, x, θ)] ◮ idea: use prior data D = {y, a, x}t 1 to define non-deterministic policy: ◮ p(a|x) = dθp(a|x, θ)p(θ|D) ◮ p(a|x) = dθ1[a = argmaxa′ Q(a′ , x, θ)]p(θ|D) ◮ hold up: ◮ Q1: what’s p(θ|D)? ◮ Q2: how am I going to evaluate this integral? 21 Note that θ is a vector, with components for each action.
  279. 279. TS: connecting w/“generative causal modeling” 2 ◮ Q1: what’s p(θ|D)?
  280. 280. TS: connecting w/“generative causal modeling” 2 ◮ Q1: what’s p(θ|D)? ◮ Q2: how am I going to evaluate this integral?
  281. 281. TS: connecting w/“generative causal modeling” 2 ◮ Q1: what’s p(θ|D)? ◮ Q2: how am I going to evaluate this integral? ◮ A1: p(θ|D) definable by choosing prior p(θ|α) and likelihood on y given by the (modeled, parameterized) response p(y|a, x, θ).
  282. 282. TS: connecting w/“generative causal modeling” 2 ◮ Q1: what’s p(θ|D)? ◮ Q2: how am I going to evaluate this integral? ◮ A1: p(θ|D) definable by choosing prior p(θ|α) and likelihood on y given by the (modeled, parameterized) response p(y|a, x, θ). ◮ (now you’re not only generative, you’re Bayesian.)
  283. 283. TS: connecting w/“generative causal modeling” 2 ◮ Q1: what’s p(θ|D)? ◮ Q2: how am I going to evaluate this integral? ◮ A1: p(θ|D) definable by choosing prior p(θ|α) and likelihood on y given by the (modeled, parameterized) response p(y|a, x, θ). ◮ (now you’re not only generative, you’re Bayesian.) ◮ p(θ|D) = p(θ|{y}t 1, {a}t 1, {x}t 1, α)
  284. 284. TS: connecting w/“generative causal modeling” 2 ◮ Q1: what’s p(θ|D)? ◮ Q2: how am I going to evaluate this integral? ◮ A1: p(θ|D) definable by choosing prior p(θ|α) and likelihood on y given by the (modeled, parameterized) response p(y|a, x, θ). ◮ (now you’re not only generative, you’re Bayesian.) ◮ p(θ|D) = p(θ|{y}t 1, {a}t 1, {x}t 1, α) ◮ ∝ p({y}t 1|{a}t 1, {x}t 1, θ)p(θ|α)
  285. 285. TS: connecting w/“generative causal modeling” 2 ◮ Q1: what’s p(θ|D)? ◮ Q2: how am I going to evaluate this integral? ◮ A1: p(θ|D) definable by choosing prior p(θ|α) and likelihood on y given by the (modeled, parameterized) response p(y|a, x, θ). ◮ (now you’re not only generative, you’re Bayesian.) ◮ p(θ|D) = p(θ|{y}t 1, {a}t 1, {x}t 1, α) ◮ ∝ p({y}t 1|{a}t 1, {x}t 1, θ)p(θ|α) ◮ = p(θ|α)Πtp(yt|at, xt, θ)
  286. 286. TS: connecting w/“generative causal modeling” 2 ◮ Q1: what’s p(θ|D)? ◮ Q2: how am I going to evaluate this integral? ◮ A1: p(θ|D) definable by choosing prior p(θ|α) and likelihood on y given by the (modeled, parameterized) response p(y|a, x, θ). ◮ (now you’re not only generative, you’re Bayesian.) ◮ p(θ|D) = p(θ|{y}t 1, {a}t 1, {x}t 1, α) ◮ ∝ p({y}t 1|{a}t 1, {x}t 1, θ)p(θ|α) ◮ = p(θ|α)Πtp(yt|at, xt, θ) ◮ warning 1: sometimes people write “p(D|θ)” but we don’t need p(a|θ) or p(x|θ) here
  287. 287. TS: connecting w/“generative causal modeling” 2 ◮ Q1: what’s p(θ|D)? ◮ Q2: how am I going to evaluate this integral? ◮ A1: p(θ|D) definable by choosing prior p(θ|α) and likelihood on y given by the (modeled, parameterized) response p(y|a, x, θ). ◮ (now you’re not only generative, you’re Bayesian.) ◮ p(θ|D) = p(θ|{y}t 1, {a}t 1, {x}t 1, α) ◮ ∝ p({y}t 1|{a}t 1, {x}t 1, θ)p(θ|α) ◮ = p(θ|α)Πtp(yt|at, xt, θ) ◮ warning 1: sometimes people write “p(D|θ)” but we don’t need p(a|θ) or p(x|θ) here ◮ warning 2: don’t need historical record of θt.
  288. 288. TS: connecting w/“generative causal modeling” 2 ◮ Q1: what’s p(θ|D)? ◮ Q2: how am I going to evaluate this integral? ◮ A1: p(θ|D) definable by choosing prior p(θ|α) and likelihood on y given by the (modeled, parameterized) response p(y|a, x, θ). ◮ (now you’re not only generative, you’re Bayesian.) ◮ p(θ|D) = p(θ|{y}t 1, {a}t 1, {x}t 1, α) ◮ ∝ p({y}t 1|{a}t 1, {x}t 1, θ)p(θ|α) ◮ = p(θ|α)Πtp(yt|at, xt, θ) ◮ warning 1: sometimes people write “p(D|θ)” but we don’t need p(a|θ) or p(x|θ) here ◮ warning 2: don’t need historical record of θt. ◮ (we used Bayes rule, but only in θ and y.)
  289. 289. TS: connecting w/“generative causal modeling” 2 ◮ Q1: what’s p(θ|D)? ◮ Q2: how am I going to evaluate this integral? ◮ A1: p(θ|D) definable by choosing prior p(θ|α) and likelihood on y given by the (modeled, parameterized) response p(y|a, x, θ). ◮ (now you’re not only generative, you’re Bayesian.) ◮ p(θ|D) = p(θ|{y}t 1, {a}t 1, {x}t 1, α) ◮ ∝ p({y}t 1|{a}t 1, {x}t 1, θ)p(θ|α) ◮ = p(θ|α)Πtp(yt|at, xt, θ) ◮ warning 1: sometimes people write “p(D|θ)” but we don’t need p(a|θ) or p(x|θ) here ◮ warning 2: don’t need historical record of θt. ◮ (we used Bayes rule, but only in θ and y.) ◮ A2: evaluate integral by N = 1 Monte Carlo22 22 AFAIK it is open research area what happens when you replace N = 1 with N = N0t−ν .
  290. 290. TS: connecting w/“generative causal modeling” 2 ◮ Q1: what’s p(θ|D)? ◮ Q2: how am I going to evaluate this integral? ◮ A1: p(θ|D) definable by choosing prior p(θ|α) and likelihood on y given by the (modeled, parameterized) response p(y|a, x, θ). ◮ (now you’re not only generative, you’re Bayesian.) ◮ p(θ|D) = p(θ|{y}t 1, {a}t 1, {x}t 1, α) ◮ ∝ p({y}t 1|{a}t 1, {x}t 1, θ)p(θ|α) ◮ = p(θ|α)Πtp(yt|at, xt, θ) ◮ warning 1: sometimes people write “p(D|θ)” but we don’t need p(a|θ) or p(x|θ) here ◮ warning 2: don’t need historical record of θt. ◮ (we used Bayes rule, but only in θ and y.) ◮ A2: evaluate integral by N = 1 Monte Carlo22 ◮ take 1 sample “θt” of θ from p(θ|D) 22 AFAIK it is open research area what happens when you replace N = 1 with N = N0t−ν .
  291. 291. TS: connecting w/“generative causal modeling” 2 ◮ Q1: what’s p(θ|D)? ◮ Q2: how am I going to evaluate this integral? ◮ A1: p(θ|D) definable by choosing prior p(θ|α) and likelihood on y given by the (modeled, parameterized) response p(y|a, x, θ). ◮ (now you’re not only generative, you’re Bayesian.) ◮ p(θ|D) = p(θ|{y}t 1, {a}t 1, {x}t 1, α) ◮ ∝ p({y}t 1|{a}t 1, {x}t 1, θ)p(θ|α) ◮ = p(θ|α)Πtp(yt|at, xt, θ) ◮ warning 1: sometimes people write “p(D|θ)” but we don’t need p(a|θ) or p(x|θ) here ◮ warning 2: don’t need historical record of θt. ◮ (we used Bayes rule, but only in θ and y.) ◮ A2: evaluate integral by N = 1 Monte Carlo22 ◮ take 1 sample “θt” of θ from p(θ|D) ◮ at = h(xt; θt) = argmaxaQ(a, x, θt) 22 AFAIK it is open research area what happens when you replace N = 1 with N = N0t−ν .
  292. 292. That sounds hard. No, just general. Let’s do toy case: ◮ y ∈ {0, 1},
  293. 293. That sounds hard. No, just general. Let’s do toy case: ◮ y ∈ {0, 1}, ◮ no context x,
  294. 294. That sounds hard. No, just general. Let’s do toy case: ◮ y ∈ {0, 1}, ◮ no context x, ◮ Bernoulli (coin flipping), keep track of
  295. 295. That sounds hard. No, just general. Let’s do toy case: ◮ y ∈ {0, 1}, ◮ no context x, ◮ Bernoulli (coin flipping), keep track of ◮ Sa ≡ number of successes flipping coin a
  296. 296. That sounds hard. No, just general. Let’s do toy case: ◮ y ∈ {0, 1}, ◮ no context x, ◮ Bernoulli (coin flipping), keep track of ◮ Sa ≡ number of successes flipping coin a ◮ Fa ≡ number of failures flipping coin a
  297. 297. That sounds hard. No, just general. Let’s do toy case: ◮ y ∈ {0, 1}, ◮ no context x, ◮ Bernoulli (coin flipping), keep track of ◮ Sa ≡ number of successes flipping coin a ◮ Fa ≡ number of failures flipping coin a
  298. 298. That sounds hard. No, just general. Let’s do toy case: ◮ y ∈ {0, 1}, ◮ no context x, ◮ Bernoulli (coin flipping), keep track of ◮ Sa ≡ number of successes flipping coin a ◮ Fa ≡ number of failures flipping coin a Then ◮ p(θ|D) ∝ p(θ|α)Πtp(yt|at, θ)
  299. 299. That sounds hard. No, just general. Let’s do toy case: ◮ y ∈ {0, 1}, ◮ no context x, ◮ Bernoulli (coin flipping), keep track of ◮ Sa ≡ number of successes flipping coin a ◮ Fa ≡ number of failures flipping coin a Then ◮ p(θ|D) ∝ p(θ|α)Πtp(yt|at, θ) ◮ = Πaθα−1 a (1 − θa)β−1 Πt,at θyt at (1 − θat )1−yt
  300. 300. That sounds hard. No, just general. Let’s do toy case: ◮ y ∈ {0, 1}, ◮ no context x, ◮ Bernoulli (coin flipping), keep track of ◮ Sa ≡ number of successes flipping coin a ◮ Fa ≡ number of failures flipping coin a Then ◮ p(θ|D) ∝ p(θ|α)Πtp(yt|at, θ) ◮ = Πaθα−1 a (1 − θa)β−1 Πt,at θyt at (1 − θat )1−yt ◮ = Πaθα+Sa−1(1 − θa)β+Fa−1
  301. 301. That sounds hard. No, just general. Let’s do toy case: ◮ y ∈ {0, 1}, ◮ no context x, ◮ Bernoulli (coin flipping), keep track of ◮ Sa ≡ number of successes flipping coin a ◮ Fa ≡ number of failures flipping coin a Then ◮ p(θ|D) ∝ p(θ|α)Πtp(yt|at, θ) ◮ = Πaθα−1 a (1 − θa)β−1 Πt,at θyt at (1 − θat )1−yt ◮ = Πaθα+Sa−1(1 − θa)β+Fa−1 ◮ ∴ θa ∼ Beta(α + Sa, β + Fa)
  302. 302. Thompson sampling: results (2011) Figure 15: Chaleppe and Li 2011
  303. 303. TS: words Figure 16: from Chaleppe and Li 2011
  304. 304. TS: p-code Figure 17: from Chaleppe and Li 2011
  305. 305. TS: Bernoulli bandit p-code23
  306. 306. TS: Bernoulli bandit p-code (results) Figure 19: from Chaleppe and Li 2011
  307. 307. UCB1 (2002), p-code Figure 20: UCB1 from Auer, Peter, Nicolo Cesa-Bianchi, and Paul Fischer. “Finite-time analysis of the multiarmed bandit problem.” Machine learning 47.2-3 (2002): 235-256.
  308. 308. TS: with context Figure 21: from Chaleppe and Li 2011
  309. 309. LinUCB: UCB with context Figure 22: LinUCB
  310. 310. TS: with context (results)
  311. 311. Bandits: Regret via Lai and Robbins (1985) Figure 24: Lai Robbins
  312. 312. Thompson sampling (1933) and optimality (2013) Figure 25: TS result from S. Agrawal, N. Goyal, ”Further optimal regret bounds for Thompson Sampling”, AISTATS 2013.; see also Agrawal, Shipra, and Navin Goyal. “Analysis of Thompson Sampling for the Multi-armed Bandit Problem.” COLT. 2012 and Emilie Kaufmann, Nathaniel Korda, and R´emi Munos. Thompson sampling: An asymptotically optimal finite-time analysis. In Algorithmic Learning Theory, pages 199–213. Springer, 2012.
  313. 313. other ‘Causalities’: structure learning Figure 26: from heckerman 1995 D. Heckerman. A Tutorial on Learning with Bayesian Networks. Technical Report MSR-TR-95-06, Microsoft Research, March, 1995.
  314. 314. other ‘Causalities’: potential outcomes ◮ model distribution of p(yi (1), yi (0), ai , xi )
  315. 315. other ‘Causalities’: potential outcomes ◮ model distribution of p(yi (1), yi (0), ai , xi ) ◮ “action” replaced by “observed outcome”
  316. 316. other ‘Causalities’: potential outcomes ◮ model distribution of p(yi (1), yi (0), ai , xi ) ◮ “action” replaced by “observed outcome” ◮ aka Neyman-Rubin causal model: Neyman (’23); Rubin (’74)
  317. 317. other ‘Causalities’: potential outcomes ◮ model distribution of p(yi (1), yi (0), ai , xi ) ◮ “action” replaced by “observed outcome” ◮ aka Neyman-Rubin causal model: Neyman (’23); Rubin (’74) ◮ see Morgan + Winship24 for connections between frameworks 24 Morgan, Stephen L., and Christopher Winship. Counterfactuals and causal inference Cambridge University Press, 2014.
  318. 318. Lecture 4: descriptive modeling @ NYT
  319. 319. review: (latent) inference and clustering ◮ what does kmeans mean?
  320. 320. review: (latent) inference and clustering ◮ what does kmeans mean? ◮ given xi ∈ RD
  321. 321. review: (latent) inference and clustering ◮ what does kmeans mean? ◮ given xi ∈ RD ◮ given d : RD → R1
  322. 322. review: (latent) inference and clustering ◮ what does kmeans mean? ◮ given xi ∈ RD ◮ given d : RD → R1 ◮ assign zi
  323. 323. review: (latent) inference and clustering ◮ what does kmeans mean? ◮ given xi ∈ RD ◮ given d : RD → R1 ◮ assign zi ◮ generative modeling gives meaning
  324. 324. review: (latent) inference and clustering ◮ what does kmeans mean? ◮ given xi ∈ RD ◮ given d : RD → R1 ◮ assign zi ◮ generative modeling gives meaning ◮ given p(x|z, θ)
  325. 325. review: (latent) inference and clustering ◮ what does kmeans mean? ◮ given xi ∈ RD ◮ given d : RD → R1 ◮ assign zi ◮ generative modeling gives meaning ◮ given p(x|z, θ) ◮ maximize p(x|θ)
  326. 326. review: (latent) inference and clustering ◮ what does kmeans mean? ◮ given xi ∈ RD ◮ given d : RD → R1 ◮ assign zi ◮ generative modeling gives meaning ◮ given p(x|z, θ) ◮ maximize p(x|θ) ◮ output assignment p(z|x, θ)
  327. 327. actual math ◮ define P ≡ p(x, z|θ)
  328. 328. actual math ◮ define P ≡ p(x, z|θ) ◮ log-likelihood L ≡ log p(x|θ) = log z P = log EqP/q (cf. importance sampling)
  329. 329. actual math ◮ define P ≡ p(x, z|θ) ◮ log-likelihood L ≡ log p(x|θ) = log z P = log EqP/q (cf. importance sampling) ◮ Jensen’s: L ≥ ˜L ≡ Eq log P/q = Eq log P + H[q] = −(U − H) = −F
  330. 330. actual math ◮ define P ≡ p(x, z|θ) ◮ log-likelihood L ≡ log p(x|θ) = log z P = log EqP/q (cf. importance sampling) ◮ Jensen’s: L ≥ ˜L ≡ Eq log P/q = Eq log P + H[q] = −(U − H) = −F ◮ analogy to free energy in physics
  331. 331. actual math ◮ define P ≡ p(x, z|θ) ◮ log-likelihood L ≡ log p(x|θ) = log z P = log EqP/q (cf. importance sampling) ◮ Jensen’s: L ≥ ˜L ≡ Eq log P/q = Eq log P + H[q] = −(U − H) = −F ◮ analogy to free energy in physics ◮ alternate optimization on θ and on q
  332. 332. actual math ◮ define P ≡ p(x, z|θ) ◮ log-likelihood L ≡ log p(x|θ) = log z P = log EqP/q (cf. importance sampling) ◮ Jensen’s: L ≥ ˜L ≡ Eq log P/q = Eq log P + H[q] = −(U − H) = −F ◮ analogy to free energy in physics ◮ alternate optimization on θ and on q ◮ NB: q step gives q(z) = p(z|x, θ)
  333. 333. actual math ◮ define P ≡ p(x, z|θ) ◮ log-likelihood L ≡ log p(x|θ) = log z P = log EqP/q (cf. importance sampling) ◮ Jensen’s: L ≥ ˜L ≡ Eq log P/q = Eq log P + H[q] = −(U − H) = −F ◮ analogy to free energy in physics ◮ alternate optimization on θ and on q ◮ NB: q step gives q(z) = p(z|x, θ) ◮ NB: log P convenient for independent examples w/ exponential families
  334. 334. actual math ◮ define P ≡ p(x, z|θ) ◮ log-likelihood L ≡ log p(x|θ) = log z P = log EqP/q (cf. importance sampling) ◮ Jensen’s: L ≥ ˜L ≡ Eq log P/q = Eq log P + H[q] = −(U − H) = −F ◮ analogy to free energy in physics ◮ alternate optimization on θ and on q ◮ NB: q step gives q(z) = p(z|x, θ) ◮ NB: log P convenient for independent examples w/ exponential families ◮ e.g., GMMs: µk ← E[x|z] and σ2 k ← E[(x − µ)2 |z] are sufficient statistics
  335. 335. actual math ◮ define P ≡ p(x, z|θ) ◮ log-likelihood L ≡ log p(x|θ) = log z P = log EqP/q (cf. importance sampling) ◮ Jensen’s: L ≥ ˜L ≡ Eq log P/q = Eq log P + H[q] = −(U − H) = −F ◮ analogy to free energy in physics ◮ alternate optimization on θ and on q ◮ NB: q step gives q(z) = p(z|x, θ) ◮ NB: log P convenient for independent examples w/ exponential families ◮ e.g., GMMs: µk ← E[x|z] and σ2 k ← E[(x − µ)2 |z] are sufficient statistics ◮ e.g., LDAs: word counts are sufficient statistics
  336. 336. tangent: more math on GMMs, part 1 Energy U (to be minimized): ◮ −U ≡ Eq log P = z i qi (z) log P(xi , zi ) ≡ Ux + Uz
  337. 337. tangent: more math on GMMs, part 1 Energy U (to be minimized): ◮ −U ≡ Eq log P = z i qi (z) log P(xi , zi ) ≡ Ux + Uz ◮ −Ux ≡ z i qi (z) log p(xi |zi )
  338. 338. tangent: more math on GMMs, part 1 Energy U (to be minimized): ◮ −U ≡ Eq log P = z i qi (z) log P(xi , zi ) ≡ Ux + Uz ◮ −Ux ≡ z i qi (z) log p(xi |zi ) ◮ = i z qi (z) k 1[zi = k] log p(xi |zi )
  339. 339. tangent: more math on GMMs, part 1 Energy U (to be minimized): ◮ −U ≡ Eq log P = z i qi (z) log P(xi , zi ) ≡ Ux + Uz ◮ −Ux ≡ z i qi (z) log p(xi |zi ) ◮ = i z qi (z) k 1[zi = k] log p(xi |zi ) ◮ define rik = z qi (z)1[zi = k]
  340. 340. tangent: more math on GMMs, part 1 Energy U (to be minimized): ◮ −U ≡ Eq log P = z i qi (z) log P(xi , zi ) ≡ Ux + Uz ◮ −Ux ≡ z i qi (z) log p(xi |zi ) ◮ = i z qi (z) k 1[zi = k] log p(xi |zi ) ◮ define rik = z qi (z)1[zi = k] ◮ −Ux = i rik log p(xi |k).
  341. 341. tangent: more math on GMMs, part 1 Energy U (to be minimized): ◮ −U ≡ Eq log P = z i qi (z) log P(xi , zi ) ≡ Ux + Uz ◮ −Ux ≡ z i qi (z) log p(xi |zi ) ◮ = i z qi (z) k 1[zi = k] log p(xi |zi ) ◮ define rik = z qi (z)1[zi = k] ◮ −Ux = i rik log p(xi |k). ◮ Gaussian25 ⇒ −Ux = i rik −1 2 (xi − µk)2λk + 1 2 ln λk − 1 2 ln 2π 25 math is simpler if you work with λk ≡ σ−2
  342. 342. tangent: more math on GMMs, part 1 Energy U (to be minimized): ◮ −U ≡ Eq log P = z i qi (z) log P(xi , zi ) ≡ Ux + Uz ◮ −Ux ≡ z i qi (z) log p(xi |zi ) ◮ = i z qi (z) k 1[zi = k] log p(xi |zi ) ◮ define rik = z qi (z)1[zi = k] ◮ −Ux = i rik log p(xi |k). ◮ Gaussian25 ⇒ −Ux = i rik −1 2 (xi − µk)2λk + 1 2 ln λk − 1 2 ln 2π 25 math is simpler if you work with λk ≡ σ−2
  343. 343. tangent: more math on GMMs, part 1 Energy U (to be minimized): ◮ −U ≡ Eq log P = z i qi (z) log P(xi , zi ) ≡ Ux + Uz ◮ −Ux ≡ z i qi (z) log p(xi |zi ) ◮ = i z qi (z) k 1[zi = k] log p(xi |zi ) ◮ define rik = z qi (z)1[zi = k] ◮ −Ux = i rik log p(xi |k). ◮ Gaussian25 ⇒ −Ux = i rik −1 2 (xi − µk)2λk + 1 2 ln λk − 1 2 ln 2π simple to minimize for parameters ϑ = {µk, λk} 25 math is simpler if you work with λk ≡ σ−2
  344. 344. tangent: more math on GMMs, part 2 ◮ -Ux = i rik −1 2 (xi − µk)2λk + 1 2 ln λk − 1 2 ln 2π
  345. 345. tangent: more math on GMMs, part 2 ◮ -Ux = i rik −1 2 (xi − µk)2λk + 1 2 ln λk − 1 2 ln 2π ◮ µk ← E[x|k] solves i rik = i rikxi
  346. 346. tangent: more math on GMMs, part 2 ◮ -Ux = i rik −1 2 (xi − µk)2λk + 1 2 ln λk − 1 2 ln 2π ◮ µk ← E[x|k] solves i rik = i rikxi ◮ λk ← E[(x − µ)2|k] solves i rik 1 2 (xi − µk)2 = λ−1 k i rik
  347. 347. tangent: Gaussians ∈ exponential family27 ◮ as before, −U = i rik log p(xi |k)
  348. 348. tangent: Gaussians ∈ exponential family27 ◮ as before, −U = i rik log p(xi |k) ◮ define p(xi |k) = exp (η(θ) · T(x) − A(θ) + B(x))
  349. 349. tangent: Gaussians ∈ exponential family27 ◮ as before, −U = i rik log p(xi |k) ◮ define p(xi |k) = exp (η(θ) · T(x) − A(θ) + B(x)) ◮ e.g., Gaussian case 26, 26 Choosing η(θ) = η called ‘canonical form’
  350. 350. tangent: Gaussians ∈ exponential family27 ◮ as before, −U = i rik log p(xi |k) ◮ define p(xi |k) = exp (η(θ) · T(x) − A(θ) + B(x)) ◮ e.g., Gaussian case 26, ◮ T1 = x, 26 Choosing η(θ) = η called ‘canonical form’
  351. 351. tangent: Gaussians ∈ exponential family27 ◮ as before, −U = i rik log p(xi |k) ◮ define p(xi |k) = exp (η(θ) · T(x) − A(θ) + B(x)) ◮ e.g., Gaussian case 26, ◮ T1 = x, ◮ T2 = x2 26 Choosing η(θ) = η called ‘canonical form’
  352. 352. tangent: Gaussians ∈ exponential family27 ◮ as before, −U = i rik log p(xi |k) ◮ define p(xi |k) = exp (η(θ) · T(x) − A(θ) + B(x)) ◮ e.g., Gaussian case 26, ◮ T1 = x, ◮ T2 = x2 ◮ η1 = µ/σ2 = µλ 26 Choosing η(θ) = η called ‘canonical form’
  353. 353. tangent: Gaussians ∈ exponential family27 ◮ as before, −U = i rik log p(xi |k) ◮ define p(xi |k) = exp (η(θ) · T(x) − A(θ) + B(x)) ◮ e.g., Gaussian case 26, ◮ T1 = x, ◮ T2 = x2 ◮ η1 = µ/σ2 = µλ ◮ η2 = −1 2 λ = −1/(2σ2 ) 26 Choosing η(θ) = η called ‘canonical form’
  354. 354. tangent: Gaussians ∈ exponential family27 ◮ as before, −U = i rik log p(xi |k) ◮ define p(xi |k) = exp (η(θ) · T(x) − A(θ) + B(x)) ◮ e.g., Gaussian case 26, ◮ T1 = x, ◮ T2 = x2 ◮ η1 = µ/σ2 = µλ ◮ η2 = −1 2 λ = −1/(2σ2 ) ◮ A = λµ2 /2 − 1 2 ln λ 26 Choosing η(θ) = η called ‘canonical form’
  355. 355. tangent: Gaussians ∈ exponential family27 ◮ as before, −U = i rik log p(xi |k) ◮ define p(xi |k) = exp (η(θ) · T(x) − A(θ) + B(x)) ◮ e.g., Gaussian case 26, ◮ T1 = x, ◮ T2 = x2 ◮ η1 = µ/σ2 = µλ ◮ η2 = −1 2 λ = −1/(2σ2 ) ◮ A = λµ2 /2 − 1 2 ln λ ◮ exp(B(x)) = (2π)−1/2 26 Choosing η(θ) = η called ‘canonical form’
  356. 356. tangent: Gaussians ∈ exponential family27 ◮ as before, −U = i rik log p(xi |k) ◮ define p(xi |k) = exp (η(θ) · T(x) − A(θ) + B(x)) ◮ e.g., Gaussian case 26, ◮ T1 = x, ◮ T2 = x2 ◮ η1 = µ/σ2 = µλ ◮ η2 = −1 2 λ = −1/(2σ2 ) ◮ A = λµ2 /2 − 1 2 ln λ ◮ exp(B(x)) = (2π)−1/2 ◮ note that in a mixture model, there are separate η (and thus A(η)) for each value of z 26 Choosing η(θ) = η called ‘canonical form’ 27 NB: Gaussians ∈ exponential family, GMM /∈ exponential family! (Thanks to Eszter Vértes for pointing out this error in earlier title.)
  357. 357. tangent: variational joy ∈ exponential family ◮ as before, −U = i rik ηT k T(xi ) − A(ηk) + B(xi )
  358. 358. tangent: variational joy ∈ exponential family ◮ as before, −U = i rik ηT k T(xi ) − A(ηk) + B(xi ) ◮ ηk,α solves i rikTk,α(xi ) = ∂A(ηk ) ∂ηk,α i rik (canonical)
  359. 359. tangent: variational joy ∈ exponential family ◮ as before, −U = i rik ηT k T(xi ) − A(ηk) + B(xi ) ◮ ηk,α solves i rikTk,α(xi ) = ∂A(ηk ) ∂ηk,α i rik (canonical) ◮ ∴ ∂ηk,α A(ηk) ← E[Tk,α|k] (canonical)
  360. 360. tangent: variational joy ∈ exponential family ◮ as before, −U = i rik ηT k T(xi ) − A(ηk) + B(xi ) ◮ ηk,α solves i rikTk,α(xi ) = ∂A(ηk ) ∂ηk,α i rik (canonical) ◮ ∴ ∂ηk,α A(ηk) ← E[Tk,α|k] (canonical) ◮ nice connection w/physics, esp. mean field theory28 28 read MacKay, David JC. Information theory, inference and learning algorithms, Cambridge university press, 2003 to learn more. Actually you should read it regardless.
  361. 361. clustering and inference: GMM/k-means case study ◮ generative model gives meaning and optimization
  362. 362. clustering and inference: GMM/k-means case study ◮ generative model gives meaning and optimization ◮ large freedom to choose different optimization approaches
  363. 363. clustering and inference: GMM/k-means case study ◮ generative model gives meaning and optimization ◮ large freedom to choose different optimization approaches ◮ e.g., hard clustering limit
  364. 364. clustering and inference: GMM/k-means case study ◮ generative model gives meaning and optimization ◮ large freedom to choose different optimization approaches ◮ e.g., hard clustering limit ◮ e.g., streaming solutions
  365. 365. clustering and inference: GMM/k-means case study ◮ generative model gives meaning and optimization ◮ large freedom to choose different optimization approaches ◮ e.g., hard clustering limit ◮ e.g., streaming solutions ◮ e.g., stochastic gradient methods
  366. 366. general framework: E+M/variational ◮ e.g., GMM+hard clustering gives kmeans
  367. 367. general framework: E+M/variational ◮ e.g., GMM+hard clustering gives kmeans ◮ e.g., some favorite applications:
  368. 368. general framework: E+M/variational ◮ e.g., GMM+hard clustering gives kmeans ◮ e.g., some favorite applications: ◮ hmm
  369. 369. general framework: E+M/variational ◮ e.g., GMM+hard clustering gives kmeans ◮ e.g., some favorite applications: ◮ hmm ◮ vbmod: arXiv:0709.3512
  370. 370. general framework: E+M/variational ◮ e.g., GMM+hard clustering gives kmeans ◮ e.g., some favorite applications: ◮ hmm ◮ vbmod: arXiv:0709.3512 ◮ ebfret: ebfret.github.io
  371. 371. general framework: E+M/variational ◮ e.g., GMM+hard clustering gives kmeans ◮ e.g., some favorite applications: ◮ hmm ◮ vbmod: arXiv:0709.3512 ◮ ebfret: ebfret.github.io ◮ EDHMM: edhmm.github.io
  372. 372. example application: LDA+topics Figure 27: From Blei 2003
  373. 373. rec engine via CTM 29 Figure 28: From Blei 2011
  374. 374. recall: recommendation via factoring Figure 29: From Blei 2011
  375. 375. CTM: combined loss function Figure 30: From Blei 2011
  376. 376. CTM: updates for factors Figure 31: From Blei 2011
  377. 377. CTM: (via Jensen’s, again) bound on loss Figure 32: From Blei 2011
  378. 378. Lecture 5 data product
  379. 379. data science and design thinking ◮ knowing customer
  380. 380. data science and design thinking ◮ knowing customer ◮ right tool for right job
  381. 381. data science and design thinking ◮ knowing customer ◮ right tool for right job ◮ practical matters:
  382. 382. data science and design thinking ◮ knowing customer ◮ right tool for right job ◮ practical matters: ◮ munging
  383. 383. data science and design thinking ◮ knowing customer ◮ right tool for right job ◮ practical matters: ◮ munging ◮ data ops
  384. 384. data science and design thinking ◮ knowing customer ◮ right tool for right job ◮ practical matters: ◮ munging ◮ data ops ◮ ML in prod
  385. 385. Thanks! Thanks MLSS students for your great questions; please contact me @chrishwiggins or chris.wiggins@{nytimes,gmail}.com with any questions, comments, or suggestions!

×