Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Successfully reported this slideshow.

Like this presentation? Why not share!

- Ds industri hilir migas 1, 2 by Zoelfikar Mannoiy... 3632 views
- Resumo cubo rubiks by João Silva 221 views
- Going viral by FINN 671 views
- CTG Ed 542_T-28-29 by Hafizh Fauzan 47 views
- The Ideal Proxy Statement by Stanford GSB Corp... 875 views
- Guilford standards for promotion an... by Dave Dobson 128 views

3,335 views

Published on

"Data Science @ The New York Times".

Topics include:

descriptive/predictive/prescriptive modeling

boosting & surrogate loss functions pp103-117

causality via generative models (p141-215)

POISE: “policy optimization via importance sample estimation”

connection w/causal inference & matching (p188-215)

brief intro to instrumental variables (p230-248)

bandits (p249-316)

connecting thompson sampling with generative modeling (p259-305)

Published in:
Engineering

No Downloads

Total views

3,335

On SlideShare

0

From Embeds

0

Number of Embeds

133

Shares

0

Downloads

87

Comments

0

Likes

10

No embeds

No notes for slide

- 1. data science @ NYT Chris Wiggins Aug 8/9, 2016
- 2. Outline 1. overview of DS@NYT
- 3. Outline 1. overview of DS@NYT 2. prediction + supervised learning
- 4. Outline 1. overview of DS@NYT 2. prediction + supervised learning 3. prescription, causality, and RL
- 5. Outline 1. overview of DS@NYT 2. prediction + supervised learning 3. prescription, causality, and RL 4. description + inference
- 6. Outline 1. overview of DS@NYT 2. prediction + supervised learning 3. prescription, causality, and RL 4. description + inference 5. (if interest) designing data products
- 7. 0. Thank the organizers! Figure 1: prepping slides until last minute
- 8. Lecture 1: overview of ds@NYT
- 9. data science @ The New York Times chris.wiggins@columbia.edu chris.wiggins@nytimes.com @chrishwiggins references: http://bit.ly/stanf16
- 10. data science @ The New York Times
- 11. data science: searches
- 12. data science: mindset & toolset drew conway, 2010
- 13. modern history: 2009
- 14. modern history: 2009
- 15. biology: 1892 vs. 1995
- 16. biology: 1892 vs. 1995 biology changed for good.
- 17. biology: 1892 vs. 1995 new toolset, new mindset
- 18. genetics: 1837 vs. 2012 ML toolset; data science mindset
- 19. genetics: 1837 vs. 2012
- 20. genetics: 1837 vs. 2012 ML toolset; data science mindset arxiv.org/abs/1105.5821 ; github.com/rajanil/mkboost
- 21. data science: mindset & toolset
- 22. 1851
- 23. news: 20th century church state
- 24. church
- 25. church
- 26. church
- 27. news: 20th century church state
- 28. news: 21st century church state data
- 29. 1851 1996 newspapering: 1851 vs. 1996
- 30. example: millions of views per hour2015
- 31. "...social activities generate large quantities of potentially valuable data...The data were not generated for the purpose of learning; however, the potential for learning is great’’
- 32. "...social activities generate large quantities of potentially valuable data...The data were not generated for the purpose of learning; however, the potential for learning is great’’ - J Chambers, Bell Labs,1993, “GLS”
- 33. data science: the web
- 34. data science: the web is your “online presence”
- 35. data science: the web is a microscope
- 36. data science: the web is an experimental tool
- 37. data science: the web is an optimization tool
- 38. 1851 1996 newspapering: 1851 vs. 1996 vs. 2008 2008
- 39. “a startup is a temporary organization in search of a repeatable and scalable business model” —Steve Blank
- 40. every publisher is now a startup
- 41. every publisher is now a startup
- 42. every publisher is now a startup
- 43. every publisher is now a startup
- 44. news: 21st century church state data
- 45. news: 21st century church state data
- 46. learnings
- 47. learnings - predictive modeling - descriptive modeling - prescriptive modeling
- 48. (actually ML, shhhh…) - (supervised learning) - (unsupervised learning) - (reinforcement learning)
- 49. (actually ML, shhhh…) h/t michael littman
- 50. (actually ML, shhhh…) h/t michael littman Supervised Learning Reinforcement Learning Unsupervised Learning (reports)
- 51. learnings - predictive modeling - descriptive modeling - prescriptive modeling cf. modelingsocialdata.org
- 52. stats.stackexchange.com
- 53. from “are you a bayesian or a frequentist” —michael jordan L = NX i=1 ϕ (yif(xi; β)) + λ||β||
- 54. predictive modeling, e.g., cf. modelingsocialdata.org
- 55. predictive modeling, e.g., “the funnel” cf. modelingsocialdata.org
- 56. interpretable predictive modeling supercoolstuff cf. modelingsocialdata.org
- 57. interpretable predictive modeling supercoolstuff cf. modelingsocialdata.org arxiv.org/abs/q-bio/0701021
- 58. optimization & learning, e.g., “How The New York Times Works “popular mechanics, 2015
- 59. optimization & prediction, e.g., (some models) (somemoneys) “newsvendor problem,” literally (+prediction+experiment)
- 60. recommendation as inference
- 61. recommendation as inference bit.ly/AlexCTM
- 62. descriptive modeling, e.g, “segments” cf. modelingsocialdata.org
- 63. descriptive modeling, e.g, “segments” cf. modelingsocialdata.org
- 64. descriptive modeling, e.g, “segments” argmax_z p(z|x)=14 cf. modelingsocialdata.org
- 65. descriptive modeling, e.g, “segments” “baby boomer” cf. modelingsocialdata.org
- 66. - descriptive data product cf. daeilkim.com
- 67. descriptive modeling, e.g, cf. daeilkim.com ; import bnpy
- 68. modeling your audience bit.ly/Hughes-Kim-Sudderth-AISTATS15
- 69. modeling your audience (optimization, ultimately)
- 70. also allows insight+targeting as inference modeling your audience
- 71. prescriptive modeling
- 72. prescriptive modeling
- 73. prescriptive modeling
- 74. “off policy value estimation” (cf. “causal effect estimation”) cf. Langford `08-`16; Horvitz & Thompson `52; Holland `86
- 75. “off policy value estimation” (cf. “causal effect estimation”) Vapnik’s razor “ When solving a (learning) problem of interest, do not solve a more complex problem as an intermediate step.”
- 76. prescriptive modeling cf. modelingsocialdata.org
- 77. prescriptive modeling aka “A/B testing”; RCT cf. modelingsocialdata.org
- 78. Reporting Learning Test aka “A/B testing”; business as usual (esp. predictive) Some of the most recognizable personalization in our service is the collection of “genre” rows. …Members connect with these rows so well that we measure an increase in member retention by placing the most tailored rows higher on the page instead of lower. cf. modelingsocialdata.org prescriptive modeling: from A/B to….
- 79. real-time A/B -> “bandits” GOOG blog: cf. modelingsocialdata.org
- 80. prescriptive modeling, e.g,
- 81. prescriptive modeling, e.g,
- 82. prescriptive modeling, e.g,
- 83. prescriptive modeling, e.g, leverage methods which are predictive yet performant
- 84. NB: data-informed, not data-driven
- 85. predicting views/cascades: doable? KDD 09: how many people are online?
- 86. predicting views/cascades: doable? WWW 14: FB shares
- 87. predicting views/cascades: features?
- 88. predicting views/cascades: features?
- 89. predicting views/cascades: doable? WWW 16: TWIT RT’s
- 90. predicting views/cascades: doable?
- 91. Reporting Learning Test Optimizing Exploredescriptive: predictive: prescriptive:
- 92. Reporting Learning Test Optimizing Exploredescriptive: predictive: prescriptive:
- 93. things: what does DS team deliver? - build data product - build APIs - impact roadmaps
- 94. data science @ The New York Times chris.wiggins@columbia.edu chris.wiggins@nytimes.com @chrishwiggins references: http://bit.ly/stanf16
- 95. Lecture 2: predictive modeling @ NYT
- 96. desc/pred/pres Figure 2: desc/pred/pres ◮ caveat: diﬀerence between observation and experiment. why?
- 97. blossom example Figure 3: Reminder: Blossom
- 98. blossom + boosting (‘exponential’) Figure 4: Reminder: Surrogate Loss Functions
- 99. tangent: logistic function as surrogate loss function ◮ deﬁne f (x) ≡ log p(y = 1|x)/p(y = −1|x) ∈ R
- 100. tangent: logistic function as surrogate loss function ◮ deﬁne f (x) ≡ log p(y = 1|x)/p(y = −1|x) ∈ R ◮ p(y = 1|x) + p(y = −1|x) = 1 → p(y|x) = 1/(1 + exp(−yf ))
- 101. tangent: logistic function as surrogate loss function ◮ deﬁne f (x) ≡ log p(y = 1|x)/p(y = −1|x) ∈ R ◮ p(y = 1|x) + p(y = −1|x) = 1 → p(y|x) = 1/(1 + exp(−yf )) ◮ − log2 p({y}N 1 ) = i log2 1 + e−yi f (xi ) ≡ i ℓ(yi f (xi ))
- 102. tangent: logistic function as surrogate loss function ◮ deﬁne f (x) ≡ log p(y = 1|x)/p(y = −1|x) ∈ R ◮ p(y = 1|x) + p(y = −1|x) = 1 → p(y|x) = 1/(1 + exp(−yf )) ◮ − log2 p({y}N 1 ) = i log2 1 + e−yi f (xi ) ≡ i ℓ(yi f (xi )) ◮ ℓ′′ > 0, ℓ(µ) > 1[µ < 0] ∀µ ∈ R.
- 103. tangent: logistic function as surrogate loss function ◮ deﬁne f (x) ≡ log p(y = 1|x)/p(y = −1|x) ∈ R ◮ p(y = 1|x) + p(y = −1|x) = 1 → p(y|x) = 1/(1 + exp(−yf )) ◮ − log2 p({y}N 1 ) = i log2 1 + e−yi f (xi ) ≡ i ℓ(yi f (xi )) ◮ ℓ′′ > 0, ℓ(µ) > 1[µ < 0] ∀µ ∈ R. ◮ ∴ maximizing log-likelihood is minimizing a surrogate convex loss function for classiﬁcation (though not strongly convex, cf. Yoram’s talk)
- 104. tangent: logistic function as surrogate loss function ◮ deﬁne f (x) ≡ log p(y = 1|x)/p(y = −1|x) ∈ R ◮ p(y = 1|x) + p(y = −1|x) = 1 → p(y|x) = 1/(1 + exp(−yf )) ◮ − log2 p({y}N 1 ) = i log2 1 + e−yi f (xi ) ≡ i ℓ(yi f (xi )) ◮ ℓ′′ > 0, ℓ(µ) > 1[µ < 0] ∀µ ∈ R. ◮ ∴ maximizing log-likelihood is minimizing a surrogate convex loss function for classiﬁcation (though not strongly convex, cf. Yoram’s talk) ◮ but i log2 1 + e−yi wT h(xi ) not as easy as i e−yi wT h(xi )
- 105. boosting 1 L exponential surrogate loss function, summed over examples: ◮ L[F] = i exp (−yi F(xi ))
- 106. boosting 1 L exponential surrogate loss function, summed over examples: ◮ L[F] = i exp (−yi F(xi )) ◮ = i exp −yi t t′ wt′ ht′ (xi ) ≡ Lt(wt)
- 107. boosting 1 L exponential surrogate loss function, summed over examples: ◮ L[F] = i exp (−yi F(xi )) ◮ = i exp −yi t t′ wt′ ht′ (xi ) ≡ Lt(wt) ◮ Draw ht ∈ H large space of rules s.t. h(x) ∈ {−1, +1}
- 108. boosting 1 L exponential surrogate loss function, summed over examples: ◮ L[F] = i exp (−yi F(xi )) ◮ = i exp −yi t t′ wt′ ht′ (xi ) ≡ Lt(wt) ◮ Draw ht ∈ H large space of rules s.t. h(x) ∈ {−1, +1} ◮ label y ∈ {−1, +1}
- 109. boosting 1 L exponential surrogate loss function, summed over examples: ◮ Lt+1(wt; w) ≡ i dt i exp (−yi wht+1(xi )) Punchlines: sparse, predictive, interpretable, fast (to execute), and easy to extend, e.g., trees, ﬂexible hypotheses spaces, L1, L∞ 1, . . .
- 110. boosting 1 L exponential surrogate loss function, summed over examples: ◮ Lt+1(wt; w) ≡ i dt i exp (−yi wht+1(xi )) ◮ = y=h′ dt i e−w + y=h′ dt i e+w ≡ e−w D+ + e+w D− Punchlines: sparse, predictive, interpretable, fast (to execute), and easy to extend, e.g., trees, ﬂexible hypotheses spaces, L1, L∞ 1, . . .
- 111. boosting 1 L exponential surrogate loss function, summed over examples: ◮ Lt+1(wt; w) ≡ i dt i exp (−yi wht+1(xi )) ◮ = y=h′ dt i e−w + y=h′ dt i e+w ≡ e−w D+ + e+w D− ◮ ∴ wt+1 = argminw Lt+1(w) = (1/2) log D+/D− Punchlines: sparse, predictive, interpretable, fast (to execute), and easy to extend, e.g., trees, ﬂexible hypotheses spaces, L1, L∞ 1, . . .
- 112. boosting 1 L exponential surrogate loss function, summed over examples: ◮ Lt+1(wt; w) ≡ i dt i exp (−yi wht+1(xi )) ◮ = y=h′ dt i e−w + y=h′ dt i e+w ≡ e−w D+ + e+w D− ◮ ∴ wt+1 = argminw Lt+1(w) = (1/2) log D+/D− ◮ Lt+1(wt+1) = 2 D+D− = 2 ν+(1 − ν+)/D, where 0 ≤ ν+ ≡ D+/D = D+/Lt ≤ 1 Punchlines: sparse, predictive, interpretable, fast (to execute), and easy to extend, e.g., trees, ﬂexible hypotheses spaces, L1, L∞ 1, . . .
- 113. boosting 1 L exponential surrogate loss function, summed over examples: ◮ Lt+1(wt; w) ≡ i dt i exp (−yi wht+1(xi )) ◮ = y=h′ dt i e−w + y=h′ dt i e+w ≡ e−w D+ + e+w D− ◮ ∴ wt+1 = argminw Lt+1(w) = (1/2) log D+/D− ◮ Lt+1(wt+1) = 2 D+D− = 2 ν+(1 − ν+)/D, where 0 ≤ ν+ ≡ D+/D = D+/Lt ≤ 1 ◮ update example weights dt+1 i = dt i e∓w Punchlines: sparse, predictive, interpretable, fast (to execute), and easy to extend, e.g., trees, ﬂexible hypotheses spaces, L1, L∞ 1, . . . 1 Duchi + Singer “Boosting with structural sparsity” ICML ’09
- 114. predicting people ◮ “customer journey” prediction
- 115. predicting people ◮ “customer journey” prediction ◮ fun covariates
- 116. predicting people ◮ “customer journey” prediction ◮ fun covariates ◮ observational complication v structural models
- 117. predicting people (reminder) Figure 5: both in science and in real world, feature analysis guides future experiments
- 118. single copy (reminder) Figure 6: from Lecture 1
- 119. example in CAR (computer assisted reporting) Figure 7: Tabuchi article
- 120. example in CAR (computer assisted reporting) ◮ cf. Friedman’s “Statistical models and Shoe Leather”2 2 Freedman, David A. “Statistical models and shoe leather.” Sociological methodology 21.2 (1991): 291-313.
- 121. example in CAR (computer assisted reporting) ◮ cf. Friedman’s “Statistical models and Shoe Leather”2 ◮ Takata airbag fatalities 2 Freedman, David A. “Statistical models and shoe leather.” Sociological methodology 21.2 (1991): 291-313.
- 122. example in CAR (computer assisted reporting) ◮ cf. Friedman’s “Statistical models and Shoe Leather”2 ◮ Takata airbag fatalities ◮ 2219 labeled3 examples from 33,204 comments 2 Freedman, David A. “Statistical models and shoe leather.” Sociological methodology 21.2 (1991): 291-313. 3 By Hiroko Tabuchi, a Pulitzer winner
- 123. example in CAR (computer assisted reporting) ◮ cf. Friedman’s “Statistical models and Shoe Leather”2 ◮ Takata airbag fatalities ◮ 2219 labeled3 examples from 33,204 comments ◮ cf. Box’s “Science and Statistics”4 2 Freedman, David A. “Statistical models and shoe leather.” Sociological methodology 21.2 (1991): 291-313. 3 By Hiroko Tabuchi, a Pulitzer winner 4 Science and Statistics, George E. P. Box Journal of the American Statistical Association, Vol. 71, No. 356. (Dec., 1976), pp. 791-799.
- 124. computer assisted reporting ◮ Impact Figure 8: impact
- 125. Lecture 3: prescriptive modeling @ NYT
- 126. the natural abstraction ◮ operators5 make decisions 5 In the sense of business deciders; that said, doctors, including those who operate, also have to make decisions, cf., personalized medicines
- 127. the natural abstraction ◮ operators5 make decisions ◮ faster horses v. cars 5 In the sense of business deciders; that said, doctors, including those who operate, also have to make decisions, cf., personalized medicines
- 128. the natural abstraction ◮ operators5 make decisions ◮ faster horses v. cars ◮ general insights v. optimal policies 5 In the sense of business deciders; that said, doctors, including those who operate, also have to make decisions, cf., personalized medicines
- 129. maximizing outcome ◮ the problem: maximizing an outcome over policies. . .
- 130. maximizing outcome ◮ the problem: maximizing an outcome over policies. . . ◮ . . . while inferring causality from observation
- 131. maximizing outcome ◮ the problem: maximizing an outcome over policies. . . ◮ . . . while inferring causality from observation ◮ diﬀerent from predicting outcome in absence of action/policy
- 132. examples ◮ observation is not experiment
- 133. examples ◮ observation is not experiment ◮ e.g., (Med.) smoking hurts vs unhealthy people smoke
- 134. examples ◮ observation is not experiment ◮ e.g., (Med.) smoking hurts vs unhealthy people smoke ◮ e.g., (Med.) aﬄuent get prescribed diﬀerent meds/treatment
- 135. examples ◮ observation is not experiment ◮ e.g., (Med.) smoking hurts vs unhealthy people smoke ◮ e.g., (Med.) aﬄuent get prescribed diﬀerent meds/treatment ◮ e.g., (life) veterans earn less vs the rich serve less6 6 Angrist, Joshua D. (1990). “Lifetime Earnings and the Vietnam Draft Lottery: Evidence from Social Security Administrative Records”. American Economic Review 80 (3): 313–336.
- 136. examples ◮ observation is not experiment ◮ e.g., (Med.) smoking hurts vs unhealthy people smoke ◮ e.g., (Med.) aﬄuent get prescribed diﬀerent meds/treatment ◮ e.g., (life) veterans earn less vs the rich serve less6 ◮ e.g., (life) admitted to school vs learn at school? 6 Angrist, Joshua D. (1990). “Lifetime Earnings and the Vietnam Draft Lottery: Evidence from Social Security Administrative Records”. American Economic Review 80 (3): 313–336.
- 137. reinforcement/machine learning/graphical models ◮ key idea: model joint p(y, a, x)
- 138. reinforcement/machine learning/graphical models ◮ key idea: model joint p(y, a, x) ◮ explore/exploit: family of joints pα(y, a, x)
- 139. reinforcement/machine learning/graphical models ◮ key idea: model joint p(y, a, x) ◮ explore/exploit: family of joints pα(y, a, x) ◮ “causality”: pα(y, a, x) = p(y|a, x)pα(a|x)p(x) “a causes y”
- 140. reinforcement/machine learning/graphical models ◮ key idea: model joint p(y, a, x) ◮ explore/exploit: family of joints pα(y, a, x) ◮ “causality”: pα(y, a, x) = p(y|a, x)pα(a|x)p(x) “a causes y” ◮ nomenclature: ‘response’, ‘policy’/‘bias’, ‘prior’ above
- 141. in general Figure 9: policy/bias, response, and prior deﬁne the distribution also describes both the ‘exploration’ and ‘exploitation’ distributions
- 142. randomized controlled trial Figure 10: RCT: ‘bias’ removed, random ‘policy’ (response and prior unaﬀected) also Pearl’s ‘do’ distribution: a distribution with “no arrows” pointing to the action variable.
- 143. POISE: calculation, estimation, optimization ◮ POISE: “policy optimization via importance sample estimation”
- 144. POISE: calculation, estimation, optimization ◮ POISE: “policy optimization via importance sample estimation” ◮ Monte Carlo importance sampling estimation
- 145. POISE: calculation, estimation, optimization ◮ POISE: “policy optimization via importance sample estimation” ◮ Monte Carlo importance sampling estimation ◮ aka “oﬀ policy estimation”
- 146. POISE: calculation, estimation, optimization ◮ POISE: “policy optimization via importance sample estimation” ◮ Monte Carlo importance sampling estimation ◮ aka “oﬀ policy estimation” ◮ role of “IPW”
- 147. POISE: calculation, estimation, optimization ◮ POISE: “policy optimization via importance sample estimation” ◮ Monte Carlo importance sampling estimation ◮ aka “oﬀ policy estimation” ◮ role of “IPW” ◮ reduction
- 148. POISE: calculation, estimation, optimization ◮ POISE: “policy optimization via importance sample estimation” ◮ Monte Carlo importance sampling estimation ◮ aka “oﬀ policy estimation” ◮ role of “IPW” ◮ reduction ◮ normalization
- 149. POISE: calculation, estimation, optimization ◮ POISE: “policy optimization via importance sample estimation” ◮ Monte Carlo importance sampling estimation ◮ aka “oﬀ policy estimation” ◮ role of “IPW” ◮ reduction ◮ normalization ◮ hyper-parameter searching
- 150. POISE: calculation, estimation, optimization ◮ POISE: “policy optimization via importance sample estimation” ◮ Monte Carlo importance sampling estimation ◮ aka “oﬀ policy estimation” ◮ role of “IPW” ◮ reduction ◮ normalization ◮ hyper-parameter searching ◮ unexpected connection: personalized medicine
- 151. POISE setup and Goal ◮ “a causes y” ⇐⇒ ∃ family pα(y, a, x) = p(y|a, x)pα(a|x)p(x)
- 152. POISE setup and Goal ◮ “a causes y” ⇐⇒ ∃ family pα(y, a, x) = p(y|a, x)pα(a|x)p(x) ◮ deﬁne oﬀ-policy/exploration distribution p−(y, a, x) = p(y|a, x)p−(a|x)p(x)
- 153. POISE setup and Goal ◮ “a causes y” ⇐⇒ ∃ family pα(y, a, x) = p(y|a, x)pα(a|x)p(x) ◮ deﬁne oﬀ-policy/exploration distribution p−(y, a, x) = p(y|a, x)p−(a|x)p(x) ◮ deﬁne exploitation distribution p+(y, a, x) = p(y|a, x)p+(a|x)p(x)
- 154. POISE setup and Goal ◮ “a causes y” ⇐⇒ ∃ family pα(y, a, x) = p(y|a, x)pα(a|x)p(x) ◮ deﬁne oﬀ-policy/exploration distribution p−(y, a, x) = p(y|a, x)p−(a|x)p(x) ◮ deﬁne exploitation distribution p+(y, a, x) = p(y|a, x)p+(a|x)p(x) ◮ Goal: Maximize E+(Y ) over p+(a|x) using data drawn from p−(y, a, x).
- 155. POISE setup and Goal ◮ “a causes y” ⇐⇒ ∃ family pα(y, a, x) = p(y|a, x)pα(a|x)p(x) ◮ deﬁne oﬀ-policy/exploration distribution p−(y, a, x) = p(y|a, x)p−(a|x)p(x) ◮ deﬁne exploitation distribution p+(y, a, x) = p(y|a, x)p+(a|x)p(x) ◮ Goal: Maximize E+(Y ) over p+(a|x) using data drawn from p−(y, a, x).
- 156. POISE setup and Goal ◮ “a causes y” ⇐⇒ ∃ family pα(y, a, x) = p(y|a, x)pα(a|x)p(x) ◮ deﬁne oﬀ-policy/exploration distribution p−(y, a, x) = p(y|a, x)p−(a|x)p(x) ◮ deﬁne exploitation distribution p+(y, a, x) = p(y|a, x)p+(a|x)p(x) ◮ Goal: Maximize E+(Y ) over p+(a|x) using data drawn from p−(y, a, x). notation: {x, a, y} ∈ {X, A, Y } i.e., Eα(Y ) is not a function of y
- 157. POISE math: IS+Monte Carlo estimation=ISE i.e, “importance sampling estimation” ◮ E+(Y ) ≡ yax yp+(y, a, x)
- 158. POISE math: IS+Monte Carlo estimation=ISE i.e, “importance sampling estimation” ◮ E+(Y ) ≡ yax yp+(y, a, x) ◮ E+(Y ) = yax yp−(y, a, x)(p+(y, a, x)/p−(y, a, x))
- 159. POISE math: IS+Monte Carlo estimation=ISE i.e, “importance sampling estimation” ◮ E+(Y ) ≡ yax yp+(y, a, x) ◮ E+(Y ) = yax yp−(y, a, x)(p+(y, a, x)/p−(y, a, x)) ◮ E+(Y ) = yax yp−(y, a, x)(p+(a|x)/p−(a|x))
- 160. POISE math: IS+Monte Carlo estimation=ISE i.e, “importance sampling estimation” ◮ E+(Y ) ≡ yax yp+(y, a, x) ◮ E+(Y ) = yax yp−(y, a, x)(p+(y, a, x)/p−(y, a, x)) ◮ E+(Y ) = yax yp−(y, a, x)(p+(a|x)/p−(a|x)) ◮ E+(Y ) ≈ N−1 i yi (p+(ai |xi )/p−(ai |xi ))
- 161. POISE math: IS+Monte Carlo estimation=ISE i.e, “importance sampling estimation” ◮ E+(Y ) ≡ yax yp+(y, a, x) ◮ E+(Y ) = yax yp−(y, a, x)(p+(y, a, x)/p−(y, a, x)) ◮ E+(Y ) = yax yp−(y, a, x)(p+(a|x)/p−(a|x)) ◮ E+(Y ) ≈ N−1 i yi (p+(ai |xi )/p−(ai |xi ))
- 162. POISE math: IS+Monte Carlo estimation=ISE i.e, “importance sampling estimation” ◮ E+(Y ) ≡ yax yp+(y, a, x) ◮ E+(Y ) = yax yp−(y, a, x)(p+(y, a, x)/p−(y, a, x)) ◮ E+(Y ) = yax yp−(y, a, x)(p+(a|x)/p−(a|x)) ◮ E+(Y ) ≈ N−1 i yi (p+(ai |xi )/p−(ai |xi )) let’s spend some time getting to know this last equation, the importance sampling estimate of outcome in a “causal model” (“a causes y”) among {y, a, x}
- 163. Observation (cf. Bottou7 ) ◮ factorizing P±(x): P+(x) P−(x) = Πfactors P+but not−(x) P−but not+(x)
- 164. Observation (cf. Bottou7 ) ◮ factorizing P±(x): P+(x) P−(x) = Πfactors P+but not−(x) P−but not+(x) ◮ origin: importance sampling Eq(f ) = Ep(fq/p) (as in variational methods)
- 165. Observation (cf. Bottou7 ) ◮ factorizing P±(x): P+(x) P−(x) = Πfactors P+but not−(x) P−but not+(x) ◮ origin: importance sampling Eq(f ) = Ep(fq/p) (as in variational methods) ◮ the “causal” model pα(y, a, x) = p(y|a, x)pα(a|x)p(x) helps here
- 166. Observation (cf. Bottou7 ) ◮ factorizing P±(x): P+(x) P−(x) = Πfactors P+but not−(x) P−but not+(x) ◮ origin: importance sampling Eq(f ) = Ep(fq/p) (as in variational methods) ◮ the “causal” model pα(y, a, x) = p(y|a, x)pα(a|x)p(x) helps here ◮ factors left over are numerator (p+(a|x), to optimize) and denominator (p−(a|x), to infer if not a RCT)
- 167. Observation (cf. Bottou7 ) ◮ factorizing P±(x): P+(x) P−(x) = Πfactors P+but not−(x) P−but not+(x) ◮ origin: importance sampling Eq(f ) = Ep(fq/p) (as in variational methods) ◮ the “causal” model pα(y, a, x) = p(y|a, x)pα(a|x)p(x) helps here ◮ factors left over are numerator (p+(a|x), to optimize) and denominator (p−(a|x), to infer if not a RCT) ◮ unobserved confounders will confound us (later) 7 Counterfactual Reasoning and Learning Systems, arXiv:1209.2355
- 168. Reduction (cf. Langford8 ,9 ,10 (’05, ’08, ’09 )) ◮ consider numerator for deterministic policy: p+(a|x) = 1[a = h(x)]
- 169. Reduction (cf. Langford8 ,9 ,10 (’05, ’08, ’09 )) ◮ consider numerator for deterministic policy: p+(a|x) = 1[a = h(x)] ◮ E+(Y ) ∝ i (yi /p−(a|x))1[a = h(x)] ≡ i wi 1[a = h(x)]
- 170. Reduction (cf. Langford8 ,9 ,10 (’05, ’08, ’09 )) ◮ consider numerator for deterministic policy: p+(a|x) = 1[a = h(x)] ◮ E+(Y ) ∝ i (yi /p−(a|x))1[a = h(x)] ≡ i wi 1[a = h(x)] ◮ Note: 1[c = d] = 1 − 1[c = d]
- 171. Reduction (cf. Langford8 ,9 ,10 (’05, ’08, ’09 )) ◮ consider numerator for deterministic policy: p+(a|x) = 1[a = h(x)] ◮ E+(Y ) ∝ i (yi /p−(a|x))1[a = h(x)] ≡ i wi 1[a = h(x)] ◮ Note: 1[c = d] = 1 − 1[c = d] ◮ ∴ E+(Y ) ∝ constant − i wi 1[a = h(x)]
- 172. Reduction (cf. Langford8 ,9 ,10 (’05, ’08, ’09 )) ◮ consider numerator for deterministic policy: p+(a|x) = 1[a = h(x)] ◮ E+(Y ) ∝ i (yi /p−(a|x))1[a = h(x)] ≡ i wi 1[a = h(x)] ◮ Note: 1[c = d] = 1 − 1[c = d] ◮ ∴ E+(Y ) ∝ constant − i wi 1[a = h(x)] ◮ ∴ reduces policy optimization to (weighted) classiﬁcation 8 Langford & Zadrozny “Relating Reinforcement Learning Performance to Classiﬁcation Performance” ICML 2005 9 Beygelzimer & Langford “The oﬀset tree for learning with partial labels” (KDD 2009) 10 Tutorial on “Reductions” (including at ICML 2009)
- 173. Reduction w/optimistic complication ◮ Prescription ⇐⇒ classiﬁcation L = i wi 1[ai = h(xi )]
- 174. Reduction w/optimistic complication ◮ Prescription ⇐⇒ classiﬁcation L = i wi 1[ai = h(xi )] ◮ weight wi = yi /p−(ai |xi ), inferred or RCT
- 175. Reduction w/optimistic complication ◮ Prescription ⇐⇒ classiﬁcation L = i wi 1[ai = h(xi )] ◮ weight wi = yi /p−(ai |xi ), inferred or RCT ◮ destroys measure by treating p−(a|x) diﬀerently than 1/p−(a|x)
- 176. Reduction w/optimistic complication ◮ Prescription ⇐⇒ classiﬁcation L = i wi 1[ai = h(xi )] ◮ weight wi = yi /p−(ai |xi ), inferred or RCT ◮ destroys measure by treating p−(a|x) diﬀerently than 1/p−(a|x) ◮ normalize as ˜L ≡ i y1[ai =h(xi )]/p−(ai |xi ) i 1[ai =h(xi )]/p−(ai |xi )
- 177. Reduction w/optimistic complication ◮ Prescription ⇐⇒ classiﬁcation L = i wi 1[ai = h(xi )] ◮ weight wi = yi /p−(ai |xi ), inferred or RCT ◮ destroys measure by treating p−(a|x) diﬀerently than 1/p−(a|x) ◮ normalize as ˜L ≡ i y1[ai =h(xi )]/p−(ai |xi ) i 1[ai =h(xi )]/p−(ai |xi ) ◮ destroys lovely reduction
- 178. Reduction w/optimistic complication ◮ Prescription ⇐⇒ classiﬁcation L = i wi 1[ai = h(xi )] ◮ weight wi = yi /p−(ai |xi ), inferred or RCT ◮ destroys measure by treating p−(a|x) diﬀerently than 1/p−(a|x) ◮ normalize as ˜L ≡ i y1[ai =h(xi )]/p−(ai |xi ) i 1[ai =h(xi )]/p−(ai |xi ) ◮ destroys lovely reduction ◮ simply11 L(λ) = i (yi − λ)1[ai = h(xi )]/p−(ai |xi ) 11 Suggestion by Dan Hsu
- 179. Reduction w/optimistic complication ◮ Prescription ⇐⇒ classiﬁcation L = i wi 1[ai = h(xi )] ◮ weight wi = yi /p−(ai |xi ), inferred or RCT ◮ destroys measure by treating p−(a|x) diﬀerently than 1/p−(a|x) ◮ normalize as ˜L ≡ i y1[ai =h(xi )]/p−(ai |xi ) i 1[ai =h(xi )]/p−(ai |xi ) ◮ destroys lovely reduction ◮ simply11 L(λ) = i (yi − λ)1[ai = h(xi )]/p−(ai |xi ) ◮ hidden here is a 2nd parameter, in classiﬁcation, ∴ harder search 11 Suggestion by Dan Hsu
- 180. POISE punchlines ◮ allows policy planning even with implicit logged exploration data12 12 Strehl, Alex, et al. “Learning from logged implicit exploration data.” Advances in Neural Information Processing Systems. 2010.
- 181. POISE punchlines ◮ allows policy planning even with implicit logged exploration data12 ◮ e.g., two hospital story 12 Strehl, Alex, et al. “Learning from logged implicit exploration data.” Advances in Neural Information Processing Systems. 2010.
- 182. POISE punchlines ◮ allows policy planning even with implicit logged exploration data12 ◮ e.g., two hospital story ◮ “personalized medicine” is also a policy 12 Strehl, Alex, et al. “Learning from logged implicit exploration data.” Advances in Neural Information Processing Systems. 2010.
- 183. POISE punchlines ◮ allows policy planning even with implicit logged exploration data12 ◮ e.g., two hospital story ◮ “personalized medicine” is also a policy ◮ abundant data available, under-explored IMHO 12 Strehl, Alex, et al. “Learning from logged implicit exploration data.” Advances in Neural Information Processing Systems. 2010.
- 184. tangent: causality as told by an economist diﬀerent, related goal ◮ they think in terms of ATE/ITE instead of policy
- 185. tangent: causality as told by an economist diﬀerent, related goal ◮ they think in terms of ATE/ITE instead of policy ◮ ATE
- 186. tangent: causality as told by an economist diﬀerent, related goal ◮ they think in terms of ATE/ITE instead of policy ◮ ATE ◮ τ ≡ E0(Y |a = 1) − E0(Y |a = 0) ≡ Q(a = 1) − Q(a = 0)
- 187. tangent: causality as told by an economist diﬀerent, related goal ◮ they think in terms of ATE/ITE instead of policy ◮ ATE ◮ τ ≡ E0(Y |a = 1) − E0(Y |a = 0) ≡ Q(a = 1) − Q(a = 0) ◮ CATE aka Individualized Treatment Eﬀect (ITE)
- 188. tangent: causality as told by an economist diﬀerent, related goal ◮ they think in terms of ATE/ITE instead of policy ◮ ATE ◮ τ ≡ E0(Y |a = 1) − E0(Y |a = 0) ≡ Q(a = 1) − Q(a = 0) ◮ CATE aka Individualized Treatment Eﬀect (ITE) ◮ τ(x) ≡ E0(Y |a = 1, x) − E0(Y |a = 0, x)
- 189. tangent: causality as told by an economist diﬀerent, related goal ◮ they think in terms of ATE/ITE instead of policy ◮ ATE ◮ τ ≡ E0(Y |a = 1) − E0(Y |a = 0) ≡ Q(a = 1) − Q(a = 0) ◮ CATE aka Individualized Treatment Eﬀect (ITE) ◮ τ(x) ≡ E0(Y |a = 1, x) − E0(Y |a = 0, x) ◮ ≡ Q(a = 1, x) − Q(a = 0, x)
- 190. Q-note: “generalizing” Monte Carlo w/kernels ◮ MC: Ep(f ) = x p(x)f (x) ≈ N−1 i∼p f (xi )
- 191. Q-note: “generalizing” Monte Carlo w/kernels ◮ MC: Ep(f ) = x p(x)f (x) ≈ N−1 i∼p f (xi ) ◮ K: p ≈ N−1 i K(x|xi )
- 192. Q-note: “generalizing” Monte Carlo w/kernels ◮ MC: Ep(f ) = x p(x)f (x) ≈ N−1 i∼p f (xi ) ◮ K: p ≈ N−1 i K(x|xi ) ◮ ⇒ x p(x)f (x) ≈ N−1 i x f (x)K(x|xi )
- 193. Q-note: “generalizing” Monte Carlo w/kernels ◮ MC: Ep(f ) = x p(x)f (x) ≈ N−1 i∼p f (xi ) ◮ K: p ≈ N−1 i K(x|xi ) ◮ ⇒ x p(x)f (x) ≈ N−1 i x f (x)K(x|xi ) ◮ K can be any normalized function, e.g., K(x|xi ) = δx,xi , which yields MC.
- 194. Q-note: “generalizing” Monte Carlo w/kernels ◮ MC: Ep(f ) = x p(x)f (x) ≈ N−1 i∼p f (xi ) ◮ K: p ≈ N−1 i K(x|xi ) ◮ ⇒ x p(x)f (x) ≈ N−1 i x f (x)K(x|xi ) ◮ K can be any normalized function, e.g., K(x|xi ) = δx,xi , which yields MC. ◮ multivariate Ep(f ) ≈ N−1 i yax f (y, a, x)K1(y|yi )K2(a|ai )K3(x|xi )
- 195. Q-note: application w/strata+matching, setup Helps think about economists’ approach: ◮ Q(a, x) ≡ E(Y |a, x) = y yp(y|a, x) = y y p−(y,a,x) p−(a|x)p(x)
- 196. Q-note: application w/strata+matching, setup Helps think about economists’ approach: ◮ Q(a, x) ≡ E(Y |a, x) = y yp(y|a, x) = y y p−(y,a,x) p−(a|x)p(x) ◮ = 1 p−(a|x)p(x) y yp−(y, a, x)
- 197. Q-note: application w/strata+matching, setup Helps think about economists’ approach: ◮ Q(a, x) ≡ E(Y |a, x) = y yp(y|a, x) = y y p−(y,a,x) p−(a|x)p(x) ◮ = 1 p−(a|x)p(x) y yp−(y, a, x) ◮ stratify x using z(x) such that ∪z = X, and ∩z, z′ =
- 198. Q-note: application w/strata+matching, setup Helps think about economists’ approach: ◮ Q(a, x) ≡ E(Y |a, x) = y yp(y|a, x) = y y p−(y,a,x) p−(a|x)p(x) ◮ = 1 p−(a|x)p(x) y yp−(y, a, x) ◮ stratify x using z(x) such that ∪z = X, and ∩z, z′ = ◮ n(x) = i 1[z(xi ) = z(x)]=number of points in x’s stratum
- 199. Q-note: application w/strata+matching, setup Helps think about economists’ approach: ◮ Q(a, x) ≡ E(Y |a, x) = y yp(y|a, x) = y y p−(y,a,x) p−(a|x)p(x) ◮ = 1 p−(a|x)p(x) y yp−(y, a, x) ◮ stratify x using z(x) such that ∪z = X, and ∩z, z′ = ◮ n(x) = i 1[z(xi ) = z(x)]=number of points in x’s stratum ◮ Ω(x) = x′ 1[z(x′) = z(x)]=area of x’s stratum
- 200. Q-note: application w/strata+matching, setup Helps think about economists’ approach: ◮ Q(a, x) ≡ E(Y |a, x) = y yp(y|a, x) = y y p−(y,a,x) p−(a|x)p(x) ◮ = 1 p−(a|x)p(x) y yp−(y, a, x) ◮ stratify x using z(x) such that ∪z = X, and ∩z, z′ = ◮ n(x) = i 1[z(xi ) = z(x)]=number of points in x’s stratum ◮ Ω(x) = x′ 1[z(x′) = z(x)]=area of x’s stratum ◮ ∴ K3(x|xi ) = 1[z(x) = z(xi )]/Ω(x)
- 201. Q-note: application w/strata+matching, setup Helps think about economists’ approach: ◮ Q(a, x) ≡ E(Y |a, x) = y yp(y|a, x) = y y p−(y,a,x) p−(a|x)p(x) ◮ = 1 p−(a|x)p(x) y yp−(y, a, x) ◮ stratify x using z(x) such that ∪z = X, and ∩z, z′ = ◮ n(x) = i 1[z(xi ) = z(x)]=number of points in x’s stratum ◮ Ω(x) = x′ 1[z(x′) = z(x)]=area of x’s stratum ◮ ∴ K3(x|xi ) = 1[z(x) = z(xi )]/Ω(x) ◮ as in MC, K1(y|yi ) = δy,yi , K2(a|ai ) = δa,ai
- 202. Q-note: application w/strata+matching, payoﬀ ◮ y yp−(y, a, x) ≈ N−1Ω(x)−1 ai =a,z(xi )=z(x) yi
- 203. Q-note: application w/strata+matching, payoﬀ ◮ y yp−(y, a, x) ≈ N−1Ω(x)−1 ai =a,z(xi )=z(x) yi ◮ p(x) ≈ (n(x)/N)Ω(x)−1
- 204. Q-note: application w/strata+matching, payoﬀ ◮ y yp−(y, a, x) ≈ N−1Ω(x)−1 ai =a,z(xi )=z(x) yi ◮ p(x) ≈ (n(x)/N)Ω(x)−1 ◮ ∴ Q(a, x) ≈ p−(a|x)−1n(x)−1 ai =a,z(xi )=z(x) yi
- 205. Q-note: application w/strata+matching, payoﬀ ◮ y yp−(y, a, x) ≈ N−1Ω(x)−1 ai =a,z(xi )=z(x) yi ◮ p(x) ≈ (n(x)/N)Ω(x)−1 ◮ ∴ Q(a, x) ≈ p−(a|x)−1n(x)−1 ai =a,z(xi )=z(x) yi
- 206. Q-note: application w/strata+matching, payoﬀ ◮ y yp−(y, a, x) ≈ N−1Ω(x)−1 ai =a,z(xi )=z(x) yi ◮ p(x) ≈ (n(x)/N)Ω(x)−1 ◮ ∴ Q(a, x) ≈ p−(a|x)−1n(x)−1 ai =a,z(xi )=z(x) yi “matching” means: choose each z to contain 1 positive example & 1 negative example, ◮ p−(a|x) ≈ 1/2, n(x) = 2
- 207. Q-note: application w/strata+matching, payoﬀ ◮ y yp−(y, a, x) ≈ N−1Ω(x)−1 ai =a,z(xi )=z(x) yi ◮ p(x) ≈ (n(x)/N)Ω(x)−1 ◮ ∴ Q(a, x) ≈ p−(a|x)−1n(x)−1 ai =a,z(xi )=z(x) yi “matching” means: choose each z to contain 1 positive example & 1 negative example, ◮ p−(a|x) ≈ 1/2, n(x) = 2 ◮ ∴ τ(a, x) = Q(a = 1, x) − Q(a = 0, x) = y1(x) − y0(x)
- 208. Q-note: application w/strata+matching, payoﬀ ◮ y yp−(y, a, x) ≈ N−1Ω(x)−1 ai =a,z(xi )=z(x) yi ◮ p(x) ≈ (n(x)/N)Ω(x)−1 ◮ ∴ Q(a, x) ≈ p−(a|x)−1n(x)−1 ai =a,z(xi )=z(x) yi “matching” means: choose each z to contain 1 positive example & 1 negative example, ◮ p−(a|x) ≈ 1/2, n(x) = 2 ◮ ∴ τ(a, x) = Q(a = 1, x) − Q(a = 0, x) = y1(x) − y0(x) ◮ z-generalizations: graphs, digraphs, k-NN, “matching”
- 209. Q-note: application w/strata+matching, payoﬀ ◮ y yp−(y, a, x) ≈ N−1Ω(x)−1 ai =a,z(xi )=z(x) yi ◮ p(x) ≈ (n(x)/N)Ω(x)−1 ◮ ∴ Q(a, x) ≈ p−(a|x)−1n(x)−1 ai =a,z(xi )=z(x) yi “matching” means: choose each z to contain 1 positive example & 1 negative example, ◮ p−(a|x) ≈ 1/2, n(x) = 2 ◮ ∴ τ(a, x) = Q(a = 1, x) − Q(a = 0, x) = y1(x) − y0(x) ◮ z-generalizations: graphs, digraphs, k-NN, “matching” ◮ K-generalizations: continuous a, any metric or similarity you like,. . .
- 210. Q-note: application w/strata+matching, payoﬀ ◮ y yp−(y, a, x) ≈ N−1Ω(x)−1 ai =a,z(xi )=z(x) yi ◮ p(x) ≈ (n(x)/N)Ω(x)−1 ◮ ∴ Q(a, x) ≈ p−(a|x)−1n(x)−1 ai =a,z(xi )=z(x) yi “matching” means: choose each z to contain 1 positive example & 1 negative example, ◮ p−(a|x) ≈ 1/2, n(x) = 2 ◮ ∴ τ(a, x) = Q(a = 1, x) − Q(a = 0, x) = y1(x) − y0(x) ◮ z-generalizations: graphs, digraphs, k-NN, “matching” ◮ K-generalizations: continuous a, any metric or similarity you like,. . .
- 211. Q-note: application w/strata+matching, payoﬀ ◮ y yp−(y, a, x) ≈ N−1Ω(x)−1 ai =a,z(xi )=z(x) yi ◮ p(x) ≈ (n(x)/N)Ω(x)−1 ◮ ∴ Q(a, x) ≈ p−(a|x)−1n(x)−1 ai =a,z(xi )=z(x) yi “matching” means: choose each z to contain 1 positive example & 1 negative example, ◮ p−(a|x) ≈ 1/2, n(x) = 2 ◮ ∴ τ(a, x) = Q(a = 1, x) − Q(a = 0, x) = y1(x) − y0(x) ◮ z-generalizations: graphs, digraphs, k-NN, “matching” ◮ K-generalizations: continuous a, any metric or similarity you like,. . . IMHO underexplored
- 212. causality, as understood in marketing ◮ a/b testing and RCT Figure 11: Blattberg, Robert C., Byung-Do Kim, and Scott A. Neslin. Database Marketing, Springer New York, 2008
- 213. causality, as understood in marketing ◮ a/b testing and RCT ◮ yield optimization Figure 11: Blattberg, Robert C., Byung-Do Kim, and Scott A. Neslin. Database Marketing, Springer New York, 2008
- 214. causality, as understood in marketing ◮ a/b testing and RCT ◮ yield optimization ◮ Lorenz curve (vs ROC plots) Figure 11: Blattberg, Robert C., Byung-Do Kim, and Scott A. Neslin. Database Marketing, Springer New York, 2008
- 215. unobserved confounders vs. “causality” modeling ◮ truth: pα(y, a, x, u) = p(y|a, x, u)pα(a|x, u)p(x, u)
- 216. unobserved confounders vs. “causality” modeling ◮ truth: pα(y, a, x, u) = p(y|a, x, u)pα(a|x, u)p(x, u) ◮ but: p+(y, a, x, u) = p(y|a, x, u)p−(a|x)p(x, u)
- 217. unobserved confounders vs. “causality” modeling ◮ truth: pα(y, a, x, u) = p(y|a, x, u)pα(a|x, u)p(x, u) ◮ but: p+(y, a, x, u) = p(y|a, x, u)p−(a|x)p(x, u) ◮ E+(Y ) ≡ yaxu yp+(yaxu) ≈ N−1 i∼p− yi p+(a|x)/p−(a|x, u)
- 218. unobserved confounders vs. “causality” modeling ◮ truth: pα(y, a, x, u) = p(y|a, x, u)pα(a|x, u)p(x, u) ◮ but: p+(y, a, x, u) = p(y|a, x, u)p−(a|x)p(x, u) ◮ E+(Y ) ≡ yaxu yp+(yaxu) ≈ N−1 i∼p− yi p+(a|x)/p−(a|x, u) ◮ denominator can not be inferred, ignore at your peril
- 219. cautionary tale problem: Simpson’s paradox ◮ a: admissions (a=1: admitted, a=0: declined)
- 220. cautionary tale problem: Simpson’s paradox ◮ a: admissions (a=1: admitted, a=0: declined) ◮ x: gender (x=1: female, x=0: male)
- 221. cautionary tale problem: Simpson’s paradox ◮ a: admissions (a=1: admitted, a=0: declined) ◮ x: gender (x=1: female, x=0: male) ◮ lawsuit (1973): .44 = p(a = 1|x = 0) > p(a = 1|x = 1) = .35
- 222. cautionary tale problem: Simpson’s paradox ◮ a: admissions (a=1: admitted, a=0: declined) ◮ x: gender (x=1: female, x=0: male) ◮ lawsuit (1973): .44 = p(a = 1|x = 0) > p(a = 1|x = 1) = .35 ◮ ‘resolved’ by Bickel (1975)13 (See also Pearl14 ) 13 P.J. Bickel, E.A. Hammel and J.W. O’Connell (1975). “Sex Bias in Graduate Admissions: Data From Berkeley”. Science 187 (4175): 398–404 14 Pearl, Judea (December 2013). “Understanding Simpson’s paradox”. UCLA Cognitive Systems Laboratory, Technical Report R-414.
- 223. cautionary tale problem: Simpson’s paradox ◮ a: admissions (a=1: admitted, a=0: declined) ◮ x: gender (x=1: female, x=0: male) ◮ lawsuit (1973): .44 = p(a = 1|x = 0) > p(a = 1|x = 1) = .35 ◮ ‘resolved’ by Bickel (1975)13 (See also Pearl14 ) ◮ u: unobserved department they applied to 13 P.J. Bickel, E.A. Hammel and J.W. O’Connell (1975). “Sex Bias in Graduate Admissions: Data From Berkeley”. Science 187 (4175): 398–404 14 Pearl, Judea (December 2013). “Understanding Simpson’s paradox”. UCLA Cognitive Systems Laboratory, Technical Report R-414.
- 224. cautionary tale problem: Simpson’s paradox ◮ a: admissions (a=1: admitted, a=0: declined) ◮ x: gender (x=1: female, x=0: male) ◮ lawsuit (1973): .44 = p(a = 1|x = 0) > p(a = 1|x = 1) = .35 ◮ ‘resolved’ by Bickel (1975)13 (See also Pearl14 ) ◮ u: unobserved department they applied to ◮ p(a|x) = u=6 u=1 p(a|x, u)p(u|x) 13 P.J. Bickel, E.A. Hammel and J.W. O’Connell (1975). “Sex Bias in Graduate Admissions: Data From Berkeley”. Science 187 (4175): 398–404 14 Pearl, Judea (December 2013). “Understanding Simpson’s paradox”. UCLA Cognitive Systems Laboratory, Technical Report R-414.
- 225. cautionary tale problem: Simpson’s paradox ◮ a: admissions (a=1: admitted, a=0: declined) ◮ x: gender (x=1: female, x=0: male) ◮ lawsuit (1973): .44 = p(a = 1|x = 0) > p(a = 1|x = 1) = .35 ◮ ‘resolved’ by Bickel (1975)13 (See also Pearl14 ) ◮ u: unobserved department they applied to ◮ p(a|x) = u=6 u=1 p(a|x, u)p(u|x) ◮ e.g., gender-blind: p(a|1) − p(a|0) = p(a|u) · (p(u|1) − p(u|0)) 13 P.J. Bickel, E.A. Hammel and J.W. O’Connell (1975). “Sex Bias in Graduate Admissions: Data From Berkeley”. Science 187 (4175): 398–404 14 Pearl, Judea (December 2013). “Understanding Simpson’s paradox”. UCLA Cognitive Systems Laboratory, Technical Report R-414.
- 226. confounded approach: quasi-experiments + instruments 17 ◮ Q: does engagement drive retention? (NYT, NFLX, . . . )
- 227. confounded approach: quasi-experiments + instruments 17 ◮ Q: does engagement drive retention? (NYT, NFLX, . . . ) ◮ we don’t directly control engagement
- 228. confounded approach: quasi-experiments + instruments 17 ◮ Q: does engagement drive retention? (NYT, NFLX, . . . ) ◮ we don’t directly control engagement ◮ nonetheless useful since many things can inﬂuence it
- 229. confounded approach: quasi-experiments + instruments 17 ◮ Q: does engagement drive retention? (NYT, NFLX, . . . ) ◮ we don’t directly control engagement ◮ nonetheless useful since many things can inﬂuence it ◮ Q: does serving in Vietnam war decrease earnings15? 15 Angrist, Joshua D. “Lifetime earnings and the Vietnam era draft lottery: evidence from social security administrative records.” The American Economic Review (1990): 313-336.
- 230. confounded approach: quasi-experiments + instruments 17 ◮ Q: does engagement drive retention? (NYT, NFLX, . . . ) ◮ we don’t directly control engagement ◮ nonetheless useful since many things can inﬂuence it ◮ Q: does serving in Vietnam war decrease earnings15? ◮ US didn’t directly control serving in Vietnam, either16 15 Angrist, Joshua D. “Lifetime earnings and the Vietnam era draft lottery: evidence from social security administrative records.” The American Economic Review (1990): 313-336. 16 cf., George Bush, Donald Trump, Bill Clinton, Dick Cheney. . .
- 231. confounded approach: quasi-experiments + instruments 17 ◮ Q: does engagement drive retention? (NYT, NFLX, . . . ) ◮ we don’t directly control engagement ◮ nonetheless useful since many things can inﬂuence it ◮ Q: does serving in Vietnam war decrease earnings15? ◮ US didn’t directly control serving in Vietnam, either16 ◮ requires strong assumptions, including linear model 15 Angrist, Joshua D. “Lifetime earnings and the Vietnam era draft lottery: evidence from social security administrative records.” The American Economic Review (1990): 313-336. 16 cf., George Bush, Donald Trump, Bill Clinton, Dick Cheney. . . 17 I thank Sinan Aral, MIT Sloan, for bringing this to my attention
- 232. IV: graphical model assumption Figure 12: independence assumption
- 233. IV: graphical model assumption (sideways) Figure 13: independence assumption
- 234. IV: review s/OLS/MOM/ (E is empirical average) ◮ a endogenous
- 235. IV: review s/OLS/MOM/ (E is empirical average) ◮ a endogenous ◮ e.g., ∃u s.t. p(y|a, x, u), p(a|x, u)
- 236. IV: review s/OLS/MOM/ (E is empirical average) ◮ a endogenous ◮ e.g., ∃u s.t. p(y|a, x, u), p(a|x, u) ◮ linear ansatz: y = βT a + ǫ
- 237. IV: review s/OLS/MOM/ (E is empirical average) ◮ a endogenous ◮ e.g., ∃u s.t. p(y|a, x, u), p(a|x, u) ◮ linear ansatz: y = βT a + ǫ ◮ if a exogenous (e.g., OLS), use E[YAj] = E[βT AAj] + E[ǫAj] (note that E[AjAk] gives square matrix; invert for β)
- 238. IV: review s/OLS/MOM/ (E is empirical average) ◮ a endogenous ◮ e.g., ∃u s.t. p(y|a, x, u), p(a|x, u) ◮ linear ansatz: y = βT a + ǫ ◮ if a exogenous (e.g., OLS), use E[YAj] = E[βT AAj] + E[ǫAj] (note that E[AjAk] gives square matrix; invert for β) ◮ add instrument x uncorrelated with ǫ
- 239. IV: review s/OLS/MOM/ (E is empirical average) ◮ a endogenous ◮ e.g., ∃u s.t. p(y|a, x, u), p(a|x, u) ◮ linear ansatz: y = βT a + ǫ ◮ if a exogenous (e.g., OLS), use E[YAj] = E[βT AAj] + E[ǫAj] (note that E[AjAk] gives square matrix; invert for β) ◮ add instrument x uncorrelated with ǫ ◮ E[YXk] = E[βT AXk] + E[ǫ]E[Xk]
- 240. IV: review s/OLS/MOM/ (E is empirical average) ◮ a endogenous ◮ e.g., ∃u s.t. p(y|a, x, u), p(a|x, u) ◮ linear ansatz: y = βT a + ǫ ◮ if a exogenous (e.g., OLS), use E[YAj] = E[βT AAj] + E[ǫAj] (note that E[AjAk] gives square matrix; invert for β) ◮ add instrument x uncorrelated with ǫ ◮ E[YXk] = E[βT AXk] + E[ǫ]E[Xk] ◮ E[Y ] = E[βT A] + E[ǫ] (from ansatz)
- 241. IV: review s/OLS/MOM/ (E is empirical average) ◮ a endogenous ◮ e.g., ∃u s.t. p(y|a, x, u), p(a|x, u) ◮ linear ansatz: y = βT a + ǫ ◮ if a exogenous (e.g., OLS), use E[YAj] = E[βT AAj] + E[ǫAj] (note that E[AjAk] gives square matrix; invert for β) ◮ add instrument x uncorrelated with ǫ ◮ E[YXk] = E[βT AXk] + E[ǫ]E[Xk] ◮ E[Y ] = E[βT A] + E[ǫ] (from ansatz) ◮ C(Y , Xk) = βT C(A, Xk), not an “inversion” problem, requires “two stage regression”
- 242. IV: binary, binary case (aka “Wald estimator”) ◮ y = βa + ǫ
- 243. IV: binary, binary case (aka “Wald estimator”) ◮ y = βa + ǫ ◮ E(Y |x) = βE(A|x) + E(ǫ), evaluate at x = {0, 1}
- 244. IV: binary, binary case (aka “Wald estimator”) ◮ y = βa + ǫ ◮ E(Y |x) = βE(A|x) + E(ǫ), evaluate at x = {0, 1} ◮ β = (E(Y |x = 1) − E(Y |x = 0))/(E(A|x = 1) − E(A|x = 0)).
- 245. bandits: obligatory slide Figure 14: almost all the talks I’ve gone to on bandits have this image
- 246. bandits ◮ wide applicability: humane clinical trials, targeting, . . .
- 247. bandits ◮ wide applicability: humane clinical trials, targeting, . . . ◮ replace meetings with code
- 248. bandits ◮ wide applicability: humane clinical trials, targeting, . . . ◮ replace meetings with code ◮ requires software engineering to replace decisions with, e.g., Javascript
- 249. bandits ◮ wide applicability: humane clinical trials, targeting, . . . ◮ replace meetings with code ◮ requires software engineering to replace decisions with, e.g., Javascript ◮ most useful if decisions or items get “stale” quickly
- 250. bandits ◮ wide applicability: humane clinical trials, targeting, . . . ◮ replace meetings with code ◮ requires software engineering to replace decisions with, e.g., Javascript ◮ most useful if decisions or items get “stale” quickly ◮ less useful for one-oﬀ, major decisions to be “interpreted”
- 251. bandits ◮ wide applicability: humane clinical trials, targeting, . . . ◮ replace meetings with code ◮ requires software engineering to replace decisions with, e.g., Javascript ◮ most useful if decisions or items get “stale” quickly ◮ less useful for one-oﬀ, major decisions to be “interpreted”
- 252. bandits ◮ wide applicability: humane clinical trials, targeting, . . . ◮ replace meetings with code ◮ requires software engineering to replace decisions with, e.g., Javascript ◮ most useful if decisions or items get “stale” quickly ◮ less useful for one-oﬀ, major decisions to be “interpreted” examples ◮ ǫ-greedy (no context, aka ‘vanilla’, aka ‘context-free’)
- 253. bandits ◮ wide applicability: humane clinical trials, targeting, . . . ◮ replace meetings with code ◮ requires software engineering to replace decisions with, e.g., Javascript ◮ most useful if decisions or items get “stale” quickly ◮ less useful for one-oﬀ, major decisions to be “interpreted” examples ◮ ǫ-greedy (no context, aka ‘vanilla’, aka ‘context-free’) ◮ UCB1 (2002) (no context) + LinUCB (with context)
- 254. bandits ◮ wide applicability: humane clinical trials, targeting, . . . ◮ replace meetings with code ◮ requires software engineering to replace decisions with, e.g., Javascript ◮ most useful if decisions or items get “stale” quickly ◮ less useful for one-oﬀ, major decisions to be “interpreted” examples ◮ ǫ-greedy (no context, aka ‘vanilla’, aka ‘context-free’) ◮ UCB1 (2002) (no context) + LinUCB (with context) ◮ Thompson Sampling (1933)18,19,20 (general, with or without context) 18 Thompson, William R. “On the likelihood that one unknown probability exceeds another in view of the evidence of two samples”. Biometrika, 25(3–4):285–294, 1933. 19 AKA “probability matching”, “posterior sampling” 20 cf., “Bayesian Bandit Explorer” (link)
- 255. TS: connecting w/“generative causal modeling” 0 ◮ WAS p(y, x, a) = p(y|x, a)pα(a|x)p(x)
- 256. TS: connecting w/“generative causal modeling” 0 ◮ WAS p(y, x, a) = p(y|x, a)pα(a|x)p(x) ◮ These 3 terms were treated by
- 257. TS: connecting w/“generative causal modeling” 0 ◮ WAS p(y, x, a) = p(y|x, a)pα(a|x)p(x) ◮ These 3 terms were treated by ◮ response p(y|a, x): avoid regression/inferring using importance sampling
- 258. TS: connecting w/“generative causal modeling” 0 ◮ WAS p(y, x, a) = p(y|x, a)pα(a|x)p(x) ◮ These 3 terms were treated by ◮ response p(y|a, x): avoid regression/inferring using importance sampling ◮ policy pα(a|x): optimize ours, infer theirs
- 259. TS: connecting w/“generative causal modeling” 0 ◮ WAS p(y, x, a) = p(y|x, a)pα(a|x)p(x) ◮ These 3 terms were treated by ◮ response p(y|a, x): avoid regression/inferring using importance sampling ◮ policy pα(a|x): optimize ours, infer theirs ◮ (NB: ours was deterministic: p(a|x) = 1[a = h(x)])
- 260. TS: connecting w/“generative causal modeling” 0 ◮ WAS p(y, x, a) = p(y|x, a)pα(a|x)p(x) ◮ These 3 terms were treated by ◮ response p(y|a, x): avoid regression/inferring using importance sampling ◮ policy pα(a|x): optimize ours, infer theirs ◮ (NB: ours was deterministic: p(a|x) = 1[a = h(x)]) ◮ prior p(x): either avoid by importance sampling or estimate via kernel methods
- 261. TS: connecting w/“generative causal modeling” 0 ◮ WAS p(y, x, a) = p(y|x, a)pα(a|x)p(x) ◮ These 3 terms were treated by ◮ response p(y|a, x): avoid regression/inferring using importance sampling ◮ policy pα(a|x): optimize ours, infer theirs ◮ (NB: ours was deterministic: p(a|x) = 1[a = h(x)]) ◮ prior p(x): either avoid by importance sampling or estimate via kernel methods ◮ In the economics approach we focus on
- 262. TS: connecting w/“generative causal modeling” 0 ◮ WAS p(y, x, a) = p(y|x, a)pα(a|x)p(x) ◮ These 3 terms were treated by ◮ response p(y|a, x): avoid regression/inferring using importance sampling ◮ policy pα(a|x): optimize ours, infer theirs ◮ (NB: ours was deterministic: p(a|x) = 1[a = h(x)]) ◮ prior p(x): either avoid by importance sampling or estimate via kernel methods ◮ In the economics approach we focus on ◮ τ(. . .) ≡ Q(a = 1, . . .) − Q(a = 0, . . .) “treatment eﬀect”
- 263. TS: connecting w/“generative causal modeling” 0 ◮ WAS p(y, x, a) = p(y|x, a)pα(a|x)p(x) ◮ These 3 terms were treated by ◮ response p(y|a, x): avoid regression/inferring using importance sampling ◮ policy pα(a|x): optimize ours, infer theirs ◮ (NB: ours was deterministic: p(a|x) = 1[a = h(x)]) ◮ prior p(x): either avoid by importance sampling or estimate via kernel methods ◮ In the economics approach we focus on ◮ τ(. . .) ≡ Q(a = 1, . . .) − Q(a = 0, . . .) “treatment eﬀect” ◮ where Q(a, . . .) = y yp(y| . . .)
- 264. TS: connecting w/“generative causal modeling” 0 ◮ WAS p(y, x, a) = p(y|x, a)pα(a|x)p(x) ◮ These 3 terms were treated by ◮ response p(y|a, x): avoid regression/inferring using importance sampling ◮ policy pα(a|x): optimize ours, infer theirs ◮ (NB: ours was deterministic: p(a|x) = 1[a = h(x)]) ◮ prior p(x): either avoid by importance sampling or estimate via kernel methods ◮ In the economics approach we focus on ◮ τ(. . .) ≡ Q(a = 1, . . .) − Q(a = 0, . . .) “treatment eﬀect” ◮ where Q(a, . . .) = y yp(y| . . .)
- 265. TS: connecting w/“generative causal modeling” 0 ◮ WAS p(y, x, a) = p(y|x, a)pα(a|x)p(x) ◮ These 3 terms were treated by ◮ response p(y|a, x): avoid regression/inferring using importance sampling ◮ policy pα(a|x): optimize ours, infer theirs ◮ (NB: ours was deterministic: p(a|x) = 1[a = h(x)]) ◮ prior p(x): either avoid by importance sampling or estimate via kernel methods ◮ In the economics approach we focus on ◮ τ(. . .) ≡ Q(a = 1, . . .) − Q(a = 0, . . .) “treatment eﬀect” ◮ where Q(a, . . .) = y yp(y| . . .) In Thompson sampling we will generate 1 datum at a time, by ◮ asserting a parameterized generative model for p(y|a, x, θ)
- 266. TS: connecting w/“generative causal modeling” 0 ◮ WAS p(y, x, a) = p(y|x, a)pα(a|x)p(x) ◮ These 3 terms were treated by ◮ response p(y|a, x): avoid regression/inferring using importance sampling ◮ policy pα(a|x): optimize ours, infer theirs ◮ (NB: ours was deterministic: p(a|x) = 1[a = h(x)]) ◮ prior p(x): either avoid by importance sampling or estimate via kernel methods ◮ In the economics approach we focus on ◮ τ(. . .) ≡ Q(a = 1, . . .) − Q(a = 0, . . .) “treatment eﬀect” ◮ where Q(a, . . .) = y yp(y| . . .) In Thompson sampling we will generate 1 datum at a time, by ◮ asserting a parameterized generative model for p(y|a, x, θ) ◮ using a deterministic but averaged policy
- 267. TS: connecting w/“generative causal modeling” 1 ◮ model true world response function p(y|a, x) parametrically as p(y|a, x, θ∗)
- 268. TS: connecting w/“generative causal modeling” 1 ◮ model true world response function p(y|a, x) parametrically as p(y|a, x, θ∗) ◮ (i.e., θ∗ is the true value of the parameter)21 21 Note that θ is a vector, with components for each action.
- 269. TS: connecting w/“generative causal modeling” 1 ◮ model true world response function p(y|a, x) parametrically as p(y|a, x, θ∗) ◮ (i.e., θ∗ is the true value of the parameter)21 ◮ if you knew θ: 21 Note that θ is a vector, with components for each action.
- 270. TS: connecting w/“generative causal modeling” 1 ◮ model true world response function p(y|a, x) parametrically as p(y|a, x, θ∗) ◮ (i.e., θ∗ is the true value of the parameter)21 ◮ if you knew θ: ◮ could compute Q(a, x, θ) ≡ y yp(y|x, a, θ∗ ) directly 21 Note that θ is a vector, with components for each action.
- 271. TS: connecting w/“generative causal modeling” 1 ◮ model true world response function p(y|a, x) parametrically as p(y|a, x, θ∗) ◮ (i.e., θ∗ is the true value of the parameter)21 ◮ if you knew θ: ◮ could compute Q(a, x, θ) ≡ y yp(y|x, a, θ∗ ) directly ◮ then choose h(x; θ) = argmaxaQ(a, x, θ) 21 Note that θ is a vector, with components for each action.
- 272. TS: connecting w/“generative causal modeling” 1 ◮ model true world response function p(y|a, x) parametrically as p(y|a, x, θ∗) ◮ (i.e., θ∗ is the true value of the parameter)21 ◮ if you knew θ: ◮ could compute Q(a, x, θ) ≡ y yp(y|x, a, θ∗ ) directly ◮ then choose h(x; θ) = argmaxaQ(a, x, θ) ◮ inducing policy p(a|x, θ) = 1[a = h(x; θ) = argmaxaQ(a, x, θ)] 21 Note that θ is a vector, with components for each action.
- 273. TS: connecting w/“generative causal modeling” 1 ◮ model true world response function p(y|a, x) parametrically as p(y|a, x, θ∗) ◮ (i.e., θ∗ is the true value of the parameter)21 ◮ if you knew θ: ◮ could compute Q(a, x, θ) ≡ y yp(y|x, a, θ∗ ) directly ◮ then choose h(x; θ) = argmaxaQ(a, x, θ) ◮ inducing policy p(a|x, θ) = 1[a = h(x; θ) = argmaxaQ(a, x, θ)] ◮ idea: use prior data D = {y, a, x}t 1 to deﬁne non-deterministic policy: 21 Note that θ is a vector, with components for each action.
- 274. TS: connecting w/“generative causal modeling” 1 ◮ model true world response function p(y|a, x) parametrically as p(y|a, x, θ∗) ◮ (i.e., θ∗ is the true value of the parameter)21 ◮ if you knew θ: ◮ could compute Q(a, x, θ) ≡ y yp(y|x, a, θ∗ ) directly ◮ then choose h(x; θ) = argmaxaQ(a, x, θ) ◮ inducing policy p(a|x, θ) = 1[a = h(x; θ) = argmaxaQ(a, x, θ)] ◮ idea: use prior data D = {y, a, x}t 1 to deﬁne non-deterministic policy: ◮ p(a|x) = dθp(a|x, θ)p(θ|D) 21 Note that θ is a vector, with components for each action.
- 275. TS: connecting w/“generative causal modeling” 1 ◮ model true world response function p(y|a, x) parametrically as p(y|a, x, θ∗) ◮ (i.e., θ∗ is the true value of the parameter)21 ◮ if you knew θ: ◮ could compute Q(a, x, θ) ≡ y yp(y|x, a, θ∗ ) directly ◮ then choose h(x; θ) = argmaxaQ(a, x, θ) ◮ inducing policy p(a|x, θ) = 1[a = h(x; θ) = argmaxaQ(a, x, θ)] ◮ idea: use prior data D = {y, a, x}t 1 to deﬁne non-deterministic policy: ◮ p(a|x) = dθp(a|x, θ)p(θ|D) ◮ p(a|x) = dθ1[a = argmaxa′ Q(a′ , x, θ)]p(θ|D) 21 Note that θ is a vector, with components for each action.
- 276. TS: connecting w/“generative causal modeling” 1 ◮ model true world response function p(y|a, x) parametrically as p(y|a, x, θ∗) ◮ (i.e., θ∗ is the true value of the parameter)21 ◮ if you knew θ: ◮ could compute Q(a, x, θ) ≡ y yp(y|x, a, θ∗ ) directly ◮ then choose h(x; θ) = argmaxaQ(a, x, θ) ◮ inducing policy p(a|x, θ) = 1[a = h(x; θ) = argmaxaQ(a, x, θ)] ◮ idea: use prior data D = {y, a, x}t 1 to deﬁne non-deterministic policy: ◮ p(a|x) = dθp(a|x, θ)p(θ|D) ◮ p(a|x) = dθ1[a = argmaxa′ Q(a′ , x, θ)]p(θ|D) ◮ hold up: 21 Note that θ is a vector, with components for each action.
- 277. TS: connecting w/“generative causal modeling” 1 ◮ model true world response function p(y|a, x) parametrically as p(y|a, x, θ∗) ◮ (i.e., θ∗ is the true value of the parameter)21 ◮ if you knew θ: ◮ could compute Q(a, x, θ) ≡ y yp(y|x, a, θ∗ ) directly ◮ then choose h(x; θ) = argmaxaQ(a, x, θ) ◮ inducing policy p(a|x, θ) = 1[a = h(x; θ) = argmaxaQ(a, x, θ)] ◮ idea: use prior data D = {y, a, x}t 1 to deﬁne non-deterministic policy: ◮ p(a|x) = dθp(a|x, θ)p(θ|D) ◮ p(a|x) = dθ1[a = argmaxa′ Q(a′ , x, θ)]p(θ|D) ◮ hold up: ◮ Q1: what’s p(θ|D)? 21 Note that θ is a vector, with components for each action.
- 278. TS: connecting w/“generative causal modeling” 1 ◮ model true world response function p(y|a, x) parametrically as p(y|a, x, θ∗) ◮ (i.e., θ∗ is the true value of the parameter)21 ◮ if you knew θ: ◮ could compute Q(a, x, θ) ≡ y yp(y|x, a, θ∗ ) directly ◮ then choose h(x; θ) = argmaxaQ(a, x, θ) ◮ inducing policy p(a|x, θ) = 1[a = h(x; θ) = argmaxaQ(a, x, θ)] ◮ idea: use prior data D = {y, a, x}t 1 to deﬁne non-deterministic policy: ◮ p(a|x) = dθp(a|x, θ)p(θ|D) ◮ p(a|x) = dθ1[a = argmaxa′ Q(a′ , x, θ)]p(θ|D) ◮ hold up: ◮ Q1: what’s p(θ|D)? ◮ Q2: how am I going to evaluate this integral? 21 Note that θ is a vector, with components for each action.
- 279. TS: connecting w/“generative causal modeling” 2 ◮ Q1: what’s p(θ|D)?
- 280. TS: connecting w/“generative causal modeling” 2 ◮ Q1: what’s p(θ|D)? ◮ Q2: how am I going to evaluate this integral?
- 281. TS: connecting w/“generative causal modeling” 2 ◮ Q1: what’s p(θ|D)? ◮ Q2: how am I going to evaluate this integral? ◮ A1: p(θ|D) deﬁnable by choosing prior p(θ|α) and likelihood on y given by the (modeled, parameterized) response p(y|a, x, θ).
- 282. TS: connecting w/“generative causal modeling” 2 ◮ Q1: what’s p(θ|D)? ◮ Q2: how am I going to evaluate this integral? ◮ A1: p(θ|D) deﬁnable by choosing prior p(θ|α) and likelihood on y given by the (modeled, parameterized) response p(y|a, x, θ). ◮ (now you’re not only generative, you’re Bayesian.)
- 283. TS: connecting w/“generative causal modeling” 2 ◮ Q1: what’s p(θ|D)? ◮ Q2: how am I going to evaluate this integral? ◮ A1: p(θ|D) deﬁnable by choosing prior p(θ|α) and likelihood on y given by the (modeled, parameterized) response p(y|a, x, θ). ◮ (now you’re not only generative, you’re Bayesian.) ◮ p(θ|D) = p(θ|{y}t 1, {a}t 1, {x}t 1, α)
- 284. TS: connecting w/“generative causal modeling” 2 ◮ Q1: what’s p(θ|D)? ◮ Q2: how am I going to evaluate this integral? ◮ A1: p(θ|D) deﬁnable by choosing prior p(θ|α) and likelihood on y given by the (modeled, parameterized) response p(y|a, x, θ). ◮ (now you’re not only generative, you’re Bayesian.) ◮ p(θ|D) = p(θ|{y}t 1, {a}t 1, {x}t 1, α) ◮ ∝ p({y}t 1|{a}t 1, {x}t 1, θ)p(θ|α)
- 285. TS: connecting w/“generative causal modeling” 2 ◮ Q1: what’s p(θ|D)? ◮ Q2: how am I going to evaluate this integral? ◮ A1: p(θ|D) deﬁnable by choosing prior p(θ|α) and likelihood on y given by the (modeled, parameterized) response p(y|a, x, θ). ◮ (now you’re not only generative, you’re Bayesian.) ◮ p(θ|D) = p(θ|{y}t 1, {a}t 1, {x}t 1, α) ◮ ∝ p({y}t 1|{a}t 1, {x}t 1, θ)p(θ|α) ◮ = p(θ|α)Πtp(yt|at, xt, θ)
- 286. TS: connecting w/“generative causal modeling” 2 ◮ Q1: what’s p(θ|D)? ◮ Q2: how am I going to evaluate this integral? ◮ A1: p(θ|D) deﬁnable by choosing prior p(θ|α) and likelihood on y given by the (modeled, parameterized) response p(y|a, x, θ). ◮ (now you’re not only generative, you’re Bayesian.) ◮ p(θ|D) = p(θ|{y}t 1, {a}t 1, {x}t 1, α) ◮ ∝ p({y}t 1|{a}t 1, {x}t 1, θ)p(θ|α) ◮ = p(θ|α)Πtp(yt|at, xt, θ) ◮ warning 1: sometimes people write “p(D|θ)” but we don’t need p(a|θ) or p(x|θ) here
- 287. TS: connecting w/“generative causal modeling” 2 ◮ Q1: what’s p(θ|D)? ◮ Q2: how am I going to evaluate this integral? ◮ A1: p(θ|D) deﬁnable by choosing prior p(θ|α) and likelihood on y given by the (modeled, parameterized) response p(y|a, x, θ). ◮ (now you’re not only generative, you’re Bayesian.) ◮ p(θ|D) = p(θ|{y}t 1, {a}t 1, {x}t 1, α) ◮ ∝ p({y}t 1|{a}t 1, {x}t 1, θ)p(θ|α) ◮ = p(θ|α)Πtp(yt|at, xt, θ) ◮ warning 1: sometimes people write “p(D|θ)” but we don’t need p(a|θ) or p(x|θ) here ◮ warning 2: don’t need historical record of θt.
- 288. TS: connecting w/“generative causal modeling” 2 ◮ Q1: what’s p(θ|D)? ◮ Q2: how am I going to evaluate this integral? ◮ A1: p(θ|D) deﬁnable by choosing prior p(θ|α) and likelihood on y given by the (modeled, parameterized) response p(y|a, x, θ). ◮ (now you’re not only generative, you’re Bayesian.) ◮ p(θ|D) = p(θ|{y}t 1, {a}t 1, {x}t 1, α) ◮ ∝ p({y}t 1|{a}t 1, {x}t 1, θ)p(θ|α) ◮ = p(θ|α)Πtp(yt|at, xt, θ) ◮ warning 1: sometimes people write “p(D|θ)” but we don’t need p(a|θ) or p(x|θ) here ◮ warning 2: don’t need historical record of θt. ◮ (we used Bayes rule, but only in θ and y.)
- 289. TS: connecting w/“generative causal modeling” 2 ◮ Q1: what’s p(θ|D)? ◮ Q2: how am I going to evaluate this integral? ◮ A1: p(θ|D) deﬁnable by choosing prior p(θ|α) and likelihood on y given by the (modeled, parameterized) response p(y|a, x, θ). ◮ (now you’re not only generative, you’re Bayesian.) ◮ p(θ|D) = p(θ|{y}t 1, {a}t 1, {x}t 1, α) ◮ ∝ p({y}t 1|{a}t 1, {x}t 1, θ)p(θ|α) ◮ = p(θ|α)Πtp(yt|at, xt, θ) ◮ warning 1: sometimes people write “p(D|θ)” but we don’t need p(a|θ) or p(x|θ) here ◮ warning 2: don’t need historical record of θt. ◮ (we used Bayes rule, but only in θ and y.) ◮ A2: evaluate integral by N = 1 Monte Carlo22 22 AFAIK it is open research area what happens when you replace N = 1 with N = N0t−ν .
- 290. TS: connecting w/“generative causal modeling” 2 ◮ Q1: what’s p(θ|D)? ◮ Q2: how am I going to evaluate this integral? ◮ A1: p(θ|D) deﬁnable by choosing prior p(θ|α) and likelihood on y given by the (modeled, parameterized) response p(y|a, x, θ). ◮ (now you’re not only generative, you’re Bayesian.) ◮ p(θ|D) = p(θ|{y}t 1, {a}t 1, {x}t 1, α) ◮ ∝ p({y}t 1|{a}t 1, {x}t 1, θ)p(θ|α) ◮ = p(θ|α)Πtp(yt|at, xt, θ) ◮ warning 1: sometimes people write “p(D|θ)” but we don’t need p(a|θ) or p(x|θ) here ◮ warning 2: don’t need historical record of θt. ◮ (we used Bayes rule, but only in θ and y.) ◮ A2: evaluate integral by N = 1 Monte Carlo22 ◮ take 1 sample “θt” of θ from p(θ|D) 22 AFAIK it is open research area what happens when you replace N = 1 with N = N0t−ν .
- 291. TS: connecting w/“generative causal modeling” 2 ◮ Q1: what’s p(θ|D)? ◮ Q2: how am I going to evaluate this integral? ◮ A1: p(θ|D) deﬁnable by choosing prior p(θ|α) and likelihood on y given by the (modeled, parameterized) response p(y|a, x, θ). ◮ (now you’re not only generative, you’re Bayesian.) ◮ p(θ|D) = p(θ|{y}t 1, {a}t 1, {x}t 1, α) ◮ ∝ p({y}t 1|{a}t 1, {x}t 1, θ)p(θ|α) ◮ = p(θ|α)Πtp(yt|at, xt, θ) ◮ warning 1: sometimes people write “p(D|θ)” but we don’t need p(a|θ) or p(x|θ) here ◮ warning 2: don’t need historical record of θt. ◮ (we used Bayes rule, but only in θ and y.) ◮ A2: evaluate integral by N = 1 Monte Carlo22 ◮ take 1 sample “θt” of θ from p(θ|D) ◮ at = h(xt; θt) = argmaxaQ(a, x, θt) 22 AFAIK it is open research area what happens when you replace N = 1 with N = N0t−ν .
- 292. That sounds hard. No, just general. Let’s do toy case: ◮ y ∈ {0, 1},
- 293. That sounds hard. No, just general. Let’s do toy case: ◮ y ∈ {0, 1}, ◮ no context x,
- 294. That sounds hard. No, just general. Let’s do toy case: ◮ y ∈ {0, 1}, ◮ no context x, ◮ Bernoulli (coin ﬂipping), keep track of
- 295. That sounds hard. No, just general. Let’s do toy case: ◮ y ∈ {0, 1}, ◮ no context x, ◮ Bernoulli (coin ﬂipping), keep track of ◮ Sa ≡ number of successes ﬂipping coin a
- 296. That sounds hard. No, just general. Let’s do toy case: ◮ y ∈ {0, 1}, ◮ no context x, ◮ Bernoulli (coin ﬂipping), keep track of ◮ Sa ≡ number of successes ﬂipping coin a ◮ Fa ≡ number of failures ﬂipping coin a
- 297. That sounds hard. No, just general. Let’s do toy case: ◮ y ∈ {0, 1}, ◮ no context x, ◮ Bernoulli (coin ﬂipping), keep track of ◮ Sa ≡ number of successes ﬂipping coin a ◮ Fa ≡ number of failures ﬂipping coin a
- 298. That sounds hard. No, just general. Let’s do toy case: ◮ y ∈ {0, 1}, ◮ no context x, ◮ Bernoulli (coin ﬂipping), keep track of ◮ Sa ≡ number of successes ﬂipping coin a ◮ Fa ≡ number of failures ﬂipping coin a Then ◮ p(θ|D) ∝ p(θ|α)Πtp(yt|at, θ)
- 299. That sounds hard. No, just general. Let’s do toy case: ◮ y ∈ {0, 1}, ◮ no context x, ◮ Bernoulli (coin ﬂipping), keep track of ◮ Sa ≡ number of successes ﬂipping coin a ◮ Fa ≡ number of failures ﬂipping coin a Then ◮ p(θ|D) ∝ p(θ|α)Πtp(yt|at, θ) ◮ = Πaθα−1 a (1 − θa)β−1 Πt,at θyt at (1 − θat )1−yt
- 300. That sounds hard. No, just general. Let’s do toy case: ◮ y ∈ {0, 1}, ◮ no context x, ◮ Bernoulli (coin ﬂipping), keep track of ◮ Sa ≡ number of successes ﬂipping coin a ◮ Fa ≡ number of failures ﬂipping coin a Then ◮ p(θ|D) ∝ p(θ|α)Πtp(yt|at, θ) ◮ = Πaθα−1 a (1 − θa)β−1 Πt,at θyt at (1 − θat )1−yt ◮ = Πaθα+Sa−1(1 − θa)β+Fa−1
- 301. That sounds hard. No, just general. Let’s do toy case: ◮ y ∈ {0, 1}, ◮ no context x, ◮ Bernoulli (coin ﬂipping), keep track of ◮ Sa ≡ number of successes ﬂipping coin a ◮ Fa ≡ number of failures ﬂipping coin a Then ◮ p(θ|D) ∝ p(θ|α)Πtp(yt|at, θ) ◮ = Πaθα−1 a (1 − θa)β−1 Πt,at θyt at (1 − θat )1−yt ◮ = Πaθα+Sa−1(1 − θa)β+Fa−1 ◮ ∴ θa ∼ Beta(α + Sa, β + Fa)
- 302. Thompson sampling: results (2011) Figure 15: Chaleppe and Li 2011
- 303. TS: words Figure 16: from Chaleppe and Li 2011
- 304. TS: p-code Figure 17: from Chaleppe and Li 2011
- 305. TS: Bernoulli bandit p-code23
- 306. TS: Bernoulli bandit p-code (results) Figure 19: from Chaleppe and Li 2011
- 307. UCB1 (2002), p-code Figure 20: UCB1 from Auer, Peter, Nicolo Cesa-Bianchi, and Paul Fischer. “Finite-time analysis of the multiarmed bandit problem.” Machine learning 47.2-3 (2002): 235-256.
- 308. TS: with context Figure 21: from Chaleppe and Li 2011
- 309. LinUCB: UCB with context Figure 22: LinUCB
- 310. TS: with context (results)
- 311. Bandits: Regret via Lai and Robbins (1985) Figure 24: Lai Robbins
- 312. Thompson sampling (1933) and optimality (2013) Figure 25: TS result from S. Agrawal, N. Goyal, ”Further optimal regret bounds for Thompson Sampling”, AISTATS 2013.; see also Agrawal, Shipra, and Navin Goyal. “Analysis of Thompson Sampling for the Multi-armed Bandit Problem.” COLT. 2012 and Emilie Kaufmann, Nathaniel Korda, and R´emi Munos. Thompson sampling: An asymptotically optimal ﬁnite-time analysis. In Algorithmic Learning Theory, pages 199–213. Springer, 2012.
- 313. other ‘Causalities’: structure learning Figure 26: from heckerman 1995 D. Heckerman. A Tutorial on Learning with Bayesian Networks. Technical Report MSR-TR-95-06, Microsoft Research, March, 1995.
- 314. other ‘Causalities’: potential outcomes ◮ model distribution of p(yi (1), yi (0), ai , xi )
- 315. other ‘Causalities’: potential outcomes ◮ model distribution of p(yi (1), yi (0), ai , xi ) ◮ “action” replaced by “observed outcome”
- 316. other ‘Causalities’: potential outcomes ◮ model distribution of p(yi (1), yi (0), ai , xi ) ◮ “action” replaced by “observed outcome” ◮ aka Neyman-Rubin causal model: Neyman (’23); Rubin (’74)
- 317. other ‘Causalities’: potential outcomes ◮ model distribution of p(yi (1), yi (0), ai , xi ) ◮ “action” replaced by “observed outcome” ◮ aka Neyman-Rubin causal model: Neyman (’23); Rubin (’74) ◮ see Morgan + Winship24 for connections between frameworks 24 Morgan, Stephen L., and Christopher Winship. Counterfactuals and causal inference Cambridge University Press, 2014.
- 318. Lecture 4: descriptive modeling @ NYT
- 319. review: (latent) inference and clustering ◮ what does kmeans mean?
- 320. review: (latent) inference and clustering ◮ what does kmeans mean? ◮ given xi ∈ RD
- 321. review: (latent) inference and clustering ◮ what does kmeans mean? ◮ given xi ∈ RD ◮ given d : RD → R1
- 322. review: (latent) inference and clustering ◮ what does kmeans mean? ◮ given xi ∈ RD ◮ given d : RD → R1 ◮ assign zi
- 323. review: (latent) inference and clustering ◮ what does kmeans mean? ◮ given xi ∈ RD ◮ given d : RD → R1 ◮ assign zi ◮ generative modeling gives meaning
- 324. review: (latent) inference and clustering ◮ what does kmeans mean? ◮ given xi ∈ RD ◮ given d : RD → R1 ◮ assign zi ◮ generative modeling gives meaning ◮ given p(x|z, θ)
- 325. review: (latent) inference and clustering ◮ what does kmeans mean? ◮ given xi ∈ RD ◮ given d : RD → R1 ◮ assign zi ◮ generative modeling gives meaning ◮ given p(x|z, θ) ◮ maximize p(x|θ)
- 326. review: (latent) inference and clustering ◮ what does kmeans mean? ◮ given xi ∈ RD ◮ given d : RD → R1 ◮ assign zi ◮ generative modeling gives meaning ◮ given p(x|z, θ) ◮ maximize p(x|θ) ◮ output assignment p(z|x, θ)
- 327. actual math ◮ deﬁne P ≡ p(x, z|θ)
- 328. actual math ◮ deﬁne P ≡ p(x, z|θ) ◮ log-likelihood L ≡ log p(x|θ) = log z P = log EqP/q (cf. importance sampling)
- 329. actual math ◮ deﬁne P ≡ p(x, z|θ) ◮ log-likelihood L ≡ log p(x|θ) = log z P = log EqP/q (cf. importance sampling) ◮ Jensen’s: L ≥ ˜L ≡ Eq log P/q = Eq log P + H[q] = −(U − H) = −F
- 330. actual math ◮ deﬁne P ≡ p(x, z|θ) ◮ log-likelihood L ≡ log p(x|θ) = log z P = log EqP/q (cf. importance sampling) ◮ Jensen’s: L ≥ ˜L ≡ Eq log P/q = Eq log P + H[q] = −(U − H) = −F ◮ analogy to free energy in physics
- 331. actual math ◮ deﬁne P ≡ p(x, z|θ) ◮ log-likelihood L ≡ log p(x|θ) = log z P = log EqP/q (cf. importance sampling) ◮ Jensen’s: L ≥ ˜L ≡ Eq log P/q = Eq log P + H[q] = −(U − H) = −F ◮ analogy to free energy in physics ◮ alternate optimization on θ and on q
- 332. actual math ◮ deﬁne P ≡ p(x, z|θ) ◮ log-likelihood L ≡ log p(x|θ) = log z P = log EqP/q (cf. importance sampling) ◮ Jensen’s: L ≥ ˜L ≡ Eq log P/q = Eq log P + H[q] = −(U − H) = −F ◮ analogy to free energy in physics ◮ alternate optimization on θ and on q ◮ NB: q step gives q(z) = p(z|x, θ)
- 333. actual math ◮ deﬁne P ≡ p(x, z|θ) ◮ log-likelihood L ≡ log p(x|θ) = log z P = log EqP/q (cf. importance sampling) ◮ Jensen’s: L ≥ ˜L ≡ Eq log P/q = Eq log P + H[q] = −(U − H) = −F ◮ analogy to free energy in physics ◮ alternate optimization on θ and on q ◮ NB: q step gives q(z) = p(z|x, θ) ◮ NB: log P convenient for independent examples w/ exponential families
- 334. actual math ◮ deﬁne P ≡ p(x, z|θ) ◮ log-likelihood L ≡ log p(x|θ) = log z P = log EqP/q (cf. importance sampling) ◮ Jensen’s: L ≥ ˜L ≡ Eq log P/q = Eq log P + H[q] = −(U − H) = −F ◮ analogy to free energy in physics ◮ alternate optimization on θ and on q ◮ NB: q step gives q(z) = p(z|x, θ) ◮ NB: log P convenient for independent examples w/ exponential families ◮ e.g., GMMs: µk ← E[x|z] and σ2 k ← E[(x − µ)2 |z] are suﬃcient statistics
- 335. actual math ◮ deﬁne P ≡ p(x, z|θ) ◮ log-likelihood L ≡ log p(x|θ) = log z P = log EqP/q (cf. importance sampling) ◮ Jensen’s: L ≥ ˜L ≡ Eq log P/q = Eq log P + H[q] = −(U − H) = −F ◮ analogy to free energy in physics ◮ alternate optimization on θ and on q ◮ NB: q step gives q(z) = p(z|x, θ) ◮ NB: log P convenient for independent examples w/ exponential families ◮ e.g., GMMs: µk ← E[x|z] and σ2 k ← E[(x − µ)2 |z] are suﬃcient statistics ◮ e.g., LDAs: word counts are suﬃcient statistics
- 336. tangent: more math on GMMs, part 1 Energy U (to be minimized): ◮ −U ≡ Eq log P = z i qi (z) log P(xi , zi ) ≡ Ux + Uz
- 337. tangent: more math on GMMs, part 1 Energy U (to be minimized): ◮ −U ≡ Eq log P = z i qi (z) log P(xi , zi ) ≡ Ux + Uz ◮ −Ux ≡ z i qi (z) log p(xi |zi )
- 338. tangent: more math on GMMs, part 1 Energy U (to be minimized): ◮ −U ≡ Eq log P = z i qi (z) log P(xi , zi ) ≡ Ux + Uz ◮ −Ux ≡ z i qi (z) log p(xi |zi ) ◮ = i z qi (z) k 1[zi = k] log p(xi |zi )
- 339. tangent: more math on GMMs, part 1 Energy U (to be minimized): ◮ −U ≡ Eq log P = z i qi (z) log P(xi , zi ) ≡ Ux + Uz ◮ −Ux ≡ z i qi (z) log p(xi |zi ) ◮ = i z qi (z) k 1[zi = k] log p(xi |zi ) ◮ deﬁne rik = z qi (z)1[zi = k]
- 340. tangent: more math on GMMs, part 1 Energy U (to be minimized): ◮ −U ≡ Eq log P = z i qi (z) log P(xi , zi ) ≡ Ux + Uz ◮ −Ux ≡ z i qi (z) log p(xi |zi ) ◮ = i z qi (z) k 1[zi = k] log p(xi |zi ) ◮ deﬁne rik = z qi (z)1[zi = k] ◮ −Ux = i rik log p(xi |k).
- 341. tangent: more math on GMMs, part 1 Energy U (to be minimized): ◮ −U ≡ Eq log P = z i qi (z) log P(xi , zi ) ≡ Ux + Uz ◮ −Ux ≡ z i qi (z) log p(xi |zi ) ◮ = i z qi (z) k 1[zi = k] log p(xi |zi ) ◮ deﬁne rik = z qi (z)1[zi = k] ◮ −Ux = i rik log p(xi |k). ◮ Gaussian25 ⇒ −Ux = i rik −1 2 (xi − µk)2λk + 1 2 ln λk − 1 2 ln 2π 25 math is simpler if you work with λk ≡ σ−2
- 342. tangent: more math on GMMs, part 1 Energy U (to be minimized): ◮ −U ≡ Eq log P = z i qi (z) log P(xi , zi ) ≡ Ux + Uz ◮ −Ux ≡ z i qi (z) log p(xi |zi ) ◮ = i z qi (z) k 1[zi = k] log p(xi |zi ) ◮ deﬁne rik = z qi (z)1[zi = k] ◮ −Ux = i rik log p(xi |k). ◮ Gaussian25 ⇒ −Ux = i rik −1 2 (xi − µk)2λk + 1 2 ln λk − 1 2 ln 2π 25 math is simpler if you work with λk ≡ σ−2
- 343. tangent: more math on GMMs, part 1 Energy U (to be minimized): ◮ −U ≡ Eq log P = z i qi (z) log P(xi , zi ) ≡ Ux + Uz ◮ −Ux ≡ z i qi (z) log p(xi |zi ) ◮ = i z qi (z) k 1[zi = k] log p(xi |zi ) ◮ deﬁne rik = z qi (z)1[zi = k] ◮ −Ux = i rik log p(xi |k). ◮ Gaussian25 ⇒ −Ux = i rik −1 2 (xi − µk)2λk + 1 2 ln λk − 1 2 ln 2π simple to minimize for parameters ϑ = {µk, λk} 25 math is simpler if you work with λk ≡ σ−2
- 344. tangent: more math on GMMs, part 2 ◮ -Ux = i rik −1 2 (xi − µk)2λk + 1 2 ln λk − 1 2 ln 2π
- 345. tangent: more math on GMMs, part 2 ◮ -Ux = i rik −1 2 (xi − µk)2λk + 1 2 ln λk − 1 2 ln 2π ◮ µk ← E[x|k] solves i rik = i rikxi
- 346. tangent: more math on GMMs, part 2 ◮ -Ux = i rik −1 2 (xi − µk)2λk + 1 2 ln λk − 1 2 ln 2π ◮ µk ← E[x|k] solves i rik = i rikxi ◮ λk ← E[(x − µ)2|k] solves i rik 1 2 (xi − µk)2 = λ−1 k i rik
- 347. tangent: Gaussians ∈ exponential family27 ◮ as before, −U = i rik log p(xi |k)
- 348. tangent: Gaussians ∈ exponential family27 ◮ as before, −U = i rik log p(xi |k) ◮ deﬁne p(xi |k) = exp (η(θ) · T(x) − A(θ) + B(x))
- 349. tangent: Gaussians ∈ exponential family27 ◮ as before, −U = i rik log p(xi |k) ◮ deﬁne p(xi |k) = exp (η(θ) · T(x) − A(θ) + B(x)) ◮ e.g., Gaussian case 26, 26 Choosing η(θ) = η called ‘canonical form’
- 350. tangent: Gaussians ∈ exponential family27 ◮ as before, −U = i rik log p(xi |k) ◮ deﬁne p(xi |k) = exp (η(θ) · T(x) − A(θ) + B(x)) ◮ e.g., Gaussian case 26, ◮ T1 = x, 26 Choosing η(θ) = η called ‘canonical form’
- 351. tangent: Gaussians ∈ exponential family27 ◮ as before, −U = i rik log p(xi |k) ◮ deﬁne p(xi |k) = exp (η(θ) · T(x) − A(θ) + B(x)) ◮ e.g., Gaussian case 26, ◮ T1 = x, ◮ T2 = x2 26 Choosing η(θ) = η called ‘canonical form’
- 352. tangent: Gaussians ∈ exponential family27 ◮ as before, −U = i rik log p(xi |k) ◮ deﬁne p(xi |k) = exp (η(θ) · T(x) − A(θ) + B(x)) ◮ e.g., Gaussian case 26, ◮ T1 = x, ◮ T2 = x2 ◮ η1 = µ/σ2 = µλ 26 Choosing η(θ) = η called ‘canonical form’
- 353. tangent: Gaussians ∈ exponential family27 ◮ as before, −U = i rik log p(xi |k) ◮ deﬁne p(xi |k) = exp (η(θ) · T(x) − A(θ) + B(x)) ◮ e.g., Gaussian case 26, ◮ T1 = x, ◮ T2 = x2 ◮ η1 = µ/σ2 = µλ ◮ η2 = −1 2 λ = −1/(2σ2 ) 26 Choosing η(θ) = η called ‘canonical form’
- 354. tangent: Gaussians ∈ exponential family27 ◮ as before, −U = i rik log p(xi |k) ◮ deﬁne p(xi |k) = exp (η(θ) · T(x) − A(θ) + B(x)) ◮ e.g., Gaussian case 26, ◮ T1 = x, ◮ T2 = x2 ◮ η1 = µ/σ2 = µλ ◮ η2 = −1 2 λ = −1/(2σ2 ) ◮ A = λµ2 /2 − 1 2 ln λ 26 Choosing η(θ) = η called ‘canonical form’
- 355. tangent: Gaussians ∈ exponential family27 ◮ as before, −U = i rik log p(xi |k) ◮ deﬁne p(xi |k) = exp (η(θ) · T(x) − A(θ) + B(x)) ◮ e.g., Gaussian case 26, ◮ T1 = x, ◮ T2 = x2 ◮ η1 = µ/σ2 = µλ ◮ η2 = −1 2 λ = −1/(2σ2 ) ◮ A = λµ2 /2 − 1 2 ln λ ◮ exp(B(x)) = (2π)−1/2 26 Choosing η(θ) = η called ‘canonical form’
- 356. tangent: Gaussians ∈ exponential family27 ◮ as before, −U = i rik log p(xi |k) ◮ deﬁne p(xi |k) = exp (η(θ) · T(x) − A(θ) + B(x)) ◮ e.g., Gaussian case 26, ◮ T1 = x, ◮ T2 = x2 ◮ η1 = µ/σ2 = µλ ◮ η2 = −1 2 λ = −1/(2σ2 ) ◮ A = λµ2 /2 − 1 2 ln λ ◮ exp(B(x)) = (2π)−1/2 ◮ note that in a mixture model, there are separate η (and thus A(η)) for each value of z 26 Choosing η(θ) = η called ‘canonical form’ 27 NB: Gaussians ∈ exponential family, GMM /∈ exponential family! (Thanks to Eszter Vértes for pointing out this error in earlier title.)
- 357. tangent: variational joy ∈ exponential family ◮ as before, −U = i rik ηT k T(xi ) − A(ηk) + B(xi )
- 358. tangent: variational joy ∈ exponential family ◮ as before, −U = i rik ηT k T(xi ) − A(ηk) + B(xi ) ◮ ηk,α solves i rikTk,α(xi ) = ∂A(ηk ) ∂ηk,α i rik (canonical)
- 359. tangent: variational joy ∈ exponential family ◮ as before, −U = i rik ηT k T(xi ) − A(ηk) + B(xi ) ◮ ηk,α solves i rikTk,α(xi ) = ∂A(ηk ) ∂ηk,α i rik (canonical) ◮ ∴ ∂ηk,α A(ηk) ← E[Tk,α|k] (canonical)
- 360. tangent: variational joy ∈ exponential family ◮ as before, −U = i rik ηT k T(xi ) − A(ηk) + B(xi ) ◮ ηk,α solves i rikTk,α(xi ) = ∂A(ηk ) ∂ηk,α i rik (canonical) ◮ ∴ ∂ηk,α A(ηk) ← E[Tk,α|k] (canonical) ◮ nice connection w/physics, esp. mean ﬁeld theory28 28 read MacKay, David JC. Information theory, inference and learning algorithms, Cambridge university press, 2003 to learn more. Actually you should read it regardless.
- 361. clustering and inference: GMM/k-means case study ◮ generative model gives meaning and optimization
- 362. clustering and inference: GMM/k-means case study ◮ generative model gives meaning and optimization ◮ large freedom to choose diﬀerent optimization approaches
- 363. clustering and inference: GMM/k-means case study ◮ generative model gives meaning and optimization ◮ large freedom to choose diﬀerent optimization approaches ◮ e.g., hard clustering limit
- 364. clustering and inference: GMM/k-means case study ◮ generative model gives meaning and optimization ◮ large freedom to choose diﬀerent optimization approaches ◮ e.g., hard clustering limit ◮ e.g., streaming solutions
- 365. clustering and inference: GMM/k-means case study ◮ generative model gives meaning and optimization ◮ large freedom to choose diﬀerent optimization approaches ◮ e.g., hard clustering limit ◮ e.g., streaming solutions ◮ e.g., stochastic gradient methods
- 366. general framework: E+M/variational ◮ e.g., GMM+hard clustering gives kmeans
- 367. general framework: E+M/variational ◮ e.g., GMM+hard clustering gives kmeans ◮ e.g., some favorite applications:
- 368. general framework: E+M/variational ◮ e.g., GMM+hard clustering gives kmeans ◮ e.g., some favorite applications: ◮ hmm
- 369. general framework: E+M/variational ◮ e.g., GMM+hard clustering gives kmeans ◮ e.g., some favorite applications: ◮ hmm ◮ vbmod: arXiv:0709.3512
- 370. general framework: E+M/variational ◮ e.g., GMM+hard clustering gives kmeans ◮ e.g., some favorite applications: ◮ hmm ◮ vbmod: arXiv:0709.3512 ◮ ebfret: ebfret.github.io
- 371. general framework: E+M/variational ◮ e.g., GMM+hard clustering gives kmeans ◮ e.g., some favorite applications: ◮ hmm ◮ vbmod: arXiv:0709.3512 ◮ ebfret: ebfret.github.io ◮ EDHMM: edhmm.github.io
- 372. example application: LDA+topics Figure 27: From Blei 2003
- 373. rec engine via CTM 29 Figure 28: From Blei 2011
- 374. recall: recommendation via factoring Figure 29: From Blei 2011
- 375. CTM: combined loss function Figure 30: From Blei 2011
- 376. CTM: updates for factors Figure 31: From Blei 2011
- 377. CTM: (via Jensen’s, again) bound on loss Figure 32: From Blei 2011
- 378. Lecture 5 data product
- 379. data science and design thinking ◮ knowing customer
- 380. data science and design thinking ◮ knowing customer ◮ right tool for right job
- 381. data science and design thinking ◮ knowing customer ◮ right tool for right job ◮ practical matters:
- 382. data science and design thinking ◮ knowing customer ◮ right tool for right job ◮ practical matters: ◮ munging
- 383. data science and design thinking ◮ knowing customer ◮ right tool for right job ◮ practical matters: ◮ munging ◮ data ops
- 384. data science and design thinking ◮ knowing customer ◮ right tool for right job ◮ practical matters: ◮ munging ◮ data ops ◮ ML in prod
- 385. Thanks! Thanks MLSS students for your great questions; please contact me @chrishwiggins or chris.wiggins@{nytimes,gmail}.com with any questions, comments, or suggestions!

No public clipboards found for this slide

Be the first to comment