Let's Practice What We Preach: Likelihood Methods for Monte Carlo Data

2,125 views

Published on

Xiao-Li Meng's slides for his talks at Columbia, Sept. 2011, and ICERM, Nov. 2012

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
2,125
On SlideShare
0
From Embeds
0
Number of Embeds
1,567
Actions
Shares
0
Downloads
22
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Let's Practice What We Preach: Likelihood Methods for Monte Carlo Data

  1. 1. Let’s Practice What We Preach: Likelihood Methods for Monte Carlo Data Xiao-Li Meng Department of Statistics, Harvard University September 24, 2011 logoXiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 1 / 23
  2. 2. Let’s Practice What We Preach: Likelihood Methods for Monte Carlo Data Xiao-Li Meng Department of Statistics, Harvard University September 24, 2011Based onKong, McCullagh, Meng, Nicolae, and Tan (2003, JRSS-B, withdiscussions);Kong, McCullagh, Meng, and Nicolae (2006, Doksum Festschrift);Tan (2004, JASA); ..., Meng and Tan (201X) logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 1 / 23
  3. 3. Importance sampling (IS) logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 2 / 23
  4. 4. Importance sampling (IS) Estimand: q1 (x) c1 = q1 (x)µ(dx) = p2 (x)µ(dx). Γ Γ p2 (x) logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 2 / 23
  5. 5. Importance sampling (IS) Estimand: q1 (x) c1 = q1 (x)µ(dx) = p2 (x)µ(dx). Γ Γ p2 (x) Data: {Xi2 , i = 1, . . . n2 } ∼ p2 = q2 /c2 logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 2 / 23
  6. 6. Importance sampling (IS) Estimand: q1 (x) c1 = q1 (x)µ(dx) = p2 (x)µ(dx). Γ Γ p2 (x) Data: {Xi2 , i = 1, . . . n2 } ∼ p2 = q2 /c2 Estimating Equation (EE): c1 q1 (X ) r≡ = E2 . c2 q2 (X ) logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 2 / 23
  7. 7. Importance sampling (IS) Estimand: q1 (x) c1 = q1 (x)µ(dx) = p2 (x)µ(dx). Γ Γ p2 (x) Data: {Xi2 , i = 1, . . . n2 } ∼ p2 = q2 /c2 Estimating Equation (EE): c1 q1 (X ) r≡ = E2 . c2 q2 (X ) The EE estimator: n2 1 q1 (Xi2 ) ˆ= r n2 q2 (Xi2 ) i=1 logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 2 / 23
  8. 8. Importance sampling (IS) Estimand: q1 (x) c1 = q1 (x)µ(dx) = p2 (x)µ(dx). Γ Γ p2 (x) Data: {Xi2 , i = 1, . . . n2 } ∼ p2 = q2 /c2 Estimating Equation (EE): c1 q1 (X ) r≡ = E2 . c2 q2 (X ) The EE estimator: n2 1 q1 (Xi2 ) ˆ= r n2 q2 (Xi2 ) i=1 Standard IS estimator for c1 when c2 = 1. logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 2 / 23
  9. 9. What about MLE? logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 3 / 23
  10. 10. What about MLE? The “likelihood” is: n2 f (X12 . . . Xn2 2 ) = p2 (Xi2 ) — free of the estimand c1 ! i=1 logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 3 / 23
  11. 11. What about MLE? The “likelihood” is: n2 f (X12 . . . Xn2 2 ) = p2 (Xi2 ) — free of the estimand c1 ! i=1 So why are {Xi2 , i = 1, . . . n2 } even relevant? Violation of likelihood principle? logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 3 / 23
  12. 12. What about MLE? The “likelihood” is: n2 f (X12 . . . Xn2 2 ) = p2 (Xi2 ) — free of the estimand c1 ! i=1 So why are {Xi2 , i = 1, . . . n2 } even relevant? Violation of likelihood principle? What are we “inferring”? What is the “unknown” model parameter? logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 3 / 23
  13. 13. Bridge sampling (BS) logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 4 / 23
  14. 14. Bridge sampling (BS) Data: {Xij , i = 1, . . . , nj } ∼ pj = qj /cj , j = 1, 2 logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 4 / 23
  15. 15. Bridge sampling (BS) Data: {Xij , i = 1, . . . , nj } ∼ pj = qj /cj , j = 1, 2 Estimating Equation (Meng and Wong, 1996): c1 E2 [α(X )q1 (X )] r≡ = , ∀α: 0<| αq1 q2 dµ| < ∞ c2 E1 [α(X )q2 (X )] logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 4 / 23
  16. 16. Bridge sampling (BS) Data: {Xij , i = 1, . . . , nj } ∼ pj = qj /cj , j = 1, 2 Estimating Equation (Meng and Wong, 1996): c1 E2 [α(X )q1 (X )] r≡ = , ∀α: 0<| αq1 q2 dµ| < ∞ c2 E1 [α(X )q2 (X )] Optimal choice: αO (x) ∝ [n1 q1 (x) + n2 rq2 (x)]−1 logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 4 / 23
  17. 17. Bridge sampling (BS) Data: {Xij , i = 1, . . . , nj } ∼ pj = qj /cj , j = 1, 2 Estimating Equation (Meng and Wong, 1996): c1 E2 [α(X )q1 (X )] r≡ = , ∀α: 0<| αq1 q2 dµ| < ∞ c2 E1 [α(X )q2 (X )] Optimal choice: αO (x) ∝ [n1 q1 (x) + n2 rq2 (x)]−1 Optimal estimator ˆO , the limit of r n2 1 q1 (Xi2 ) n2 (t) s1 q1 (Xi2 )+s2 ˆO q2 (Xi2 ) r (t+1) i=1 ˆO r = n1 1 q2 (Xi1 ) n1 (t) s1 q1 (Xi1 )+s2 ˆO q2 (Xi1 ) r i=1 logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 4 / 23
  18. 18. What about MLE? logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 5 / 23
  19. 19. What about MLE? The “likelihood” is: 2 nj qj (Xij ) −n −n ∝ c1 1 c2 2 — free of data! cj j=1 i=1 logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 5 / 23
  20. 20. What about MLE? The “likelihood” is: 2 nj qj (Xij ) −n −n ∝ c1 1 c2 2 — free of data! cj j=1 i=1 What went wrong: cj is not “free parameter” because cj = Γ qj (x)µ(dx) and qj is known. logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 5 / 23
  21. 21. What about MLE? The “likelihood” is: 2 nj qj (Xij ) −n −n ∝ c1 1 c2 2 — free of data! cj j=1 i=1 What went wrong: cj is not “free parameter” because cj = Γ qj (x)µ(dx) and qj is known. So what is the “unknown” model parameter? logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 5 / 23
  22. 22. What about MLE? The “likelihood” is: 2 nj qj (Xij ) −n −n ∝ c1 1 c2 2 — free of data! cj j=1 i=1 What went wrong: cj is not “free parameter” because cj = Γ qj (x)µ(dx) and qj is known. So what is the “unknown” model parameter? Turns out ˆO is the same as Bennett’s (1976) optimal acceptance r ratio estimator, as well as Geyer’s (1994) reversed logistic regression estimator. logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 5 / 23
  23. 23. What about MLE? The “likelihood” is: 2 nj qj (Xij ) −n −n ∝ c1 1 c2 2 — free of data! cj j=1 i=1 What went wrong: cj is not “free parameter” because cj = Γ qj (x)µ(dx) and qj is known. So what is the “unknown” model parameter? Turns out ˆO is the same as Bennett’s (1976) optimal acceptance r ratio estimator, as well as Geyer’s (1994) reversed logistic regression estimator. So why is that? Can it be improved upon without any “sleight of hand”? logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 5 / 23
  24. 24. Pretending the measure is unknown! logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 6 / 23
  25. 25. Pretending the measure is unknown! Because c= q(x)µ(dx), Γ and q is known in the sense that we can evaluate it at any sample value, the only way to make c “unknown” is to assume the underlying measure µ is “unknown”. logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 6 / 23
  26. 26. Pretending the measure is unknown! Because c= q(x)µ(dx), Γ and q is known in the sense that we can evaluate it at any sample value, the only way to make c “unknown” is to assume the underlying measure µ is “unknown”. This is natural because Monte Carlo simulation means we use samples to represent, and thus estimate/infer, the underlying population q(x)µ(dx), and hence estimate/infer µ since q is known. logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 6 / 23
  27. 27. Pretending the measure is unknown! Because c= q(x)µ(dx), Γ and q is known in the sense that we can evaluate it at any sample value, the only way to make c “unknown” is to assume the underlying measure µ is “unknown”. This is natural because Monte Carlo simulation means we use samples to represent, and thus estimate/infer, the underlying population q(x)µ(dx), and hence estimate/infer µ since q is known. Monte Carlo integration is about finding a tractable discrete µ to ˆ approximate the intractable µ. logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 6 / 23
  28. 28. Importance Sampling Likelihood logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 7 / 23
  29. 29. Importance Sampling Likelihood Estimand: c1 = Γ q1 (x)µ(dx) logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 7 / 23
  30. 30. Importance Sampling Likelihood Estimand: c1 = Γ q1 (x)µ(dx) −1 Data: {Xi2 , i = 1, . . . n2 } ∼ i.i.d. c2 q2 (x)µ(dx) logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 7 / 23
  31. 31. Importance Sampling Likelihood Estimand: c1 = Γ q1 (x)µ(dx) −1 Data: {Xi2 , i = 1, . . . n2 } ∼ i.i.d. c2 q2 (x)µ(dx) Likelihood for µ: n2 −1 L(µ) = c2 q2 (Xi2 )µ(Xi2 ) i=1 Note that c2 is a functional of µ. logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 7 / 23
  32. 32. Importance Sampling Likelihood Estimand: c1 = Γ q1 (x)µ(dx) −1 Data: {Xi2 , i = 1, . . . n2 } ∼ i.i.d. c2 q2 (x)µ(dx) Likelihood for µ: n2 −1 L(µ) = c2 q2 (Xi2 )µ(Xi2 ) i=1 Note that c2 is a functional of µ. The nonparametric MLE of µ is ˆ P(dx) µ(dx) = ˆ , ˆ P — empirical measure q2 (x) logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 7 / 23
  33. 33. Importance Sampling Likelihood logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 8 / 23
  34. 34. Importance Sampling Likelihood Thus the MLE for r ≡ c1 /c2 is n2 1 q1 (Xi2 ) ˆ= r q1 (x)ˆ(dx) = µ n2 q2 (Xi2 ) i=1 logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 8 / 23
  35. 35. Importance Sampling Likelihood Thus the MLE for r ≡ c1 /c2 is n2 1 q1 (Xi2 ) ˆ= r q1 (x)ˆ(dx) = µ n2 q2 (Xi2 ) i=1 When c2 = 1, q2 = p2 , standard IS estimator for c1 is obtained. logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 8 / 23
  36. 36. Importance Sampling Likelihood Thus the MLE for r ≡ c1 /c2 is n2 1 q1 (Xi2 ) ˆ= r q1 (x)ˆ(dx) = µ n2 q2 (Xi2 ) i=1 When c2 = 1, q2 = p2 , standard IS estimator for c1 is obtained. {X(i2) , i = 1, . . . n2 } is (minimum) sufficient for µ on x ∈ S2 = {x : q2 (x) > 0}, and hence c1 is guaranteed to be ˆ consistent only when S1 ⊂ S2 . logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 8 / 23
  37. 37. Bridge Sampling Likelihood logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 9 / 23
  38. 38. Bridge Sampling Likelihood Estimand: ∝ cj = Γ qj (x)µ(x), j = 1, . . . , J. logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 9 / 23
  39. 39. Bridge Sampling Likelihood Estimand: ∝ cj = Γ qj (x)µ(x), j = 1, . . . , J. Data: {Xij , 1 ≤ i ≤ nj } ∼ cj−1 qj (x)µ(dx), 1 ≤j ≤J logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 9 / 23
  40. 40. Bridge Sampling Likelihood Estimand: ∝ cj = Γ qj (x)µ(x), j = 1, . . . , J. Data: {Xij , 1 ≤ i ≤ nj } ∼ cj−1 qj (x)µ(dx), 1 ≤ j ≤ J nj Likelihood for µ: L(µ) = J j=1 −1 i=1 cj qj (Xij )µ(Xij ) logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 9 / 23
  41. 41. Bridge Sampling Likelihood Estimand: ∝ cj = Γ qj (x)µ(x), j = 1, . . . , J. Data: {Xij , 1 ≤ i ≤ nj } ∼ cj−1 qj (x)µ(dx), 1 ≤ j ≤ J nj Likelihood for µ: L(µ) = J j=1 −1 i=1 cj qj (Xij )µ(Xij ) Writing θ(x) = log µ(x), then J log L(µ) = n ˆ θ(x)d P − nj log cj (θ), Γ j=1 ˆ P is the empirical measure on {Xij , 1 ≤ i ≤ nj , 1 ≤ j ≤ J}. logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 9 / 23
  42. 42. Bridge Sampling Likelihood logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 10 / 23
  43. 43. Bridge Sampling Likelihood ˆ MLE for µ given by equating the canonical sufficient statistics P to its expectation: J ˆ nP(dx) = nj cj−1 qj (x)ˆ(dx), ˆ µ j=1 ˆ nP(dx) µ(dx) = ˆ J . (A) ˆ−1 j=1 nj cj qj (x) logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 10 / 23
  44. 44. Bridge Sampling Likelihood ˆ MLE for µ given by equating the canonical sufficient statistics P to its expectation: J ˆ nP(dx) = nj cj−1 qj (x)ˆ(dx), ˆ µ j=1 ˆ nP(dx) µ(dx) = ˆ J . (A) ˆ−1 j=1 nj cj qj (x) Consequently, the MLE for {c1 , . . . , cJ } must satisfy J nj qr (xij ) cr = ˆ qr (x) d µ = ˆ J . (B) Γ j=1 i=1 s=1 ˆ−1 ns cs qs (xij ) logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 10 / 23
  45. 45. Bridge Sampling Likelihood ˆ MLE for µ given by equating the canonical sufficient statistics P to its expectation: J ˆ nP(dx) = nj cj−1 qj (x)ˆ(dx), ˆ µ j=1 ˆ nP(dx) µ(dx) = ˆ J . (A) ˆ−1 j=1 nj cj qj (x) Consequently, the MLE for {c1 , . . . , cJ } must satisfy J nj qr (xij ) cr = ˆ qr (x) d µ = ˆ J . (B) Γ j=1 i=1 s=1 ˆ−1 ns cs qs (xij ) (B) is the “dual” equation of (A), and is also the same as the logo equation for optimal multiple bridge sampling estimator (Tan 2004). Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 10 / 23
  46. 46. But We Can Ignore Less ... logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 11 / 23
  47. 47. But We Can Ignore Less ... To restrict the parameter space for µ by using some knowledge of the known µ, that it, to set up a sub-model. logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 11 / 23
  48. 48. But We Can Ignore Less ... To restrict the parameter space for µ by using some knowledge of the known µ, that it, to set up a sub-model. The new MLE has a smaller asymptotic variance under the submodel than under the full model. logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 11 / 23
  49. 49. But We Can Ignore Less ... To restrict the parameter space for µ by using some knowledge of the known µ, that it, to set up a sub-model. The new MLE has a smaller asymptotic variance under the submodel than under the full model. Examples: logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 11 / 23
  50. 50. But We Can Ignore Less ... To restrict the parameter space for µ by using some knowledge of the known µ, that it, to set up a sub-model. The new MLE has a smaller asymptotic variance under the submodel than under the full model. Examples: Group-invariance submodel logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 11 / 23
  51. 51. But We Can Ignore Less ... To restrict the parameter space for µ by using some knowledge of the known µ, that it, to set up a sub-model. The new MLE has a smaller asymptotic variance under the submodel than under the full model. Examples: Group-invariance submodel Linear submodel logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 11 / 23
  52. 52. But We Can Ignore Less ... To restrict the parameter space for µ by using some knowledge of the known µ, that it, to set up a sub-model. The new MLE has a smaller asymptotic variance under the submodel than under the full model. Examples: Group-invariance submodel Linear submodel Log-linear submodel logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 11 / 23
  53. 53. An Universally Improved IS logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 12 / 23
  54. 54. An Universally Improved IS Estimand: r = c1 /c2 ; cj = Rd qj (x)µ(dx) logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 12 / 23
  55. 55. An Universally Improved IS Estimand: r = c1 /c2 ; cj = R d qj (x)µ(dx) −1 Data: {Xi2 , i = 1, . . . n2 } i.i.d ∼ c2 q2 µ(dx) logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 12 / 23
  56. 56. An Universally Improved IS Estimand: r = c1 /c2 ; cj = R d qj (x)µ(dx) −1 Data: {Xi2 , i = 1, . . . n2 } i.i.d ∼ c2 q2 µ(dx) Taking G = {Id , −Id } leads to logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 12 / 23
  57. 57. An Universally Improved IS Estimand: r = c1 /c2 ; cj = R d qj (x)µ(dx) −1 Data: {Xi2 , i = 1, . . . n2 } i.i.d ∼ c2 q2 µ(dx) Taking G = {Id , −Id } leads to n2 1 q1 (Xi2 ) + q1 (−Xi2 ) ˆG = r . n2 q2 (Xi2 ) + q2 (−Xi2 ) i=1 logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 12 / 23
  58. 58. An Universally Improved IS Estimand: r = c1 /c2 ; cj = R d qj (x)µ(dx) −1 Data: {Xi2 , i = 1, . . . n2 } i.i.d ∼ c2 q2 µ(dx) Taking G = {Id , −Id } leads to n2 1 q1 (Xi2 ) + q1 (−Xi2 ) ˆG = r . n2 q2 (Xi2 ) + q2 (−Xi2 ) i=1 Because of the Rao-Blackwellization, V(ˆG ) ≤ V(ˆ). r r logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 12 / 23
  59. 59. An Universally Improved IS Estimand: r = c1 /c2 ; cj = R d qj (x)µ(dx) −1 Data: {Xi2 , i = 1, . . . n2 } i.i.d ∼ c2 q2 µ(dx) Taking G = {Id , −Id } leads to n2 1 q1 (Xi2 ) + q1 (−Xi2 ) ˆG = r . n2 q2 (Xi2 ) + q2 (−Xi2 ) i=1 Because of the Rao-Blackwellization, V(ˆG ) ≤ V(ˆ). r r Need twice as many evaluations, but typically this is a small insurance premium. logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 12 / 23
  60. 60. An Universally Improved IS Estimand: r = c1 /c2 ; cj = R d qj (x)µ(dx) −1 Data: {Xi2 , i = 1, . . . n2 } i.i.d ∼ c2 q2 µ(dx) Taking G = {Id , −Id } leads to n2 1 q1 (Xi2 ) + q1 (−Xi2 ) ˆG = r . n2 q2 (Xi2 ) + q2 (−Xi2 ) i=1 Because of the Rao-Blackwellization, V(ˆG ) ≤ V(ˆ). r r Need twice as many evaluations, but typically this is a small insurance premium. Consider S1 = R & S2 = R + . Then ˆG is consistent for r : r n2 n2 1 q1 (Xi2 ) 1 q1 (−Xi2 ) ˆG = r + . n2 q2 (Xi2 ) n2 q2 (Xi2 ) i=1 i=1 logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 12 / 23
  61. 61. An Universally Improved IS Estimand: r = c1 /c2 ; cj = R d qj (x)µ(dx) −1 Data: {Xi2 , i = 1, . . . n2 } i.i.d ∼ c2 q2 µ(dx) Taking G = {Id , −Id } leads to n2 1 q1 (Xi2 ) + q1 (−Xi2 ) ˆG = r . n2 q2 (Xi2 ) + q2 (−Xi2 ) i=1 Because of the Rao-Blackwellization, V(ˆG ) ≤ V(ˆ). r r Need twice as many evaluations, but typically this is a small insurance premium. Consider S1 = R & S2 = R + . Then ˆG is consistent for r : r n2 n2 1 q1 (Xi2 ) 1 q1 (−Xi2 ) ˆG = r + . n2 q2 (Xi2 ) n2 q2 (Xi2 ) i=1 i=1 logo ∞ But standard IS ˆ only estimates r 0 q1 (x)µ(dx)/c2 . Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 12 / 23
  62. 62. There are many more improvements ... Define a sub-model by requiring µ to be G-invariant, where G is a finite group on Γ. logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 13 / 23
  63. 63. There are many more improvements ... Define a sub-model by requiring µ to be G-invariant, where G is a finite group on Γ. The new MLE of µ is ˆ nP G (dx) µG (dx) = ˆ J , ˆ−1 G j=1 nj cj q j (x) ˆ ˆ where P G (A) = aveg ∈G P(gA); q j G (x) = aveg ∈G qj (gx). logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 13 / 23
  64. 64. There are many more improvements ... Define a sub-model by requiring µ to be G-invariant, where G is a finite group on Γ. The new MLE of µ is ˆ nP G (dx) µG (dx) = ˆ J , ˆ−1 G j=1 nj cj q j (x) where P G (A) = aveg ∈G P(gA); q j G (x) = aveg ∈G qj (gx). ˆ ˆ When the draws are i.i.d. within each ps dµ, µG = E [ˆ| GX ], ˆ µ i.e., the Rao-Blackwellization of µ given the orbit. ˆ logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 13 / 23
  65. 65. There are many more improvements ... Define a sub-model by requiring µ to be G-invariant, where G is a finite group on Γ. The new MLE of µ is ˆ nP G (dx) µG (dx) = ˆ J , ˆ−1 G j=1 nj cj q j (x) where P G (A) = aveg ∈G P(gA); q j G (x) = aveg ∈G qj (gx). ˆ ˆ When the draws are i.i.d. within each ps dµ, µG = E [ˆ| GX ], ˆ µ i.e., the Rao-Blackwellization of µ given the orbit. ˆ Consequently, cj G = ˆ qj (x)µG (dx) = E [ˆj |GX ]. c logo Γ Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 13 / 23
  66. 66. Using Groups to model trade-off logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 14 / 23
  67. 67. Using Groups to model trade-off If G1 G2 , then G1 G2 Var c ≤ Var c . logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 14 / 23
  68. 68. Using Groups to model trade-off If G1 G2 , then G1 G2 Var c ≤ Var c . The statistical efficiency increases with the size of Gi , but so does the computational cost needed for function evaluation (but not for sampling, because there are no additional samples involved). logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 14 / 23
  69. 69. Linear submodel: stratified sampling (Tan 2004) logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 15 / 23
  70. 70. Linear submodel: stratified sampling (Tan 2004) i.i.d Data: {Xij , 1 ≤ i ≤ nj } ∼ pj (x)µ(dx), 1 ≤ j ≤ J. logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 15 / 23
  71. 71. Linear submodel: stratified sampling (Tan 2004) i.i.d Data: {Xij , 1 ≤ i ≤ nj } ∼ pj (x)µ(dx), 1 ≤ j ≤ J. The sub-model has parameter space µ: pj (x) µ(dx), 1 ≤ j ≤ J, are equal (to 1). Γ logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 15 / 23
  72. 72. Linear submodel: stratified sampling (Tan 2004) i.i.d Data: {Xij , 1 ≤ i ≤ nj } ∼ pj (x)µ(dx), 1 ≤ j ≤ J. The sub-model has parameter space µ: pj (x) µ(dx), 1 ≤ j ≤ J, are equal (to 1). Γ J nj Likelihood for µ: L(µ) = j=1 i=1 pj (Xij )µ(Xij ) logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 15 / 23
  73. 73. Linear submodel: stratified sampling (Tan 2004) i.i.d Data: {Xij , 1 ≤ i ≤ nj } ∼ pj (x)µ(dx), 1 ≤ j ≤ J. The sub-model has parameter space µ: pj (x) µ(dx), 1 ≤ j ≤ J, are equal (to 1). Γ J nj Likelihood for µ: L(µ) = j=1 i=1 pj (Xij )µ(Xij ) The MLE is ˆ P(dx) µlin (dx) = ˆ J , j=1 πj pj (x) ˆ where πj s are MLEs from a mixture model: ˆ i.i.d J the data ∼ j=1 πj pj (·) with πj s unknown logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 15 / 23
  74. 74. So why MLE? logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 16 / 23
  75. 75. So why MLE? Goal: to estimate c = Γ q(x)µ(dx). logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 16 / 23
  76. 76. So why MLE? Goal: to estimate c = Γ q(x)µ(dx). For an arbitrary vector b, consider the control-variate estimator (Owen and Zhou 2000) J nj q(xji ) − b g (xji ) cb ≡ ˆ J , j=1 i=1 s=1 ns ps (xji ) where g = (p2 − p1 , . . . , pJ − p1 ) . logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 16 / 23
  77. 77. So why MLE? Goal: to estimate c = Γ q(x)µ(dx). For an arbitrary vector b, consider the control-variate estimator (Owen and Zhou 2000) J nj q(xji ) − b g (xji ) cb ≡ ˆ J , j=1 i=1 s=1 ns ps (xji ) where g = (p2 − p1 , . . . , pJ − p1 ) . A more general class: for J λj (x) ≡ 1 and J λj (x)bj (x) ≡ b, j=1 j=1 consider (Veach and Guibas 1995 for bj ≡ 0; Tan, 2004) J nj 1 q(xji ) − bj (xji )g (xji ) cλ,B = ˆ λj (xji ) nj pj (xji ) j=1 i=1 logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 16 / 23
  78. 78. So why MLE? Goal: to estimate c = Γ q(x)µ(dx). For an arbitrary vector b, consider the control-variate estimator (Owen and Zhou 2000) J nj q(xji ) − b g (xji ) cb ≡ ˆ J , j=1 i=1 s=1 ns ps (xji ) where g = (p2 − p1 , . . . , pJ − p1 ) . A more general class: for J λj (x) ≡ 1 and J λj (x)bj (x) ≡ b, j=1 j=1 consider (Veach and Guibas 1995 for bj ≡ 0; Tan, 2004) J nj 1 q(xji ) − bj (xji )g (xji ) cλ,B = ˆ λj (xji ) nj pj (xji ) j=1 i=1 Should cλ,B be more efficient than cb ? Could there be something ˆ ˆ logo even more efficient? Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 16 / 23
  79. 79. Three estimators for c = Γ q(x) µ(dx): logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 17 / 23
  80. 80. Three estimators for c = Γ q(x) µ(dx): IS: 1 n q(xi ) J , n j=1 πj pj (xi ) i=1 where πj = nj /n are the true proportions. logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 17 / 23
  81. 81. Three estimators for c = Γ q(x) µ(dx): IS: 1 n q(xi ) J , n j=1 πj pj (xi ) i=1 where πj = nj /n are the true proportions. Reg: n ˆ 1 q(xi ) − β g (xi ) J , n j=1 πj pj (xi ) i=1 ˆ where β is the estimated regression coefficient, ignoring stratification. logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 17 / 23
  82. 82. Three estimators for c = Γ q(x) µ(dx): IS: 1 n q(xi ) J , n j=1 πj pj (xi ) i=1 where πj = nj /n are the true proportions. Reg: n ˆ 1 q(xi ) − β g (xi ) J , n j=1 πj pj (xi ) i=1 ˆ where β is the estimated regression coefficient, ignoring stratification. Lik: 1 n q(xi ) J , n j=1 πj pj (xi ) ˆ i=1 where πj s are the estimated proportions, ignoring stratification. ˆ logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 17 / 23
  83. 83. Three estimators for c = Γ q(x) µ(dx): IS: 1 n q(xi ) J , n j=1 πj pj (xi ) i=1 where πj = nj /n are the true proportions. Reg: n ˆ 1 q(xi ) − β g (xi ) J , n j=1 πj pj (xi ) i=1 ˆ where β is the estimated regression coefficient, ignoring stratification. Lik: 1 n q(xi ) J , n j=1 πj pj (xi ) ˆ i=1 where πj s are the estimated proportions, ignoring stratification. ˆ logo Which one is most efficient? Least efficient? Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 17 / 23
  84. 84. Let’s find it out ... logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 18 / 23
  85. 85. Let’s find it out ... Γ = R10 and µ is Lebesgue measure. logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 18 / 23
  86. 86. Let’s find it out ... Γ = R10 and µ is Lebesgue measure. The integrand is 10 10 q(x) = 0.8 φ(x j ) + 0.2 ψ(x j ; 4) , j=1 j=1 where φ(·) is standard normal density and ψ(·; 4) is t4 density. logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 18 / 23
  87. 87. Let’s find it out ... Γ = R10 and µ is Lebesgue measure. The integrand is 10 10 q(x) = 0.8 φ(x j ) + 0.2 ψ(x j ; 4) , j=1 j=1 where φ(·) is standard normal density and ψ(·; 4) is t4 density. Two sampling designs: logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 18 / 23
  88. 88. Let’s find it out ... Γ = R10 and µ is Lebesgue measure. The integrand is 10 10 q(x) = 0.8 φ(x j ) + 0.2 ψ(x j ; 4) , j=1 j=1 where φ(·) is standard normal density and ψ(·; 4) is t4 density. Two sampling designs: (i) q2 (x) with n draws, or logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 18 / 23
  89. 89. Let’s find it out ... Γ = R10 and µ is Lebesgue measure. The integrand is 10 10 q(x) = 0.8 φ(x j ) + 0.2 ψ(x j ; 4) , j=1 j=1 where φ(·) is standard normal density and ψ(·; 4) is t4 density. Two sampling designs: (i) q2 (x) with n draws, or (ii) q1 (x) and q2 (x) each with n/2 draws, where 10 10 q1 (x) = φ(x j ), q2 (x) = ψ(x j ; 1) j=1 j=1 logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 18 / 23
  90. 90. A little surprise? Table: Comparison of design and estimator one sampler two samplers IS Reg Lik IS Reg Lik Sqrt MSE .162 .00942 .00931 .0175 .00881 .00881 Std Err .162 .00919 .00920 .0174 .00885 .00884 √ Note: Sqrt MSE is mean squared error of the point estimates and √ Std Err is mean of the variance estimates from 10000 repeated simulations of size n = 500. logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 19 / 23
  91. 91. Comparison of efficiency: logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 20 / 23
  92. 92. Comparison of efficiency: Statistical efficiency: IS < Reg ≈ Lik logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 20 / 23
  93. 93. Comparison of efficiency: Statistical efficiency: IS < Reg ≈ Lik IS is a stratified estimator, which uses only the labels. logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 20 / 23
  94. 94. Comparison of efficiency: Statistical efficiency: IS < Reg ≈ Lik IS is a stratified estimator, which uses only the labels. Reg is conventional method of control variates. logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 20 / 23
  95. 95. Comparison of efficiency: Statistical efficiency: IS < Reg ≈ Lik IS is a stratified estimator, which uses only the labels. Reg is conventional method of control variates. Lik is constrained MLE, which uses pj s but ignores the labels; it is exact if q = pj for any particular j. logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 20 / 23
  96. 96. Building intuition ... logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 21 / 23
  97. 97. Building intuition ... Suppose we make n = 2 draws, one from N(0, 1) and one from Cauchy (0, 1), hence π1 = π2 = 50%. logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 21 / 23
  98. 98. Building intuition ... Suppose we make n = 2 draws, one from N(0, 1) and one from Cauchy (0, 1), hence π1 = π2 = 50%. Suppose the draws are {1, 1}, what would be the MLE (ˆ1 , π2 )? π ˆ logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 21 / 23
  99. 99. Building intuition ... Suppose we make n = 2 draws, one from N(0, 1) and one from Cauchy (0, 1), hence π1 = π2 = 50%. Suppose the draws are {1, 1}, what would be the MLE (ˆ1 , π2 )? π ˆ Suppose the draws are {1, 3}, what would be the MLE (ˆ1 , π2 )? π ˆ logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 21 / 23
  100. 100. Building intuition ... Suppose we make n = 2 draws, one from N(0, 1) and one from Cauchy (0, 1), hence π1 = π2 = 50%. Suppose the draws are {1, 1}, what would be the MLE (ˆ1 , π2 )? π ˆ Suppose the draws are {1, 3}, what would be the MLE (ˆ1 , π2 )? π ˆ Suppose the draws are {3, 3}, what would be the MLE (ˆ1 , π2 )? π ˆ logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 21 / 23
  101. 101. What Did I Learn? logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 22 / 23
  102. 102. What Did I Learn? Model what we ignore, not what we know! logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 22 / 23
  103. 103. What Did I Learn? Model what we ignore, not what we know! Model comparison/selection is not about which model is true (as all of them are “true”), but which model represents a better compromise among human, computational, and statistical efficiency. logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 22 / 23
  104. 104. What Did I Learn? Model what we ignore, not what we know! Model comparison/selection is not about which model is true (as all of them are “true”), but which model represents a better compromise among human, computational, and statistical efficiency. There is a cure for our “schizophrenia” — we now can analyze Monte Carlo data using the same sound statistical principles and methods for analyzing real data. logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 22 / 23
  105. 105. If you are looking for theoretical research topics ... logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 23 / 23
  106. 106. If you are looking for theoretical research topics ... RE-EXAM OLD ONES AND DERIVE NEW ONES! logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 23 / 23
  107. 107. If you are looking for theoretical research topics ... RE-EXAM OLD ONES AND DERIVE NEW ONES! Prove it is MLE, or a good approximation to MLE. logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 23 / 23
  108. 108. If you are looking for theoretical research topics ... RE-EXAM OLD ONES AND DERIVE NEW ONES! Prove it is MLE, or a good approximation to MLE. Or derive MLE or a cost-effective approximation to it. logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 23 / 23
  109. 109. If you are looking for theoretical research topics ... RE-EXAM OLD ONES AND DERIVE NEW ONES! Prove it is MLE, or a good approximation to MLE. Or derive MLE or a cost-effective approximation to it. Markov chain Monte Carlo (Tan 2006, 2008) logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 23 / 23
  110. 110. If you are looking for theoretical research topics ... RE-EXAM OLD ONES AND DERIVE NEW ONES! Prove it is MLE, or a good approximation to MLE. Or derive MLE or a cost-effective approximation to it. Markov chain Monte Carlo (Tan 2006, 2008) More ...... logo Xiao-Li Meng (Harvard) MCMC+likelihood September 24, 2011 23 / 23

×