Your SlideShare is downloading. ×
Robustness under Independent Contamination Model
Robustness under Independent Contamination Model
Robustness under Independent Contamination Model
Robustness under Independent Contamination Model
Robustness under Independent Contamination Model
Robustness under Independent Contamination Model
Robustness under Independent Contamination Model
Robustness under Independent Contamination Model
Robustness under Independent Contamination Model
Robustness under Independent Contamination Model
Robustness under Independent Contamination Model
Robustness under Independent Contamination Model
Robustness under Independent Contamination Model
Robustness under Independent Contamination Model
Robustness under Independent Contamination Model
Robustness under Independent Contamination Model
Robustness under Independent Contamination Model
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Robustness under Independent Contamination Model

226

Published on

Joint UBC/SFU student seminar presentation by Mike Danilov, PhD student at UBC

Joint UBC/SFU student seminar presentation by Mike Danilov, PhD student at UBC

Published in: Technology, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
226
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
2
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Robustness under Independent Contamination Mike Danilov November 21, 2009 1 / 17
  • 2. Traditional robustness Definition of contamination Simple examples Weighted representation Independent Contamination The Idea Why traditional robust estimates don’t work Naive approaches Cell-weighting approach 2 / 17
  • 3. The Problem (aka Disclaimer) and Terminology Estimation of mean vector µ and covariance matrix Σ of supposedly i.i.d. multivariate sample: x1 , . . . , xn ∈ Rp . Data matrix    x1 x11 x12 ... x1p  x   x21 x22 ... x2p   2  X= . = .  . . .   .   . . . . . . . .  . xn xn1 xn2 . . . xnp Vectors xi ∈ Rp – data cases Values xij ∈ R – data values or cells 3 / 17
  • 4. Types of error in Statistics 1. Usual statistical error. Every observation is moderately affected Xobs = Xmean + e, with e ∼ N (0, σ 2 ) where variance of e defines the quality of the data. 2. Contamination. Some observations are ruined: Xgood , usually Xobs = Xhorrible , sometimes. Typically comes on top of the usual error: Xgood = Xmean + e. 4 / 17
  • 5. Mixture contamination model Observed data come from the mixture distribution F = (1 − ε)F0 (θ) + εH F0 (θ) is the distribution of interest H is an arbitrary unknown nuisance distribution. Equivalently X = (1 − B)Xgood + BXhorrible , where B is a Bernoulli(ε) indicator. Estimate T (F ): feed data from F , obtain estimates for θ. Breakdown point εBP (T ) = sup sup T (F (θ, ε, H)) < ∞ ε H that is the maximum ε such that T can still isolate F0 from H. Maximum achievable (and desirable) εBP (T ) ≤ 0.5. 5 / 17
  • 6. Examples: simple robust estimates Location Median: x(n/2) n(1−δ/2) 1 Trimmed mean: x(i) , with δ ∈ (0, 1). n(1 − δ) i=nδ/2 Scale MAD: Median |xi − Median xj | i j IQR: x(n/4) − x(3n/4) Regression LMS: arg min Median(yi − β xi )2 β i 6 / 17
  • 7. Examples: multivariate robust estimates Minimum Covariance Determinant (MCD) by Rousseeuw (1985): minimize determinant of sample covariance of 50% of data points: 6 Sample Covariance 4 MCD 2 Clean 0 −2 −4 −6 7 / 17
  • 8. Weighted representation Many robust estimates can be represented as weighted versions of familiar estimates n i=1 wi xi ˆ µ= n i=1 wi n ˆ i=1 wi (xi − µ)(xi ˆ − µ) ˆ Σ= n , i=1 wi with weights depending on the estimates themselves ˆ ˆ wi = w(MD(xi ; µ, Σ)), where Mahalanobis Distances are given by MD(xi ; µ, Σ) = (xi − µ) Σ−1 (xi − µ). ˆ ˆ ˆ ˆ ˆ 8 / 17
  • 9. Contaminated cells not cases Traditional Contamination Independent Contamination ε = 10% q q 9 / 17
  • 10. Generalized Contamination Data entry errors, hardware malfunction, etc Can express as Xj = (1 − Bj )(XGood )j + Bj (XHorrible )j , for j = 1, . . . , p, or, in matrix form, as X = (1 − B)X Good + BX Horrible , where B is a vector of Bernoulli r.v.’s B’s dependence structure is important Will assume Independent Contamination: all Bj are independent and independent of X’s. Also: P[Bj = 1] = ε for simplicity. 10 / 17
  • 11. Number of clean cases each case will appear as outlier if diagnosed with MD’s P[case is clean] = (1 − ε)p e.g. with ε = 0.05 and p = 20 — only 20% are clean waste of data exceeds breakdown point of traditional robust estimates. 11 / 17
  • 12. Affine-equivariance Definition: if data set Y = A + XB, then ˆ ˆ µ(Y ) = A + B µ(Y ) ˆ ˆ Σ(Y ) = B ΣB, Desirable: easy to study etc Most “respectable” robust estimates are A-E Alqallaf et al (2009) have a proof that reasonable A-E estimates cannot be robust against IC if know how it behaves on X, then know for Y ; and vice versa 12 / 17
  • 13. Affine Transformation of Contaminated Data Original Contaminated Transformed X → Y = XB −→ q q 13 / 17
  • 14. Pairwise approach P[pair of variables are clean] = (1 − ε)2 (1 − ε)p ˆ Estimate all elements Σab , for a, b = 1, . . . , p separately Problem: multivariate structure is damaged/destroyed Particular problem: may not be positive-definite. May or may not be a problem. Usually is. Studied to some extent by Alqallaf (2003, PhD thesis) 14 / 17
  • 15. Detecting cells Some are obvious: univariate outliers Some only show up with respect to other cells: structural outliers Van Aelst et al (2009) use Stahel-Donoho projections Little and Smith (1987) used partial Mahalanobis distances: ˆ ˆ if MD(x; µ, Σ) is large, ˆ ˆ consider MD(x−j ; µ, Σ) for all j = 1, . . . , p. Mike explores MD-approach and iterative estimation of covariances in his thesis. 15 / 17
  • 16. Weighted estimate with cell weights Van Aelst et al (2009) proposed a weighted estimate, but it is pairwise and not SPD Mike knows how to deal with zero weights - remove the values and treat them as MCAR. Then do MLE via EM, for example. Proper cell-weighted estimate is still to be developed. 16 / 17
  • 17. The End 17 / 17

×