Successfully reported this slideshow.
Upcoming SlideShare
×

# Robustness under Independent Contamination Model

417 views

Published on

Joint UBC/SFU student seminar presentation by Mike Danilov, PhD student at UBC

• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

• Be the first to like this

### Robustness under Independent Contamination Model

1. 1. Robustness under Independent Contamination Mike Danilov November 21, 2009 1 / 17
2. 2. Traditional robustness Deﬁnition of contamination Simple examples Weighted representation Independent Contamination The Idea Why traditional robust estimates don’t work Naive approaches Cell-weighting approach 2 / 17
3. 3. The Problem (aka Disclaimer) and Terminology Estimation of mean vector µ and covariance matrix Σ of supposedly i.i.d. multivariate sample: x1 , . . . , xn ∈ Rp . Data matrix    x1 x11 x12 ... x1p  x   x21 x22 ... x2p   2  X= . = .  . . .   .   . . . . . . . .  . xn xn1 xn2 . . . xnp Vectors xi ∈ Rp – data cases Values xij ∈ R – data values or cells 3 / 17
4. 4. Types of error in Statistics 1. Usual statistical error. Every observation is moderately aﬀected Xobs = Xmean + e, with e ∼ N (0, σ 2 ) where variance of e deﬁnes the quality of the data. 2. Contamination. Some observations are ruined: Xgood , usually Xobs = Xhorrible , sometimes. Typically comes on top of the usual error: Xgood = Xmean + e. 4 / 17
5. 5. Mixture contamination model Observed data come from the mixture distribution F = (1 − ε)F0 (θ) + εH F0 (θ) is the distribution of interest H is an arbitrary unknown nuisance distribution. Equivalently X = (1 − B)Xgood + BXhorrible , where B is a Bernoulli(ε) indicator. Estimate T (F ): feed data from F , obtain estimates for θ. Breakdown point εBP (T ) = sup sup T (F (θ, ε, H)) < ∞ ε H that is the maximum ε such that T can still isolate F0 from H. Maximum achievable (and desirable) εBP (T ) ≤ 0.5. 5 / 17
6. 6. Examples: simple robust estimates Location Median: x(n/2) n(1−δ/2) 1 Trimmed mean: x(i) , with δ ∈ (0, 1). n(1 − δ) i=nδ/2 Scale MAD: Median |xi − Median xj | i j IQR: x(n/4) − x(3n/4) Regression LMS: arg min Median(yi − β xi )2 β i 6 / 17
7. 7. Examples: multivariate robust estimates Minimum Covariance Determinant (MCD) by Rousseeuw (1985): minimize determinant of sample covariance of 50% of data points: 6 Sample Covariance 4 MCD 2 Clean 0 −2 −4 −6 7 / 17
8. 8. Weighted representation Many robust estimates can be represented as weighted versions of familiar estimates n i=1 wi xi ˆ µ= n i=1 wi n ˆ i=1 wi (xi − µ)(xi ˆ − µ) ˆ Σ= n , i=1 wi with weights depending on the estimates themselves ˆ ˆ wi = w(MD(xi ; µ, Σ)), where Mahalanobis Distances are given by MD(xi ; µ, Σ) = (xi − µ) Σ−1 (xi − µ). ˆ ˆ ˆ ˆ ˆ 8 / 17
9. 9. Contaminated cells not cases Traditional Contamination Independent Contamination ε = 10% q q 9 / 17
10. 10. Generalized Contamination Data entry errors, hardware malfunction, etc Can express as Xj = (1 − Bj )(XGood )j + Bj (XHorrible )j , for j = 1, . . . , p, or, in matrix form, as X = (1 − B)X Good + BX Horrible , where B is a vector of Bernoulli r.v.’s B’s dependence structure is important Will assume Independent Contamination: all Bj are independent and independent of X’s. Also: P[Bj = 1] = ε for simplicity. 10 / 17
11. 11. Number of clean cases each case will appear as outlier if diagnosed with MD’s P[case is clean] = (1 − ε)p e.g. with ε = 0.05 and p = 20 — only 20% are clean waste of data exceeds breakdown point of traditional robust estimates. 11 / 17
12. 12. Aﬃne-equivariance Deﬁnition: if data set Y = A + XB, then ˆ ˆ µ(Y ) = A + B µ(Y ) ˆ ˆ Σ(Y ) = B ΣB, Desirable: easy to study etc Most “respectable” robust estimates are A-E Alqallaf et al (2009) have a proof that reasonable A-E estimates cannot be robust against IC if know how it behaves on X, then know for Y ; and vice versa 12 / 17
13. 13. Aﬃne Transformation of Contaminated Data Original Contaminated Transformed X → Y = XB −→ q q 13 / 17
14. 14. Pairwise approach P[pair of variables are clean] = (1 − ε)2 (1 − ε)p ˆ Estimate all elements Σab , for a, b = 1, . . . , p separately Problem: multivariate structure is damaged/destroyed Particular problem: may not be positive-deﬁnite. May or may not be a problem. Usually is. Studied to some extent by Alqallaf (2003, PhD thesis) 14 / 17
15. 15. Detecting cells Some are obvious: univariate outliers Some only show up with respect to other cells: structural outliers Van Aelst et al (2009) use Stahel-Donoho projections Little and Smith (1987) used partial Mahalanobis distances: ˆ ˆ if MD(x; µ, Σ) is large, ˆ ˆ consider MD(x−j ; µ, Σ) for all j = 1, . . . , p. Mike explores MD-approach and iterative estimation of covariances in his thesis. 15 / 17
16. 16. Weighted estimate with cell weights Van Aelst et al (2009) proposed a weighted estimate, but it is pairwise and not SPD Mike knows how to deal with zero weights - remove the values and treat them as MCAR. Then do MLE via EM, for example. Proper cell-weighted estimate is still to be developed. 16 / 17
17. 17. The End 17 / 17