Multivariate outlier detection

Estimating Distance Distributions
and Testing Observation
Outlyingness for Complex Surveys
Jianqiang Wang
Major Professor: Jean Opsomer
Committee: Wayne A. Fuller
Song X. Chen
Dan Nettleton
Dimitris Margaritis

2/52
Outline
W Introduction
W Notation and assumptions
W Mean, median-based inference
W Variance estimation
W Simulation study
W Application in National Resources Inventory
W Theoretical extensions

3/52
Structure of survey data
W Many finite populations targeted by surveys consist of
homogeneous subpopulations.
W “Homogeneity” refers to variables being collected,
generally different from design variables.
W Example
W Interested in the health condition of U.S. residents
between 45 and 60 years old, stratify by county, and
homogeneity refers to health condition variables we
collect.

4/52
Conceptual ideas
W Given this structure of population, provide measure of
outlyingness and flag unusual points.
W Assign each point to a subpopulation and define certain
measure of outlyingness.
W Dimension reduction, describe multivariate populations,
identify outliers and discriminate objects.

5/52
Outlier identification procedure
(1)
W Identify target variables that we want to test outlyingness
on.
W Partition the population into a number of relatively
homogeneous groups.
W Define a measure of subpopulation center and a distance
metric from each point to subpopulation center.
W Define the outlyingness of each point as the fraction of
points with a less extreme distance in its subpopulation.

6/52
Outlier identification procedure
(2)
W Estimate the distance distribution and outlyingness of each
point.
W Flag observations with measure of outlyingness exceeding
a prespecified threshold (0.95 or 0.98).
W Make decisions on the list of suspicious points.

7/52
Inference in survey sampling
W The target measure of outlyingness is defined at finite
population level.
W Two mechanisms
W Mechanism for generating finite population
W Mechanism for drawing a sample
W Condition on the finite population, use design-based
inference.
W Asymptotic theory in survey sampling
W Sequence of finite populations
W Sequence of sampling designs

8/52
Sequence of finite populations
W Let be the population index.
W Associated with the -th population element is a -dim vector
, with inclusion probability .
W Finite pop of sizes , and sample a
of sizes , expected sample size a
W Population composition , with .
W Assume we know and subpopulation association.
W Let be the power set of .
yi = (yi ;1; :::; yi ;p)
p
¼i
Ng = fN gN fN g 2 [fL ; fH ]
i
Uº = f1; 2; ¢ ¢ ¢ ; Nº g
Uº = [G
g= 1Uº g Nº g
Aº =
SG
g= 1
Aº ;g
G
n¤
º g = E(nº gjFº ):
nº g
Fº fy1; y2; ¢ ¢ ¢ ; yN º
g

9/52
Sequence of sampling designs
W A probability sample is drawn from with respect to
some measurable design.
W Associate a sampling indicator with each element.
W Inclusion probability
W Sample size with expectation .
AN UN
I(i 2 A N )
n
¼i =Pr(i 2 AN ) = E(I(i 2 A N ) jFN )
¼i j = Pr(i; j 2 AN ) = E(I(i 2 A N ) I(j 2 A N ) jFN )
n¤
= E(njFN )

10/52
Examples of sampling designs
and estimators
W Simple random sampling without replacement
W Poisson sampling
W Horvitz-Thompson and Hajek estimator of the mean
Arbitrary ¼i j =
½
¼i ¼j ; i 6= j
¼i ; i = j
¼i = n
N
; ¼i j =
(n ¡ 1)n
(N ¡ 1)N
; i 6= j
¼i ;

11/52
Norms
W Use the notion of norm to quantify the distance between an
observation and measure of center.
W A norm satisfies:
W Non-degenerate:
W Homogeneity:
W Triangle inequality:
k¹ k : <p
! <+
k¹ k = 0 , ¹ = 0
k¹ 1 + ¹ 2k · k¹ 1k + k¹ 2k
k®¹ k = j®jk¹ k

12/52
Example of norms and unit
circle
W General norms
W Manhattan distance:
W Euclidean distance:
W Supremum norm:
W Quadratic norm
L1 : k¹ k1 =
Pp
i = 1
j¹i j
L2 : k¹ k2 =
pPp
i = 1
¹2
i
L1 : k¹ k = maxfj¹1j; ¢ ¢ ¢ ; j¹pjg
LA : k¹ kA =
p
¹ 0A¹

13/52
Distribution of population
distances
W Population
W Sample
where
W Measure of center: mean vector
W Population
W Sample
Dº ;d(¹ º ) = 1
N º
P
Uº
I(ky i ¡ ¹ º k· d)
bDº ;d(^¹ º ) = 1
bN º
P
A º
1
¼i
I(ky i ¡ ^¹ º k· d)
bNº =
P
A º
1
¼i
¹ º = 1
N º
P
Uº
yi
^¹ º = 1
bN º
P
A º
y i
¼i
location
radius

14/52
Bivariate population
fy : ky ¡ ¹ k = dg
¹
Dº ;d(¹ º )
bDº ;d(^¹ º )

15/52
Nondifferentiability with respect
to location

16/52
General design assumptions
W Assumptions on , , and design variance
W
W
W For any vector with finite moments, define
as the HT estimator of mean, assume
W For any with positive definite population variance-
covariance matrix and finite fourth moment,
and
z
z
KL · N
n
¼j · KU
Var(¹zN ;¼jFN ) · K1VarSRS (¹zN ;SRS jFN )
[V (¹zN ;¼jFN )]¡ 1 ^VH T f¹zN ;¼g ¡ Ip£ p = Op(n¤¡ 1=2
)
n = Op(N¯
º ); with ¯ 2 ( 2p
2p+ 1
; 1]
¹zN ;¼
n ¼i
2 + ±
n¤1=2
(¹zN ;¼ ¡ ¹zN )jFN
d
! N(0; §z z )

17/52
Application specific
assumption 1
W The population distance distribution converges to a limiting
function
where
W The limiting function is continuous in . and ,
with finite derivatives and
W The norm is continuous on , with a continuous derivative
, and bounded second derivative matrix .
(d; ¹ ) 2 [0; 1) £ <p:
d 2 [0 1)
¹ 2 <p
k ¢ k <p
Ã(¢) Hs(¢)
lim
N ! 1
Dº ;d(¹ ) = Dd(¹ )
Dd(¹ )
@D d (¹ )
@d
; @D d (¹ )
@¹
@2
D d (¹ )
@¹ 2 :

18/52
assumption 2
W The population quantity
where and
W Justification assumes a probabilistic model.
W Markov’s Inequality, Borel-Cantalli Lemma.
® 2 [1
4
; 1)
p
Nº
n
1
N º
P
Uº
I(d< ky i ¡ ¹ k· d+ hN º ) ¡ @D d (¹ )
@d
hN º
o
! 0
hº = O(N¡ ®
º )

19/52
assumption 3
W The population quantity
converges to 0 uniformly for and
W Justification assumes a
probabilistic model.
W Proof: empirical process
theory.
s 2 Cs¹ 2 <p
n¤
º
1=2
N º
P
Uº
h
I(ky i ¡ ¹ ¡ n¤
º
¡ 1=2 sk· d) ¡ I(ky i ¡ ¹ k· d) ¡ Dd(¹ º + n¤
º
¡ 1=2
s) + Dd(¹ )
i

20/52
Design consistency
W Decomposition
W Intermediate result
W Consistency
n¤
º
1=2
( bDº ;d(^¹ º ) ¡ bDº ;d(¹ º ) ¡ Dd(^¹ º ) + Dd(¹ º ))
p
! 0
n¤
º
1=2
( bDº ;d(^¹ º ) ¡ Dd(¹ º ))
¯
¯
¯ Fº = Op(1)
n¤
º
1=2
³
bDº ;d(^¹ º ) ¡ Dº ;d(¹ º )
´
= n¤
º
1=2
³
bDº ;d(^¹ º ) ¡ bDº ;d(¹ º ) ¡ Dg;d(^¹ º ) + Dg;d(¹ º )
´
+ n¤
º
1=2
³
bDº ;d(¹ º ) ¡ Dº ;d(¹ º )
´
+ n¤
º
1=2
(Dd(^¹ º ) ¡ Dd(¹ º )) ;

21/52
Asymptotic normality
W Let and be the design variance-
covariance matrix of HT estimator of mean .
W Asymptotic normality
where
Unknown subpop size
assume subpop size and mean are known
Unknown subpop mean
b¹ ;i = (I(ky i ¡ ¹ º k· d) ; 1; yi )0
b¹ ;i
a¹ =
·
1; ¡Dº ;d(¹ º ) ¡ @D d (¹ º )
@¹ º
T
¹ º ; @D d (¹ º )
@¹ º
T
¸0
§¹ ;d
¡
a0
¹ §¹ ;da¹
¢¡ 1=2
³
bDº ;d(^¹ º ) ¡ Dº ;d(¹ º )
´¯
¯
¯ Fº
d
! N(0; 1)

22/52
Multivariate median
W Mean vector
W Generalized median
W Population
W Sample
W Existence and uniqueness
qº = arg infq
P
Uº
kyi ¡ qk
^qº = arg infq
P
A º
1
¼i
kyi ¡ qk

23/52
Multivariate median
W Estimating equations
W Population
W Sample
W Linearization of
W What if the estimating equation is not differentiable?
P
Uº
Ã(yi ¡ q) = 0
P
A º
1
¼i
Ã(yi ¡ q) = 0:
^qº
^qº = qº +
"
1
Nº
X
i 2 A º
Hs(yi ¡ qº )
¼i
#¡ 1
1
Nº
X
i 2 A º
Ã(yi ¡ qº )
¼i
+ op(n¤
º
¡ 1=2
)

24/52
Median-based distances
 Asymptotic results
W Design consistency and asymptotic normality of
for .
W Design consistency and asymptotic normality of
as an estimator of .
^qº
qº
bDº ;d(^qº )
Dº ;d(qº )

25/52
Mahalanobis distances
W Mean and median-based inference.
W Choose an appropriate norm to match the shape of
underlying multivariate distribution.
W Estimate the variance-covariance matrix or other shape
measure of subpopulation, and use Mahalanobis distance.
W Estimate the distribution of Mahalanobis distances.
W See application section for more details.

26/52
Naive variance estimator
W Use mean-based case to explain variance estimators.
W Recall the asymptotic variance of :
where
W Claim: The extra variance due to estimating the center can be
ignored in elliptical distributions using quadratic norm.
W Naïve variance estimator, ignoring the gradient vector:
bDº ;d(^¹ º )
V
³
bDº ;d(^¹ º )
´
= a0
¹ §¹ ;da¹
a¹ =
µ
1; ¡Dº ;d(¹ º ) ¡
³
@D d (¹ º )
@¹ º
´0
¹ º ;
³
@D d (¹ º )
@¹ º
´0
¶0
:
^¾2
¹ ;d;nai ve =
³
1; ¡ bDº ;d(^¹ º )
´
b§¹ ;d
³
1; ¡ bDº ;d(^¹ º )
´0
:

27/52
Estimate the gradient vector
by kernel smoothing
W Idea: estimate by
where , e.g.: CDF of standard normal.
W Kernel estimator
W Design consistent for under mild assumptions.
W Jackknife variance estimation has been proposed for mean-
based case.
K(¢) =
R
K(t)dt
@D d (¹ º )
@¹ º
Dd(¹ ) = lim
º ! 1
1
N º
P
Uº
I(ky i ¡ ¹ k· d)
1
^N º
P
A º
K
³
d¡ ky i ¡ ¹ k
h
´
1
¼i
^³ º ;d(^¹ º ) = 1
^N º h
P
A º
K
³
d¡ ky i ¡ ^¹ º k
h
´
Ã(yi ¡ ^¹ º ) 1
¼i

28/52
Jackknife variance estimator
W Recall
W Recalculate mean for each jackknife replicate?
Inconsistent!
W Proposed idea: incorporate an estimated gradient vector in
replication estimation.
W For the l-th replicate sample, calculate
and use
bDº ;d(^¹ º ) = 1
bN º
P
A º
1
¼i
I(ky i ¡ ^¹ º k· d)
bD(l)
(^¹ º ) = bD(l)
º ;d
(^¹ º ) + ^³ º ;d(^¹ º )(^¹ (l)
º ¡ ^¹ º )
bVJ K
³
bDº ;d(^¹ º )
´
=
LX
l= 1
cl
³
bD(l)
(^¹ º ) ¡ bDº ;d(^¹ º )
´2

29/52
Simulation study
W Goal of simulation study:
W Assess asymptotic properties of estimators.
W Compare naive variance estimator with kernel estimator.
W Simulation parameters
W P=2, G=5.
W Subpopulations 1-4 are elliptically contoured,
subpopulations 5 is skewed.
W Stratified SRS.
W Norm: Euclidean norm.

31/52
Subpopulation distance
distribution functions

32/52
=5000, =1000, =5
Cluster 4 Cluster 5
1.00 1.41 2.45 1.00 1.41 2.45
.44 .54 .71 .31 .52 .85
-0.11 0.00 -0.00 -0.00 -0.00 0.05
-0.01 0.00 -0.01 0.00 0.00 0.01
1.03 1.00 1.00 1.30 1.13 1.00
G
d
Effect of estimating the center
bi as( ^D (¹ ))
sd( ^D ( ^¹ ))
sd( ^D ( ^¹ ))
sd( ^D (¹ ))
bi as( ^D ( ^¹ ))
sd( ^D ( ^¹ ))
N n
Dº ;d(¹ º )

33/52
=5000, =200, =5
Cluster 4 Cluster 5
1.00 1.41 2.45 1.0 1.41 2.45
.43 .53 .68 .35 .55 .88
-0.28 -0.05 0.12 0.05 0.14 0.10
0.03 0.04 0.06 0.00 -0.01 0.02
1.17 1.03 1.01 1.16 1.04 0.96
G
d
Effect of estimating the
center
bi as( ^D (¹ ))
sd( ^D ( ^¹ ))
sd( ^D ( ^¹ ))
sd( ^D (¹ ))
bi as( ^D ( ^¹ ))
sd( ^D ( ^¹ ))
N n
Dº ;d(¹ º )

34/52
=5000, =1000, =5
Cluster 4 Cluster 5
1.0 1.41 2.45 1.0 1.41 2.45
0.44 0.54 0.71 0.31 0.52 0.85
0.94 1.00 1.00 0.53 0.78 1.07
1.21 1.15 1.12 1.00 1.01 1.04
1.07 1.06 1.04 0.85 0.94 0.98
G
d
^¾2
d ; S M
^¾2
d ; M C
h= 0:4
Average estimated variance relative
to MC variance
N n
Dº ;d(¹ º )
^¾2
d ; S M
^¾2
d ; M C
h= 0:1
^¾2
d ; N V
^¾2
d ; M C

35/52
NRI application
W Introduction to NRI.
W Outlier identification for a longitudinal survey.
W Strategy for initial partitioning in NRI.
W How to define Mahalanobis distances.
W Analysis of identified points.

36/52
National Resources Inventory (1)
W National Resources Inventory is a longitudinal survey of
natural resources on non-Federal land in U.S.
W Conducted by the USDA NRCS, in co-operation with CSSM
at Iowa State University.
W Produce a longitudinal database containing numerous agro-
environmental variables for scientific investigation and
policy-making.
W Information was updated every 5 years before 1997 and
annually through a partially overlapping subsampling
design.

37/52
National Resources Inventory (2)
W Various aspects of land use, farming practice, and
environmentally important variables like wetland status and
soil erosion.
W Measure both level and change over time in these
variables.
W Primary mode of data collection is a combination of aerial
photography and field collection.
W Outliers arise from errors in data collection, processing or
some real points themselves behave abnormally.

38/52
Outlier identification for a
longitudinal survey
W Identify outliers for periodically updated data.
W Build outlier identification rules on previous years’ data
and use the rules to flag current observations.
Observe
years
2001-2005
(2001,2002,
2003)
(2003,2004
,2005)
Training set
Test set

39/52
Target variables
W Non-pseudo core points with soil erosion in years 2001-
2005.
W Variables: broad use, land use, USLE C factor, support
practice factor, slope, slope length and USLE loss .
W USLE loss represents the potential long term soil loss in
tons/acre.
USLELOSS= R * K * LS * C * P

40/52
Point classification
b.u. Point Type b.u. Point Type
1 Cultivated cropland 7 Urban and built-up land
2 Noncultivated cropland 8 Rural transportation
3 Pastureland 9 Small water areas
4 Rangeland 10 Large water areas
5 Forest land 11 Rederal land
6 Minor land 12 CRP

41/52
Initial partitioning
W Initial partitioning uses geographical association and
broad use category.
Partition national data into state-wise categories.
Collapse northeastern states.
Partition each region based on broad use sequence
into (1,1,1), (2,2,2) (3,3,3), (12,12,12) and points
with broad use change.
Merge points with same broad use change pattern,
say (2,2,3), (1,1,12).

42/52
Defining distances
W Estimate subpopulation mean vector and covariance matrix
W Calculate distance to the center
W The inverse matrix is defined through a principal value
decomposition.
^¹ º = 1
bN º
P
Sº
y i
¼i
b§º = 1
bN º
P
Sº
(yi ¡ ^¹ º )(yi ¡ ^¹ º )0 1
¼i
kyi ¡ ^¹ º kb§ º
=
q
(yi ¡ ^¹ º )0b§¡
º (yi ¡ ^¹ º ):
b§¡
º

43/52
Source of outlyingness
W Flagged 1% points in training set, and compared test
distances with 99%-quantile of training distances.
W Source of outlyingness
^eº ;i =
b§ ¡ 1 = 2
º
( ^¹ º ¡ y i )
k b§ ¡ 1 = 2
º ( ^¹ º ¡ y i )k

44/52
Analysis of flagged points
W Agricultural specialists analyzed identified points by
suspicious variables.
W C factor: almost all points were considered suspicious.
W Data entry errors
W Invalid entries
 c factor=1 for hayland, pastureland or CRP
W Unusual levels or trends in relation to landuse
(0.013, 0.13, 0.013, 0.013, 0.013)
(0.011, 0.06, 0.11, 0.003, 0.003)

45/52
Analysis of flagged points
W P factor: all points are candidates for review because of the
change over time.
W Slope length: all points were flagged because of the level,
not change over time.
W Land use: Most points flagged because of a change in the
type of hayland or pastureland over time. Not a major
concern to NRCS reviewers.
(1.0, 1.0, 1.0, 0.6, 1.0)

46/52
Nondifferentiable survey
estimators
W The sample distance distribution is nondifferentiable
function of the estimated location parameter.
W A general class of survey estimators:
with corresponding population quantity
W A direct Taylor linearization may not be applicable, again use
a differentiable limiting function , with
derivative .
bT(^¸ ) = 1
N
P
i 2 Sº
1
¼i
h(yi ; ^¸ )
TN (¸ N ) = 1
N
PN
i = 1
h(yi ; ¸ N )
Not necessarily
differentiable
³ (° )
bDº ;d(^¹º )
T (° ) = lim
N ! 1
TN (° )

47/52
Asymptotics
W We provide a set of sufficient conditions on the limiting
function and a number of population quantities under which
where
W The extra variance due to estimating unknown parameter may
or may not be negligible, depending on the derivative.
W Propose a kernel estimator to estimate unknown derivative.
n¤1=2
h
V ( bT(^¸ ))
i¡ 1=2 ³
bT(^¸ ) ¡ TN (¸ N )
´¯
¯
¯
¯ F
d
! N(0; 1)
( bT(^¸ )) =
³
1; [³ (¸ N )]T
´
V (¹z¼)
µ
1
³ (¸ N )
¶
:

48/52
Estimating distribution function
using auxiliary information
W Ratio model
W Use as a substitute of , where .
W Difference estimator
W The extra variance due to estimating ratio is negligible
(RKM, 1990).
^Rxi
yi ^R =
P
S º
yi =¼i
P
S º
x i =¼i
bT ( ^R) = 1
N
nP
Sº
1
¼i
I(yi · t) +
hP
U I( ^Rxi · t)
¡
P
Sº
1
¼i
I( ^Rxi · t)
io
yi = Rxi + ²i ; ²i » ID(0; xi ¾2)

49/52
Estimating a fraction below an
estimated quantity
W Estimate the fraction of households in poverty when the
poverty line is drawn at 60% of the median income.
with population quantity
W Assume that , the extra variance
depends on .
bT (^q) = 1
N
P
Sº
1
¼i
I(yi · 0:6^q)
TN (qN ) = 1
N
NP
i = 1
I(yi · 0:6qN )
lim
N ! 1
TN (°) = FY (0:6°)
@FY (0:6° )
@°

50/52
Nondifferentiable estimating
equations
W The sample p-th quantile can be defined through
estimating equations
W The usual practice is to linearize the estimating function,
but this approach is not applicable due to
nondifferentiability.
W Provide a set of sufficient conditions on the monotonicity
and smoothness of and its limit for proof.
^S(t) = 1
N
P
i 2 S
1
¼i
I(yi ¡ t· 0) ¡ p
^» = infft : ^S(t) ¸ 0g
SN (t) = 1
N
NP
i = 1
I(yi ¡ t· 0) ¡ p
»N = infft : SN (t) ¸ 0g
SN (t)

51/52
Concluding remarks
W Proposed an estimator for subpopulation distance
distribution and demonstrated its statistical properties.
W Application in a large-scale longitudinal survey.
W Theoretical extensions to nondifferentiable survey
estimators.

Multivariate outlier detection

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (13)

Similar to Multivariate outlier detection

Similar to Multivariate outlier detection (20)

More from Jay (Jianqiang) Wang

More from Jay (Jianqiang) Wang (10)

Recently uploaded

Recently uploaded (20)

Multivariate outlier detection