Multivariate outlier detection

1
Outlier Identification in National
Resources Inventory and Theoretical
Extensions to Nondifferentiable Survey
Estimators
Jianqiang Wang
Major Professor: Jean Opsomer
Committee: Wayne A. Fuller
Song X. Chen
Dan Nettleton
Dimitris Margaritis

2
Outline
W Introduction
W Notation and assumptions
W Mean and median-based inference
W Variance estimation
W Simulation study
W Application in National Resources Inventory
W Theoretical extensions

3
National Resources Inventory (1)
W National Resources Inventory is a longitudinal survey of
natural resources on non-Federal land in U.S.
W Conducted by the USDA NRCS, in co-operation with CSSM
at Iowa State University.
W Produce a longitudinal database containing numerous agro-
environmental variables for scientific investigation and
policy-making.
W Information was updated every 5 years before 1997 and
annually through a partially overlapping subsampling
design.

4
National Resources Inventory (2)
W Various aspects of land use, farming practice, and
environmentally important variables like wetland status and
soil erosion.
W Measure both level and change over time in these
variables.
W Primary mode of data collection is a combination of aerial
photography and field collection.
W Outliers arise from errors in data collection, processing or
some real points themselves behave abnormally.

5
Outlier identification for a
longitudinal survey
W Identify outliers for periodically updated data.
W Build outlier identification rules on previous years’ data and
use the rules to flag current observations.
Observe
years
2001-2005
(2001,2002,
2003)
(2003,2004
,2005)
Training set
Test set

6
Target variables
W Non-pseudo core points with soil erosion in years 2001-
2005.
W Training set variables: broad use, land use, C factor, support
practice factor, slope, slope length and USLE loss in years
2001, 2002 and 2003.
W USLE loss represents the potential long term soil loss in
tons/acre.USLELOSS= R * K * LS * C * P

7
Point classification
b.u. Point Type b.u. Point Type
1 Cultivated cropland 7 Urban and built-up land
2 Noncultivated cropland 8 Rural transportation
3 Pastureland 9 Small water areas
4 Rangeland 10 Large water areas
5 Forest land 11 Rederal land
6 Minor land 12 CRP

8
Initial partitioning
W Initial partitioning uses geographical association and broad
use category.
Partition national data into state-wise categories.
Collapse northeastern states.
Partition each region based on broad use sequence
into (1,1,1), (2,2,2) (3,3,3), (12,12,12) and points
with broad use change.
Merge points with same broad use change pattern,
say (2,2,3), (1,1,12).

9
Source of outlyingness
W Flagged 1% points on training set, and compare test
distances with 99%-quantile of training distances.
W Source of outlyingness
^eº ;i =
b§ ¡ 1 = 2
º
( ^¹ º ¡ y i )
k b§ ¡ 1 = 2
º ( ^¹ º ¡ y i )k

10
Analysis of flagged points
W Agricultural specialists analyzed identified points by
suspicious variables.
W C factor: almost all points were considered suspicious.
W Data entry errors
W Invalid entries
 c factor=1 for hayland, pastureland or CRP
W Unusual levels or trends in relation to landuse
(0.013, 0.13, 0.013, 0.013, 0.013)
(0.011, 0.06, 0.11, 0.003, 0.003)

11
Analysis of flagged points
W P factor: all points are candidates for review because of the
change over time.
W Slope length: all points were flagged because of the level,
not change over time.
(1.0, 1.0, 1.0, 0.6, 1.0)

12
Nondifferentiable survey
estimators
W The sample distance distribution is nondifferentiable
function of the estimated location parameter.
W A general class of survey estimators:
with corresponding population quantity
W A direct Taylor linearization may not be applicable, again use
a differentiable limiting function , with
derivative .
bT(^¸ ) = 1
N
P
i 2 Sº
1
¼i
h(yi ; ^¸ )
TN (¸ N ) = 1
N
PN
i = 1
h(yi ; ¸ N )
Not necessarily
differentiable
T (° ) = lim
N ! 1
TN (° )
³ (° )
bDº ;d(^¹º )

13
Asymptotics
W Under certain regularity conditions,
where
W The extra variance due to estimating unknown parameter
may or may not be negligible.
W Propose a kernel estimator to estimate unknown derivative.
n¤1=2
h
V ( bT(^¸ ))
i¡ 1=2 ³
bT(^¸ ) ¡ TN (¸ N )
´¯
¯
¯
¯ F
d
! N(0; 1)
( bT(^¸ )) =
³
1; [³ (¸ N )]T
´
V (¹z¼)
µ
1
³ (¸ N )
¶
:

14
Estimating distribution function
using auxiliary information
W Ratio model
W Use as a substitute of , where .
W Difference estimator
W The extra variance due to estimating ratio is negligible
(RKM, 1990).
yi = Rxi + ²i ; ²i » N(0; xi ¾2)
^Rxi
yi ^R =
P
S º
yi =¼i
P
S º
x i =¼i
bT ( ^R) = 1
N
nP
Sº
1
¼i
I(yi · t) +
hP
U I( ^Rxi · t)
¡
P
Sº
1
¼i
I( ^Rxi · t)
io

15
Estimating a fraction below an
estimated quantity
W Estimate the fraction of households in poverty when the
poverty line is drawn at 60% of the median income.
with population quantity
W Assume that , the extra variance
depends on .
bT (^q) = 1
N
P
Sº
1
¼i
I(yi · 0:6^q)
TN (qN ) = 1
N
NP
i = 1
I(yi · 0:6qN )
lim
N ! 1
TN (°) = FY (0:6°)
@FY (0:6° )
@°

16
Concluding remarks
W Proposed an estimator for subpopulation distance
distribution and demonstrated its statistical properties.
W Application in a large-scale longitudinal survey.
W Theoretical extensions to nondifferentiable survey
estimators.

Multivariate outlier detection

Recommended

Recommended

More Related Content

Similar to Multivariate outlier detection

Similar to Multivariate outlier detection (20)

More from Jay (Jianqiang) Wang

More from Jay (Jianqiang) Wang (10)

Recently uploaded

Recently uploaded (20)

Multivariate outlier detection