SlideShare a Scribd company logo
1 of 52
Estimating Distance Distributions
and Testing Observation
Outlyingness for Complex Surveys
Jianqiang Wang
Major Professor: Jean Opsomer
Committee: Wayne A. Fuller
Song X. Chen
Dan Nettleton
Dimitris Margaritis
2/52
Outline
W Introduction
W Notation and assumptions
W Mean, median-based inference
W Variance estimation
W Simulation study
W Application in National Resources Inventory
W Theoretical extensions
3/52
Structure of survey data
W Many finite populations targeted by surveys consist of
homogeneous subpopulations.
W “Homogeneity” refers to variables being collected,
generally different from design variables.
W Example
W Interested in the health condition of U.S. residents
between 45 and 60 years old, stratify by county, and
homogeneity refers to health condition variables we
collect.
4/52
Conceptual ideas
W Given this structure of population, provide measure of
outlyingness and flag unusual points.
W Assign each point to a subpopulation and define certain
measure of outlyingness.
W Dimension reduction, describe multivariate populations,
identify outliers and discriminate objects.
5/52
Outlier identification procedure
(1)
W Identify target variables that we want to test outlyingness
on.
W Partition the population into a number of relatively
homogeneous groups.
W Define a measure of subpopulation center and a distance
metric from each point to subpopulation center.
W Define the outlyingness of each point as the fraction of
points with a less extreme distance in its subpopulation.
6/52
Outlier identification procedure
(2)
W Estimate the distance distribution and outlyingness of each
point.
W Flag observations with measure of outlyingness exceeding
a prespecified threshold (0.95 or 0.98).
W Make decisions on the list of suspicious points.
7/52
Inference in survey sampling
W The target measure of outlyingness is defined at finite
population level.
W Two mechanisms
W Mechanism for generating finite population
W Mechanism for drawing a sample
W Condition on the finite population, use design-based
inference.
W Asymptotic theory in survey sampling
W Sequence of finite populations
W Sequence of sampling designs
8/52
Sequence of finite populations
W Let be the population index.
W Associated with the -th population element is a -dim vector
, with inclusion probability .
W Finite pop of sizes , and sample a
of sizes , expected sample size a
W Population composition , with .
W Assume we know and subpopulation association.
W Let be the power set of .
yi = (yi ;1; :::; yi ;p)
p
¼i
Ng = fN gN fN g 2 [fL ; fH ]
i
Uº = f1; 2; ¢ ¢ ¢ ; Nº g
Uº = [G
g= 1Uº g Nº g
Aº =
SG
g= 1
Aº ;g
G
n¤
º g = E(nº gjFº ):
nº g
Fº fy1; y2; ¢ ¢ ¢ ; yN º
g
9/52
Sequence of sampling designs
W A probability sample is drawn from with respect to
some measurable design.
W Associate a sampling indicator with each element.
W Inclusion probability
W Sample size with expectation .
AN UN
I(i 2 A N )
n
¼i =Pr(i 2 AN ) = E(I(i 2 A N ) jFN )
¼i j = Pr(i; j 2 AN ) = E(I(i 2 A N ) I(j 2 A N ) jFN )
n¤
= E(njFN )
10/52
Examples of sampling designs
and estimators
W Simple random sampling without replacement
W Inclusion probability
W Poisson sampling
W Inclusion probability
W Horvitz-Thompson and Hajek estimator of the mean
Arbitrary ¼i j =
½
¼i ¼j ; i 6= j
¼i ; i = j
¼i = n
N
; ¼i j =
(n ¡ 1)n
(N ¡ 1)N
; i 6= j
¼i ;
11/52
Norms
W Use the notion of norm to quantify the distance between an
observation and measure of center.
W A norm satisfies:
W Non-degenerate:
W Homogeneity:
W Triangle inequality:
k¹ k : <p
! <+
k¹ k = 0 , ¹ = 0
k¹ 1 + ¹ 2k · k¹ 1k + k¹ 2k
k®¹ k = j®jk¹ k
12/52
Example of norms and unit
circle
W General norms
W Manhattan distance:
W Euclidean distance:
W Supremum norm:
W Quadratic norm
L1 : k¹ k1 =
Pp
i = 1
j¹i j
L2 : k¹ k2 =
pPp
i = 1
¹2
i
L1 : k¹ k = maxfj¹1j; ¢ ¢ ¢ ; j¹pjg
LA : k¹ kA =
p
¹ 0A¹
13/52
Distribution of population
distances
W Population
W Sample
where
W Measure of center: mean vector
W Population
W Sample
Dº ;d(¹ º ) = 1
N º
P
Uº
I(ky i ¡ ¹ º k· d)
bDº ;d(^¹ º ) = 1
bN º
P
A º
1
¼i
I(ky i ¡ ^¹ º k· d)
bNº =
P
A º
1
¼i
¹ º = 1
N º
P
Uº
yi
^¹ º = 1
bN º
P
A º
y i
¼i
location
radius
14/52
Bivariate population
fy : ky ¡ ¹ k = dg
¹
Dº ;d(¹ º )
bDº ;d(^¹ º )
15/52
Nondifferentiability with respect
to location
16/52
General design assumptions
W Assumptions on , , and design variance
W
W
W For any vector with finite moments, define
as the HT estimator of mean, assume
W For any with positive definite population variance-
covariance matrix and finite fourth moment,
and
z
z
KL · N
n
¼j · KU
Var(¹zN ;¼jFN ) · K1VarSRS (¹zN ;SRS jFN )
[V (¹zN ;¼jFN )]¡ 1 ^VH T f¹zN ;¼g ¡ Ip£ p = Op(n¤¡ 1=2
)
n = Op(N¯
º ); with ¯ 2 ( 2p
2p+ 1
; 1]
¹zN ;¼
n ¼i
2 + ±
n¤1=2
(¹zN ;¼ ¡ ¹zN )jFN
d
! N(0; §z z )
17/52
Application specific
assumption 1
W The population distance distribution converges to a limiting
function
where
W The limiting function is continuous in . and ,
with finite derivatives and
W The norm is continuous on , with a continuous derivative
, and bounded second derivative matrix .
(d; ¹ ) 2 [0; 1) £ <p:
d 2 [0 1)
¹ 2 <p
k ¢ k <p
Ã(¢) Hs(¢)
lim
N ! 1
Dº ;d(¹ ) = Dd(¹ )
Dd(¹ )
@D d (¹ )
@d
; @D d (¹ )
@¹
@2
D d (¹ )
@¹ 2 :
18/52
Application specific
assumption 2
W The population quantity
where and
W Justification assumes a probabilistic model.
W Markov’s Inequality, Borel-Cantalli Lemma.
® 2 [1
4
; 1)
p
Nº
n
1
N º
P
Uº
I(d< ky i ¡ ¹ k· d+ hN º ) ¡ @D d (¹ )
@d
hN º
o
! 0
hº = O(N¡ ®
º )
19/52
Application specific
assumption 3
W The population quantity
converges to 0 uniformly for and
W Justification assumes a
probabilistic model.
W Proof: empirical process
theory.
s 2 Cs¹ 2 <p
n¤
º
1=2
N º
P
Uº
h
I(ky i ¡ ¹ ¡ n¤
º
¡ 1=2 sk· d) ¡ I(ky i ¡ ¹ k· d) ¡ Dd(¹ º + n¤
º
¡ 1=2
s) + Dd(¹ )
i
20/52
Design consistency
W Decomposition
W Intermediate result
W Consistency
n¤
º
1=2
( bDº ;d(^¹ º ) ¡ bDº ;d(¹ º ) ¡ Dd(^¹ º ) + Dd(¹ º ))
p
! 0
n¤
º
1=2
( bDº ;d(^¹ º ) ¡ Dd(¹ º ))
¯
¯
¯ Fº = Op(1)
n¤
º
1=2
³
bDº ;d(^¹ º ) ¡ Dº ;d(¹ º )
´
= n¤
º
1=2
³
bDº ;d(^¹ º ) ¡ bDº ;d(¹ º ) ¡ Dg;d(^¹ º ) + Dg;d(¹ º )
´
+ n¤
º
1=2
³
bDº ;d(¹ º ) ¡ Dº ;d(¹ º )
´
+ n¤
º
1=2
(Dd(^¹ º ) ¡ Dd(¹ º )) ;
21/52
Asymptotic normality
W Let and be the design variance-
covariance matrix of HT estimator of mean .
W Asymptotic normality
where
Unknown subpop size
assume subpop size and mean are known
Unknown subpop mean
b¹ ;i = (I(ky i ¡ ¹ º k· d) ; 1; yi )0
b¹ ;i
a¹ =
·
1; ¡Dº ;d(¹ º ) ¡ @D d (¹ º )
@¹ º
T
¹ º ; @D d (¹ º )
@¹ º
T
¸0
§¹ ;d
¡
a0
¹ §¹ ;da¹
¢¡ 1=2
³
bDº ;d(^¹ º ) ¡ Dº ;d(¹ º )
´¯
¯
¯ Fº
d
! N(0; 1)
22/52
Multivariate median
W Mean vector
W Generalized median
W Population
W Sample
W Existence and uniqueness
qº = arg infq
P
Uº
kyi ¡ qk
^qº = arg infq
P
A º
1
¼i
kyi ¡ qk
23/52
Multivariate median
W Estimating equations
W Population
W Sample
W Linearization of
W What if the estimating equation is not differentiable?
P
Uº
Ã(yi ¡ q) = 0
P
A º
1
¼i
Ã(yi ¡ q) = 0:
^qº
^qº = qº +
"
1
Nº
X
i 2 A º
Hs(yi ¡ qº )
¼i
#¡ 1
1
Nº
X
i 2 A º
Ã(yi ¡ qº )
¼i
+ op(n¤
º
¡ 1=2
)
24/52
Median-based distances
 Asymptotic results
W Design consistency and asymptotic normality of
for .
W Design consistency and asymptotic normality of
as an estimator of .
^qº
qº
bDº ;d(^qº )
Dº ;d(qº )
25/52
Mahalanobis distances
W Mean and median-based inference.
W Choose an appropriate norm to match the shape of
underlying multivariate distribution.
W Estimate the variance-covariance matrix or other shape
measure of subpopulation, and use Mahalanobis distance.
W Estimate the distribution of Mahalanobis distances.
W See application section for more details.
26/52
Naive variance estimator
W Use mean-based case to explain variance estimators.
W Recall the asymptotic variance of :
where
W Claim: The extra variance due to estimating the center can be
ignored in elliptical distributions using quadratic norm.
W Naïve variance estimator, ignoring the gradient vector:
bDº ;d(^¹ º )
V
³
bDº ;d(^¹ º )
´
= a0
¹ §¹ ;da¹
a¹ =
µ
1; ¡Dº ;d(¹ º ) ¡
³
@D d (¹ º )
@¹ º
´0
¹ º ;
³
@D d (¹ º )
@¹ º
´0
¶0
:
^¾2
¹ ;d;nai ve =
³
1; ¡ bDº ;d(^¹ º )
´
b§¹ ;d
³
1; ¡ bDº ;d(^¹ º )
´0
:
27/52
Estimate the gradient vector
by kernel smoothing
W Idea: estimate by
where , e.g.: CDF of standard normal.
W Kernel estimator
W Design consistent for under mild assumptions.
W Jackknife variance estimation has been proposed for mean-
based case.
K(¢) =
R
K(t)dt
@D d (¹ º )
@¹ º
Dd(¹ ) = lim
º ! 1
1
N º
P
Uº
I(ky i ¡ ¹ k· d)
1
^N º
P
A º
K
³
d¡ ky i ¡ ¹ k
h
´
1
¼i
^³ º ;d(^¹ º ) = 1
^N º h
P
A º
K
³
d¡ ky i ¡ ^¹ º k
h
´
Ã(yi ¡ ^¹ º ) 1
¼i
28/52
Jackknife variance estimator
W Recall
W Recalculate mean for each jackknife replicate?
Inconsistent!
W Proposed idea: incorporate an estimated gradient vector in
replication estimation.
W For the l-th replicate sample, calculate
and use
bDº ;d(^¹ º ) = 1
bN º
P
A º
1
¼i
I(ky i ¡ ^¹ º k· d)
bD(l)
(^¹ º ) = bD(l)
º ;d
(^¹ º ) + ^³ º ;d(^¹ º )(^¹ (l)
º ¡ ^¹ º )
bVJ K
³
bDº ;d(^¹ º )
´
=
LX
l= 1
cl
³
bD(l)
(^¹ º ) ¡ bDº ;d(^¹ º )
´2
29/52
Simulation study
W Goal of simulation study:
W Assess asymptotic properties of estimators.
W Compare naive variance estimator with kernel estimator.
W Simulation parameters
W P=2, G=5.
W Subpopulations 1-4 are elliptically contoured,
subpopulations 5 is skewed.
W Stratified SRS.
W Norm: Euclidean norm.
30/52
Simulated population
31/52
Subpopulation distance
distribution functions
32/52
=5000, =1000, =5
Cluster 4 Cluster 5
1.00 1.41 2.45 1.00 1.41 2.45
.44 .54 .71 .31 .52 .85
-0.11 0.00 -0.00 -0.00 -0.00 0.05
-0.01 0.00 -0.01 0.00 0.00 0.01
1.03 1.00 1.00 1.30 1.13 1.00
G
d
Effect of estimating the center
bi as( ^D (¹ ))
sd( ^D ( ^¹ ))
sd( ^D ( ^¹ ))
sd( ^D (¹ ))
bi as( ^D ( ^¹ ))
sd( ^D ( ^¹ ))
N n
Dº ;d(¹ º )
33/52
=5000, =200, =5
Cluster 4 Cluster 5
1.00 1.41 2.45 1.0 1.41 2.45
.43 .53 .68 .35 .55 .88
-0.28 -0.05 0.12 0.05 0.14 0.10
0.03 0.04 0.06 0.00 -0.01 0.02
1.17 1.03 1.01 1.16 1.04 0.96
G
d
Effect of estimating the
center
bi as( ^D (¹ ))
sd( ^D ( ^¹ ))
sd( ^D ( ^¹ ))
sd( ^D (¹ ))
bi as( ^D ( ^¹ ))
sd( ^D ( ^¹ ))
N n
Dº ;d(¹ º )
34/52
=5000, =1000, =5
Cluster 4 Cluster 5
1.0 1.41 2.45 1.0 1.41 2.45
0.44 0.54 0.71 0.31 0.52 0.85
0.94 1.00 1.00 0.53 0.78 1.07
1.21 1.15 1.12 1.00 1.01 1.04
1.07 1.06 1.04 0.85 0.94 0.98
G
d
^¾2
d ; S M
^¾2
d ; M C
h= 0:4
Average estimated variance relative
to MC variance
N n
Dº ;d(¹ º )
^¾2
d ; S M
^¾2
d ; M C
h= 0:1
^¾2
d ; N V
^¾2
d ; M C
35/52
NRI application
W Introduction to NRI.
W Outlier identification for a longitudinal survey.
W Strategy for initial partitioning in NRI.
W How to define Mahalanobis distances.
W Analysis of identified points.
36/52
National Resources Inventory (1)
W National Resources Inventory is a longitudinal survey of
natural resources on non-Federal land in U.S.
W Conducted by the USDA NRCS, in co-operation with CSSM
at Iowa State University.
W Produce a longitudinal database containing numerous agro-
environmental variables for scientific investigation and
policy-making.
W Information was updated every 5 years before 1997 and
annually through a partially overlapping subsampling
design.
37/52
National Resources Inventory (2)
W Various aspects of land use, farming practice, and
environmentally important variables like wetland status and
soil erosion.
W Measure both level and change over time in these
variables.
W Primary mode of data collection is a combination of aerial
photography and field collection.
W Outliers arise from errors in data collection, processing or
some real points themselves behave abnormally.
38/52
Outlier identification for a
longitudinal survey
W Identify outliers for periodically updated data.
W Build outlier identification rules on previous years’ data
and use the rules to flag current observations.
Observe
years
2001-2005
(2001,2002,
2003)
(2003,2004
,2005)
Training set
Test set
39/52
Target variables
W Non-pseudo core points with soil erosion in years 2001-
2005.
W Variables: broad use, land use, USLE C factor, support
practice factor, slope, slope length and USLE loss .
W USLE loss represents the potential long term soil loss in
tons/acre.
USLELOSS= R * K * LS * C * P
40/52
Point classification
b.u. Point Type b.u. Point Type
1 Cultivated cropland 7 Urban and built-up land
2 Noncultivated cropland 8 Rural transportation
3 Pastureland 9 Small water areas
4 Rangeland 10 Large water areas
5 Forest land 11 Rederal land
6 Minor land 12 CRP
41/52
Initial partitioning
W Initial partitioning uses geographical association and
broad use category.
Partition national data into state-wise categories.
Collapse northeastern states.
Partition each region based on broad use sequence
into (1,1,1), (2,2,2) (3,3,3), (12,12,12) and points
with broad use change.
Merge points with same broad use change pattern,
say (2,2,3), (1,1,12).
42/52
Defining distances
W Estimate subpopulation mean vector and covariance matrix
W Calculate distance to the center
W The inverse matrix is defined through a principal value
decomposition.
^¹ º = 1
bN º
P
Sº
y i
¼i
b§º = 1
bN º
P
Sº
(yi ¡ ^¹ º )(yi ¡ ^¹ º )0 1
¼i
kyi ¡ ^¹ º kb§ º
=
q
(yi ¡ ^¹ º )0b§¡
º (yi ¡ ^¹ º ):
b§¡
º
43/52
Source of outlyingness
W Flagged 1% points in training set, and compared test
distances with 99%-quantile of training distances.
W Source of outlyingness
^eº ;i =
b§ ¡ 1 = 2
º
( ^¹ º ¡ y i )
k b§ ¡ 1 = 2
º ( ^¹ º ¡ y i )k
44/52
Analysis of flagged points
W Agricultural specialists analyzed identified points by
suspicious variables.
W C factor: almost all points were considered suspicious.
W Data entry errors
W Invalid entries
 c factor=1 for hayland, pastureland or CRP
W Unusual levels or trends in relation to landuse
(0.013, 0.13, 0.013, 0.013, 0.013)
(0.011, 0.06, 0.11, 0.003, 0.003)
45/52
Analysis of flagged points
W P factor: all points are candidates for review because of the
change over time.
W Slope length: all points were flagged because of the level,
not change over time.
W Land use: Most points flagged because of a change in the
type of hayland or pastureland over time. Not a major
concern to NRCS reviewers.
(1.0, 1.0, 1.0, 0.6, 1.0)
46/52
Nondifferentiable survey
estimators
W The sample distance distribution is nondifferentiable
function of the estimated location parameter.
W A general class of survey estimators:
with corresponding population quantity
W A direct Taylor linearization may not be applicable, again use
a differentiable limiting function , with
derivative .
bT(^¸ ) = 1
N
P
i 2 Sº
1
¼i
h(yi ; ^¸ )
TN (¸ N ) = 1
N
PN
i = 1
h(yi ; ¸ N )
Not necessarily
differentiable
³ (° )
bDº ;d(^¹º )
T (° ) = lim
N ! 1
TN (° )
47/52
Asymptotics
W We provide a set of sufficient conditions on the limiting
function and a number of population quantities under which
where
W The extra variance due to estimating unknown parameter may
or may not be negligible, depending on the derivative.
W Propose a kernel estimator to estimate unknown derivative.
n¤1=2
h
V ( bT(^¸ ))
i¡ 1=2 ³
bT(^¸ ) ¡ TN (¸ N )
´¯
¯
¯
¯ F
d
! N(0; 1)
( bT(^¸ )) =
³
1; [³ (¸ N )]T
´
V (¹z¼)
µ
1
³ (¸ N )
¶
:
48/52
Estimating distribution function
using auxiliary information
W Ratio model
W Use as a substitute of , where .
W Difference estimator
W The extra variance due to estimating ratio is negligible
(RKM, 1990).
^Rxi
yi ^R =
P
S º
yi =¼i
P
S º
x i =¼i
bT ( ^R) = 1
N
nP
Sº
1
¼i
I(yi · t) +
hP
U I( ^Rxi · t)
¡
P
Sº
1
¼i
I( ^Rxi · t)
io
yi = Rxi + ²i ; ²i » ID(0; xi ¾2)
49/52
Estimating a fraction below an
estimated quantity
W Estimate the fraction of households in poverty when the
poverty line is drawn at 60% of the median income.
with population quantity
W Assume that , the extra variance
depends on .
bT (^q) = 1
N
P
Sº
1
¼i
I(yi · 0:6^q)
TN (qN ) = 1
N
NP
i = 1
I(yi · 0:6qN )
lim
N ! 1
TN (°) = FY (0:6°)
@FY (0:6° )
@°
50/52
Nondifferentiable estimating
equations
W The sample p-th quantile can be defined through
estimating equations
W The usual practice is to linearize the estimating function,
but this approach is not applicable due to
nondifferentiability.
W Provide a set of sufficient conditions on the monotonicity
and smoothness of and its limit for proof.
^S(t) = 1
N
P
i 2 S
1
¼i
I(yi ¡ t· 0) ¡ p
^» = infft : ^S(t) ¸ 0g
SN (t) = 1
N
NP
i = 1
I(yi ¡ t· 0) ¡ p
»N = infft : SN (t) ¸ 0g
SN (t)
51/52
Concluding remarks
W Proposed an estimator for subpopulation distance
distribution and demonstrated its statistical properties.
W Application in a large-scale longitudinal survey.
W Theoretical extensions to nondifferentiable survey
estimators.
52/52
Thank you

More Related Content

Viewers also liked (13)

UNIDAD 2
UNIDAD 2UNIDAD 2
UNIDAD 2
 
ITUNES ICON
ITUNES ICONITUNES ICON
ITUNES ICON
 
Gradle_NormandyJUG
Gradle_NormandyJUGGradle_NormandyJUG
Gradle_NormandyJUG
 
Matemáticas
MatemáticasMatemáticas
Matemáticas
 
True Or False-Know Before You Go---To Get a Mortgage
True Or False-Know Before You Go---To Get a MortgageTrue Or False-Know Before You Go---To Get a Mortgage
True Or False-Know Before You Go---To Get a Mortgage
 
Modulo 01
Modulo 01Modulo 01
Modulo 01
 
Modulo 07
Modulo 07Modulo 07
Modulo 07
 
Monitoreo ambulatorio de presión arterial-MAPA
Monitoreo ambulatorio de presión arterial-MAPAMonitoreo ambulatorio de presión arterial-MAPA
Monitoreo ambulatorio de presión arterial-MAPA
 
Problematizar ejemplos
Problematizar ejemplosProblematizar ejemplos
Problematizar ejemplos
 
Nuevas terapias en ATROFIA MUSCULAR ESPINAL
Nuevas terapias en ATROFIA MUSCULAR ESPINAL Nuevas terapias en ATROFIA MUSCULAR ESPINAL
Nuevas terapias en ATROFIA MUSCULAR ESPINAL
 
Chemical Reactions
Chemical ReactionsChemical Reactions
Chemical Reactions
 
Présentation de Robert Reed, San Francisco Ville Zéro Déchet, à Chambéry juin...
Présentation de Robert Reed, San Francisco Ville Zéro Déchet, à Chambéry juin...Présentation de Robert Reed, San Francisco Ville Zéro Déchet, à Chambéry juin...
Présentation de Robert Reed, San Francisco Ville Zéro Déchet, à Chambéry juin...
 
Proyecto fisica y matemáticas grupo 4
Proyecto fisica y matemáticas grupo 4Proyecto fisica y matemáticas grupo 4
Proyecto fisica y matemáticas grupo 4
 

Similar to Multivariate outlier detection

Divergence clustering
Divergence clusteringDivergence clustering
Divergence clusteringFrank Nielsen
 
Slides: Jeffreys centroids for a set of weighted histograms
Slides: Jeffreys centroids for a set of weighted histogramsSlides: Jeffreys centroids for a set of weighted histograms
Slides: Jeffreys centroids for a set of weighted histogramsFrank Nielsen
 
Divergence center-based clustering and their applications
Divergence center-based clustering and their applicationsDivergence center-based clustering and their applications
Divergence center-based clustering and their applicationsFrank Nielsen
 
Estimating a Population Standard Deviation or Variance
Estimating a Population Standard Deviation or VarianceEstimating a Population Standard Deviation or Variance
Estimating a Population Standard Deviation or VarianceLong Beach City College
 
On Clustering Histograms with k-Means by Using Mixed α-Divergences
 On Clustering Histograms with k-Means by Using Mixed α-Divergences On Clustering Histograms with k-Means by Using Mixed α-Divergences
On Clustering Histograms with k-Means by Using Mixed α-DivergencesFrank Nielsen
 
Formulas statistics
Formulas statisticsFormulas statistics
Formulas statisticsPrashi_Jain
 
Slides: Total Jensen divergences: Definition, Properties and k-Means++ Cluste...
Slides: Total Jensen divergences: Definition, Properties and k-Means++ Cluste...Slides: Total Jensen divergences: Definition, Properties and k-Means++ Cluste...
Slides: Total Jensen divergences: Definition, Properties and k-Means++ Cluste...Frank Nielsen
 
Nec 602 unit ii Random Variables and Random process
Nec 602 unit ii Random Variables and Random processNec 602 unit ii Random Variables and Random process
Nec 602 unit ii Random Variables and Random processDr Naim R Kidwai
 
Session 03 Probability & sampling Distribution NEW.pptx
Session 03 Probability & sampling Distribution NEW.pptxSession 03 Probability & sampling Distribution NEW.pptx
Session 03 Probability & sampling Distribution NEW.pptxMuneer Akhter
 
Localized methods for diffusions in large graphs
Localized methods for diffusions in large graphsLocalized methods for diffusions in large graphs
Localized methods for diffusions in large graphsDavid Gleich
 
Computational Information Geometry on Matrix Manifolds (ICTP 2013)
Computational Information Geometry on Matrix Manifolds (ICTP 2013)Computational Information Geometry on Matrix Manifolds (ICTP 2013)
Computational Information Geometry on Matrix Manifolds (ICTP 2013)Frank Nielsen
 
Spatial Point Processes and Their Applications in Epidemiology
Spatial Point Processes and Their Applications in EpidemiologySpatial Point Processes and Their Applications in Epidemiology
Spatial Point Processes and Their Applications in EpidemiologyLilac Liu Xu
 
A note on estimation of population mean in sample survey using auxiliary info...
A note on estimation of population mean in sample survey using auxiliary info...A note on estimation of population mean in sample survey using auxiliary info...
A note on estimation of population mean in sample survey using auxiliary info...Alexander Decker
 
4 litvak
4 litvak4 litvak
4 litvakYandex
 
Tables and Formulas for Sullivan, Statistics Informed Decisio.docx
Tables and Formulas for Sullivan, Statistics Informed Decisio.docxTables and Formulas for Sullivan, Statistics Informed Decisio.docx
Tables and Formulas for Sullivan, Statistics Informed Decisio.docxmattinsonjanel
 
Large variance and fat tail of damage by natural disaster
Large variance and fat tail of damage by natural disasterLarge variance and fat tail of damage by natural disaster
Large variance and fat tail of damage by natural disasterHang-Hyun Jo
 

Similar to Multivariate outlier detection (20)

Divergence clustering
Divergence clusteringDivergence clustering
Divergence clustering
 
Slides: Jeffreys centroids for a set of weighted histograms
Slides: Jeffreys centroids for a set of weighted histogramsSlides: Jeffreys centroids for a set of weighted histograms
Slides: Jeffreys centroids for a set of weighted histograms
 
Divergence center-based clustering and their applications
Divergence center-based clustering and their applicationsDivergence center-based clustering and their applications
Divergence center-based clustering and their applications
 
MUMS Undergraduate Workshop - Introduction to Bayesian Inference & Uncertaint...
MUMS Undergraduate Workshop - Introduction to Bayesian Inference & Uncertaint...MUMS Undergraduate Workshop - Introduction to Bayesian Inference & Uncertaint...
MUMS Undergraduate Workshop - Introduction to Bayesian Inference & Uncertaint...
 
Estimating a Population Standard Deviation or Variance
Estimating a Population Standard Deviation or VarianceEstimating a Population Standard Deviation or Variance
Estimating a Population Standard Deviation or Variance
 
On Clustering Histograms with k-Means by Using Mixed α-Divergences
 On Clustering Histograms with k-Means by Using Mixed α-Divergences On Clustering Histograms with k-Means by Using Mixed α-Divergences
On Clustering Histograms with k-Means by Using Mixed α-Divergences
 
Formulas statistics
Formulas statisticsFormulas statistics
Formulas statistics
 
Slides: Total Jensen divergences: Definition, Properties and k-Means++ Cluste...
Slides: Total Jensen divergences: Definition, Properties and k-Means++ Cluste...Slides: Total Jensen divergences: Definition, Properties and k-Means++ Cluste...
Slides: Total Jensen divergences: Definition, Properties and k-Means++ Cluste...
 
MUMS: Bayesian, Fiducial, and Frequentist Conference - Model Selection in the...
MUMS: Bayesian, Fiducial, and Frequentist Conference - Model Selection in the...MUMS: Bayesian, Fiducial, and Frequentist Conference - Model Selection in the...
MUMS: Bayesian, Fiducial, and Frequentist Conference - Model Selection in the...
 
CLIM Fall 2017 Course: Statistics for Climate Research, Spatial Data: Models ...
CLIM Fall 2017 Course: Statistics for Climate Research, Spatial Data: Models ...CLIM Fall 2017 Course: Statistics for Climate Research, Spatial Data: Models ...
CLIM Fall 2017 Course: Statistics for Climate Research, Spatial Data: Models ...
 
Nec 602 unit ii Random Variables and Random process
Nec 602 unit ii Random Variables and Random processNec 602 unit ii Random Variables and Random process
Nec 602 unit ii Random Variables and Random process
 
Session 03 Probability & sampling Distribution NEW.pptx
Session 03 Probability & sampling Distribution NEW.pptxSession 03 Probability & sampling Distribution NEW.pptx
Session 03 Probability & sampling Distribution NEW.pptx
 
Localized methods for diffusions in large graphs
Localized methods for diffusions in large graphsLocalized methods for diffusions in large graphs
Localized methods for diffusions in large graphs
 
Computational Information Geometry on Matrix Manifolds (ICTP 2013)
Computational Information Geometry on Matrix Manifolds (ICTP 2013)Computational Information Geometry on Matrix Manifolds (ICTP 2013)
Computational Information Geometry on Matrix Manifolds (ICTP 2013)
 
Spatial Point Processes and Their Applications in Epidemiology
Spatial Point Processes and Their Applications in EpidemiologySpatial Point Processes and Their Applications in Epidemiology
Spatial Point Processes and Their Applications in Epidemiology
 
A note on estimation of population mean in sample survey using auxiliary info...
A note on estimation of population mean in sample survey using auxiliary info...A note on estimation of population mean in sample survey using auxiliary info...
A note on estimation of population mean in sample survey using auxiliary info...
 
4 litvak
4 litvak4 litvak
4 litvak
 
Problems and solutions_4
Problems and solutions_4Problems and solutions_4
Problems and solutions_4
 
Tables and Formulas for Sullivan, Statistics Informed Decisio.docx
Tables and Formulas for Sullivan, Statistics Informed Decisio.docxTables and Formulas for Sullivan, Statistics Informed Decisio.docx
Tables and Formulas for Sullivan, Statistics Informed Decisio.docx
 
Large variance and fat tail of damage by natural disaster
Large variance and fat tail of damage by natural disasterLarge variance and fat tail of damage by natural disaster
Large variance and fat tail of damage by natural disaster
 

More from Jay (Jianqiang) Wang

The Practice of Data Driven Products in Kuaishou
The Practice of Data Driven Products in KuaishouThe Practice of Data Driven Products in Kuaishou
The Practice of Data Driven Products in KuaishouJay (Jianqiang) Wang
 
Artificial Intelligence in fashion -- Combining Statistics and Expert Human J...
Artificial Intelligence in fashion -- Combining Statistics and Expert Human J...Artificial Intelligence in fashion -- Combining Statistics and Expert Human J...
Artificial Intelligence in fashion -- Combining Statistics and Expert Human J...Jay (Jianqiang) Wang
 
Making data-informed decisions and building intelligent products (Chinese)
Making data-informed decisions and building intelligent products (Chinese)Making data-informed decisions and building intelligent products (Chinese)
Making data-informed decisions and building intelligent products (Chinese)Jay (Jianqiang) Wang
 
Notes on Machine Learning and Data-centric Startups
Notes on Machine Learning and Data-centric StartupsNotes on Machine Learning and Data-centric Startups
Notes on Machine Learning and Data-centric StartupsJay (Jianqiang) Wang
 
Introduction to data science and its application in online advertising
Introduction to data science and its application in online advertisingIntroduction to data science and its application in online advertising
Introduction to data science and its application in online advertisingJay (Jianqiang) Wang
 
How to prepare for data science interviews
How to prepare for data science interviewsHow to prepare for data science interviews
How to prepare for data science interviewsJay (Jianqiang) Wang
 
Introduction to data science and candidate data science projects
Introduction to data science and candidate data science projectsIntroduction to data science and candidate data science projects
Introduction to data science and candidate data science projectsJay (Jianqiang) Wang
 
Boosted multinomial logit model (working manuscript)
Boosted multinomial logit model (working manuscript)Boosted multinomial logit model (working manuscript)
Boosted multinomial logit model (working manuscript)Jay (Jianqiang) Wang
 
Boosted Tree-based Multinomial Logit Model for Aggregated Market Data
Boosted Tree-based Multinomial Logit Model for Aggregated Market DataBoosted Tree-based Multinomial Logit Model for Aggregated Market Data
Boosted Tree-based Multinomial Logit Model for Aggregated Market DataJay (Jianqiang) Wang
 
A Bayesian Approach to Estimating Agricultual Yield Based on Multiple Repeat...
A Bayesian Approach to Estimating Agricultual Yield Based on Multiple Repeat...A Bayesian Approach to Estimating Agricultual Yield Based on Multiple Repeat...
A Bayesian Approach to Estimating Agricultual Yield Based on Multiple Repeat...Jay (Jianqiang) Wang
 

More from Jay (Jianqiang) Wang (10)

The Practice of Data Driven Products in Kuaishou
The Practice of Data Driven Products in KuaishouThe Practice of Data Driven Products in Kuaishou
The Practice of Data Driven Products in Kuaishou
 
Artificial Intelligence in fashion -- Combining Statistics and Expert Human J...
Artificial Intelligence in fashion -- Combining Statistics and Expert Human J...Artificial Intelligence in fashion -- Combining Statistics and Expert Human J...
Artificial Intelligence in fashion -- Combining Statistics and Expert Human J...
 
Making data-informed decisions and building intelligent products (Chinese)
Making data-informed decisions and building intelligent products (Chinese)Making data-informed decisions and building intelligent products (Chinese)
Making data-informed decisions and building intelligent products (Chinese)
 
Notes on Machine Learning and Data-centric Startups
Notes on Machine Learning and Data-centric StartupsNotes on Machine Learning and Data-centric Startups
Notes on Machine Learning and Data-centric Startups
 
Introduction to data science and its application in online advertising
Introduction to data science and its application in online advertisingIntroduction to data science and its application in online advertising
Introduction to data science and its application in online advertising
 
How to prepare for data science interviews
How to prepare for data science interviewsHow to prepare for data science interviews
How to prepare for data science interviews
 
Introduction to data science and candidate data science projects
Introduction to data science and candidate data science projectsIntroduction to data science and candidate data science projects
Introduction to data science and candidate data science projects
 
Boosted multinomial logit model (working manuscript)
Boosted multinomial logit model (working manuscript)Boosted multinomial logit model (working manuscript)
Boosted multinomial logit model (working manuscript)
 
Boosted Tree-based Multinomial Logit Model for Aggregated Market Data
Boosted Tree-based Multinomial Logit Model for Aggregated Market DataBoosted Tree-based Multinomial Logit Model for Aggregated Market Data
Boosted Tree-based Multinomial Logit Model for Aggregated Market Data
 
A Bayesian Approach to Estimating Agricultual Yield Based on Multiple Repeat...
A Bayesian Approach to Estimating Agricultual Yield Based on Multiple Repeat...A Bayesian Approach to Estimating Agricultual Yield Based on Multiple Repeat...
A Bayesian Approach to Estimating Agricultual Yield Based on Multiple Repeat...
 

Recently uploaded

Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxMohammedJunaid861692
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 

Recently uploaded (20)

Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 

Multivariate outlier detection

  • 1. Estimating Distance Distributions and Testing Observation Outlyingness for Complex Surveys Jianqiang Wang Major Professor: Jean Opsomer Committee: Wayne A. Fuller Song X. Chen Dan Nettleton Dimitris Margaritis
  • 2. 2/52 Outline W Introduction W Notation and assumptions W Mean, median-based inference W Variance estimation W Simulation study W Application in National Resources Inventory W Theoretical extensions
  • 3. 3/52 Structure of survey data W Many finite populations targeted by surveys consist of homogeneous subpopulations. W “Homogeneity” refers to variables being collected, generally different from design variables. W Example W Interested in the health condition of U.S. residents between 45 and 60 years old, stratify by county, and homogeneity refers to health condition variables we collect.
  • 4. 4/52 Conceptual ideas W Given this structure of population, provide measure of outlyingness and flag unusual points. W Assign each point to a subpopulation and define certain measure of outlyingness. W Dimension reduction, describe multivariate populations, identify outliers and discriminate objects.
  • 5. 5/52 Outlier identification procedure (1) W Identify target variables that we want to test outlyingness on. W Partition the population into a number of relatively homogeneous groups. W Define a measure of subpopulation center and a distance metric from each point to subpopulation center. W Define the outlyingness of each point as the fraction of points with a less extreme distance in its subpopulation.
  • 6. 6/52 Outlier identification procedure (2) W Estimate the distance distribution and outlyingness of each point. W Flag observations with measure of outlyingness exceeding a prespecified threshold (0.95 or 0.98). W Make decisions on the list of suspicious points.
  • 7. 7/52 Inference in survey sampling W The target measure of outlyingness is defined at finite population level. W Two mechanisms W Mechanism for generating finite population W Mechanism for drawing a sample W Condition on the finite population, use design-based inference. W Asymptotic theory in survey sampling W Sequence of finite populations W Sequence of sampling designs
  • 8. 8/52 Sequence of finite populations W Let be the population index. W Associated with the -th population element is a -dim vector , with inclusion probability . W Finite pop of sizes , and sample a of sizes , expected sample size a W Population composition , with . W Assume we know and subpopulation association. W Let be the power set of . yi = (yi ;1; :::; yi ;p) p ¼i Ng = fN gN fN g 2 [fL ; fH ] i Uº = f1; 2; ¢ ¢ ¢ ; Nº g Uº = [G g= 1Uº g Nº g Aº = SG g= 1 Aº ;g G n¤ º g = E(nº gjFº ): nº g Fº fy1; y2; ¢ ¢ ¢ ; yN º g
  • 9. 9/52 Sequence of sampling designs W A probability sample is drawn from with respect to some measurable design. W Associate a sampling indicator with each element. W Inclusion probability W Sample size with expectation . AN UN I(i 2 A N ) n ¼i =Pr(i 2 AN ) = E(I(i 2 A N ) jFN ) ¼i j = Pr(i; j 2 AN ) = E(I(i 2 A N ) I(j 2 A N ) jFN ) n¤ = E(njFN )
  • 10. 10/52 Examples of sampling designs and estimators W Simple random sampling without replacement W Inclusion probability W Poisson sampling W Inclusion probability W Horvitz-Thompson and Hajek estimator of the mean Arbitrary ¼i j = ½ ¼i ¼j ; i 6= j ¼i ; i = j ¼i = n N ; ¼i j = (n ¡ 1)n (N ¡ 1)N ; i 6= j ¼i ;
  • 11. 11/52 Norms W Use the notion of norm to quantify the distance between an observation and measure of center. W A norm satisfies: W Non-degenerate: W Homogeneity: W Triangle inequality: k¹ k : <p ! <+ k¹ k = 0 , ¹ = 0 k¹ 1 + ¹ 2k · k¹ 1k + k¹ 2k k®¹ k = j®jk¹ k
  • 12. 12/52 Example of norms and unit circle W General norms W Manhattan distance: W Euclidean distance: W Supremum norm: W Quadratic norm L1 : k¹ k1 = Pp i = 1 j¹i j L2 : k¹ k2 = pPp i = 1 ¹2 i L1 : k¹ k = maxfj¹1j; ¢ ¢ ¢ ; j¹pjg LA : k¹ kA = p ¹ 0A¹
  • 13. 13/52 Distribution of population distances W Population W Sample where W Measure of center: mean vector W Population W Sample Dº ;d(¹ º ) = 1 N º P Uº I(ky i ¡ ¹ º k· d) bDº ;d(^¹ º ) = 1 bN º P A º 1 ¼i I(ky i ¡ ^¹ º k· d) bNº = P A º 1 ¼i ¹ º = 1 N º P Uº yi ^¹ º = 1 bN º P A º y i ¼i location radius
  • 14. 14/52 Bivariate population fy : ky ¡ ¹ k = dg ¹ Dº ;d(¹ º ) bDº ;d(^¹ º )
  • 16. 16/52 General design assumptions W Assumptions on , , and design variance W W W For any vector with finite moments, define as the HT estimator of mean, assume W For any with positive definite population variance- covariance matrix and finite fourth moment, and z z KL · N n ¼j · KU Var(¹zN ;¼jFN ) · K1VarSRS (¹zN ;SRS jFN ) [V (¹zN ;¼jFN )]¡ 1 ^VH T f¹zN ;¼g ¡ Ip£ p = Op(n¤¡ 1=2 ) n = Op(N¯ º ); with ¯ 2 ( 2p 2p+ 1 ; 1] ¹zN ;¼ n ¼i 2 + ± n¤1=2 (¹zN ;¼ ¡ ¹zN )jFN d ! N(0; §z z )
  • 17. 17/52 Application specific assumption 1 W The population distance distribution converges to a limiting function where W The limiting function is continuous in . and , with finite derivatives and W The norm is continuous on , with a continuous derivative , and bounded second derivative matrix . (d; ¹ ) 2 [0; 1) £ <p: d 2 [0 1) ¹ 2 <p k ¢ k <p Ã(¢) Hs(¢) lim N ! 1 Dº ;d(¹ ) = Dd(¹ ) Dd(¹ ) @D d (¹ ) @d ; @D d (¹ ) @¹ @2 D d (¹ ) @¹ 2 :
  • 18. 18/52 Application specific assumption 2 W The population quantity where and W Justification assumes a probabilistic model. W Markov’s Inequality, Borel-Cantalli Lemma. ® 2 [1 4 ; 1) p Nº n 1 N º P Uº I(d< ky i ¡ ¹ k· d+ hN º ) ¡ @D d (¹ ) @d hN º o ! 0 hº = O(N¡ ® º )
  • 19. 19/52 Application specific assumption 3 W The population quantity converges to 0 uniformly for and W Justification assumes a probabilistic model. W Proof: empirical process theory. s 2 Cs¹ 2 <p n¤ º 1=2 N º P Uº h I(ky i ¡ ¹ ¡ n¤ º ¡ 1=2 sk· d) ¡ I(ky i ¡ ¹ k· d) ¡ Dd(¹ º + n¤ º ¡ 1=2 s) + Dd(¹ ) i
  • 20. 20/52 Design consistency W Decomposition W Intermediate result W Consistency n¤ º 1=2 ( bDº ;d(^¹ º ) ¡ bDº ;d(¹ º ) ¡ Dd(^¹ º ) + Dd(¹ º )) p ! 0 n¤ º 1=2 ( bDº ;d(^¹ º ) ¡ Dd(¹ º )) ¯ ¯ ¯ Fº = Op(1) n¤ º 1=2 ³ bDº ;d(^¹ º ) ¡ Dº ;d(¹ º ) ´ = n¤ º 1=2 ³ bDº ;d(^¹ º ) ¡ bDº ;d(¹ º ) ¡ Dg;d(^¹ º ) + Dg;d(¹ º ) ´ + n¤ º 1=2 ³ bDº ;d(¹ º ) ¡ Dº ;d(¹ º ) ´ + n¤ º 1=2 (Dd(^¹ º ) ¡ Dd(¹ º )) ;
  • 21. 21/52 Asymptotic normality W Let and be the design variance- covariance matrix of HT estimator of mean . W Asymptotic normality where Unknown subpop size assume subpop size and mean are known Unknown subpop mean b¹ ;i = (I(ky i ¡ ¹ º k· d) ; 1; yi )0 b¹ ;i a¹ = · 1; ¡Dº ;d(¹ º ) ¡ @D d (¹ º ) @¹ º T ¹ º ; @D d (¹ º ) @¹ º T ¸0 §¹ ;d ¡ a0 ¹ §¹ ;da¹ ¢¡ 1=2 ³ bDº ;d(^¹ º ) ¡ Dº ;d(¹ º ) ´¯ ¯ ¯ Fº d ! N(0; 1)
  • 22. 22/52 Multivariate median W Mean vector W Generalized median W Population W Sample W Existence and uniqueness qº = arg infq P Uº kyi ¡ qk ^qº = arg infq P A º 1 ¼i kyi ¡ qk
  • 23. 23/52 Multivariate median W Estimating equations W Population W Sample W Linearization of W What if the estimating equation is not differentiable? P Uº Ã(yi ¡ q) = 0 P A º 1 ¼i Ã(yi ¡ q) = 0: ^qº ^qº = qº + " 1 Nº X i 2 A º Hs(yi ¡ qº ) ¼i #¡ 1 1 Nº X i 2 A º Ã(yi ¡ qº ) ¼i + op(n¤ º ¡ 1=2 )
  • 24. 24/52 Median-based distances  Asymptotic results W Design consistency and asymptotic normality of for . W Design consistency and asymptotic normality of as an estimator of . ^qº qº bDº ;d(^qº ) Dº ;d(qº )
  • 25. 25/52 Mahalanobis distances W Mean and median-based inference. W Choose an appropriate norm to match the shape of underlying multivariate distribution. W Estimate the variance-covariance matrix or other shape measure of subpopulation, and use Mahalanobis distance. W Estimate the distribution of Mahalanobis distances. W See application section for more details.
  • 26. 26/52 Naive variance estimator W Use mean-based case to explain variance estimators. W Recall the asymptotic variance of : where W Claim: The extra variance due to estimating the center can be ignored in elliptical distributions using quadratic norm. W Naïve variance estimator, ignoring the gradient vector: bDº ;d(^¹ º ) V ³ bDº ;d(^¹ º ) ´ = a0 ¹ §¹ ;da¹ a¹ = µ 1; ¡Dº ;d(¹ º ) ¡ ³ @D d (¹ º ) @¹ º ´0 ¹ º ; ³ @D d (¹ º ) @¹ º ´0 ¶0 : ^¾2 ¹ ;d;nai ve = ³ 1; ¡ bDº ;d(^¹ º ) ´ b§¹ ;d ³ 1; ¡ bDº ;d(^¹ º ) ´0 :
  • 27. 27/52 Estimate the gradient vector by kernel smoothing W Idea: estimate by where , e.g.: CDF of standard normal. W Kernel estimator W Design consistent for under mild assumptions. W Jackknife variance estimation has been proposed for mean- based case. K(¢) = R K(t)dt @D d (¹ º ) @¹ º Dd(¹ ) = lim º ! 1 1 N º P Uº I(ky i ¡ ¹ k· d) 1 ^N º P A º K ³ d¡ ky i ¡ ¹ k h ´ 1 ¼i ^³ º ;d(^¹ º ) = 1 ^N º h P A º K ³ d¡ ky i ¡ ^¹ º k h ´ Ã(yi ¡ ^¹ º ) 1 ¼i
  • 28. 28/52 Jackknife variance estimator W Recall W Recalculate mean for each jackknife replicate? Inconsistent! W Proposed idea: incorporate an estimated gradient vector in replication estimation. W For the l-th replicate sample, calculate and use bDº ;d(^¹ º ) = 1 bN º P A º 1 ¼i I(ky i ¡ ^¹ º k· d) bD(l) (^¹ º ) = bD(l) º ;d (^¹ º ) + ^³ º ;d(^¹ º )(^¹ (l) º ¡ ^¹ º ) bVJ K ³ bDº ;d(^¹ º ) ´ = LX l= 1 cl ³ bD(l) (^¹ º ) ¡ bDº ;d(^¹ º ) ´2
  • 29. 29/52 Simulation study W Goal of simulation study: W Assess asymptotic properties of estimators. W Compare naive variance estimator with kernel estimator. W Simulation parameters W P=2, G=5. W Subpopulations 1-4 are elliptically contoured, subpopulations 5 is skewed. W Stratified SRS. W Norm: Euclidean norm.
  • 32. 32/52 =5000, =1000, =5 Cluster 4 Cluster 5 1.00 1.41 2.45 1.00 1.41 2.45 .44 .54 .71 .31 .52 .85 -0.11 0.00 -0.00 -0.00 -0.00 0.05 -0.01 0.00 -0.01 0.00 0.00 0.01 1.03 1.00 1.00 1.30 1.13 1.00 G d Effect of estimating the center bi as( ^D (¹ )) sd( ^D ( ^¹ )) sd( ^D ( ^¹ )) sd( ^D (¹ )) bi as( ^D ( ^¹ )) sd( ^D ( ^¹ )) N n Dº ;d(¹ º )
  • 33. 33/52 =5000, =200, =5 Cluster 4 Cluster 5 1.00 1.41 2.45 1.0 1.41 2.45 .43 .53 .68 .35 .55 .88 -0.28 -0.05 0.12 0.05 0.14 0.10 0.03 0.04 0.06 0.00 -0.01 0.02 1.17 1.03 1.01 1.16 1.04 0.96 G d Effect of estimating the center bi as( ^D (¹ )) sd( ^D ( ^¹ )) sd( ^D ( ^¹ )) sd( ^D (¹ )) bi as( ^D ( ^¹ )) sd( ^D ( ^¹ )) N n Dº ;d(¹ º )
  • 34. 34/52 =5000, =1000, =5 Cluster 4 Cluster 5 1.0 1.41 2.45 1.0 1.41 2.45 0.44 0.54 0.71 0.31 0.52 0.85 0.94 1.00 1.00 0.53 0.78 1.07 1.21 1.15 1.12 1.00 1.01 1.04 1.07 1.06 1.04 0.85 0.94 0.98 G d ^¾2 d ; S M ^¾2 d ; M C h= 0:4 Average estimated variance relative to MC variance N n Dº ;d(¹ º ) ^¾2 d ; S M ^¾2 d ; M C h= 0:1 ^¾2 d ; N V ^¾2 d ; M C
  • 35. 35/52 NRI application W Introduction to NRI. W Outlier identification for a longitudinal survey. W Strategy for initial partitioning in NRI. W How to define Mahalanobis distances. W Analysis of identified points.
  • 36. 36/52 National Resources Inventory (1) W National Resources Inventory is a longitudinal survey of natural resources on non-Federal land in U.S. W Conducted by the USDA NRCS, in co-operation with CSSM at Iowa State University. W Produce a longitudinal database containing numerous agro- environmental variables for scientific investigation and policy-making. W Information was updated every 5 years before 1997 and annually through a partially overlapping subsampling design.
  • 37. 37/52 National Resources Inventory (2) W Various aspects of land use, farming practice, and environmentally important variables like wetland status and soil erosion. W Measure both level and change over time in these variables. W Primary mode of data collection is a combination of aerial photography and field collection. W Outliers arise from errors in data collection, processing or some real points themselves behave abnormally.
  • 38. 38/52 Outlier identification for a longitudinal survey W Identify outliers for periodically updated data. W Build outlier identification rules on previous years’ data and use the rules to flag current observations. Observe years 2001-2005 (2001,2002, 2003) (2003,2004 ,2005) Training set Test set
  • 39. 39/52 Target variables W Non-pseudo core points with soil erosion in years 2001- 2005. W Variables: broad use, land use, USLE C factor, support practice factor, slope, slope length and USLE loss . W USLE loss represents the potential long term soil loss in tons/acre. USLELOSS= R * K * LS * C * P
  • 40. 40/52 Point classification b.u. Point Type b.u. Point Type 1 Cultivated cropland 7 Urban and built-up land 2 Noncultivated cropland 8 Rural transportation 3 Pastureland 9 Small water areas 4 Rangeland 10 Large water areas 5 Forest land 11 Rederal land 6 Minor land 12 CRP
  • 41. 41/52 Initial partitioning W Initial partitioning uses geographical association and broad use category. Partition national data into state-wise categories. Collapse northeastern states. Partition each region based on broad use sequence into (1,1,1), (2,2,2) (3,3,3), (12,12,12) and points with broad use change. Merge points with same broad use change pattern, say (2,2,3), (1,1,12).
  • 42. 42/52 Defining distances W Estimate subpopulation mean vector and covariance matrix W Calculate distance to the center W The inverse matrix is defined through a principal value decomposition. ^¹ º = 1 bN º P Sº y i ¼i b§º = 1 bN º P Sº (yi ¡ ^¹ º )(yi ¡ ^¹ º )0 1 ¼i kyi ¡ ^¹ º kb§ º = q (yi ¡ ^¹ º )0b§¡ º (yi ¡ ^¹ º ): b§¡ º
  • 43. 43/52 Source of outlyingness W Flagged 1% points in training set, and compared test distances with 99%-quantile of training distances. W Source of outlyingness ^eº ;i = b§ ¡ 1 = 2 º ( ^¹ º ¡ y i ) k b§ ¡ 1 = 2 º ( ^¹ º ¡ y i )k
  • 44. 44/52 Analysis of flagged points W Agricultural specialists analyzed identified points by suspicious variables. W C factor: almost all points were considered suspicious. W Data entry errors W Invalid entries  c factor=1 for hayland, pastureland or CRP W Unusual levels or trends in relation to landuse (0.013, 0.13, 0.013, 0.013, 0.013) (0.011, 0.06, 0.11, 0.003, 0.003)
  • 45. 45/52 Analysis of flagged points W P factor: all points are candidates for review because of the change over time. W Slope length: all points were flagged because of the level, not change over time. W Land use: Most points flagged because of a change in the type of hayland or pastureland over time. Not a major concern to NRCS reviewers. (1.0, 1.0, 1.0, 0.6, 1.0)
  • 46. 46/52 Nondifferentiable survey estimators W The sample distance distribution is nondifferentiable function of the estimated location parameter. W A general class of survey estimators: with corresponding population quantity W A direct Taylor linearization may not be applicable, again use a differentiable limiting function , with derivative . bT(^¸ ) = 1 N P i 2 Sº 1 ¼i h(yi ; ^¸ ) TN (¸ N ) = 1 N PN i = 1 h(yi ; ¸ N ) Not necessarily differentiable ³ (° ) bDº ;d(^¹º ) T (° ) = lim N ! 1 TN (° )
  • 47. 47/52 Asymptotics W We provide a set of sufficient conditions on the limiting function and a number of population quantities under which where W The extra variance due to estimating unknown parameter may or may not be negligible, depending on the derivative. W Propose a kernel estimator to estimate unknown derivative. n¤1=2 h V ( bT(^¸ )) i¡ 1=2 ³ bT(^¸ ) ¡ TN (¸ N ) ´¯ ¯ ¯ ¯ F d ! N(0; 1) ( bT(^¸ )) = ³ 1; [³ (¸ N )]T ´ V (¹z¼) µ 1 ³ (¸ N ) ¶ :
  • 48. 48/52 Estimating distribution function using auxiliary information W Ratio model W Use as a substitute of , where . W Difference estimator W The extra variance due to estimating ratio is negligible (RKM, 1990). ^Rxi yi ^R = P S º yi =¼i P S º x i =¼i bT ( ^R) = 1 N nP Sº 1 ¼i I(yi · t) + hP U I( ^Rxi · t) ¡ P Sº 1 ¼i I( ^Rxi · t) io yi = Rxi + ²i ; ²i » ID(0; xi ¾2)
  • 49. 49/52 Estimating a fraction below an estimated quantity W Estimate the fraction of households in poverty when the poverty line is drawn at 60% of the median income. with population quantity W Assume that , the extra variance depends on . bT (^q) = 1 N P Sº 1 ¼i I(yi · 0:6^q) TN (qN ) = 1 N NP i = 1 I(yi · 0:6qN ) lim N ! 1 TN (°) = FY (0:6°) @FY (0:6° ) @°
  • 50. 50/52 Nondifferentiable estimating equations W The sample p-th quantile can be defined through estimating equations W The usual practice is to linearize the estimating function, but this approach is not applicable due to nondifferentiability. W Provide a set of sufficient conditions on the monotonicity and smoothness of and its limit for proof. ^S(t) = 1 N P i 2 S 1 ¼i I(yi ¡ t· 0) ¡ p ^» = infft : ^S(t) ¸ 0g SN (t) = 1 N NP i = 1 I(yi ¡ t· 0) ¡ p »N = infft : SN (t) ¸ 0g SN (t)
  • 51. 51/52 Concluding remarks W Proposed an estimator for subpopulation distance distribution and demonstrated its statistical properties. W Application in a large-scale longitudinal survey. W Theoretical extensions to nondifferentiable survey estimators.