SlideShare a Scribd company logo
1 of 172
Download to read offline
Leonardo Auslender –Ch. 1 Copyright 2004 Ch. 1.1-19/8/2019
Leonardo Auslender –Ch. 1 Copyright 2004 1.1-29/8/2019
Leonardo Auslender –Ch. 1 Copyright 2004 1.1-39/8/2019
Curse of Dimensionality in high dimensions.
Assume cube in [0; 1], edges X, Y, Z. Volume = 1, and length (X) = length (Y) =
length (Z) = 1.
Take sub-cube of s = 10% observations ➔ expected edge length = s ** (1 / 3) =
.46; 30% sub-cube .67, while edge length for each input = 1 ➔ must cover 46%
of range of each input to capture 10% of data because 0.46 ** 3 is about 10%.
Sampling density proportional to s ** (1 / p), p sample proportion. In
Multivariate studies, typically add features to enhance study. The more we
add, less representative the sample is. Most actual data point lie outside of
sample larger number of variables. Note that individual variables cardinality
typically maintained, it is multivariate aspect that is cursed.
If n1 = 1000 is dense sample for X1, n10 = 1000 ** 10 is sample size required for
same sampling density with 10 inputs.
Sampling needs grow exponentially with dimensions. Next slide: example
with 6 binary variables (X, Y, Z, T, U W) for overall population of 10,000 and
then random samples of 5%, 10% and 30%. Note that that there are 2 ** 6 = 64
possible combinations of variable levels. Notice missing patterns in samples
(partial output displayed due to space).
Leonardo Auslender –Ch. 1 Copyright 2004 1.1-49/8/2019
Curse of dimensionality N in
pop
% in
pop
N in
5%
% in
5%
N in
10%
% in
10%
N in
30%
% in
30%
x y z u v w
82 0.82 5 0.96 13 1.28 26 0.870 0 0 0 0 0
1 22 0.22 1 0.19 2 0.20 9 0.30
1 0 68 0.68 1 0.19 12 1.18 18 0.61
1 13 0.13 1 0.19 2 0.20 4 0.13
1 0 0 33 0.33 2 0.38 4 0.39 9 0.30
1 11 0.11 1 0.10 2 0.07
1 0 36 0.36 3 0.29 11 0.37
1 12 0.12 4 0.13
1 0 0 0 539 5.39 31 5.95 54 5.31 158 5.32
1 151 1.51 10 1.92 19 1.87 46 1.55
1 0 543 5.43 31 5.95 53 5.21 148 4.98
1 158 1.58 12 2.30 15 1.47 51 1.72
1 0 0 299 2.99 13 2.50 19 1.87 84 2.83
1 90 0.90 4 0.77 11 1.08 32 1.08
1 0 290 2.90 13 2.50 20 1.97 84 2.83
1 88 0.88 9 1.73 14 1.38 23 0.77
1 0 0 0 0 149 1.49 9 1.73 17 1.67 49 1.65
1 46 0.46 1 0.19 2 0.20 13 0.44
1 0 142 1.42 6 1.15 14 1.38 47 1.58
1 42 0.42 1 0.19 2 0.20 16 0.54
1 0 0 67 0.67 2 0.38 7 0.69 27 0.91
1 13 0.13 1 0.10 4 0.13
1 0 81 0.81 2 0.38 12 1.18 31 1.04
1 22 0.22 1 0.10 8 0.27
1 0 0 0 1318 13.18 56 10.75 115 11.31 372 12.52
1 309 3.09 16 3.07 41 4.03 91 3.06
Leonardo Auslender –Ch. 1 Copyright 2004 1.1-59/8/2019
9 binary variables, sampling 5%, 10%, 30% (100% total population). When # vars = 5 (in
this example), % captured patterns in samples declines steadily.
Leonardo Auslender –Ch. 1 Copyright 2004 Ch. 1.1-69/8/2019
Leonardo Auslender –Ch. 1 Copyright 2004 1.1-79/8/2019
Positive Definite Matrix and definitions. ***
Symmetric matrix with all positive eigenvalues. In the
case of covariance and correlation matrices (that are
symmetrical), all eigenvalues are real numbers.
Correlation and covariance matrices must have
positive eigenvalues, otherwise they are not of full
rank ➔ there are perfectly linear dependencies
among the variables.
For X data matrix of predictors, sample covariance =
X’X = v. Generalized sample variance = det (v) (not
much used). Since vars in X have different scales,
could use instead correlation matrix, i.e., det (R).
Leonardo Auslender –Ch. 1 Copyright 2004 Ch. 1.1-89/8/2019
Principal Components Analysis (PCA)
Technique for forming new variables from (typically)
large ‘p’ data set, which are linear composites of the
original. Variables.
Aim is to reduce dimension (‘p’) of the data set while
minimizing the amount of information lost if we do
not choose all the composites. Number of
composites = number of original variables ➔
problem of composite selection.
The other dimension of the data, ‘n’, is reduced via
cluster analysis, presented later on.
Leonardo Auslender –Ch. 1 Copyright 2004 1.1-99/8/2019
Web example: 12 observations.
Leonardo Auslender –Ch. 1 Copyright 2004 1.1-109/8/2019
23.091 is variance of X1 ….
Note: total variance also called “overall “ or “summative” variance, multivariate or
total variability: It is the sum of the variance of the individual variables.
Leonardo Auslender –Ch. 1 Copyright 2004 1.1-119/8/2019
Let’s create Xnew arbitrarily from x1 and x2.
Leonardo Auslender –Ch. 1 Copyright 2004 1.1-129/8/2019
Leonardo Auslender –Ch. 1 Copyright 2004 1.1-139/8/2019
Play with the angle to maximize the fitted variance.
Leonardo Auslender –Ch. 1 Copyright 2004 1.1-149/8/2019
Leonardo Auslender –Ch. 1 Copyright 2004 1.1-159/8/2019
Geometric Interpretation:
Note that in order to find Xnew, we have rotated X1 and X2.
Xnew accounts for 87.31% of the total variation. Ergo,
possible to estimate a new vector, called second eigenvector or
principal component that accounts for
Variation not fit by first vector, xnew.
Axis of second vector is orthogonal to first one ➔
uncorrelated.
Derivation of new axes or vars are called principal components,
which refer to the weights by which Original variables are multiplied
and then summed up, and their values are PC scores.
Leonardo Auslender –Ch. 1 Copyright 2004 1.1-169/8/2019
Variance of xnew, 38.576 is called first eigenvalue.
Estimates of the eigenvalues provides measures of the
amount of the original total variation fit by each of
the new derived variables.
The sum of all the eigenvalues equals the sum of the
variances of the original variables. PCA rearranges the
variance in the original variables so it is concentrated in
the first few new components.
Debatable whether binary variables can be used in PCA
because binary variables do not have true ‘origin’. Thus,
its variance in [0, .25] and mean lies in [0, 1], while other
variables can have far larger variances and means.
Leonardo Auslender –Ch. 1 Copyright 2004 1.1-179/8/2019
Eigenvectors (characteristic vectors)
Eigenvectors are lists of coefficients or weights (cj)
showing how much each original variable contributes to
each new derived variable, or eigenvector.
The eigenvectors are usually scaled so that the sum of
squared coefficients for each eigenvector equals one.
Eigenvectors are orthogonal
Analysis can be done by decomposing covariance (as in
example) or correlation. Covariance keeps original units
and method tends to fit variables with higher variances.
Leonardo Auslender –Ch. 1 Copyright 2004 1.1-18
Leonardo Auslender –Ch. 1 Copyright 2004 1.1-199/8/2019
Important Message N
1
Note to clarify Variable names in the table below:
1
.
1
If the model name ends up in 'ORIGINAL', Variable values are of the form
1
'variable name'_'PrinX_', where X denotes an eigenvalue number.
1
If in addition, the model name ends up in COV_ORIGINAL,
1
the corresponding variance to 'variable_name'_prinx_ is in column prinX.
1
And for values other than X, the columns represent the covariance.
1
..
1
If the model name ends up in CORR_ORIGINAL, then the prinX columns denote
1
correlations.
1
...
1
If the model name ends up in COV_PCA, the principal components
1
were obtained from the covariance matrix. If in CORR_PCA,
1
obviously correlations.
1
....
1
For space reasons, a maximum of 6 Prin variables is shown.
1
Leonardo Auslender –Ch. 1 Copyright 2004 1.1-209/8/2019
Var/Cov/Corr before and after PCA
prin1 prin2
Model Name Variable
21.091 16.455
M2_TRN_PCOMP_COV_ORIGI
NAL
x2_prin1_
x1_prin2_
16.455 23.091
Ovl. Var
44.182 44.182
M2_TRN_PCOMP_COV_PCA Prin1
38.576 0.000
Prin2
0.000 5.606
Ovl Var
44.182 44.182
M5_TRN_PCOMP_CORR_ORIG
INAL
x2_prin1_
1.000 0.746
x1_prin2_
0.746 1.000
Ovl. Var
2.000 2.000
M5_TRN_PCOMP_CORR_PCA Prin1
1.000 0.000
Prin2
0.000 1.000
Ovl Var
2.000 2.000
Leonardo Auslender –Ch. 1 Copyright 2004 1.1-219/8/2019
From previous slide
(cov_original and cov_PCA)
X2_prin1_ means X2 variance appears in column prin1…
Ovl. Var is sum of diagonal elements. Prin1 is first eigenvector. Etc.
Notice that raw units were used in PCA in this case ➔ Variable with larger
variance (X1) has more influence in results.
Notice that overall variance (44.182) is same for original case as for PCA
results. Dimension reduction refers to selecting # of Principal components
that fits up to specific percentage of overall variance. But original variables
are no longer useful.
First pcomp fits 87.3% (38.576 * 100 / 44.182) of the overall variance, etc. In
the available data, PCA finds direction or dimension with the largest
variance out of the overall variance, which is 38.576.
Leonardo Auslender –Ch. 1 Copyright 2004 1.1-229/8/2019
Then, orthogonal to the first direction, finds direction of largest variance of
whatever is left of overall variance, i.e., 44.182 – 38.576, which in our simple
example is 5.606.
CORR_Original and CORR_PCA
Correlation based PCA. Notice that Prin1, etc
Variance and covariance are different. Notice orthogonality
(in previous slide also) between prin1 and prin2 given by
the 0 covariances.
Also, PCA can be performed on [0; 1] rescaled data via covariance
(not shown).
Leonardo Auslender –Ch. 1 Copyright 2004 1.1-239/8/2019
Leonardo Auslender –Ch. 1 Copyright 2004 1.1-249/8/2019
Leonardo Auslender –Ch. 1 Copyright 2004 1.1-259/8/2019
Aim: find projections to summarize (mean centered) data.
Approaches.
1) Find projections/vectors of maximum total variance.
2) Find projections with smallest avg (mean-squared)
distance between original and projections, which is
equivalent to 1). Thus, maximize the variance by
choosing ‘w’ (w is the vector of coeffs, x the original
data matrix), where variance is given by:
Leonardo Auslender –Ch. 1 Copyright 2004 1.1-269/8/2019
To maximize variance fitted by component w, requires w
to be a unit vector, and thus w’w = 1 as constraint. Thus
maximize with constraint, i.e., Lagrange multiplier
method.
2
( , ) ( ' 1)
' 1
2 2
wL w w w
L
w w
L
vw w
w
  





 − −
= −
= −
And setting them to 0, we obtain:
Leonardo Auslender –Ch. 1 Copyright 2004 1.1-279/8/2019
' 1
0
w w
vw w vw w 
=
=  − =
Thus, ‘w’ is eigenvector of v & maximizing w is the one
associated with largest eigenvalue. Since v (= x’x) is (p,
p), there are at most p eigenvalues.
Since v is covariance ➔ it is symmetric ➔ all
eigenvectors orthogonal to each other. Since v positive
matrix ➔ all eigenvalues > 0.
Leonardo Auslender –Ch. 1 Copyright 2004 1.1-289/8/2019
While these principal factors represent or
replace one or more of the original variables,
they are not just a one-to-one transformation, ➔
Inverse transformations are not possible.
NB: can obtain PCA without w’w = 1 constraint,
but then standard PCA interpretation is not true.
Leonardo Auslender –Ch. 1 Copyright 2004 Ch. 1.1-299/8/2019
Detour on Eigenvalues, etc.
Let A (n,n) , v (n, 1),  scalar. Note that A is not the typical
rectangular data set but a square matrix, for instance, a
covariance or correlation matrix.
Problem: find  /
A v =  v has nonzero solution. Note that A v is a
vector, for instance, the estimated predictions of a
linear regression.
(For us, A is data, v is coefficients,  v linear transformation of
coefficients).  called eigenvalue if nonzero vector v exists that
satisfies equation.
Since v  0 ➔ |A -  I| = 0 ➔ equation of degree n in
, determines values for  (notice that roots of
equation could be complex).
Leonardo Auslender –Ch. 1 Copyright 2004 Ch. 1.1-309/8/2019
Diagonalization.
Matrix A diagonalizable  has n distinct eigenvalues. Then S (n,n) is
diagonalizing matrix, with eigenvectors of A as elements, and D is
diagonal matrix with eigenvalues of A as its elements.
S–1AS = D ➔ A = SDS–1, and
A2 = (SDS–1) (SDS–1) = SD2S–1
Ak = (SDS–1) …. (SDS–1) = SDkS–1
Example: 30% of married women get divorced; 20% of single get
married each year. 8000 M and 2000 S, and constant population.
Find number of M and S in 5 years.
v = (8000, 2000)’; A = {0.7 0.2
0.3 0.8}
Eigenvalues = 1; .05.
Eigenvectors: v1 = (2; 3)’
v2 = (1; -1)’ ,,,,,,,, A5 = SDS–1 = (4125, 5875)’
As k → , eigenvalues → (1; 0) ➔ A  → (4000; 6000)’
Leonardo Auslender –Ch. 1 Copyright 2004 Ch. 1.1-319/8/2019
Detour on Eigenvalues, etc (cont. 2).
Singular Value Decomposition (SVD)
Notice that only square matrices can be diagonalized. Typical data
sets, however, are rectangular. SVD provides necessary link.
A (m,n), m  n ➔ A = UV’,
U(m, m) orthogonal matrix (its columns are eigenvectors of AA’)
(AA’ = U  V’V  U’ = U 2U’)
V(n, n) orthogonal matrix (its columns are eigenvectors of A’A)
(A’A= V  U’U  V’ = V 2V’)
 (m,n) = diagonal ( 1, 0) ,
1 = diag( 1  2 ….  n)  0.
’s called singular values of A. 2’s are eigenvalues of A’A.
U and V: left and right singular matrices
(or matrices of singular vectors).
Leonardo Auslender –Ch. 1 Copyright 2004 Ch. 1.1-329/8/2019
Principal Components Analysis (PCA): Dimension Reduction
for interval-measure variables (for dummy variables, replace Pearson
correlations by polychoric correlations that assume underlying latent
variables. Continuous-dummy correlations are fine).
PCA creates linear combinations of original set of variables which explain
largest amount of variation.
First principal component explains largest amount of variation in original
set; second one explains second largest amount of variation subject to
being orthogonal to first one, etc.
X2
X1
X3 PC1
PC2PC3
PCi=V1i*X1+ V2i *X2+V3i*X3
Leonardo Auslender –Ch. 1 Copyright 2004
Ch. 1.1-339/8/2019
PC scores for each observation created by product of
X and V, the set of eigenvectors.























===
===
===
p
1i
niip
p
1i
ni2i
p
1i
ni1i
p
1i
i2ip
p
1i
i22i
p
1i
i21i
p
1i
i1ip
p
1i
i12i
p
1i
i11i
XVXVXV
XVXVXV
XVXVXV



XV =
SVD of Covariance/Correlation Matrix = USVT
Covariance/Correlation Matrix
Leonardo Auslender –Ch. 1 Copyright 2004 Ch. 1.1-349/8/2019
PCA computed by performing SVD/Eigenvalue
Decomp. on covariance or correlation matrix.
Eigenvalues and associated eigenvectors extracted
from covariance matrix ‘sequentially’.
Each successive eigenvalue is smaller (in absolute
value), and each associated eigenvector is orthogonal
to previous one.
X =
Covariance/Correlation Matrix
SVD of Covariance/Correlation Matrix = USVT
Leonardo Auslender –Ch. 1 Copyright 2004
Ch. 1.1-359/8/2019
Amount of variation fitted by first k principal
components can be computed in following way. i are
eigenvalues of covariance/correlation matrix.

2
=
























p
k
2
1
000
000
000
000






% Variation fitted =
%100p
1j
j
k
1i
i





=
=
Leonardo Auslender –Ch. 1 Copyright 2004 1.1-369/8/2019
Leonardo Auslender –Ch. 1 Copyright 2004 Ch. 1.1-379/8/2019
Covariance or Correlation Matrix derivation?
Overlooked point: results are different. Correlation matrix is the
covariance matrix of the same data but in standardized form. Assume 3
variables x1 through x3. If Var(x1) = k (var (x2) + var(x3)) for large k, then
x1 will dominate the first eigenvalue and the others would be negligible.
Standardization implicit in correlation matrix treats all variables equally,
because of unitary variance of each one.
Recommendation: depends on focus of study, similar problem in
clustering: outliers can badly affect standard deviation and mean
estimations
➔ standardized variables do not reflect behavior of original
variable.
Leonardo Auslender –Ch. 1 Copyright 2004 Ch. 1.1-389/8/2019
SAS
Proc princomp data = &indata. Cov out =
outp_princomp;
Var doctor_visits fraud member_duration
no_claims optom_presc total_spend;
Run;
Proc corr data = outp_princomp;
Var prin1 prin2 prin3 doctor_visits fraud
member_duration no_claims optom_presc
total_spend;
Run;
Leonardo Auslender –Ch. 1 Copyright 2004 1.1-399/8/2019
Leonardo Auslender –Ch. 1 Copyright 2004 1.1-409/8/2019
Var/Cov/Corr before and after PCA prin1 prin2 prin3 prin4 prin5 prin6
Model Name Variable
1.236 0.001 0.394 2.550 0.139 -278.0M2_TRN_PCOMP_COV_ORI
GINAL
no_claims_prin1_
num_members_prin2_ 0.001 0.994 0.115 -0.501 -0.024 -163.5
doctor_visits_prin3_ 0.394 0.115 48.888 115.04 -0.859 8399.9
member_duration_prin4_ 2.550 -0.501 115.04 6493.3 -12.79 93386
optom_presc_prin5_ 0.139 -0.024 -0.859 -12.79 2.749 726.61
total_spend_prin6_ -278.0 -163.5 8399.9 93386 726.61 1.3E8
Ovl. Var 1.3E8 1.3E8 1.3E8 1.3E8 1.3E8 1.3E8
M2_TRN_PCOMP_COV_PCA Prin1 1.3E8 0.000 0.000 0.000 0.000 0.000
Prin2 0.000 6427.9 0.000 0.000 0.000 0.000
Prin3 0.000 0.000 46.495 0.000 0.000 0.000
Prin4 0.000 0.000 0.000 2.723 0.000 0.000
Prin5 0.000 0.000 0.000 0.000 1.216 0.000
Prin6 0.000 0.000 0.000 0.000 0.000 0.993
Ovl Var 1.3E8 1.3E8 1.3E8 1.3E8 1.3E8 1.3E8
M5_TRN_PCOMP_CORR_O
RIGINAL
no_claims_prin1_ 1.000 0.001 0.051 0.028 0.075 -0.022
num_members_prin2_ 0.001 1.000 0.016 -0.006 -0.015 -0.014
doctor_visits_prin3_ 0.051 0.016 1.000 0.204 -0.074 0.106
member_duration_prin4_ 0.028 -0.006 0.204 1.000 -0.096 0.102
optom_presc_prin5_ 0.075 -0.015 -0.074 -0.096 1.000 0.039
total_spend_prin6_ -0.022 -0.014 0.106 0.102 0.039 1.000
Ovl. Var 6.000 6.000 6.000 6.000 6.000 6.000
M5_TRN_PCOMP_CORR_PC
A
Prin1 1.000 0.000 0.000 0.000 0.000 0.000
Prin2 0.000 1.000 0.000 0.000 0.000 0.000
Prin3 0.000 0.000 1.000 0.000 0.000 0.000
Prin4 0.000 0.000 0.000 1.000 0.000 0.000
Prin5 0.000 0.000 0.000 0.000 1.000 0.000
Prin6 0.000 0.000 0.000 0.000 0.000 1.000
Ovl Var 6.000 6.000 6.000 6.000 6.000 6.000
Leonardo Auslender –Ch. 1 Copyright 2004 1.1-419/8/2019
For COV Principal components run “M2_TRN_PCOMP_COV_ORIGINAL”
“total_spend” has far larger variance, all PCA analysis,
will be dominated by this variable.
Later eigenvalues are corresponding variances of
Eigenvectors.
PCA results are different between COV and Corr..
Below, Corred PCA results.
Leonardo Auslender –Ch. 1 Copyright 2004
5 components fit 87% of total variation. Eigenvalue X Is p_comp
X variance. Eigenvalue also called Characteristic root.
Leonardo Auslender –Ch. 1 Copyright 2004 1.1-439/8/2019
Eigenvalues table
Eigenval
ue Difference
Propor
tion
Cumul
ative
Number model_name
1.30 0.22 0.22 0.22
1 M1_PCOMP_NO_S_COV
M1_PCOMP_STAN_CORR
1.30 0.22 0.22 0.22
2 M1_PCOMP_NO_S_COV
1.07 0.05 0.18 0.39
M1_PCOMP_STAN_CORR
1.07 0.05 0.18 0.39
3 M1_PCOMP_NO_S_COV
1.02 0.02 0.17 0.56
M1_PCOMP_STAN_CORR
1.02 0.02 0.17 0.56
4 M1_PCOMP_NO_S_COV
1.00 0.17 0.17 0.73
M1_PCOMP_STAN_CORR
1.00 0.17 0.17 0.73
5 M1_PCOMP_NO_S_COV
0.82 0.03 0.14 0.87
M1_PCOMP_STAN_CORR
0.82 0.03 0.14 0.87
6 M1_PCOMP_NO_S_COV
0.79 0.13 1.00
M1_PCOMP_STAN_CORR
0.79 0.13 1.00
All
12.00 1.00 2.00 7.55
Leonardo Auslender –Ch. 1 Copyright 2004 1.1-449/8/2019
Scree indicates ‘4’ components.
Scree plot: Plot of eigenvalues vs. component
Number and looking for obvious break or elbow.
Leonardo Auslender –Ch. 1 Copyright 2004 1.1-459/8/2019
Leonardo Auslender –Ch. 1 Copyright 2004 Ch. 1.1-469/8/2019
P-comp1 mostly fitted by doctor_visits and member duration. # 2
(which fits residuals from step 1 ) by No_claims and optom_presc, etc.
Leonardo Auslender –Ch. 1 Copyright 2004 Ch. 1.1-479/8/2019
Leonardo Auslender –Ch. 1 Copyright 2004 Ch. 1.1-489/8/2019
Data set used is Home Equity Loan. All continuous
Variables (except for job, reason) are:
BAD(binary target) - Default or seriously delinquent
CLAGE - Age of oldest trade line in months
CLNO - Number of trade (credit) lines
DEBTINC - Debt to income ratio
DELINQ - Number of delinquent trade lines
DEROG - Number of major derogatory reports
JOB - Prof/exec, sales, manager, office, self, or other
LOAN - Amount of current loan request
MORTDUE - Amount due on existing mortgage
NINQ - Number of recent credit inquiries
REASON - Home improvement or debt consolidation
VALUE - Value of current property
YOJ - Years on current job.
Leonardo Auslender –Ch. 1 Copyright 2004 Ch. 1.1-499/8/2019
Variables used in PCA are measured in interval scale.
Variables are:
LOAN - Amount of current loan request
MORTDUE - Amount due on existing mortgage
VALUE - Value of current property
YOJ - Years on current job
CLAGE - Age of oldest trade line in months
NINQ - Number of recent credit inquiries
CLNO - Number of trade (credit) lines
DEBTINC - Debt to income ratio.
Leonardo Auslender –Ch. 1 Copyright 2004 Ch. 1.1-509/8/2019
Eigenvalues report indicates that first four principal components fit
%70.77 of variation of original variables.
Leonardo Auslender –Ch. 1 Copyright 2004 Ch. 1.1-519/8/2019
First principal component score for each observation is created by
following linear combination.
PC1=.3179*LOAN + .6005*MORTDUE +. 6054*VALUE + .0141*YOJ +
.1827*CLAGE + .0606*NINQ + .3314*CLNO + .1574*DEBTINC
Eigenvectors report contains V coefficients associated with each
of original variables for first four principal components.
Leonardo Auslender –Ch. 1 Copyright 2004 1.1-529/8/2019
At this stage, it is customary to try to interpret the
eigenvectors in terms of the original variables.
The first vector had high relative loads in MORTDUE, VALUE and
CLNO that indicates a dimension of financial stress (remember that
there is no dependent variable, i.e., BADS does not play a role).
Given “financial stress”, the second vector is a measure of “time
effects” based on YOJ and CLAGE. And so on to the third and fourth
vectors. Notice that the interpretation is based on the magnitude of
the coefficients, without any guidelines as to what constitutes a high
relative load.
Therefore, with a large number of variables, interpretation is more
difficult because the loads do not necessarily distinguish themselves
as high or low.
In next table, conditioning on value and mortdue, hardly afflects
correlation between YOJ and CLAGE. Note that full analysis would
require 2nd order partial correlations (not done here).
Leonardo Auslender –Ch. 1 Copyright 2004 1.1-539/8/2019
1st component.
Leonardo Auslender –Ch. 1 Copyright 2004 1.1-549/8/2019
Correlations
Zero order, partial and semipartial Corrs
Corr Type
PARTIAL SEMI_P
ZERO_O
RDER
Value Value Value
Model Name With Var Variables Cond. Var
M1 VALUE YOJ CLNO
-0.010 -0.010
MORTDUE 0.138 0.138
YOJ CLAGE
0.202
CLNO 0.225 0.221
MORTDUE 0.236 0.231
VALUE 0.226 0.221
CLNO 0.025
MORTDUE 0.042 0.042
VALUE 0.013 0.013
MORTDUE -0.088
CLNO -0.095 -0.095
VALUE -0.162 -0.162
VALUE 0.008
CLNO -0.010 -0.010
MORTDUE 0.138 0.138
YOJ 1.000
Leonardo Auslender –Ch. 1 Copyright 2004 1.1-559/8/2019
Interpretation:
PCR1: high loadings for MORTDUE, VALUE and CLNO ➔ financial
aspect?
PCR2: given PCR1, YOJ and CLAGE ➔ time aspect?, etc. Note: nice to
have inference on component loadings. When p large, very difficult.
Also, when looking at PCR2 for interpretation, it is imperative to
first remove the effects of the first component equation from all
variables, before looking at correlations.
Leonardo Auslender –Ch. 1 Copyright 2004 Ch. 1.1-569/8/2019
Y
X2
X1
Y
PC2
PC1
PCA advantage; co-linearity is removed when
regressing on principal components, which is called
Principal Components Regression (PCR).
Leonardo Auslender –Ch. 1 Copyright 2004 Ch. 1.1-579/8/2019
Principal Components Regression.
1) Resulting model still contains all variables.
2) Similar to ridge regression , but with
truncation (due to choice of vectors) instead of
shrinkage of ridge.
3) “Look where there’s light” fallacy. We are not
looking at original information.
Leonardo Auslender –Ch. 1 Copyright 2004 Ch. 1.1-589/8/2019
Leonardo Auslender –Ch. 1 Copyright 2004 Ch. 1.1-599/8/2019
Discussion on Principal components (PCs).
• Dependent variable is not used ➔ No selection bias (i.e., dep var
does not affect PCA, which is ‘good’.)
• Very often PCs not interpretable in terms of original variables.
• Dependent variable not necessarily highly correlated with vectors
corresponding to largest eigenvalues (in var sel context, tendency to
select top eigenvalues related eigenvectors is unwarranted).
• Sometimes most highly correlated vector corresponds to smaller
eigenvalue.
• May be impossible to implement in present Tera-Giga-bases.
• If error component in data, PC chooses too many components.
Leonardo Auslender –Ch. 1 Copyright 2004 Ch. 1.1-609/8/2019
Discussion on Principal components (PCs) (cont. 1).
• Alternative to ‘ad-hoc’ PC selection, inference on
eigenvalues. See Johnstone (2001).
• Common practice: choose eigenvectors corresponding to high
eigenvalues ➔ vector selection problem in addition to third
point previous page (Ferre (1995) argues most methods fail. For
newer version, see Guo et al, (2002), and Minka (2000) for
Bayesian perspective). Foucart (2000) provides framework for
“dropping” principal components in regression. For robust
calculation, see Higuchi and Eguchi (2004). Li et al (2002)
analyze L1 for Principal Components. Song et al (2018) obtains
optimal number based on stability approach.
Leonardo Auslender –Ch. 1 Copyright 2004 Ch. 1.1-619/8/2019
INTERVIEW Question:
Our data is mostly
binaries. PCA?
Leonardo Auslender –Ch. 1 Copyright 2004 Ch. 1.1-629/8/2019
Leonardo Auslender –Ch. 1 Copyright 2004 Ch. 1.1-639/8/2019
Factor Analysis.
Family of statistical techniques to reduce variables into small number of latent
factors. Main assumption: existence of unobserved latent variables or
factors among variables. If factors are partialled from observed vars, partial
corrs among existent variables should be zero (to be reviewed in BEDA).
Each observed var can be expressed as weighted sum of latent
components:
For instance, concept of frailty can be ascertained by testing strength,
weight, speed, agility, balance, etc. Want to explain the component of
frailty in terms of these measures.
Very popular in social sciences, such as psychology, survey analysis, sociology, etc.
Idea is that any correlation between pair of observed variables can be explained
in terms of their relationship with latent variables.
FA as generic term includes PCA, but they have different assumptions.
....i i1 1 i2 2 ik k iy a f a f a f e= + + + +
Leonardo Auslender –Ch. 1 Copyright 2004 Ch. 1.1-649/8/2019
Differences between FA and PCA.
Difference in definition of variance of the variable to be analyzed.
Variance of variable can be decomposed into common variance,
shared by other variables and thus caused by values of latent
constructs, and unique variance that include error component. Unique
unrelated to any latent construct.
“Common” Factor analysis (FA (CFA notation used for confirmatory later) or
exploratory EFA) analyzes only common variance, PCA considers
total variance without distinction of common and unique. In PCA,
factors account for inter-correlations among variables, to identify
latent dimensions. In PCA, we account for maximum portion of
variation in original set of variables. FA uses notion of causality, PCA
is free of that.
PCA better when vars measured relatively error free (age, nationality, etc.). If vars
are only indicators of latent constructs (test score, response to attitude scale,
or surveys of aptitudes) ➔ CFA.
PCs: composite variables computed from linear combinations of the measured
variables. CFs: linear combinations of “common” parts of measured
variables that capture underlying constructs.
Leonardo Auslender –Ch. 1 Copyright 2004 Ch. 1.1-659/8/2019
EFA Rotations.
An infinite number of solutions is possible, which produces same
correlation matrix, by rotating reference axes of the factor solution to
simplify the factor structure and to achieve a more meaningful and
interpretable solution. IDEA BEHIND: rotate factors simultaneously
so as to have as many zero loadings on each factor as possible.
Meaningful and interpretable demand analyst’s expertise.
Orthogonal rotation: angle between reference axes of factors are
maintained at 90 degrees; oblique no (when factors assumed to be
correlated).
In FA case, negative eigenvalues ➔ covariance matrix NOT positive
definite ➔ Cum fitted variation proportion > 1. Note PCA not affected.
Leonardo Auslender –Ch. 1 Copyright 2004 1.1-669/8/2019
Leonardo Auslender –Ch. 1 Copyright 2004 1.1-679/8/2019
Assume 10 variables that we view in 2 factor space (Y and
X axes). Each dot below is one observation. An orthogonal
rotation (i.e., assumes that variables are uncorrelated) gets
points closer to one of the axis (and away from the other).
From thenalysisfactor.com
Leonardo Auslender –Ch. 1 Copyright 2004 1.1-689/8/2019
If variables are correlated (say education and income
level), oblique rotations (< 90* axes) creates better fit.
From thenalysisfactor.com
Leonardo Auslender –Ch. 1 Copyright 2004 Ch. 1.1-699/8/2019
Exploratory FA.
PCA gives unique solution, FA different solutions depending on method &
estimates of communality.
While PCA analyzes Corr (cov) matrix, FA replaces main diagonal corrs
by prior communality estimates: estimate of proportion of variance
of the variable that is both error-free and shared with other
variables in matrix (there are many methods to find estimates).
Determining optimal # factors: ultimately subjective. Some methods:
Kaiser-Guttman rule, % variance, scree test, size of residuals, and
interpretability.
Kaiser-Guttman: eigenvalues >= 1.
% variance of sum of communalities fitted by successive factors.
Scree test: plots rate of decline of successive eigenvalues.
Analysis of residuals: Predicted corr matrix similar to original corr.
Possibly, huge graphical output.
Leonardo Auslender –Ch. 1 Copyright 2004 1.1-709/8/2019
Differences between FA and PCA, communalities.
PCA analyzes original corr matrix with’1’ in main diagonal, i.e., total variance.
FA analyzes communalities, given by common variance. Main diag of corr
matrix is then replaced, with options (SAS):
ASMC sets the prior communality estimates proportional to the squared multiple
correlations but adjusted so that their sum is equal to that of the maximum absolute
correlations (Cureton; 1968).
INPUT reads the prior communality estimates from the first observation with
either _TYPE_=’PRIORS’ or_TYPE_=’COMMUNAL’ in the DATA = data set (which cannot be
TYPE=DATA).
MAX sets the prior communality estimate for each variable to its maximum absolute
correlation with any other variable.
ONE sets all prior communalities to 1.0.
RANDOM sets the prior communality estimates to pseudo-random numbers uniformly
distributed between 0 and 1.
SMC sets the prior communality estimate for each variable to its squared multiple
correlation with all other variables.
Final communalities: proportion of the variance in each of the original variables retained
after extracting the factors.
Leonardo Auslender –Ch. 1 Copyright 2004 1.1-719/8/2019
FA properties (SAS)
Estimation method: PRINCIPAL (yields principal components),
MAXIMUM LIKELIHOOD
MINEIGEN: Smallest eigenvalue for retaining a factor.
Nfactors: Maximum number of factors to retain.
Scree: display scree plot.
Rotate
Priors.
Leonardo Auslender –Ch. 1 Copyright 2004 1.1-729/8/2019
Additional Factor Methods and comparisons.
EFA: explores possible underlying factor structure of set of observed
variables without imposing preconceived structure on outcome. Aim:
identify underlying factor structure.
Confirmatory factor analysis (CFA): statistical technique used to
verify factor structure of set of observed variables. CFA allows to test
hypothesis of relationship between observed variables and their
underlying latent constructs exists. Researcher uses knowledge of theory,
empirical research, or both, postulates relationship pattern a priori and
then tests the hypothesis statistically. In short, NUMBER of factors, type of
rotation and which variables load into each factor are known. Rule of
thumb: factor loading into factor must be > 0.7.
Confirmatory factor models ( ≈ linear factor models)
Item response models ( ≈ nonlinear factor models).
Leonardo Auslender –Ch. 1 Copyright 2004 1.1-739/8/2019
Leonardo Auslender –Ch. 1 Copyright 2004 1.1-749/8/2019
Leonardo Auslender –Ch. 1 Copyright 2004 1.1-759/8/2019
Leonardo Auslender –Ch. 1 Copyright 2004 1.1-769/8/2019
Leonardo Auslender –Ch. 1 Copyright 2004 1.1-779/8/2019
Leonardo Auslender –Ch. 1 Copyright 2004 1.1-789/8/2019
Leonardo Auslender –Ch. 1 Copyright 2004 1.1-799/8/2019
Leonardo Auslender –Ch. 1 Copyright 2004 1.1-809/8/2019
Varimax Rotation.
VARIMAX: orthogonal rotation that maximizes sum of
variance of squared loadings (squared correlations between variables
and factors). Orthogonality: Intuitively achieved if, (a) any given variable
has a high loading on a single factor but near-zero loadings on the
remaining factors and if (b) any given factor is constituted by only a few
variables with very high loadings on this factor while remaining variables
have near-zero loadings on this factor. a) and b) ➔ factor loading matrix is
said to have "simple structure," and varimax rotation brings the loading
matrix closer to such simple structure (as much as the data allow).
Each variable can be well described by a linear combination of only a few
basis functions (Kaiser, 1958).
In next slides, compare ORIGINAL with VARIMAX for different factors (F_).
Leonardo Auslender –Ch. 1 Copyright 2004 1.1-819/8/2019
Leonardo Auslender –Ch. 1 Copyright 2004 1.1-829/8/2019
Leonardo Auslender –Ch. 1 Copyright 2004 1.1-839/8/2019
Leonardo Auslender –Ch. 1 Copyright 2004 1.1-849/8/2019
Leonardo Auslender –Ch. 1 Copyright 2004 1.1-859/8/2019
Leonardo Auslender –Ch. 1 Copyright 2004 1.1-869/8/2019
All variables,
And very messy.
Leonardo Auslender –Ch. 1 Copyright 2004 1.1-879/8/2019
Leonardo Auslender –Ch. 1 Copyright 2004 1.1-889/8/2019
Leonardo Auslender –Ch. 1 Copyright 2004 1.1-899/8/2019
Leonardo Auslender –Ch. 1 Copyright 2004 1.1-909/8/2019
From: https://www.linkedin.com/groups/4292855/4292855-6171016874768228353
Question about dimension reduction (factor analysis) for survey question set
What would be the interpretation for a set of survey questions where
rotation fails to converge in 25 iterations, and the non-rotated
solution shows 2 clear factors with Eigenvalues above 2, but the scree plot levels out right
at eigenvalue = 2 and the remaining (many) factors are quite close together?
Answers:
1. Well you first want to get the model to convergence. I usually increase the # of iterations to 500.
2. I would suggest its a one factor solution and the above 1 criteria is probably not appropriate. How many
items/questions were there in your survey?
Answer from poster: Thank you. I was able to do this with # iteration = 500. There are 14 factors with eigenvalues
above 1, accounting for a total of 66.7% of the variance. I am
still unsure what the interpretation would be - I've never had a dataset before
that had so many factors. Too much noise? A lot of variability in survey response?
3. I think that "14 eigenvalues make just 2/3 of the variance" is a warning. It means to my experience that there
are no large eigenvalues at all and that there are just "scree" eigenvalues.
This can be an effect of having too many variables (= too high dimension). In this case an "automatic"
dimensional reduction will necessarily fail and a visual dimensional reduction is due.
Leonardo Auslender –Ch. 1 Copyright 2004 1.1-919/8/2019
It can also mean that the data cloud is more or less "spherical". This would mean that there are many
columns (or rows) in the correlation matrix containing just values close to zero. One can easily "eliminate"
such "circular" variables as follows
a) copy the correlation matrix to an Excel sheet
b) For each column calculate (sum of elements - 1) = rowsum
c) sort the columns by descending rowsum
d) take just the "top 20" or so variables with the largest rowsum
e) do the analysis with the 20 variables and study the "scree plot"
Sorry, in step b) you should also calculate the maximal column element except the "1" on the diagonal. In step
d) you should also add variables with a small rowsum but a relatively large maximal correlation.
4. I agree with stated above. Just FYI, use Bartlett sphericity test to formally check low correlation issue. Try
also alternate the type of rotation.
5. Without knowing what you are measuring ... I can tell you about a similar situation I experienced ... it took a
high number of iteration to converge, only one eigen value above 2, and a dozen or more above 1 that made
no theoretical sense. I deleted all items with little response variability, and reran it ... and it came out more
clearly as a homogeneous measure (1 factor). Once accepted that I was dealing with one factor, I was able to
make some edits to the items, collected more data on the revised measure, and now have a fairly tight
homogeneous measure.... where I really thought there would be 5 or so factors!
➔ MESSAGE: Extremely ad-hoc solutions are typical, not
necessarily recommended ➔ think before you rush in..
Leonardo Auslender –Ch. 1 Copyright 2004 Ch. 1.1-929/8/2019
Leonardo Auslender –Ch. 1 Copyright 2004 1.1-939/8/2019
Introduction
Different approaches to clustering (there are other taxonomies)
1) disjoint or partitioning (k-means);
2) non-disjoint hierarchical (agglomerative),
3) density-based, grid-based, fuzzy, soft method or overlapping
methods, constraint-based, or model-based clustering (EM algorithm).
Marketing prefers disjoint in order to separate customers completely
(assuming independent observations). Archaeology prefers agglomerative
because two nearby clusters might emerge from previous one in downward
tree hierarchy (e.g., fossils in evolutionary science).
Agglomerative or hierarchical methods: typically bottom-up method.
Start from individual observations and agglomerate upwards. Info on # of
clusters not necessary, but impractical for large data sets. End result called
dendogram, tree structure. Necessary to define distance, different
distances ➔ different methods.
Leonardo Auslender –Ch. 1 Copyright 2004 1.1-949/8/2019
Basic Introduction
Overlapping, fuzzy: methods that deal with data that cannot
be completely separated or with probability statement attached
to cluster membership.
Grid-based: very fast, need to determine finite grid of cells
along each dimension.
Constraint-based, constraint given by business strictures or
applications.
Won’t review top-down (divisive methods), overlapping or
fuzzy methods, etc..
Leonardo Auslender –Ch. 1 Copyright 2004 1.1-959/8/2019
Why so many methods?
If there are ‘n’ data points, and aim at clustering into ‘k’
clusters, there are kn / k! number of ways to do it, ➔ brute
force methods not adequate.
For instance, k = 3, n = 100, number of ways is
8.5896253455335 * 10 ** 46. For n = 1000, computer cannot
calculate it. And n = 1000, rather small data size at present.
➔Heuristics used, such as k-means, especially.
Methods typically use Euclidean distance, but correlation
distance is also possible.
Leonardo Auslender –Ch. 1 Copyright 2004 1.1-969/8/2019
Leonardo Auslender –Ch. 1 Copyright 2004 Ch. 1.1-979/8/2019
Disjoint: K-means (McQueen, 1967) (most used clustering
method in business)
Key concept underlying cluster detection: similarity, homogeneity or
closeness of OBSERVATIONS. Resulting implementation is based on
similarity or dissimilarity of measures of distance. Methods typically greedy
(one observation at time).
Start with given number of requested clusters K, N data points and P
variables. Continuous variables only.
Algorithm determines K arbitrary seeds that become original location of
clusters in P dimensions (there is variety of ways to change starting seeds).
By using Euclidean distance function, allocate each observation to nearest
cluster given by original seeds.
Re-calculate centroid (cluster center of gravity). Re-allocate observations
based on minimal distance to newer centroids and repeat until convergence
given by maximum number of iterations, or until cluster boundaries remain
unchanged. K-means typically converges quickly.
Leonardo Auslender –Ch. 1 Copyright 2004 Ch. 1.1-989/8/2019
Outliers can have negative effects because calculated centroid would be
affected. If outlier is itself chosen as initial seed, effect is paradoxical:
analyst may realize that relative scarcity of observations is indication of an
outlier. If outlier is not chosen initially, centroid is unavoidably affected. On
the other hand, the distortion introduced may be such as to make such a
conclusion difficult to reach.
Further disadvantage: method depends heavily on initial choice of
seeds ➔ recommended that more than one run be performed but then
difficult/impossible to combine results.
In addition, # of desired clusters must be specified, which is in many
situations the answer the analyst wants the algorithm to provide. #
iterations must also be given.
More importantly, search for clusters is based on Euclidean distances that
produce convex shapes. If ‘true’ cluster is not convex, K-means could not
find that solution.
Leonardo Auslender –Ch. 1 Copyright 2004 1.1-999/8/2019
Number of Clusters, determination.
Cubic clustering criterion (CCC) explained later with Ward method.
Elbow rule: for each ‘k’ clustering solution, find out % between
cluster variation over total variation at every K, and stop at
point when increasing K does not decrease the ratio
significantly (can also be used for WARD method later on).
Elbow point sometimes cannot be fully distinguished.
WEB
Leonardo Auslender –Ch. 1 Copyright 2004 1.1-1009/8/2019
Alternatives:
K-medoids replaces mean by data points. In this sense, more robust to
outliers but inefficient for large data sets (Rousseeuw and Kaufman,
1987).
Resulting clusters are disjoint: merging two clusters does not lead to
combined overall super-cluster. Since method is non-hierarchical,
impossible to determine closeness among clusters.
In addition to closeness issue, possible that some observations may
belong in more than one cluster and thus it would be important to report
a measure of the probability belonging in a cluster.
Originally created for continuous variables. Huang (1998)
among others, extended algorithm to nominal variables.
Next: Cluster graphs derived from canonical discriminant analysis.
Leonardo Auslender –Ch. 1 Copyright 2004 1.1-1019/8/2019
Leonardo Auslender –Ch. 1 Copyright 2004 Ch. 1.1-1029/8/2019
Clusterin
g
solution
Total visits to a
doctor
Fraudul
ent
Activity
yes/no
Members
hip
duration
No of
claims
made
recently
Number
of
opticals
claimed
Total
spent on
opticals
# obs Mean Mean Mean Mean Mean Mean
Cluster
503 1.81 -0.42 0.48 -0.18 -0.27 -0.011
2 1452 -0.40 -0.50 -0.56 -0.23 -0.12 -0.18
3 203 0.17 -0.18 0.32 -0.21 -0.14 2.79
4 165 -0.01 1.23 0.16 3.77 0.12 -0.17
5 491 -0.20 2.00 -0.43 0.01 0.04 -0.41
6 150 -0.18 0.57 -0.46 -0.08 3.57 0.45
7 662 -0.35 -0.49 1.19 -0.25 -0.27 -0.14
Is this great?
Clusterin
g
solution
VALIDAT
ION
Total visits to a
doctor
Fraudule
nt
Activity
yes/no
Member
ship
duration
No of
claims
made
recently
Number
of
opticals
claimed
Total
spent on
opticals
# obs Mean Mean Mean Mean Mean Mean
Cluster
2334 -0.00 0.01 -0.01 0.00 -0.02 -0.02
1
NO, look at validation but Fraud difficult to work with. .
Fraud data set.
Leonardo Auslender –Ch. 1 Copyright 2004 1.1-1039/8/2019
Hmeq K-means 3 clusters selected (ABC method,Aligned Box Criterion, SAS).
Leonardo Auslender –Ch. 1 Copyright 2004 1.1-1049/8/2019
Rescaled variable means by cluster (statistical inference,
Parametric or otherwise, necessary to create profiles).
Leonardo Auslender –Ch. 1 Copyright 2004 1.1-1059/8/2019
Leonardo Auslender –Ch. 1 Copyright 2004 Ch. 1.1-1069/8/2019
ods output ABCResults= abcoutput;
proc hpclus data = training maxclusters = 8 maxiter = 100 seed =
54321 NOC = ABC (B= 1 minclusters = 3 align= PCA);
input DOCTOR_VISITS /* FRAUD */ MEMBER_DURATION NO_CLAIMS OPTOM_PRESC
TOTAL_SPEND; /* FRAUD OMITTED BEC. IT’S BINARY */
run;
proc sql noprint;
select k into :abc_k from abcoutput;
quit;
proc fastclus data = training out = outdata maxiter = 100 converge =
0 replace = random radius = 10 maxclusters = 7 outseed = clusterseeds summary;
var DOCTOR_VISITS /* FRAUD */ MEMBER_DURATION NO_CLAIMS OPTOM_PRESC
total_spend;
run;
/* VALIDATION STEP, NOTICE validation AND clusterseeds. */
proc fastclus data = validation out = outdata_val maxiter = 100 seed
= clusterseeds converge = 0 radius = 100 maxclusters = 7 outseed = outseed
summary;
var DOCTOR_VISITS /*FRAUD */ MEMBER_DURATION NO_CLAIMS OPTOM_PRESC
TOTAL_SPEND;
run;
;
Leonardo Auslender –Ch. 1 Copyright 2004 1.1-1079/8/2019
Leonardo Auslender –Ch. 1 Copyright 2004 1.1-1089/8/2019
k-means assume:
1) the variance of the distribution of each attribute (variable)
is spherical, i.e., E(x.x’) = σ2 IN.
2) all variables have the same variance;
3) the prior probability for all k clusters is the same, i.e. each
cluster has roughly equal number of observations.
Assumptions almost never verified, what happens when
violated?
Plus, difficult if not impossible to ascertain best results.
Examples in two dimensions, X and Y next.
Leonardo Auslender –Ch. 1 Copyright 2004 1.1-1099/8/2019
Non-spherical data.
Leonardo Auslender –Ch. 1 Copyright 2004 1.1-1109/8/2019
K-means solutions. X: centroids of found clusters.
Leonardo Auslender –Ch. 1 Copyright 2004 1.1-1119/8/2019
Instead, single linkage hierarchical clustering solution.
Leonardo Auslender –Ch. 1 Copyright 2004 1.1-1129/8/2019
Additional problems:
Differently sized clusters.
➔NO FREE LUNCH – NFL- (Wolpert, MacReady, 1997)
“We have dubbed the associated results NFL theorems
because they demonstrate that if an algorithm performs
well on a certain class of problems then it necessarily pays
for that with degraded performance on the set of all
remaining problems”.
➔CANNOT USE SAME MOUSETRAP ALL
THE TIME.
➔ Hint: Verify assumptions, they ARE IMPORTANT.
Leonardo Auslender –Ch. 1 Copyright 2004 1.1-1139/8/2019
Interview Question:
Aliens from far away prepare for invasion of Earth. Need to
find whether intelligent creature live here and plan to launch
1000 probes for that purpose to random locations. Unknown
to them, the oceans cover 71% of the Earth and the probe
send back about the landing site and surroundings. Let us
assume that just (some) humans are intelligent.
The alien data scientist decides to use k-means on the data.
Discuss how he/she would conclude whether there’s intelligent
life on Earth (no sarcastic answers allowed)
Leonardo Auslender –Ch. 1 Copyright 2004 1.1-1149/8/2019
Agglomerative (hierarchical) clustering
methods:
Single linkage
Centroid,
Average Linkage
and Ward.
Leonardo Auslender –Ch. 1 Copyright 2004 1.1-1159/8/2019
Agglomerative Clustering (standard in bio sciences).
In single-linkage clustering (aka nearest neighbor or neighbor joining
tree in genetics), distance between two clusters determined by single element
pair, namely those two elements (one in each cluster) that are closest to each
other. And later compounds defined by min distance (see example below).
Shortest of these links that remains at any step causes fusion of two clusters
whose elements are involved. Method also known as nearest neighbor
clustering: distance between two clusters is the shortest possible distance
among members of clusters, or best of friends. Result of clustering can be
visualized as dendogram: sequence of cluster fusion and distance at which
each fusion took place.
Distance or linkage factor given by
,
( , ) ( , )
X and Y clusters, d is distance between
elements x and y.
 
=
x X y Y
D X Y min d x y
Leonardo Auslender –Ch. 1 Copyright 2004 1.1-1169/8/2019
In centroid method, (comonly used in biology) distance between clusters “l”
and “r” is given by Euclidean distance between centroids. Centroid method is
more robust than other linkage methods presented here, but has drawback of
inversions (clusters do not become more dissimilar as we keep on linking up)
.
In complete linkage (a.k.a. furthest neighbor) the distance between two
clusters is longest possible distance between the groups, or the worst among
the friends.
In the case of average linkage method, the distance is the average distance
between each pair of observations, one from each cluster. The method tends
to join clusters with small variances.
The Ward’s minimum variance method assumes that data set is derived
from multivariate normal mixture, that clusters have equal covariance
matrices and sampling probabilities. Tends to produce clusters with roughly
same number of observations and based on the notion of information loss
suffered when joining two clusters. Loss is quantified by Anova like Error
Sums of Squares criterion.
Leonardo Auslender –Ch. 1 Copyright 2004 1.1-1179/8/2019
Example of complete linkage: Assume 5 observations
with Euclidean distance as given by:
1 2 3 4 5
1 0
2 9 0
3 3 7 0
4 6 5 9 0
5 11 10 2 8 0
Let’s cluster closest observations, 3 and 5 (as 35), where
distance between 1 and 35 is given by max distance (1 – 3, 1
– 5). After 6 steps, all observations are clustered. Dendograms
(distance has height in Y axis) show agglomeration.
35 1 2 4
35 0
1 11 0
2 10 9 0
4 9 6 5 0
Leonardo Auslender –Ch. 1 Copyright 2004 1.1-1189/8/2019
Complete linkage
Single linkage
Leonardo Auslender –Ch. 1 Copyright 2004 1.1-1199/8/2019
How many clusters with agglomerative methods?
Cut previous dendogram with horizontal line at specific
point. No prevailing method, however. E.g., 2 clusters.
Next: cluster solutions comparison (skipped bar charts of means of vars.)
Leonardo Auslender –Ch. 1 Copyright 2004 1.1-1209/8/2019
Cubic clustering criterion (Sarle,1983).
Assume 3 cluster solution left, and a reference distribution (right)
Reference distribution: hyper-cube of uniformly distributed random points
aligned with principal components. Reference distribution typically be a hyper-
cube. Heuristic formula to calculate error of distance based methods for k = 1 to
k = top # clusters.
CCC is difference error ( k = 1 ,,,,) to Top K – 1 to Error (top K). Largest CCC ➔
desired k. Fails when variables are highly correlated. ABC method improves on
CCC because it simulates multiple reference distribution, instead of just one
heuristic as in CCC.
Leonardo Auslender –Ch. 1 Copyright 2004 1.1-1219/8/2019
Example, and
Putting all
Methods together.
Leonardo Auslender –Ch. 1 Copyright 2004 1.1-1229/8/2019
Fraud comparison of canonical discr. Vectors.
Leonardo Auslender –Ch. 1 Copyright 2004 1.1-1239/8/2019
Notice missing ‘.’ cluster allocation.
Leonardo Auslender –Ch. 1 Copyright 2004 1.1-1249/8/2019
Leonardo Auslender –Ch. 1 Copyright 2004 1.1-1259/8/2019
Leonardo Auslender –Ch. 1 Copyright 2004 1.1-1269/8/2019
Leonardo Auslender –Ch. 1 Copyright 2004 1.1-1279/8/2019
Previous slides: Hpclus (SAS proprietary cluster solution),
Similar solutions between hpclus and k-means, very different
from others.
How to compare?
Full disagreement. Since there is no initial cluster membership,
no basis to obtain error rates.
There are many proposed measures, such as silhoutte
coefficient, Adjusted Rand Index, etc.
Final issues:
Number of Clusters could be different across methods.
Number of predictors, i.e., predictor selection, could be also
different across methods.
Leonardo Auslender –Ch. 1 Copyright 2004 1.1-1289/8/2019
Methods for Number of Clusters determination.
Ideal solution should minimize the “within cluster
variation” (WCV) and maximize the between cluster
variation (BCV). But WCV decreases and BCV increases with
increasing number of clusters.
Compromises:
CH index (Calinski et al, 1974):
which is undefined for K = 1, i.e., no cluster case.
GAP statistic (Tibshirani et al 2001): WCV ↓ as K ↑. Evaluate
rate of decrease against uniformly distributed points.
Milligan and Cooper (1985) compared many methods and up
until 1985, CH was best.
1
WCV
K
BCW
n K
−
−
Leonardo Auslender –Ch. 1 Copyright 2004 1.1-1299/8/2019
Leonardo Auslender –Ch. 1 Copyright 2004 1.1-1309/8/2019
Applications
1) Marketing Segmentation and customer profiling,
even when supervised methods could be used.
2) Content based Recommender systems. E.g.:
recommend based on movie categories preferred. E.g.,
cluster movie database and recommend within clusters.
3) Establish hierarchy or evolutionary path of fossils in
archaeology/prehistory.
Leonardo Auslender –Ch. 1 Copyright 2004 1.1-1319/8/2019
Especially in
marketing.
Leonardo Auslender –Ch. 1 Copyright 2004
Ch. 1.1-1329/8/2019
Clustering and Segmentation in Marketing (easily
Extrapolated to other applications).
Definition: Segmentation: “viewing a heterogeneous market as a number
of smaller homogeneous markets” (Wendell Smith, 1956).
Bad practices.
1) Segmentation is descriptive, not predictive. However, business
decisions made with eye to future (i.e., predictive). Business
decisions based on segmentation are subjective and inappropriate for
decision making, because segmentation only shows present
strengths and weaknesses of brand (in marketing research), but
doesn’t give and cannot give indications as to how to proceed.
2) CRM ISSUE: Segmentation assumes segment homogeneity, which
contradicts basic CRM tenet of customer segments of 1.
Leonardo Auslender –Ch. 1 Copyright 2004 Ch. 1.1-1339/8/2019
Clustering and Segmentation in Marketing
3) Competitors information and reactions are usually ignored at segment
level. When Coca-Cola analyzed introduction of sweeter drink, only
focused on Coca-Cola drinkers, forgetting customers’ perception of
Coca Cola image. About AT&T, just look as to where AT&T is after 2000,
after big mid-90 marketing failure based on segmentation, among other
horrors.
4) Segmentation always excludes significant numbers of real
prospects and conversely includes significant ones of non-
prospects. In typical marketing situation, best and worst customers are
easier to find, and the ones in between are non-easily classifiable. But
segmentation imposes such a classification, and users do not remind
themselves enough of the classification issues behind.
Leonardo Auslender –Ch. 1 Copyright 2004 Ch. 1.1-1349/8/2019
Clustering and Segmentation in Marketing
Really unfortunate bad practices.
1) Humans categorize information to make it into comprehensible
concepts. Thus, segments are typically labeled, and labels
become “the” segments, regardless of segment accuracy,
construction or stability of content, or changing market conditions.
Worse yet, could well be that segments do not properly exist but that
data derived clusters merely reflect normal variation (e.g., human
evolution studies area of conflict in this).
2) Segments thus constructed cannot foretell changing market
conditions, except once they have already taken place. Thus, you
either gained, lost or kept customers. No amount of labeling, re-
labeling or label tweaking can be basis of successful operation in
market place, since segments cannot predict behavior.
Leonardo Auslender –Ch. 1 Copyright 2004 Ch. 1.1-1359/8/2019
Clustering and Segmentation in Marketing.
Really unfortunate bad practices.
3) Segments also derived from attitudinal data. Attitudes of
customer base usually measured by way of survey opinions and/or
focus groups. Derived information (psychographics) not easy to
merge with created clusters from operational and demographic
information.
4) Immediate temptation is to view whether segments derived from
two very different sources have any affinity. This implies that it is
necessary to ‘score’ customer base with psycho-graphically derived
segments, in order to merge results. Accuracy of classification for
this application has been traditionally very low.
5) Better practice: encourage usage of original clusters based on
operational and demographic data as basis for obtaining psycho-
graphic information.
Leonardo Auslender –Ch. 1 Copyright 2004 Ch. 1.1-1369/8/2019
Clustering and Segmentation in Marketing
6) Finally, all models are based on available data. If aim is to
segment entire US population, and one feature is NY Times
readership (because that’s only subscription list available), useful
mostly in NorthEast, but not so much in Kansas probably. In
fact, it produces geographically based clustering, which may be
undesirable or unrecognized effect.
Good practice.
• It can be systematic way to enhance marketing creativity, if
possible.
Patting yourself in the back ➔
Leonardo Auslender –Ch. 1 Copyright 2004 Ch. 1.1-1379/8/2019
Important note on how to work: Confirmatory Bias.
Psychologists call ‘confirmatory bias’ the tendency to try to prove a new
idea correct instead of searching to prove the new ideas wrong.
This is a strong impediment to understanding randomness. Bacon
(1620) wrote: “the human understanding, once it has adopted an
opinion, collects any instances that confirm it, and though the
contrary instances may be more numerous and more weighty, it
either does not notice them or else rejects them, in order that this
opinion will remain unshaken.”
Thus, we confirm our stereotypes about minorities, for instance, by
focusing on events that prove our prior beliefs and dismiss opposing
ones. This is a serious contradiction to the ability of experts to judge in
an unbiased fashion.
Thus many times, we see what we want to see. Instead, per Doyle’s
Sherlock Holmes: “One should always look for a possible alternative,
and provide against it.” (to prove your point, The Adventure of Black
Peter).
Leonardo Auslender –Ch. 1 Copyright 2004 1.1-1389/8/2019
Leonardo Auslender –Ch. 1 Copyright 2004 1.1-1399/8/2019
1. Cluster Analysis using the Jaccard Matching
Coefficient
2. Latent Class Analysis
3. CHAID analysis (class of tree methods, requires a
target variable).
4. Mutual Information and/or Entropy.
5. Multiple Correspondence Analysis MCA.
Not reviewed in this class.
Leonardo Auslender –Ch. 1 Copyright 2004 1.1-1409/8/2019
Some Un-reviewed methods
Mean shift clustering: non-parametric mode seeking algo.
Density based spatial clustering of applications with noise (DBSCAN)
BIRCH: balanced iterative reducing and clustering using hierarchies.
Gaussian Mixture
Power iteration clustering (PIC)
Latent Dirichlet allocation (LDA)
Bisecting k-means
Streaming k-means
Spectral Clustering
Leonardo Auslender –Ch. 1 Copyright 2004 1.1-1419/8/2019
Leonardo Auslender –Ch. 1 Copyright 2004 1.1-1429/8/2019
Many analytical methods require presence of complete observations, that
is, if any feature/predictor/variable as a missing value, the entire observation is
not used in the analysis. For instance, regression analysis requires complete
observations. Failing to verify the completeness of a data set can lead to
serious error, especially if we rely on UEDA notions of missingness.
For instance, the table below shows a simulation in which for a given set of
number of variables ‘p’ (100, 350, 600, 850) and number of observations in
the data set ‘n’ (1000, 1500, 2000), each variable has a probability of 0.01
and 0.11 of being missing. A priori these probabilities seem very low to cause
much harm.
Table shows, however, that for modest ‘p’ = 100, resulting data sets have at
least 60.93% missing values, and when ‘p’ reaches 350 almost all
observations are missing. When univariate missingness is 11%, all
observations are missing.
Leonardo Auslender –Ch. 1 Copyright 2004 1.1-1439/8/2019
Missing value
analysis
Num Obs_in Database (n)
1000 1500 2000
# obs
with_at
least one
missing
value
%
missing
obs_in
databas
e
# obs
with_at
least
one
missing
value
%
missing
obs_in
databas
e
# obs
with_at
least
one
missing
value
% missing
obs_in
database
Prob of
missing
value
Num
Vars_in
Database
(p)
619 61.90 914 60.93 1224 61.200.01 100
350 960 96.00 1449 96.60 1936 96.80
600 998 99.80 1496 99.73 1995 99.75
850 1000 100.00 1500 100.00 2000 100.00
0.11 100 1000 100.00 1500 100.00 2000 100.00
350 1000 100.00 1500 100.00 2000 100.00
600 1000 100.00 1500 100.00 2000 100.00
850 1000 100.00 1500 100.00 2000 100.00
Leonardo Auslender –Ch. 1 Copyright 2004 1.1-1449/8/2019
A more subtle complication arises when missingness is
not at random like in the table above. That is, assume
that missingness in a variable of importance is related to its
information itself, such as reported income is likely to be
missing for high earners.
In this case, study of occupation by income, in which
observations with missing values are skipped, would
provide a very distorted picture because high-earners
occupations would be underrepresented.
In other cases, data bases are created by merging
different sources that were partially matched by some
key indicator that could be unreliable (e.g., customer
number) ➔ data collection missingness).
Leonardo Auslender –Ch. 1 Copyright 2004 1.1-1459/8/2019
Missing values taxonomy (Little and Rubin, 2002)
We looked at the importance of treatment of missing values in a
dataset. Now, let’s identify the reasons for occurrence of these
missing values. They may occur at two stages:
Data Extraction: It is possible that there are problems with extraction process. In
such cases, we should double-check for correct data with data guardians. Some
hashing procedures can also be used to make sure data extraction is correct.
Errors at data extraction stage are typically easy to find and can be corrected
easily as well.
Data collection: These errors occur at time of data collection and are
harder to correct. They can be categorized in three types:
Missing completely at random (MCAR): This is a case when the probability
of missing variable is the same for all observations. For example: respondents of a
data collection process decide that they will declare their earnings or weights after
tossing a fair coin. In this case, each observation has an equal chance of containing
a missing value.
Leonardo Auslender –Ch. 1 Copyright 2004 1.1-1469/8/2019
Missing at random (MAR): This is a case when a variable is missing at
random regardless of the underlying value but probably induced by a
conditioning variable. For example: age is typically missing in higher
proportions for females than for males regardless of the underlying age of
the individual. Thus, missingness is related only to the observed data.
Missing not at random (MNAR): the case of missing income above, that
is, a variable is missing due to its underlying value. It also involves
missingness that depends on unobserved predictors.
Leonardo Auslender –Ch. 1 Copyright 2004 1.1-1479/8/2019
Solving missingness in Data Bases.
Case Deletion: It is of two types: List Wise Deletion and Pair Wise
Deletion, used in the MCAR case, because otherwise a biased
sample might occur. .
List-wise deletion removes all observations in which at least one
missing value occurs. As we say above, the resulting sample size
could be seriously diminished.
Due to the disadvantage of list-deletion, pair wise deletion
proceeds with analyzing all cases in which variables of interest
are complete. Thus, if the interest centers on variables A and B
to be correlated, analysis proceeds on those observations with
non-missing values of variables A and B, regardless of
missingness in other variables. If the study centers on different
pairs of variables, then different sample sizes may result.
Leonardo Auslender –Ch. 1 Copyright 2004 1.1-1489/8/2019
Mean/ Mode/ Median Imputation: Imputation is a method to fill in the
missing values with estimated ones. The objective is to employ
known relationships that can be identified in the valid values of the
data set to assist in estimating the missing values. Mean / Mode /
Median imputation is one of the most frequently used methods. It
consists of replacing the missing data for a given attribute by the
mean, median (quantitative attribute) or mode (qualitative attribute)
of all known values of that variable. It can be of two types:
Generalized Imputation: In this case, we calculate the mean or median
for all the complete values of that variable and then impute the
missing values correspondingly.
Conditional imputation: If missingness is known to differ by a third
variable/s, obtain the mean/median/mode by the different values of
the third variable/s and impute. Thus, in the case of missing age,
obtain statistics corresponding to males and females, and impute.
Leonardo Auslender –Ch. 1 Copyright 2004 1.1-1499/8/2019
Prediction Model:
Prediction model estimates values that will substitute missing
data. In this case, divide data set into two sets: One set with no
missing values for variable in question and another one with
missing values. First data set becomes training data set of model
while second data set with missing values is test data set and
variable with missing values is treated as target variable. Next,
create model to predict target variable based on other attributes
of training data set and populate missing values of test data set.
We can use regression, ANOVA, Logistic regression and various
modeling technique to perform this.
2 drawbacks:
▪ Model estimated values usually more well-behaved than true
values, i.e., smaller variance.
▪ If there are no relationships with attributes in the data set and
the attribute with missing values, then model will not be
precise for estimating missing values.
Leonardo Auslender –Ch. 1 Copyright 2004 1.1-1509/8/2019
A more general drawback.
More likely that some information is MAR (lost data)
while other is MNAR (high-earners income). It is very
difficult or impossible to identify. In practice, most
methods tend toward MAR or MCAR.
Leonardo Auslender –Ch. 1 Copyright 2004 1.1-1519/8/2019
KNN Imputation
In this method of imputation, missing values of attribute are imputed using
given number of attributes that are most similar to the attribute whose
values are missing. Similarity of two attributes determined using distance
function. It is also known to have certain advantage & disadvantages.
Advantages:
k-nearest neighbor can predict both qualitative & quantitative
attributes
Creation of predictive model for each attribute with missing data is
not required
Attributes with multiple missing values can be easily treated
Correlation structure of data is taken into consideration
Disadvantage:
KNN algorithm very time-consuming in analyzing large database. It
searches through all dataset looking for most similar instances.
Choice of k-value is very critical. Higher value of k would include
attributes which are significantly different from what we need
whereas lower value of k implies missing out of significant attributes.
Leonardo Auslender –Ch. 1 Copyright 2004 1.1-1529/8/2019
Dummy coding.
While not a solution, it recognizes the existence of the missing value. For each
variable with a missing value anywhere in the data base, we create a dummy variable
with value 0 when the corresponding value is not missing, and 1 when it is missing.
The disadvantage is that the number of predictors can increase significantly. In some
contexts, researchers drop the missing variables and work with the dummies instead.
In most applied work, it is assumed that missingness is not of the MNAR type. In all
cases of imputation, we should note that the imputed values may shrink the variance
of the individual variables. Thus, it is appropriate to ‘contaminate’ these estimates
with a random component, for instance, a normally distributed random error for a
continuous variable.
ISSUE
If also want to transform data, decide whether to transform and then impute, or
instead to impute raw data and then work with transformations.
Leonardo Auslender –Ch. 1 Copyright 2004 1.1-1539/8/2019
Leonardo Auslender –Ch. 1 Copyright 2004 1.1-1549/8/2019
Modeling Method based on trees (see 4th lecture and come
back this page).
Assume 3 variables have missing values, call them miss1,
miss2 and miss3 and denote by pct1, pct2, pct3 the % of
missing observations for each. Assume that pc1 < pct2 <
pct3.
Create trees and impute the ‘miss’ variables one at a time
in descending order of missingness, and using ‘miss’
variables as dependent variables in a tree run.
It is also possible to use Gradient Boosting or Random
forests instead of Trees as the modeling tool.
Leonardo Auslender –Ch. 1 Copyright 2004
Dataset ‘XXXX' with 8166 Obs #
Nonmis
s Obs % Missing Mean
Std of
Mean Mode
Variable Model Name
1,430 82.49 64.681 3.545 3.990
Var_1 M1_INTERVAL_VARS
Var_2 M1_INTERVAL_VARS
8,166 0.00 136.898 2.329 49.990
Var_3 M1_INTERVAL_VARS
5,014 38.60 520.282 9.850 49.990
M1_IMPUTED_INTERV
AL_VARS
8,166 0.00 447.642 6.156 273.053
Var_4 M1_INTERVAL_VARS
3,741 54.19 0.261 0.029 0.000
M1_IMPUTED_INTERV
AL_VARS
8,166 0.00 0.135 0.014 0.000
Var_5 M1_INTERVAL_VARS
4,358 46.63 0.207 0.022 0.000
M1_IMPUTED_INTERV
AL_VARS
8,166 0.00 0.122 0.012 0.000
Var_6 M1_INTERVAL_VARS
7,344 10.07 11.770 0.849 0.000
M1_IMPUTED_INTERV
AL_VARS
8,166 0.00 10.823 0.765 0.000
Var_7 M1_INTERVAL_VARS
8,164 0.02 12.207 0.246 5.000
M1_IMPUTED_INTERV
AL_VARS
8,166 0.00 12.207 0.246 5.000
Example Notice change in means and std (mean)
Leonardo Auslender –Ch. 1 Copyright 2004 1.1-1569/8/2019
Distr of imputed
Variables same as that of original variables.
Leonardo Auslender –Ch. 1 Copyright 2004 1.1-1579/8/2019
Simulating missingness at random in Fraud data set.
Relatively low missingness. Imputed variables have similar
Statistics as original variables.
Basics and Measures of centralitity #
Nonmis
s Obs
%
Missing Mean Median Mode
Variable Model Name
5,960 0.00 8.941 8.000 9.000
DOCTOR_VISITS M1_TRN_INTERVAL
_VARS
MEMBER_DURATI
ON
M1_TRN_INTERVAL
_VARS 5,851 1.83 179.757 178.000 180.000
M1_TRN_IMPUTED
_TRN_INTERVAL_V
ARS 5,960 0.00 179.680 178.000 180.000
NO_CLAIMS M1_TRN_INTERVAL
_VARS 5,875 1.43 0.404 0.000 0.000
M1_TRN_IMPUTED
_TRN_INTERVAL_V
ARS 5,960 0.00 0.404 0.000 0.000
NUM_MEMBERS M1_TRN_INTERVAL
_VARS 5,875 1.43 1.986 2.000 1.000
M1_TRN_IMPUTED
_TRN_INTERVAL_V
ARS 5,960 0.00 1.985 2.000 1.000
OPTOM_PRESC M1_TRN_INTERVAL
_VARS 5,960 0.00 1.170 1.000 0.000
TOTAL_SPEND M1_TRN_INTERVAL
_VARS 5,960 0.00
18,607.9
70
16,300.0
00
15,000.0
00
Leonardo Auslender –Ch. 1 Copyright 2004 1.1-1589/8/2019
Measures of Dispersion
Varianc
e
Std
Deviati
on
Std of
Mean
Median
Abs Dev
Nrmlzd
MAD
Variable Model Name
52.31 7.23 0.09 5.00 7.41
DOCTOR_VISITS M1_TRN_INTERV
AL_VARS
MEMBER_DURAT
ION
M1_TRN_INTERV
AL_VARS 6,782.3
2 82.35 1.08 57.00 84.51
M1_TRN_IMPUTE
D_TRN_INTERVAL
_VARS 6,674.7
6 81.70 1.06 56.00 83.03
NO_CLAIMS M1_TRN_INTERV
AL_VARS 1.16 1.08 0.01 0.00 0.00
M1_TRN_IMPUTE
D_TRN_INTERVAL
_VARS 1.15 1.07 0.01 0.00 0.00
NUM_MEMBERS M1_TRN_INTERV
AL_VARS 1.00 1.00 0.01 1.00 1.48
M1_TRN_IMPUTE
D_TRN_INTERVAL
_VARS 0.98 0.99 0.01 1.00 1.48
OPTOM_PRESC M1_TRN_INTERV
AL_VARS 2.74 1.65 0.02 1.00 1.48
TOTAL_SPEND M1_TRN_INTERV
AL_VARS
125,607
,617.29
11,207.
48 145.17 6,000.00 8,895.60
Leonardo Auslender –Ch. 1 Copyright 2004 1.1-1599/8/2019
Leonardo Auslender –Ch. 1 Copyright 2004 1.1-1609/8/2019
Outliers and Variables Transformations.
Outlier and variable transformation analysis are sometimes included as part
of EDA. Since both topics must be understood in context of modeling of
dependent or target variable, we will state some general issues.
Wrongly asserted that analyst should verify existence of outliers and
then blindly remove or impute them to more ‘accomodating’ values
without reference to problem at hand. E.g., sometimes argued that
registered data point for man’s height of 8 feet must be wrong or outlier,
except that in antiquity there is historical evidence for that occurrence. In
present times, tendency to disregard income levels above say, $50 million,
when mean value in sample is probably $50,000. However, extreme values
are real, and probably most interesting.
Conversely, age of 300 years or more is quite suspicious, unless we are
referring to a mummy. In cases when data points in reference to the
model at hand can and should be disputed, outliers can then be treated
as if they were missing values, and most likely of the MNAR kind. Thus,
mean, median or mode imputation should not be considered as the
immediate solution.
Leonardo Auslender –Ch. 1 Copyright 2004 1.1-1619/8/2019
If we view the data bi-variately, data points that otherwise would not
be considered to be outliers, could be bi-variate outliers. For instance,
weight of 400 pounds and 3 years of age are possible when considered
univariately, but highly suspicious in one individual.
In the area of variable transformations, we already saw the
convenience of standardizing all variables in the case of principal
components and/or clustering. There are other cases when the
analyst implements single variable transformations, such as taking
the log, which lowers the variance of the variable in question.
Again, it is important to reiterate that most information is not of uni-
variate but of multi-variate importance. Further, there is no magic in
trying to obtain univariate or multi-variate normality, as it may be thought for
the case of inference, since inference does not require that variable
information be normally distributed.
Leonardo Auslender –Ch. 1 Copyright 2004 1.1-1629/8/2019
Sources of outlierness:
• Data Entry or data processing Errors, such as errors in
programming when merging data from many different sources.
• Measurement Error due to faulty measuring device, for instance.
• Experimental Error: Another cause of outliers is experimental error.
For example: In a 100m sprint of 7 runners, one runner missed out on
concentrating on the ‘Go’ call which caused him to start late. Hence,
this caused the runner’s run time to be more than other runners. His
total run time can be an outlier.
• Intentional misreporting: Adverse effects of pharmaceuticals are well
known to be under-reported.
• Sampling error, when information unrelated to the study at hand
is included. For instance, male individuals included in a study on
pregnancy.
Leonardo Auslender –Ch. 1 Copyright 2004 1.1-1639/8/2019
Effects of Outliers.
Outliers may distort analysis as is well known in the case of
linear and logistic regressions. They also distort inference
since by their very nature, they affect mean calculations, which
are the focus of inference in many instances. We will review
outlier detection while reviewing modeling methods.
There are “robust’ modeling methods that are (more)
impervious to outliers, such as robust regression. Tree
modeling methods and their derivatives are also impervious
to outliers.
Leonardo Auslender –Ch. 1 Copyright 2004 1.1-1649/8/2019
Variables transformations and MEDA.
Raw data sets can have variables or features that are not directly
useful for analysis. For instance, age coded in terms of “infant”, “teen’,
etc. do not connote the underlying ordering, and can be more easily
denoted in terms of an ordered sequence of numbers. In a different
example, if we are studying volumes, variables that may affect them
might require to be raised to the third power to correspond to the cubic
nature of volumes. In short, variable transformation belongs in the
MEDA realm because we are interested in how the transformed
variable relates to others.
The topic of variable transformations is also called variable
engineering because different disciplines add more jargon to each
other.
Leonardo Auslender –Ch. 1 Copyright 2004 1.1-1659/8/2019
Some prevailing practices in variable transformation (see previous
sections for more detail).
1) Standardization (as we saw in clustering and also principal
components): it does not alter the variable distribution but rescales
and standardizes to a common variance of 1.
2) Linearization, via logs: If the underlying model is deemed to be
multiplicative, log transformation turns it into an additive model.
Likewise for skewed distributions, as in the case of count variables.
And obviously, sometimes the required transformation is the square,
cubic or square root of the original variable.
3) Binning: usually done by cutting the range of a continuous variable
into sub-ranges that are deemed to be uniform or more
representative, for instance, in the case of age mentioned previously.
4) Dummying: Used typically with a categorical variable, such as one
denoting color. Some modeling methods, such as regression based
methods, require that if there are k classes of a categorical variable,
(k-1) dummy variables (0/1) variables be derived. Tree methods do
not require this construction.
Leonardo Auslender –Ch. 1 Copyright 2004 1.1-1669/8/2019
Leonardo Auslender –Ch. 1 Copyright 2004 1.1-1679/8/2019
1. A baseball bat and a ball cost $1.10 together, and the bat costs $1 more
than the ball. What’s the cost of the ball?
2. In a 2007 blog, David Munger (as stated by Adrian Paenza of Pagina 12,
2012/12/21), proposes the following question: without thinking more than a
second, choose a number between 1 and 20 and ask friends to do the
same and tally all the answers. What is the distribution.
3. Three friends go out to dinner and the bill is $30. They each contribute $10
but the waiter brings back $5 in $1 dollar bills because there was an error
and the tab was just $25. They each take $1 and give $2 to the waiter. But
then, one of them says that they each paid $9 for a total of $27, plus the $2
tip to the waiter, which all adds up to $29 and not to the $30 that they
originally paid. Where is the missing dollar?
4. Explain the relevance of the central limit theorem to a class of freshmen in
the social sciences who barely have knowledge about statistics.
5. What can you say about statistical inference when sample is whole population?
6. What is the number of barber shops in NYC? (coined by Roberto Lopez of
Bed, Bath & Beyond, 2017).
Leonardo Auslender –Ch. 1 Copyright 2004 1.1-1689/8/2019
7) In a very tiny city there are two cab companies, the Yellows (with 15
cars) and the Blacks (with 75 cars). The core of the problem is that there
was an accident during a drizzly night and that all cabs were on the streets.
A witness testifies that a yellow cab was guilty of the accident. The police
check his eyesight by showing him yellow and black cab pictures and he
identified them correctly 80% of the time. That is, in one case out of five he
confused a yellow cab for a black cab or vice-versa.
Knowing what we know so far, is it more likely that the cab involved in the
accident was yellow or black? The immediate unconditional answer (i.e.,
based on the direct evidence shown) is that there is 80% probability that the
cab was yellow.
State your reasoning, if any.
8) Can a random variable have infinite mean and/or variance?
9) State succintly differences between PCA, FA and clustering.
Leonardo Auslender –Ch. 1 Copyright 2004 1.1-1699/8/2019
10) In WWII, bomber runs over Germans suffered many losses. As a
statistician for the government, your task is to recommend
improvements in aircraft armor, defense, strategic formation, etc.
The available data set is from the damage suffered by the returning aircraft,
anti-aircraft damage, fighter plane damage, number of planes in formation,
etc.
State your recommendation once you have presented your line of
reasoning.
Leonardo Auslender –Ch. 1 Copyright 2004 1.1-1709/8/2019
References
Bacon F. (1620): Novum Organon, XLVI.
Calinski T., Harabasz J. (1974): A dendrite method for cluster analysis, Communications in
Statistics
Huang Z. (1998): Extensions to the k-Means Algorithm for Clustering Large
Data Sets with Categorical Values, Data Mining and Knowledge Discovery 2.
Johnstone I. (2001): On the Distribution of the largest eigenvalue in Principal
Components Analysis, The Annals of Statistics, vol. 29, # 2.
MacQueen J. (1967): “Some Methods for Classification and Analysis of
Multivariate Observations.” In Proc. of Fifth Berkeley Symp. on Math. Statistics
and Probability, Vol.1: Statistics, 281–97. Berkeley, Calif.: Univ. of California Press.
Milligan G., Cooper M. (1985): An examination of procedures for determining the
number of clusters in a data set, Psychometrika, 159-179.
Sarle W. (1983): Cubic Clustering Criterion, SAS Press.
Leonardo Auslender –Ch. 1 Copyright 2004 1.1-1719/8/2019
References (cont. 1)
Song J., Shin S. (2018): Stability approach to selecting the number of principal
components, Computational Statistics, 33: 1923:1938
Tibshirani R. et al (2001): Estimating the number of clusters in a data set via the
gap statistic., J. R. Statis. Soc. B, pp. 411-423
Leonardo Auslender –Ch. 1 Copyright 2004 Ch. 1.1-1729/8/2019

More Related Content

What's hot

17 ch ken black solution
17 ch ken black solution17 ch ken black solution
17 ch ken black solution
Krunal Shah
 

What's hot (18)

Applied Statistics and Probability for Engineers 6th Edition Montgomery Solut...
Applied Statistics and Probability for Engineers 6th Edition Montgomery Solut...Applied Statistics and Probability for Engineers 6th Edition Montgomery Solut...
Applied Statistics and Probability for Engineers 6th Edition Montgomery Solut...
 
Chapter9
Chapter9Chapter9
Chapter9
 
17 ch ken black solution
17 ch ken black solution17 ch ken black solution
17 ch ken black solution
 
Two Variances or Standard Deviations
Two Variances or Standard DeviationsTwo Variances or Standard Deviations
Two Variances or Standard Deviations
 
Solution to the practice test ch 10 correlation reg ch 11 gof ch12 anova
Solution to the practice test ch 10 correlation reg ch 11 gof ch12 anovaSolution to the practice test ch 10 correlation reg ch 11 gof ch12 anova
Solution to the practice test ch 10 correlation reg ch 11 gof ch12 anova
 
Ridge regression
Ridge regressionRidge regression
Ridge regression
 
Stat sample test ch 12 solution
Stat sample test ch 12 solutionStat sample test ch 12 solution
Stat sample test ch 12 solution
 
Stats chapter 8
Stats chapter 8Stats chapter 8
Stats chapter 8
 
PG STAT 531 Lecture 3 Graphical and Diagrammatic Representation of Data
PG STAT 531 Lecture 3 Graphical and Diagrammatic Representation of DataPG STAT 531 Lecture 3 Graphical and Diagrammatic Representation of Data
PG STAT 531 Lecture 3 Graphical and Diagrammatic Representation of Data
 
P G STAT 531 Lecture 8 Chi square test
P G STAT 531 Lecture 8 Chi square testP G STAT 531 Lecture 8 Chi square test
P G STAT 531 Lecture 8 Chi square test
 
Chapter04
Chapter04Chapter04
Chapter04
 
Probability Distribution
Probability DistributionProbability Distribution
Probability Distribution
 
Hypergeometric distribution
Hypergeometric distributionHypergeometric distribution
Hypergeometric distribution
 
PG STAT 531 Lecture 2 Descriptive statistics
PG STAT 531 Lecture 2 Descriptive statisticsPG STAT 531 Lecture 2 Descriptive statistics
PG STAT 531 Lecture 2 Descriptive statistics
 
Practice test ch 10 correlation reg ch 11 gof ch12 anova
Practice test ch 10 correlation reg ch 11 gof ch12 anovaPractice test ch 10 correlation reg ch 11 gof ch12 anova
Practice test ch 10 correlation reg ch 11 gof ch12 anova
 
Introduction to Probability and Statistics 13th Edition Mendenhall Solutions ...
Introduction to Probability and Statistics 13th Edition Mendenhall Solutions ...Introduction to Probability and Statistics 13th Edition Mendenhall Solutions ...
Introduction to Probability and Statistics 13th Edition Mendenhall Solutions ...
 
Estimating a Population Mean
Estimating a Population MeanEstimating a Population Mean
Estimating a Population Mean
 
Doe Helicopters Project
Doe Helicopters ProjectDoe Helicopters Project
Doe Helicopters Project
 

Similar to 4 meda

Mva 06 principal_component_analysis_2010_11
Mva 06 principal_component_analysis_2010_11Mva 06 principal_component_analysis_2010_11
Mva 06 principal_component_analysis_2010_11
P Palai
 

Similar to 4 meda (20)

4 MEDA.pdf
4 MEDA.pdf4 MEDA.pdf
4 MEDA.pdf
 
Intro
IntroIntro
Intro
 
Variance component analysis by paravayya c pujeri
Variance component analysis by paravayya c pujeriVariance component analysis by paravayya c pujeri
Variance component analysis by paravayya c pujeri
 
Analysis of convection diffusion problems at various peclet numbers using fin...
Analysis of convection diffusion problems at various peclet numbers using fin...Analysis of convection diffusion problems at various peclet numbers using fin...
Analysis of convection diffusion problems at various peclet numbers using fin...
 
Respose surface methods
Respose surface methodsRespose surface methods
Respose surface methods
 
Where and why are the lucky primes positioned in the spectrum of the Polignac...
Where and why are the lucky primes positioned in the spectrum of the Polignac...Where and why are the lucky primes positioned in the spectrum of the Polignac...
Where and why are the lucky primes positioned in the spectrum of the Polignac...
 
FITTED OPERATOR FINITE DIFFERENCE METHOD FOR SINGULARLY PERTURBED PARABOLIC C...
FITTED OPERATOR FINITE DIFFERENCE METHOD FOR SINGULARLY PERTURBED PARABOLIC C...FITTED OPERATOR FINITE DIFFERENCE METHOD FOR SINGULARLY PERTURBED PARABOLIC C...
FITTED OPERATOR FINITE DIFFERENCE METHOD FOR SINGULARLY PERTURBED PARABOLIC C...
 
FITTED OPERATOR FINITE DIFFERENCE METHOD FOR SINGULARLY PERTURBED PARABOLIC C...
FITTED OPERATOR FINITE DIFFERENCE METHOD FOR SINGULARLY PERTURBED PARABOLIC C...FITTED OPERATOR FINITE DIFFERENCE METHOD FOR SINGULARLY PERTURBED PARABOLIC C...
FITTED OPERATOR FINITE DIFFERENCE METHOD FOR SINGULARLY PERTURBED PARABOLIC C...
 
Mva 06 principal_component_analysis_2010_11
Mva 06 principal_component_analysis_2010_11Mva 06 principal_component_analysis_2010_11
Mva 06 principal_component_analysis_2010_11
 
Get Multiple Regression Assignment Help
Get Multiple Regression Assignment Help Get Multiple Regression Assignment Help
Get Multiple Regression Assignment Help
 
Kertas 3 pep percubaan spm pahang 2015
Kertas 3 pep percubaan spm pahang 2015Kertas 3 pep percubaan spm pahang 2015
Kertas 3 pep percubaan spm pahang 2015
 
final_report
final_reportfinal_report
final_report
 
Getting started with chemometric classification
Getting started with chemometric classificationGetting started with chemometric classification
Getting started with chemometric classification
 
Hydraulic similitude and model analysis
Hydraulic similitude and model analysisHydraulic similitude and model analysis
Hydraulic similitude and model analysis
 
Sdarticle 2
Sdarticle 2Sdarticle 2
Sdarticle 2
 
Sdarticle 2
Sdarticle 2Sdarticle 2
Sdarticle 2
 
Sdarticle 2
Sdarticle 2Sdarticle 2
Sdarticle 2
 
83662164 case-study-1
83662164 case-study-183662164 case-study-1
83662164 case-study-1
 
A GENERALIZED SAMPLING THEOREM OVER GALOIS FIELD DOMAINS FOR EXPERIMENTAL DESIGN
A GENERALIZED SAMPLING THEOREM OVER GALOIS FIELD DOMAINS FOR EXPERIMENTAL DESIGNA GENERALIZED SAMPLING THEOREM OVER GALOIS FIELD DOMAINS FOR EXPERIMENTAL DESIGN
A GENERALIZED SAMPLING THEOREM OVER GALOIS FIELD DOMAINS FOR EXPERIMENTAL DESIGN
 
A Generalized Sampling Theorem Over Galois Field Domains for Experimental Des...
A Generalized Sampling Theorem Over Galois Field Domains for Experimental Des...A Generalized Sampling Theorem Over Galois Field Domains for Experimental Des...
A Generalized Sampling Theorem Over Galois Field Domains for Experimental Des...
 

More from Leonardo Auslender

4_5_Model Interpretation and diagnostics part 4_B.pdf
4_5_Model Interpretation and diagnostics part 4_B.pdf4_5_Model Interpretation and diagnostics part 4_B.pdf
4_5_Model Interpretation and diagnostics part 4_B.pdf
Leonardo Auslender
 

More from Leonardo Auslender (20)

1 UMI.pdf
1 UMI.pdf1 UMI.pdf
1 UMI.pdf
 
Ensembles.pdf
Ensembles.pdfEnsembles.pdf
Ensembles.pdf
 
Suppression Enhancement.pdf
Suppression Enhancement.pdfSuppression Enhancement.pdf
Suppression Enhancement.pdf
 
4_2_Ensemble models and gradient boosting2.pdf
4_2_Ensemble models and gradient boosting2.pdf4_2_Ensemble models and gradient boosting2.pdf
4_2_Ensemble models and gradient boosting2.pdf
 
4_5_Model Interpretation and diagnostics part 4_B.pdf
4_5_Model Interpretation and diagnostics part 4_B.pdf4_5_Model Interpretation and diagnostics part 4_B.pdf
4_5_Model Interpretation and diagnostics part 4_B.pdf
 
4_2_Ensemble models and grad boost part 2.pdf
4_2_Ensemble models and grad boost part 2.pdf4_2_Ensemble models and grad boost part 2.pdf
4_2_Ensemble models and grad boost part 2.pdf
 
4_2_Ensemble models and grad boost part 3.pdf
4_2_Ensemble models and grad boost part 3.pdf4_2_Ensemble models and grad boost part 3.pdf
4_2_Ensemble models and grad boost part 3.pdf
 
4_5_Model Interpretation and diagnostics part 4.pdf
4_5_Model Interpretation and diagnostics part 4.pdf4_5_Model Interpretation and diagnostics part 4.pdf
4_5_Model Interpretation and diagnostics part 4.pdf
 
4_3_Ensemble models and grad boost part 2.pdf
4_3_Ensemble models and grad boost part 2.pdf4_3_Ensemble models and grad boost part 2.pdf
4_3_Ensemble models and grad boost part 2.pdf
 
4_2_Ensemble models and grad boost part 1.pdf
4_2_Ensemble models and grad boost part 1.pdf4_2_Ensemble models and grad boost part 1.pdf
4_2_Ensemble models and grad boost part 1.pdf
 
4_1_Tree World.pdf
4_1_Tree World.pdf4_1_Tree World.pdf
4_1_Tree World.pdf
 
Classification methods and assessment.pdf
Classification methods and assessment.pdfClassification methods and assessment.pdf
Classification methods and assessment.pdf
 
Linear Regression.pdf
Linear Regression.pdfLinear Regression.pdf
Linear Regression.pdf
 
2 UEDA.pdf
2 UEDA.pdf2 UEDA.pdf
2 UEDA.pdf
 
3 BEDA.pdf
3 BEDA.pdf3 BEDA.pdf
3 BEDA.pdf
 
1 EDA.pdf
1 EDA.pdf1 EDA.pdf
1 EDA.pdf
 
0 Statistics Intro.pdf
0 Statistics Intro.pdf0 Statistics Intro.pdf
0 Statistics Intro.pdf
 
0 Model Interpretation setting.pdf
0 Model Interpretation setting.pdf0 Model Interpretation setting.pdf
0 Model Interpretation setting.pdf
 
4 2 ensemble models and grad boost part 3 2019-10-07
4 2 ensemble models and grad boost part 3 2019-10-074 2 ensemble models and grad boost part 3 2019-10-07
4 2 ensemble models and grad boost part 3 2019-10-07
 
4 2 ensemble models and grad boost part 2 2019-10-07
4 2 ensemble models and grad boost part 2 2019-10-074 2 ensemble models and grad boost part 2 2019-10-07
4 2 ensemble models and grad boost part 2 2019-10-07
 

Recently uploaded

如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
acoha1
 
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Valters Lauzums
 
一比一原版(ucla文凭证书)加州大学洛杉矶分校毕业证学历认证官方成绩单
一比一原版(ucla文凭证书)加州大学洛杉矶分校毕业证学历认证官方成绩单一比一原版(ucla文凭证书)加州大学洛杉矶分校毕业证学历认证官方成绩单
一比一原版(ucla文凭证书)加州大学洛杉矶分校毕业证学历认证官方成绩单
aqpto5bt
 
1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证
1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证
1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证
dq9vz1isj
 
obat aborsi Banjarmasin wa 082135199655 jual obat aborsi cytotec asli di Ban...
obat aborsi Banjarmasin wa 082135199655 jual obat aborsi cytotec asli di  Ban...obat aborsi Banjarmasin wa 082135199655 jual obat aborsi cytotec asli di  Ban...
obat aborsi Banjarmasin wa 082135199655 jual obat aborsi cytotec asli di Ban...
siskavia95
 
edited gordis ebook sixth edition david d.pdf
edited gordis ebook sixth edition david d.pdfedited gordis ebook sixth edition david d.pdf
edited gordis ebook sixth edition david d.pdf
great91
 
Audience Researchndfhcvnfgvgbhujhgfv.pptx
Audience Researchndfhcvnfgvgbhujhgfv.pptxAudience Researchndfhcvnfgvgbhujhgfv.pptx
Audience Researchndfhcvnfgvgbhujhgfv.pptx
Stephen266013
 
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotecAbortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证
1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证
1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证
ppy8zfkfm
 
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
jk0tkvfv
 
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
Amil baba
 
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
yulianti213969
 
sourabh vyas1222222222222222222244444444
sourabh vyas1222222222222222222244444444sourabh vyas1222222222222222222244444444
sourabh vyas1222222222222222222244444444
saurabvyas476
 

Recently uploaded (20)

如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
 
The Significance of Transliteration Enhancing
The Significance of Transliteration EnhancingThe Significance of Transliteration Enhancing
The Significance of Transliteration Enhancing
 
Formulas dax para power bI de microsoft.pdf
Formulas dax para power bI de microsoft.pdfFormulas dax para power bI de microsoft.pdf
Formulas dax para power bI de microsoft.pdf
 
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
 
一比一原版(ucla文凭证书)加州大学洛杉矶分校毕业证学历认证官方成绩单
一比一原版(ucla文凭证书)加州大学洛杉矶分校毕业证学历认证官方成绩单一比一原版(ucla文凭证书)加州大学洛杉矶分校毕业证学历认证官方成绩单
一比一原版(ucla文凭证书)加州大学洛杉矶分校毕业证学历认证官方成绩单
 
1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证
1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证
1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证
 
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...Identify Customer Segments to Create Customer Offers for Each Segment - Appli...
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...
 
obat aborsi Banjarmasin wa 082135199655 jual obat aborsi cytotec asli di Ban...
obat aborsi Banjarmasin wa 082135199655 jual obat aborsi cytotec asli di  Ban...obat aborsi Banjarmasin wa 082135199655 jual obat aborsi cytotec asli di  Ban...
obat aborsi Banjarmasin wa 082135199655 jual obat aborsi cytotec asli di Ban...
 
edited gordis ebook sixth edition david d.pdf
edited gordis ebook sixth edition david d.pdfedited gordis ebook sixth edition david d.pdf
edited gordis ebook sixth edition david d.pdf
 
SCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarj
SCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarjSCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarj
SCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarj
 
Northern New England Tableau User Group (TUG) May 2024
Northern New England Tableau User Group (TUG) May 2024Northern New England Tableau User Group (TUG) May 2024
Northern New England Tableau User Group (TUG) May 2024
 
Predictive Precipitation: Advanced Rain Forecasting Techniques
Predictive Precipitation: Advanced Rain Forecasting TechniquesPredictive Precipitation: Advanced Rain Forecasting Techniques
Predictive Precipitation: Advanced Rain Forecasting Techniques
 
Audience Researchndfhcvnfgvgbhujhgfv.pptx
Audience Researchndfhcvnfgvgbhujhgfv.pptxAudience Researchndfhcvnfgvgbhujhgfv.pptx
Audience Researchndfhcvnfgvgbhujhgfv.pptx
 
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotecAbortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
 
1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证
1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证
1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证
 
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
 
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
 
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
 
Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...
Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...
Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...
 
sourabh vyas1222222222222222222244444444
sourabh vyas1222222222222222222244444444sourabh vyas1222222222222222222244444444
sourabh vyas1222222222222222222244444444
 

4 meda

  • 1. Leonardo Auslender –Ch. 1 Copyright 2004 Ch. 1.1-19/8/2019
  • 2. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-29/8/2019
  • 3. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-39/8/2019 Curse of Dimensionality in high dimensions. Assume cube in [0; 1], edges X, Y, Z. Volume = 1, and length (X) = length (Y) = length (Z) = 1. Take sub-cube of s = 10% observations ➔ expected edge length = s ** (1 / 3) = .46; 30% sub-cube .67, while edge length for each input = 1 ➔ must cover 46% of range of each input to capture 10% of data because 0.46 ** 3 is about 10%. Sampling density proportional to s ** (1 / p), p sample proportion. In Multivariate studies, typically add features to enhance study. The more we add, less representative the sample is. Most actual data point lie outside of sample larger number of variables. Note that individual variables cardinality typically maintained, it is multivariate aspect that is cursed. If n1 = 1000 is dense sample for X1, n10 = 1000 ** 10 is sample size required for same sampling density with 10 inputs. Sampling needs grow exponentially with dimensions. Next slide: example with 6 binary variables (X, Y, Z, T, U W) for overall population of 10,000 and then random samples of 5%, 10% and 30%. Note that that there are 2 ** 6 = 64 possible combinations of variable levels. Notice missing patterns in samples (partial output displayed due to space).
  • 4. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-49/8/2019 Curse of dimensionality N in pop % in pop N in 5% % in 5% N in 10% % in 10% N in 30% % in 30% x y z u v w 82 0.82 5 0.96 13 1.28 26 0.870 0 0 0 0 0 1 22 0.22 1 0.19 2 0.20 9 0.30 1 0 68 0.68 1 0.19 12 1.18 18 0.61 1 13 0.13 1 0.19 2 0.20 4 0.13 1 0 0 33 0.33 2 0.38 4 0.39 9 0.30 1 11 0.11 1 0.10 2 0.07 1 0 36 0.36 3 0.29 11 0.37 1 12 0.12 4 0.13 1 0 0 0 539 5.39 31 5.95 54 5.31 158 5.32 1 151 1.51 10 1.92 19 1.87 46 1.55 1 0 543 5.43 31 5.95 53 5.21 148 4.98 1 158 1.58 12 2.30 15 1.47 51 1.72 1 0 0 299 2.99 13 2.50 19 1.87 84 2.83 1 90 0.90 4 0.77 11 1.08 32 1.08 1 0 290 2.90 13 2.50 20 1.97 84 2.83 1 88 0.88 9 1.73 14 1.38 23 0.77 1 0 0 0 0 149 1.49 9 1.73 17 1.67 49 1.65 1 46 0.46 1 0.19 2 0.20 13 0.44 1 0 142 1.42 6 1.15 14 1.38 47 1.58 1 42 0.42 1 0.19 2 0.20 16 0.54 1 0 0 67 0.67 2 0.38 7 0.69 27 0.91 1 13 0.13 1 0.10 4 0.13 1 0 81 0.81 2 0.38 12 1.18 31 1.04 1 22 0.22 1 0.10 8 0.27 1 0 0 0 1318 13.18 56 10.75 115 11.31 372 12.52 1 309 3.09 16 3.07 41 4.03 91 3.06
  • 5. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-59/8/2019 9 binary variables, sampling 5%, 10%, 30% (100% total population). When # vars = 5 (in this example), % captured patterns in samples declines steadily.
  • 6. Leonardo Auslender –Ch. 1 Copyright 2004 Ch. 1.1-69/8/2019
  • 7. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-79/8/2019 Positive Definite Matrix and definitions. *** Symmetric matrix with all positive eigenvalues. In the case of covariance and correlation matrices (that are symmetrical), all eigenvalues are real numbers. Correlation and covariance matrices must have positive eigenvalues, otherwise they are not of full rank ➔ there are perfectly linear dependencies among the variables. For X data matrix of predictors, sample covariance = X’X = v. Generalized sample variance = det (v) (not much used). Since vars in X have different scales, could use instead correlation matrix, i.e., det (R).
  • 8. Leonardo Auslender –Ch. 1 Copyright 2004 Ch. 1.1-89/8/2019 Principal Components Analysis (PCA) Technique for forming new variables from (typically) large ‘p’ data set, which are linear composites of the original. Variables. Aim is to reduce dimension (‘p’) of the data set while minimizing the amount of information lost if we do not choose all the composites. Number of composites = number of original variables ➔ problem of composite selection. The other dimension of the data, ‘n’, is reduced via cluster analysis, presented later on.
  • 9. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-99/8/2019 Web example: 12 observations.
  • 10. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-109/8/2019 23.091 is variance of X1 …. Note: total variance also called “overall “ or “summative” variance, multivariate or total variability: It is the sum of the variance of the individual variables.
  • 11. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-119/8/2019 Let’s create Xnew arbitrarily from x1 and x2.
  • 12. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-129/8/2019
  • 13. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-139/8/2019 Play with the angle to maximize the fitted variance.
  • 14. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-149/8/2019
  • 15. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-159/8/2019 Geometric Interpretation: Note that in order to find Xnew, we have rotated X1 and X2. Xnew accounts for 87.31% of the total variation. Ergo, possible to estimate a new vector, called second eigenvector or principal component that accounts for Variation not fit by first vector, xnew. Axis of second vector is orthogonal to first one ➔ uncorrelated. Derivation of new axes or vars are called principal components, which refer to the weights by which Original variables are multiplied and then summed up, and their values are PC scores.
  • 16. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-169/8/2019 Variance of xnew, 38.576 is called first eigenvalue. Estimates of the eigenvalues provides measures of the amount of the original total variation fit by each of the new derived variables. The sum of all the eigenvalues equals the sum of the variances of the original variables. PCA rearranges the variance in the original variables so it is concentrated in the first few new components. Debatable whether binary variables can be used in PCA because binary variables do not have true ‘origin’. Thus, its variance in [0, .25] and mean lies in [0, 1], while other variables can have far larger variances and means.
  • 17. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-179/8/2019 Eigenvectors (characteristic vectors) Eigenvectors are lists of coefficients or weights (cj) showing how much each original variable contributes to each new derived variable, or eigenvector. The eigenvectors are usually scaled so that the sum of squared coefficients for each eigenvector equals one. Eigenvectors are orthogonal Analysis can be done by decomposing covariance (as in example) or correlation. Covariance keeps original units and method tends to fit variables with higher variances.
  • 18. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-18
  • 19. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-199/8/2019 Important Message N 1 Note to clarify Variable names in the table below: 1 . 1 If the model name ends up in 'ORIGINAL', Variable values are of the form 1 'variable name'_'PrinX_', where X denotes an eigenvalue number. 1 If in addition, the model name ends up in COV_ORIGINAL, 1 the corresponding variance to 'variable_name'_prinx_ is in column prinX. 1 And for values other than X, the columns represent the covariance. 1 .. 1 If the model name ends up in CORR_ORIGINAL, then the prinX columns denote 1 correlations. 1 ... 1 If the model name ends up in COV_PCA, the principal components 1 were obtained from the covariance matrix. If in CORR_PCA, 1 obviously correlations. 1 .... 1 For space reasons, a maximum of 6 Prin variables is shown. 1
  • 20. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-209/8/2019 Var/Cov/Corr before and after PCA prin1 prin2 Model Name Variable 21.091 16.455 M2_TRN_PCOMP_COV_ORIGI NAL x2_prin1_ x1_prin2_ 16.455 23.091 Ovl. Var 44.182 44.182 M2_TRN_PCOMP_COV_PCA Prin1 38.576 0.000 Prin2 0.000 5.606 Ovl Var 44.182 44.182 M5_TRN_PCOMP_CORR_ORIG INAL x2_prin1_ 1.000 0.746 x1_prin2_ 0.746 1.000 Ovl. Var 2.000 2.000 M5_TRN_PCOMP_CORR_PCA Prin1 1.000 0.000 Prin2 0.000 1.000 Ovl Var 2.000 2.000
  • 21. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-219/8/2019 From previous slide (cov_original and cov_PCA) X2_prin1_ means X2 variance appears in column prin1… Ovl. Var is sum of diagonal elements. Prin1 is first eigenvector. Etc. Notice that raw units were used in PCA in this case ➔ Variable with larger variance (X1) has more influence in results. Notice that overall variance (44.182) is same for original case as for PCA results. Dimension reduction refers to selecting # of Principal components that fits up to specific percentage of overall variance. But original variables are no longer useful. First pcomp fits 87.3% (38.576 * 100 / 44.182) of the overall variance, etc. In the available data, PCA finds direction or dimension with the largest variance out of the overall variance, which is 38.576.
  • 22. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-229/8/2019 Then, orthogonal to the first direction, finds direction of largest variance of whatever is left of overall variance, i.e., 44.182 – 38.576, which in our simple example is 5.606. CORR_Original and CORR_PCA Correlation based PCA. Notice that Prin1, etc Variance and covariance are different. Notice orthogonality (in previous slide also) between prin1 and prin2 given by the 0 covariances. Also, PCA can be performed on [0; 1] rescaled data via covariance (not shown).
  • 23. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-239/8/2019
  • 24. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-249/8/2019
  • 25. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-259/8/2019 Aim: find projections to summarize (mean centered) data. Approaches. 1) Find projections/vectors of maximum total variance. 2) Find projections with smallest avg (mean-squared) distance between original and projections, which is equivalent to 1). Thus, maximize the variance by choosing ‘w’ (w is the vector of coeffs, x the original data matrix), where variance is given by:
  • 26. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-269/8/2019 To maximize variance fitted by component w, requires w to be a unit vector, and thus w’w = 1 as constraint. Thus maximize with constraint, i.e., Lagrange multiplier method. 2 ( , ) ( ' 1) ' 1 2 2 wL w w w L w w L vw w w          − − = − = − And setting them to 0, we obtain:
  • 27. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-279/8/2019 ' 1 0 w w vw w vw w  = =  − = Thus, ‘w’ is eigenvector of v & maximizing w is the one associated with largest eigenvalue. Since v (= x’x) is (p, p), there are at most p eigenvalues. Since v is covariance ➔ it is symmetric ➔ all eigenvectors orthogonal to each other. Since v positive matrix ➔ all eigenvalues > 0.
  • 28. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-289/8/2019 While these principal factors represent or replace one or more of the original variables, they are not just a one-to-one transformation, ➔ Inverse transformations are not possible. NB: can obtain PCA without w’w = 1 constraint, but then standard PCA interpretation is not true.
  • 29. Leonardo Auslender –Ch. 1 Copyright 2004 Ch. 1.1-299/8/2019 Detour on Eigenvalues, etc. Let A (n,n) , v (n, 1),  scalar. Note that A is not the typical rectangular data set but a square matrix, for instance, a covariance or correlation matrix. Problem: find  / A v =  v has nonzero solution. Note that A v is a vector, for instance, the estimated predictions of a linear regression. (For us, A is data, v is coefficients,  v linear transformation of coefficients).  called eigenvalue if nonzero vector v exists that satisfies equation. Since v  0 ➔ |A -  I| = 0 ➔ equation of degree n in , determines values for  (notice that roots of equation could be complex).
  • 30. Leonardo Auslender –Ch. 1 Copyright 2004 Ch. 1.1-309/8/2019 Diagonalization. Matrix A diagonalizable  has n distinct eigenvalues. Then S (n,n) is diagonalizing matrix, with eigenvectors of A as elements, and D is diagonal matrix with eigenvalues of A as its elements. S–1AS = D ➔ A = SDS–1, and A2 = (SDS–1) (SDS–1) = SD2S–1 Ak = (SDS–1) …. (SDS–1) = SDkS–1 Example: 30% of married women get divorced; 20% of single get married each year. 8000 M and 2000 S, and constant population. Find number of M and S in 5 years. v = (8000, 2000)’; A = {0.7 0.2 0.3 0.8} Eigenvalues = 1; .05. Eigenvectors: v1 = (2; 3)’ v2 = (1; -1)’ ,,,,,,,, A5 = SDS–1 = (4125, 5875)’ As k → , eigenvalues → (1; 0) ➔ A  → (4000; 6000)’
  • 31. Leonardo Auslender –Ch. 1 Copyright 2004 Ch. 1.1-319/8/2019 Detour on Eigenvalues, etc (cont. 2). Singular Value Decomposition (SVD) Notice that only square matrices can be diagonalized. Typical data sets, however, are rectangular. SVD provides necessary link. A (m,n), m  n ➔ A = UV’, U(m, m) orthogonal matrix (its columns are eigenvectors of AA’) (AA’ = U  V’V  U’ = U 2U’) V(n, n) orthogonal matrix (its columns are eigenvectors of A’A) (A’A= V  U’U  V’ = V 2V’)  (m,n) = diagonal ( 1, 0) , 1 = diag( 1  2 ….  n)  0. ’s called singular values of A. 2’s are eigenvalues of A’A. U and V: left and right singular matrices (or matrices of singular vectors).
  • 32. Leonardo Auslender –Ch. 1 Copyright 2004 Ch. 1.1-329/8/2019 Principal Components Analysis (PCA): Dimension Reduction for interval-measure variables (for dummy variables, replace Pearson correlations by polychoric correlations that assume underlying latent variables. Continuous-dummy correlations are fine). PCA creates linear combinations of original set of variables which explain largest amount of variation. First principal component explains largest amount of variation in original set; second one explains second largest amount of variation subject to being orthogonal to first one, etc. X2 X1 X3 PC1 PC2PC3 PCi=V1i*X1+ V2i *X2+V3i*X3
  • 33. Leonardo Auslender –Ch. 1 Copyright 2004 Ch. 1.1-339/8/2019 PC scores for each observation created by product of X and V, the set of eigenvectors.                        === === === p 1i niip p 1i ni2i p 1i ni1i p 1i i2ip p 1i i22i p 1i i21i p 1i i1ip p 1i i12i p 1i i11i XVXVXV XVXVXV XVXVXV    XV = SVD of Covariance/Correlation Matrix = USVT Covariance/Correlation Matrix
  • 34. Leonardo Auslender –Ch. 1 Copyright 2004 Ch. 1.1-349/8/2019 PCA computed by performing SVD/Eigenvalue Decomp. on covariance or correlation matrix. Eigenvalues and associated eigenvectors extracted from covariance matrix ‘sequentially’. Each successive eigenvalue is smaller (in absolute value), and each associated eigenvector is orthogonal to previous one. X = Covariance/Correlation Matrix SVD of Covariance/Correlation Matrix = USVT
  • 35. Leonardo Auslender –Ch. 1 Copyright 2004 Ch. 1.1-359/8/2019 Amount of variation fitted by first k principal components can be computed in following way. i are eigenvalues of covariance/correlation matrix.  2 =                         p k 2 1 000 000 000 000       % Variation fitted = %100p 1j j k 1i i      = =
  • 36. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-369/8/2019
  • 37. Leonardo Auslender –Ch. 1 Copyright 2004 Ch. 1.1-379/8/2019 Covariance or Correlation Matrix derivation? Overlooked point: results are different. Correlation matrix is the covariance matrix of the same data but in standardized form. Assume 3 variables x1 through x3. If Var(x1) = k (var (x2) + var(x3)) for large k, then x1 will dominate the first eigenvalue and the others would be negligible. Standardization implicit in correlation matrix treats all variables equally, because of unitary variance of each one. Recommendation: depends on focus of study, similar problem in clustering: outliers can badly affect standard deviation and mean estimations ➔ standardized variables do not reflect behavior of original variable.
  • 38. Leonardo Auslender –Ch. 1 Copyright 2004 Ch. 1.1-389/8/2019 SAS Proc princomp data = &indata. Cov out = outp_princomp; Var doctor_visits fraud member_duration no_claims optom_presc total_spend; Run; Proc corr data = outp_princomp; Var prin1 prin2 prin3 doctor_visits fraud member_duration no_claims optom_presc total_spend; Run;
  • 39. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-399/8/2019
  • 40. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-409/8/2019 Var/Cov/Corr before and after PCA prin1 prin2 prin3 prin4 prin5 prin6 Model Name Variable 1.236 0.001 0.394 2.550 0.139 -278.0M2_TRN_PCOMP_COV_ORI GINAL no_claims_prin1_ num_members_prin2_ 0.001 0.994 0.115 -0.501 -0.024 -163.5 doctor_visits_prin3_ 0.394 0.115 48.888 115.04 -0.859 8399.9 member_duration_prin4_ 2.550 -0.501 115.04 6493.3 -12.79 93386 optom_presc_prin5_ 0.139 -0.024 -0.859 -12.79 2.749 726.61 total_spend_prin6_ -278.0 -163.5 8399.9 93386 726.61 1.3E8 Ovl. Var 1.3E8 1.3E8 1.3E8 1.3E8 1.3E8 1.3E8 M2_TRN_PCOMP_COV_PCA Prin1 1.3E8 0.000 0.000 0.000 0.000 0.000 Prin2 0.000 6427.9 0.000 0.000 0.000 0.000 Prin3 0.000 0.000 46.495 0.000 0.000 0.000 Prin4 0.000 0.000 0.000 2.723 0.000 0.000 Prin5 0.000 0.000 0.000 0.000 1.216 0.000 Prin6 0.000 0.000 0.000 0.000 0.000 0.993 Ovl Var 1.3E8 1.3E8 1.3E8 1.3E8 1.3E8 1.3E8 M5_TRN_PCOMP_CORR_O RIGINAL no_claims_prin1_ 1.000 0.001 0.051 0.028 0.075 -0.022 num_members_prin2_ 0.001 1.000 0.016 -0.006 -0.015 -0.014 doctor_visits_prin3_ 0.051 0.016 1.000 0.204 -0.074 0.106 member_duration_prin4_ 0.028 -0.006 0.204 1.000 -0.096 0.102 optom_presc_prin5_ 0.075 -0.015 -0.074 -0.096 1.000 0.039 total_spend_prin6_ -0.022 -0.014 0.106 0.102 0.039 1.000 Ovl. Var 6.000 6.000 6.000 6.000 6.000 6.000 M5_TRN_PCOMP_CORR_PC A Prin1 1.000 0.000 0.000 0.000 0.000 0.000 Prin2 0.000 1.000 0.000 0.000 0.000 0.000 Prin3 0.000 0.000 1.000 0.000 0.000 0.000 Prin4 0.000 0.000 0.000 1.000 0.000 0.000 Prin5 0.000 0.000 0.000 0.000 1.000 0.000 Prin6 0.000 0.000 0.000 0.000 0.000 1.000 Ovl Var 6.000 6.000 6.000 6.000 6.000 6.000
  • 41. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-419/8/2019 For COV Principal components run “M2_TRN_PCOMP_COV_ORIGINAL” “total_spend” has far larger variance, all PCA analysis, will be dominated by this variable. Later eigenvalues are corresponding variances of Eigenvectors. PCA results are different between COV and Corr.. Below, Corred PCA results.
  • 42. Leonardo Auslender –Ch. 1 Copyright 2004 5 components fit 87% of total variation. Eigenvalue X Is p_comp X variance. Eigenvalue also called Characteristic root.
  • 43. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-439/8/2019 Eigenvalues table Eigenval ue Difference Propor tion Cumul ative Number model_name 1.30 0.22 0.22 0.22 1 M1_PCOMP_NO_S_COV M1_PCOMP_STAN_CORR 1.30 0.22 0.22 0.22 2 M1_PCOMP_NO_S_COV 1.07 0.05 0.18 0.39 M1_PCOMP_STAN_CORR 1.07 0.05 0.18 0.39 3 M1_PCOMP_NO_S_COV 1.02 0.02 0.17 0.56 M1_PCOMP_STAN_CORR 1.02 0.02 0.17 0.56 4 M1_PCOMP_NO_S_COV 1.00 0.17 0.17 0.73 M1_PCOMP_STAN_CORR 1.00 0.17 0.17 0.73 5 M1_PCOMP_NO_S_COV 0.82 0.03 0.14 0.87 M1_PCOMP_STAN_CORR 0.82 0.03 0.14 0.87 6 M1_PCOMP_NO_S_COV 0.79 0.13 1.00 M1_PCOMP_STAN_CORR 0.79 0.13 1.00 All 12.00 1.00 2.00 7.55
  • 44. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-449/8/2019 Scree indicates ‘4’ components. Scree plot: Plot of eigenvalues vs. component Number and looking for obvious break or elbow.
  • 45. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-459/8/2019
  • 46. Leonardo Auslender –Ch. 1 Copyright 2004 Ch. 1.1-469/8/2019 P-comp1 mostly fitted by doctor_visits and member duration. # 2 (which fits residuals from step 1 ) by No_claims and optom_presc, etc.
  • 47. Leonardo Auslender –Ch. 1 Copyright 2004 Ch. 1.1-479/8/2019
  • 48. Leonardo Auslender –Ch. 1 Copyright 2004 Ch. 1.1-489/8/2019 Data set used is Home Equity Loan. All continuous Variables (except for job, reason) are: BAD(binary target) - Default or seriously delinquent CLAGE - Age of oldest trade line in months CLNO - Number of trade (credit) lines DEBTINC - Debt to income ratio DELINQ - Number of delinquent trade lines DEROG - Number of major derogatory reports JOB - Prof/exec, sales, manager, office, self, or other LOAN - Amount of current loan request MORTDUE - Amount due on existing mortgage NINQ - Number of recent credit inquiries REASON - Home improvement or debt consolidation VALUE - Value of current property YOJ - Years on current job.
  • 49. Leonardo Auslender –Ch. 1 Copyright 2004 Ch. 1.1-499/8/2019 Variables used in PCA are measured in interval scale. Variables are: LOAN - Amount of current loan request MORTDUE - Amount due on existing mortgage VALUE - Value of current property YOJ - Years on current job CLAGE - Age of oldest trade line in months NINQ - Number of recent credit inquiries CLNO - Number of trade (credit) lines DEBTINC - Debt to income ratio.
  • 50. Leonardo Auslender –Ch. 1 Copyright 2004 Ch. 1.1-509/8/2019 Eigenvalues report indicates that first four principal components fit %70.77 of variation of original variables.
  • 51. Leonardo Auslender –Ch. 1 Copyright 2004 Ch. 1.1-519/8/2019 First principal component score for each observation is created by following linear combination. PC1=.3179*LOAN + .6005*MORTDUE +. 6054*VALUE + .0141*YOJ + .1827*CLAGE + .0606*NINQ + .3314*CLNO + .1574*DEBTINC Eigenvectors report contains V coefficients associated with each of original variables for first four principal components.
  • 52. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-529/8/2019 At this stage, it is customary to try to interpret the eigenvectors in terms of the original variables. The first vector had high relative loads in MORTDUE, VALUE and CLNO that indicates a dimension of financial stress (remember that there is no dependent variable, i.e., BADS does not play a role). Given “financial stress”, the second vector is a measure of “time effects” based on YOJ and CLAGE. And so on to the third and fourth vectors. Notice that the interpretation is based on the magnitude of the coefficients, without any guidelines as to what constitutes a high relative load. Therefore, with a large number of variables, interpretation is more difficult because the loads do not necessarily distinguish themselves as high or low. In next table, conditioning on value and mortdue, hardly afflects correlation between YOJ and CLAGE. Note that full analysis would require 2nd order partial correlations (not done here).
  • 53. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-539/8/2019 1st component.
  • 54. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-549/8/2019 Correlations Zero order, partial and semipartial Corrs Corr Type PARTIAL SEMI_P ZERO_O RDER Value Value Value Model Name With Var Variables Cond. Var M1 VALUE YOJ CLNO -0.010 -0.010 MORTDUE 0.138 0.138 YOJ CLAGE 0.202 CLNO 0.225 0.221 MORTDUE 0.236 0.231 VALUE 0.226 0.221 CLNO 0.025 MORTDUE 0.042 0.042 VALUE 0.013 0.013 MORTDUE -0.088 CLNO -0.095 -0.095 VALUE -0.162 -0.162 VALUE 0.008 CLNO -0.010 -0.010 MORTDUE 0.138 0.138 YOJ 1.000
  • 55. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-559/8/2019 Interpretation: PCR1: high loadings for MORTDUE, VALUE and CLNO ➔ financial aspect? PCR2: given PCR1, YOJ and CLAGE ➔ time aspect?, etc. Note: nice to have inference on component loadings. When p large, very difficult. Also, when looking at PCR2 for interpretation, it is imperative to first remove the effects of the first component equation from all variables, before looking at correlations.
  • 56. Leonardo Auslender –Ch. 1 Copyright 2004 Ch. 1.1-569/8/2019 Y X2 X1 Y PC2 PC1 PCA advantage; co-linearity is removed when regressing on principal components, which is called Principal Components Regression (PCR).
  • 57. Leonardo Auslender –Ch. 1 Copyright 2004 Ch. 1.1-579/8/2019 Principal Components Regression. 1) Resulting model still contains all variables. 2) Similar to ridge regression , but with truncation (due to choice of vectors) instead of shrinkage of ridge. 3) “Look where there’s light” fallacy. We are not looking at original information.
  • 58. Leonardo Auslender –Ch. 1 Copyright 2004 Ch. 1.1-589/8/2019
  • 59. Leonardo Auslender –Ch. 1 Copyright 2004 Ch. 1.1-599/8/2019 Discussion on Principal components (PCs). • Dependent variable is not used ➔ No selection bias (i.e., dep var does not affect PCA, which is ‘good’.) • Very often PCs not interpretable in terms of original variables. • Dependent variable not necessarily highly correlated with vectors corresponding to largest eigenvalues (in var sel context, tendency to select top eigenvalues related eigenvectors is unwarranted). • Sometimes most highly correlated vector corresponds to smaller eigenvalue. • May be impossible to implement in present Tera-Giga-bases. • If error component in data, PC chooses too many components.
  • 60. Leonardo Auslender –Ch. 1 Copyright 2004 Ch. 1.1-609/8/2019 Discussion on Principal components (PCs) (cont. 1). • Alternative to ‘ad-hoc’ PC selection, inference on eigenvalues. See Johnstone (2001). • Common practice: choose eigenvectors corresponding to high eigenvalues ➔ vector selection problem in addition to third point previous page (Ferre (1995) argues most methods fail. For newer version, see Guo et al, (2002), and Minka (2000) for Bayesian perspective). Foucart (2000) provides framework for “dropping” principal components in regression. For robust calculation, see Higuchi and Eguchi (2004). Li et al (2002) analyze L1 for Principal Components. Song et al (2018) obtains optimal number based on stability approach.
  • 61. Leonardo Auslender –Ch. 1 Copyright 2004 Ch. 1.1-619/8/2019 INTERVIEW Question: Our data is mostly binaries. PCA?
  • 62. Leonardo Auslender –Ch. 1 Copyright 2004 Ch. 1.1-629/8/2019
  • 63. Leonardo Auslender –Ch. 1 Copyright 2004 Ch. 1.1-639/8/2019 Factor Analysis. Family of statistical techniques to reduce variables into small number of latent factors. Main assumption: existence of unobserved latent variables or factors among variables. If factors are partialled from observed vars, partial corrs among existent variables should be zero (to be reviewed in BEDA). Each observed var can be expressed as weighted sum of latent components: For instance, concept of frailty can be ascertained by testing strength, weight, speed, agility, balance, etc. Want to explain the component of frailty in terms of these measures. Very popular in social sciences, such as psychology, survey analysis, sociology, etc. Idea is that any correlation between pair of observed variables can be explained in terms of their relationship with latent variables. FA as generic term includes PCA, but they have different assumptions. ....i i1 1 i2 2 ik k iy a f a f a f e= + + + +
  • 64. Leonardo Auslender –Ch. 1 Copyright 2004 Ch. 1.1-649/8/2019 Differences between FA and PCA. Difference in definition of variance of the variable to be analyzed. Variance of variable can be decomposed into common variance, shared by other variables and thus caused by values of latent constructs, and unique variance that include error component. Unique unrelated to any latent construct. “Common” Factor analysis (FA (CFA notation used for confirmatory later) or exploratory EFA) analyzes only common variance, PCA considers total variance without distinction of common and unique. In PCA, factors account for inter-correlations among variables, to identify latent dimensions. In PCA, we account for maximum portion of variation in original set of variables. FA uses notion of causality, PCA is free of that. PCA better when vars measured relatively error free (age, nationality, etc.). If vars are only indicators of latent constructs (test score, response to attitude scale, or surveys of aptitudes) ➔ CFA. PCs: composite variables computed from linear combinations of the measured variables. CFs: linear combinations of “common” parts of measured variables that capture underlying constructs.
  • 65. Leonardo Auslender –Ch. 1 Copyright 2004 Ch. 1.1-659/8/2019 EFA Rotations. An infinite number of solutions is possible, which produces same correlation matrix, by rotating reference axes of the factor solution to simplify the factor structure and to achieve a more meaningful and interpretable solution. IDEA BEHIND: rotate factors simultaneously so as to have as many zero loadings on each factor as possible. Meaningful and interpretable demand analyst’s expertise. Orthogonal rotation: angle between reference axes of factors are maintained at 90 degrees; oblique no (when factors assumed to be correlated). In FA case, negative eigenvalues ➔ covariance matrix NOT positive definite ➔ Cum fitted variation proportion > 1. Note PCA not affected.
  • 66. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-669/8/2019
  • 67. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-679/8/2019 Assume 10 variables that we view in 2 factor space (Y and X axes). Each dot below is one observation. An orthogonal rotation (i.e., assumes that variables are uncorrelated) gets points closer to one of the axis (and away from the other). From thenalysisfactor.com
  • 68. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-689/8/2019 If variables are correlated (say education and income level), oblique rotations (< 90* axes) creates better fit. From thenalysisfactor.com
  • 69. Leonardo Auslender –Ch. 1 Copyright 2004 Ch. 1.1-699/8/2019 Exploratory FA. PCA gives unique solution, FA different solutions depending on method & estimates of communality. While PCA analyzes Corr (cov) matrix, FA replaces main diagonal corrs by prior communality estimates: estimate of proportion of variance of the variable that is both error-free and shared with other variables in matrix (there are many methods to find estimates). Determining optimal # factors: ultimately subjective. Some methods: Kaiser-Guttman rule, % variance, scree test, size of residuals, and interpretability. Kaiser-Guttman: eigenvalues >= 1. % variance of sum of communalities fitted by successive factors. Scree test: plots rate of decline of successive eigenvalues. Analysis of residuals: Predicted corr matrix similar to original corr. Possibly, huge graphical output.
  • 70. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-709/8/2019 Differences between FA and PCA, communalities. PCA analyzes original corr matrix with’1’ in main diagonal, i.e., total variance. FA analyzes communalities, given by common variance. Main diag of corr matrix is then replaced, with options (SAS): ASMC sets the prior communality estimates proportional to the squared multiple correlations but adjusted so that their sum is equal to that of the maximum absolute correlations (Cureton; 1968). INPUT reads the prior communality estimates from the first observation with either _TYPE_=’PRIORS’ or_TYPE_=’COMMUNAL’ in the DATA = data set (which cannot be TYPE=DATA). MAX sets the prior communality estimate for each variable to its maximum absolute correlation with any other variable. ONE sets all prior communalities to 1.0. RANDOM sets the prior communality estimates to pseudo-random numbers uniformly distributed between 0 and 1. SMC sets the prior communality estimate for each variable to its squared multiple correlation with all other variables. Final communalities: proportion of the variance in each of the original variables retained after extracting the factors.
  • 71. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-719/8/2019 FA properties (SAS) Estimation method: PRINCIPAL (yields principal components), MAXIMUM LIKELIHOOD MINEIGEN: Smallest eigenvalue for retaining a factor. Nfactors: Maximum number of factors to retain. Scree: display scree plot. Rotate Priors.
  • 72. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-729/8/2019 Additional Factor Methods and comparisons. EFA: explores possible underlying factor structure of set of observed variables without imposing preconceived structure on outcome. Aim: identify underlying factor structure. Confirmatory factor analysis (CFA): statistical technique used to verify factor structure of set of observed variables. CFA allows to test hypothesis of relationship between observed variables and their underlying latent constructs exists. Researcher uses knowledge of theory, empirical research, or both, postulates relationship pattern a priori and then tests the hypothesis statistically. In short, NUMBER of factors, type of rotation and which variables load into each factor are known. Rule of thumb: factor loading into factor must be > 0.7. Confirmatory factor models ( ≈ linear factor models) Item response models ( ≈ nonlinear factor models).
  • 73. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-739/8/2019
  • 74. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-749/8/2019
  • 75. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-759/8/2019
  • 76. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-769/8/2019
  • 77. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-779/8/2019
  • 78. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-789/8/2019
  • 79. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-799/8/2019
  • 80. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-809/8/2019 Varimax Rotation. VARIMAX: orthogonal rotation that maximizes sum of variance of squared loadings (squared correlations between variables and factors). Orthogonality: Intuitively achieved if, (a) any given variable has a high loading on a single factor but near-zero loadings on the remaining factors and if (b) any given factor is constituted by only a few variables with very high loadings on this factor while remaining variables have near-zero loadings on this factor. a) and b) ➔ factor loading matrix is said to have "simple structure," and varimax rotation brings the loading matrix closer to such simple structure (as much as the data allow). Each variable can be well described by a linear combination of only a few basis functions (Kaiser, 1958). In next slides, compare ORIGINAL with VARIMAX for different factors (F_).
  • 81. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-819/8/2019
  • 82. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-829/8/2019
  • 83. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-839/8/2019
  • 84. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-849/8/2019
  • 85. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-859/8/2019
  • 86. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-869/8/2019 All variables, And very messy.
  • 87. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-879/8/2019
  • 88. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-889/8/2019
  • 89. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-899/8/2019
  • 90. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-909/8/2019 From: https://www.linkedin.com/groups/4292855/4292855-6171016874768228353 Question about dimension reduction (factor analysis) for survey question set What would be the interpretation for a set of survey questions where rotation fails to converge in 25 iterations, and the non-rotated solution shows 2 clear factors with Eigenvalues above 2, but the scree plot levels out right at eigenvalue = 2 and the remaining (many) factors are quite close together? Answers: 1. Well you first want to get the model to convergence. I usually increase the # of iterations to 500. 2. I would suggest its a one factor solution and the above 1 criteria is probably not appropriate. How many items/questions were there in your survey? Answer from poster: Thank you. I was able to do this with # iteration = 500. There are 14 factors with eigenvalues above 1, accounting for a total of 66.7% of the variance. I am still unsure what the interpretation would be - I've never had a dataset before that had so many factors. Too much noise? A lot of variability in survey response? 3. I think that "14 eigenvalues make just 2/3 of the variance" is a warning. It means to my experience that there are no large eigenvalues at all and that there are just "scree" eigenvalues. This can be an effect of having too many variables (= too high dimension). In this case an "automatic" dimensional reduction will necessarily fail and a visual dimensional reduction is due.
  • 91. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-919/8/2019 It can also mean that the data cloud is more or less "spherical". This would mean that there are many columns (or rows) in the correlation matrix containing just values close to zero. One can easily "eliminate" such "circular" variables as follows a) copy the correlation matrix to an Excel sheet b) For each column calculate (sum of elements - 1) = rowsum c) sort the columns by descending rowsum d) take just the "top 20" or so variables with the largest rowsum e) do the analysis with the 20 variables and study the "scree plot" Sorry, in step b) you should also calculate the maximal column element except the "1" on the diagonal. In step d) you should also add variables with a small rowsum but a relatively large maximal correlation. 4. I agree with stated above. Just FYI, use Bartlett sphericity test to formally check low correlation issue. Try also alternate the type of rotation. 5. Without knowing what you are measuring ... I can tell you about a similar situation I experienced ... it took a high number of iteration to converge, only one eigen value above 2, and a dozen or more above 1 that made no theoretical sense. I deleted all items with little response variability, and reran it ... and it came out more clearly as a homogeneous measure (1 factor). Once accepted that I was dealing with one factor, I was able to make some edits to the items, collected more data on the revised measure, and now have a fairly tight homogeneous measure.... where I really thought there would be 5 or so factors! ➔ MESSAGE: Extremely ad-hoc solutions are typical, not necessarily recommended ➔ think before you rush in..
  • 92. Leonardo Auslender –Ch. 1 Copyright 2004 Ch. 1.1-929/8/2019
  • 93. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-939/8/2019 Introduction Different approaches to clustering (there are other taxonomies) 1) disjoint or partitioning (k-means); 2) non-disjoint hierarchical (agglomerative), 3) density-based, grid-based, fuzzy, soft method or overlapping methods, constraint-based, or model-based clustering (EM algorithm). Marketing prefers disjoint in order to separate customers completely (assuming independent observations). Archaeology prefers agglomerative because two nearby clusters might emerge from previous one in downward tree hierarchy (e.g., fossils in evolutionary science). Agglomerative or hierarchical methods: typically bottom-up method. Start from individual observations and agglomerate upwards. Info on # of clusters not necessary, but impractical for large data sets. End result called dendogram, tree structure. Necessary to define distance, different distances ➔ different methods.
  • 94. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-949/8/2019 Basic Introduction Overlapping, fuzzy: methods that deal with data that cannot be completely separated or with probability statement attached to cluster membership. Grid-based: very fast, need to determine finite grid of cells along each dimension. Constraint-based, constraint given by business strictures or applications. Won’t review top-down (divisive methods), overlapping or fuzzy methods, etc..
  • 95. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-959/8/2019 Why so many methods? If there are ‘n’ data points, and aim at clustering into ‘k’ clusters, there are kn / k! number of ways to do it, ➔ brute force methods not adequate. For instance, k = 3, n = 100, number of ways is 8.5896253455335 * 10 ** 46. For n = 1000, computer cannot calculate it. And n = 1000, rather small data size at present. ➔Heuristics used, such as k-means, especially. Methods typically use Euclidean distance, but correlation distance is also possible.
  • 96. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-969/8/2019
  • 97. Leonardo Auslender –Ch. 1 Copyright 2004 Ch. 1.1-979/8/2019 Disjoint: K-means (McQueen, 1967) (most used clustering method in business) Key concept underlying cluster detection: similarity, homogeneity or closeness of OBSERVATIONS. Resulting implementation is based on similarity or dissimilarity of measures of distance. Methods typically greedy (one observation at time). Start with given number of requested clusters K, N data points and P variables. Continuous variables only. Algorithm determines K arbitrary seeds that become original location of clusters in P dimensions (there is variety of ways to change starting seeds). By using Euclidean distance function, allocate each observation to nearest cluster given by original seeds. Re-calculate centroid (cluster center of gravity). Re-allocate observations based on minimal distance to newer centroids and repeat until convergence given by maximum number of iterations, or until cluster boundaries remain unchanged. K-means typically converges quickly.
  • 98. Leonardo Auslender –Ch. 1 Copyright 2004 Ch. 1.1-989/8/2019 Outliers can have negative effects because calculated centroid would be affected. If outlier is itself chosen as initial seed, effect is paradoxical: analyst may realize that relative scarcity of observations is indication of an outlier. If outlier is not chosen initially, centroid is unavoidably affected. On the other hand, the distortion introduced may be such as to make such a conclusion difficult to reach. Further disadvantage: method depends heavily on initial choice of seeds ➔ recommended that more than one run be performed but then difficult/impossible to combine results. In addition, # of desired clusters must be specified, which is in many situations the answer the analyst wants the algorithm to provide. # iterations must also be given. More importantly, search for clusters is based on Euclidean distances that produce convex shapes. If ‘true’ cluster is not convex, K-means could not find that solution.
  • 99. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-999/8/2019 Number of Clusters, determination. Cubic clustering criterion (CCC) explained later with Ward method. Elbow rule: for each ‘k’ clustering solution, find out % between cluster variation over total variation at every K, and stop at point when increasing K does not decrease the ratio significantly (can also be used for WARD method later on). Elbow point sometimes cannot be fully distinguished. WEB
  • 100. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-1009/8/2019 Alternatives: K-medoids replaces mean by data points. In this sense, more robust to outliers but inefficient for large data sets (Rousseeuw and Kaufman, 1987). Resulting clusters are disjoint: merging two clusters does not lead to combined overall super-cluster. Since method is non-hierarchical, impossible to determine closeness among clusters. In addition to closeness issue, possible that some observations may belong in more than one cluster and thus it would be important to report a measure of the probability belonging in a cluster. Originally created for continuous variables. Huang (1998) among others, extended algorithm to nominal variables. Next: Cluster graphs derived from canonical discriminant analysis.
  • 101. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-1019/8/2019
  • 102. Leonardo Auslender –Ch. 1 Copyright 2004 Ch. 1.1-1029/8/2019 Clusterin g solution Total visits to a doctor Fraudul ent Activity yes/no Members hip duration No of claims made recently Number of opticals claimed Total spent on opticals # obs Mean Mean Mean Mean Mean Mean Cluster 503 1.81 -0.42 0.48 -0.18 -0.27 -0.011 2 1452 -0.40 -0.50 -0.56 -0.23 -0.12 -0.18 3 203 0.17 -0.18 0.32 -0.21 -0.14 2.79 4 165 -0.01 1.23 0.16 3.77 0.12 -0.17 5 491 -0.20 2.00 -0.43 0.01 0.04 -0.41 6 150 -0.18 0.57 -0.46 -0.08 3.57 0.45 7 662 -0.35 -0.49 1.19 -0.25 -0.27 -0.14 Is this great? Clusterin g solution VALIDAT ION Total visits to a doctor Fraudule nt Activity yes/no Member ship duration No of claims made recently Number of opticals claimed Total spent on opticals # obs Mean Mean Mean Mean Mean Mean Cluster 2334 -0.00 0.01 -0.01 0.00 -0.02 -0.02 1 NO, look at validation but Fraud difficult to work with. . Fraud data set.
  • 103. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-1039/8/2019 Hmeq K-means 3 clusters selected (ABC method,Aligned Box Criterion, SAS).
  • 104. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-1049/8/2019 Rescaled variable means by cluster (statistical inference, Parametric or otherwise, necessary to create profiles).
  • 105. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-1059/8/2019
  • 106. Leonardo Auslender –Ch. 1 Copyright 2004 Ch. 1.1-1069/8/2019 ods output ABCResults= abcoutput; proc hpclus data = training maxclusters = 8 maxiter = 100 seed = 54321 NOC = ABC (B= 1 minclusters = 3 align= PCA); input DOCTOR_VISITS /* FRAUD */ MEMBER_DURATION NO_CLAIMS OPTOM_PRESC TOTAL_SPEND; /* FRAUD OMITTED BEC. IT’S BINARY */ run; proc sql noprint; select k into :abc_k from abcoutput; quit; proc fastclus data = training out = outdata maxiter = 100 converge = 0 replace = random radius = 10 maxclusters = 7 outseed = clusterseeds summary; var DOCTOR_VISITS /* FRAUD */ MEMBER_DURATION NO_CLAIMS OPTOM_PRESC total_spend; run; /* VALIDATION STEP, NOTICE validation AND clusterseeds. */ proc fastclus data = validation out = outdata_val maxiter = 100 seed = clusterseeds converge = 0 radius = 100 maxclusters = 7 outseed = outseed summary; var DOCTOR_VISITS /*FRAUD */ MEMBER_DURATION NO_CLAIMS OPTOM_PRESC TOTAL_SPEND; run; ;
  • 107. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-1079/8/2019
  • 108. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-1089/8/2019 k-means assume: 1) the variance of the distribution of each attribute (variable) is spherical, i.e., E(x.x’) = σ2 IN. 2) all variables have the same variance; 3) the prior probability for all k clusters is the same, i.e. each cluster has roughly equal number of observations. Assumptions almost never verified, what happens when violated? Plus, difficult if not impossible to ascertain best results. Examples in two dimensions, X and Y next.
  • 109. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-1099/8/2019 Non-spherical data.
  • 110. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-1109/8/2019 K-means solutions. X: centroids of found clusters.
  • 111. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-1119/8/2019 Instead, single linkage hierarchical clustering solution.
  • 112. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-1129/8/2019 Additional problems: Differently sized clusters. ➔NO FREE LUNCH – NFL- (Wolpert, MacReady, 1997) “We have dubbed the associated results NFL theorems because they demonstrate that if an algorithm performs well on a certain class of problems then it necessarily pays for that with degraded performance on the set of all remaining problems”. ➔CANNOT USE SAME MOUSETRAP ALL THE TIME. ➔ Hint: Verify assumptions, they ARE IMPORTANT.
  • 113. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-1139/8/2019 Interview Question: Aliens from far away prepare for invasion of Earth. Need to find whether intelligent creature live here and plan to launch 1000 probes for that purpose to random locations. Unknown to them, the oceans cover 71% of the Earth and the probe send back about the landing site and surroundings. Let us assume that just (some) humans are intelligent. The alien data scientist decides to use k-means on the data. Discuss how he/she would conclude whether there’s intelligent life on Earth (no sarcastic answers allowed)
  • 114. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-1149/8/2019 Agglomerative (hierarchical) clustering methods: Single linkage Centroid, Average Linkage and Ward.
  • 115. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-1159/8/2019 Agglomerative Clustering (standard in bio sciences). In single-linkage clustering (aka nearest neighbor or neighbor joining tree in genetics), distance between two clusters determined by single element pair, namely those two elements (one in each cluster) that are closest to each other. And later compounds defined by min distance (see example below). Shortest of these links that remains at any step causes fusion of two clusters whose elements are involved. Method also known as nearest neighbor clustering: distance between two clusters is the shortest possible distance among members of clusters, or best of friends. Result of clustering can be visualized as dendogram: sequence of cluster fusion and distance at which each fusion took place. Distance or linkage factor given by , ( , ) ( , ) X and Y clusters, d is distance between elements x and y.   = x X y Y D X Y min d x y
  • 116. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-1169/8/2019 In centroid method, (comonly used in biology) distance between clusters “l” and “r” is given by Euclidean distance between centroids. Centroid method is more robust than other linkage methods presented here, but has drawback of inversions (clusters do not become more dissimilar as we keep on linking up) . In complete linkage (a.k.a. furthest neighbor) the distance between two clusters is longest possible distance between the groups, or the worst among the friends. In the case of average linkage method, the distance is the average distance between each pair of observations, one from each cluster. The method tends to join clusters with small variances. The Ward’s minimum variance method assumes that data set is derived from multivariate normal mixture, that clusters have equal covariance matrices and sampling probabilities. Tends to produce clusters with roughly same number of observations and based on the notion of information loss suffered when joining two clusters. Loss is quantified by Anova like Error Sums of Squares criterion.
  • 117. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-1179/8/2019 Example of complete linkage: Assume 5 observations with Euclidean distance as given by: 1 2 3 4 5 1 0 2 9 0 3 3 7 0 4 6 5 9 0 5 11 10 2 8 0 Let’s cluster closest observations, 3 and 5 (as 35), where distance between 1 and 35 is given by max distance (1 – 3, 1 – 5). After 6 steps, all observations are clustered. Dendograms (distance has height in Y axis) show agglomeration. 35 1 2 4 35 0 1 11 0 2 10 9 0 4 9 6 5 0
  • 118. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-1189/8/2019 Complete linkage Single linkage
  • 119. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-1199/8/2019 How many clusters with agglomerative methods? Cut previous dendogram with horizontal line at specific point. No prevailing method, however. E.g., 2 clusters. Next: cluster solutions comparison (skipped bar charts of means of vars.)
  • 120. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-1209/8/2019 Cubic clustering criterion (Sarle,1983). Assume 3 cluster solution left, and a reference distribution (right) Reference distribution: hyper-cube of uniformly distributed random points aligned with principal components. Reference distribution typically be a hyper- cube. Heuristic formula to calculate error of distance based methods for k = 1 to k = top # clusters. CCC is difference error ( k = 1 ,,,,) to Top K – 1 to Error (top K). Largest CCC ➔ desired k. Fails when variables are highly correlated. ABC method improves on CCC because it simulates multiple reference distribution, instead of just one heuristic as in CCC.
  • 121. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-1219/8/2019 Example, and Putting all Methods together.
  • 122. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-1229/8/2019 Fraud comparison of canonical discr. Vectors.
  • 123. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-1239/8/2019 Notice missing ‘.’ cluster allocation.
  • 124. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-1249/8/2019
  • 125. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-1259/8/2019
  • 126. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-1269/8/2019
  • 127. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-1279/8/2019 Previous slides: Hpclus (SAS proprietary cluster solution), Similar solutions between hpclus and k-means, very different from others. How to compare? Full disagreement. Since there is no initial cluster membership, no basis to obtain error rates. There are many proposed measures, such as silhoutte coefficient, Adjusted Rand Index, etc. Final issues: Number of Clusters could be different across methods. Number of predictors, i.e., predictor selection, could be also different across methods.
  • 128. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-1289/8/2019 Methods for Number of Clusters determination. Ideal solution should minimize the “within cluster variation” (WCV) and maximize the between cluster variation (BCV). But WCV decreases and BCV increases with increasing number of clusters. Compromises: CH index (Calinski et al, 1974): which is undefined for K = 1, i.e., no cluster case. GAP statistic (Tibshirani et al 2001): WCV ↓ as K ↑. Evaluate rate of decrease against uniformly distributed points. Milligan and Cooper (1985) compared many methods and up until 1985, CH was best. 1 WCV K BCW n K − −
  • 129. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-1299/8/2019
  • 130. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-1309/8/2019 Applications 1) Marketing Segmentation and customer profiling, even when supervised methods could be used. 2) Content based Recommender systems. E.g.: recommend based on movie categories preferred. E.g., cluster movie database and recommend within clusters. 3) Establish hierarchy or evolutionary path of fossils in archaeology/prehistory.
  • 131. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-1319/8/2019 Especially in marketing.
  • 132. Leonardo Auslender –Ch. 1 Copyright 2004 Ch. 1.1-1329/8/2019 Clustering and Segmentation in Marketing (easily Extrapolated to other applications). Definition: Segmentation: “viewing a heterogeneous market as a number of smaller homogeneous markets” (Wendell Smith, 1956). Bad practices. 1) Segmentation is descriptive, not predictive. However, business decisions made with eye to future (i.e., predictive). Business decisions based on segmentation are subjective and inappropriate for decision making, because segmentation only shows present strengths and weaknesses of brand (in marketing research), but doesn’t give and cannot give indications as to how to proceed. 2) CRM ISSUE: Segmentation assumes segment homogeneity, which contradicts basic CRM tenet of customer segments of 1.
  • 133. Leonardo Auslender –Ch. 1 Copyright 2004 Ch. 1.1-1339/8/2019 Clustering and Segmentation in Marketing 3) Competitors information and reactions are usually ignored at segment level. When Coca-Cola analyzed introduction of sweeter drink, only focused on Coca-Cola drinkers, forgetting customers’ perception of Coca Cola image. About AT&T, just look as to where AT&T is after 2000, after big mid-90 marketing failure based on segmentation, among other horrors. 4) Segmentation always excludes significant numbers of real prospects and conversely includes significant ones of non- prospects. In typical marketing situation, best and worst customers are easier to find, and the ones in between are non-easily classifiable. But segmentation imposes such a classification, and users do not remind themselves enough of the classification issues behind.
  • 134. Leonardo Auslender –Ch. 1 Copyright 2004 Ch. 1.1-1349/8/2019 Clustering and Segmentation in Marketing Really unfortunate bad practices. 1) Humans categorize information to make it into comprehensible concepts. Thus, segments are typically labeled, and labels become “the” segments, regardless of segment accuracy, construction or stability of content, or changing market conditions. Worse yet, could well be that segments do not properly exist but that data derived clusters merely reflect normal variation (e.g., human evolution studies area of conflict in this). 2) Segments thus constructed cannot foretell changing market conditions, except once they have already taken place. Thus, you either gained, lost or kept customers. No amount of labeling, re- labeling or label tweaking can be basis of successful operation in market place, since segments cannot predict behavior.
  • 135. Leonardo Auslender –Ch. 1 Copyright 2004 Ch. 1.1-1359/8/2019 Clustering and Segmentation in Marketing. Really unfortunate bad practices. 3) Segments also derived from attitudinal data. Attitudes of customer base usually measured by way of survey opinions and/or focus groups. Derived information (psychographics) not easy to merge with created clusters from operational and demographic information. 4) Immediate temptation is to view whether segments derived from two very different sources have any affinity. This implies that it is necessary to ‘score’ customer base with psycho-graphically derived segments, in order to merge results. Accuracy of classification for this application has been traditionally very low. 5) Better practice: encourage usage of original clusters based on operational and demographic data as basis for obtaining psycho- graphic information.
  • 136. Leonardo Auslender –Ch. 1 Copyright 2004 Ch. 1.1-1369/8/2019 Clustering and Segmentation in Marketing 6) Finally, all models are based on available data. If aim is to segment entire US population, and one feature is NY Times readership (because that’s only subscription list available), useful mostly in NorthEast, but not so much in Kansas probably. In fact, it produces geographically based clustering, which may be undesirable or unrecognized effect. Good practice. • It can be systematic way to enhance marketing creativity, if possible. Patting yourself in the back ➔
  • 137. Leonardo Auslender –Ch. 1 Copyright 2004 Ch. 1.1-1379/8/2019 Important note on how to work: Confirmatory Bias. Psychologists call ‘confirmatory bias’ the tendency to try to prove a new idea correct instead of searching to prove the new ideas wrong. This is a strong impediment to understanding randomness. Bacon (1620) wrote: “the human understanding, once it has adopted an opinion, collects any instances that confirm it, and though the contrary instances may be more numerous and more weighty, it either does not notice them or else rejects them, in order that this opinion will remain unshaken.” Thus, we confirm our stereotypes about minorities, for instance, by focusing on events that prove our prior beliefs and dismiss opposing ones. This is a serious contradiction to the ability of experts to judge in an unbiased fashion. Thus many times, we see what we want to see. Instead, per Doyle’s Sherlock Holmes: “One should always look for a possible alternative, and provide against it.” (to prove your point, The Adventure of Black Peter).
  • 138. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-1389/8/2019
  • 139. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-1399/8/2019 1. Cluster Analysis using the Jaccard Matching Coefficient 2. Latent Class Analysis 3. CHAID analysis (class of tree methods, requires a target variable). 4. Mutual Information and/or Entropy. 5. Multiple Correspondence Analysis MCA. Not reviewed in this class.
  • 140. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-1409/8/2019 Some Un-reviewed methods Mean shift clustering: non-parametric mode seeking algo. Density based spatial clustering of applications with noise (DBSCAN) BIRCH: balanced iterative reducing and clustering using hierarchies. Gaussian Mixture Power iteration clustering (PIC) Latent Dirichlet allocation (LDA) Bisecting k-means Streaming k-means Spectral Clustering
  • 141. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-1419/8/2019
  • 142. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-1429/8/2019 Many analytical methods require presence of complete observations, that is, if any feature/predictor/variable as a missing value, the entire observation is not used in the analysis. For instance, regression analysis requires complete observations. Failing to verify the completeness of a data set can lead to serious error, especially if we rely on UEDA notions of missingness. For instance, the table below shows a simulation in which for a given set of number of variables ‘p’ (100, 350, 600, 850) and number of observations in the data set ‘n’ (1000, 1500, 2000), each variable has a probability of 0.01 and 0.11 of being missing. A priori these probabilities seem very low to cause much harm. Table shows, however, that for modest ‘p’ = 100, resulting data sets have at least 60.93% missing values, and when ‘p’ reaches 350 almost all observations are missing. When univariate missingness is 11%, all observations are missing.
  • 143. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-1439/8/2019 Missing value analysis Num Obs_in Database (n) 1000 1500 2000 # obs with_at least one missing value % missing obs_in databas e # obs with_at least one missing value % missing obs_in databas e # obs with_at least one missing value % missing obs_in database Prob of missing value Num Vars_in Database (p) 619 61.90 914 60.93 1224 61.200.01 100 350 960 96.00 1449 96.60 1936 96.80 600 998 99.80 1496 99.73 1995 99.75 850 1000 100.00 1500 100.00 2000 100.00 0.11 100 1000 100.00 1500 100.00 2000 100.00 350 1000 100.00 1500 100.00 2000 100.00 600 1000 100.00 1500 100.00 2000 100.00 850 1000 100.00 1500 100.00 2000 100.00
  • 144. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-1449/8/2019 A more subtle complication arises when missingness is not at random like in the table above. That is, assume that missingness in a variable of importance is related to its information itself, such as reported income is likely to be missing for high earners. In this case, study of occupation by income, in which observations with missing values are skipped, would provide a very distorted picture because high-earners occupations would be underrepresented. In other cases, data bases are created by merging different sources that were partially matched by some key indicator that could be unreliable (e.g., customer number) ➔ data collection missingness).
  • 145. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-1459/8/2019 Missing values taxonomy (Little and Rubin, 2002) We looked at the importance of treatment of missing values in a dataset. Now, let’s identify the reasons for occurrence of these missing values. They may occur at two stages: Data Extraction: It is possible that there are problems with extraction process. In such cases, we should double-check for correct data with data guardians. Some hashing procedures can also be used to make sure data extraction is correct. Errors at data extraction stage are typically easy to find and can be corrected easily as well. Data collection: These errors occur at time of data collection and are harder to correct. They can be categorized in three types: Missing completely at random (MCAR): This is a case when the probability of missing variable is the same for all observations. For example: respondents of a data collection process decide that they will declare their earnings or weights after tossing a fair coin. In this case, each observation has an equal chance of containing a missing value.
  • 146. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-1469/8/2019 Missing at random (MAR): This is a case when a variable is missing at random regardless of the underlying value but probably induced by a conditioning variable. For example: age is typically missing in higher proportions for females than for males regardless of the underlying age of the individual. Thus, missingness is related only to the observed data. Missing not at random (MNAR): the case of missing income above, that is, a variable is missing due to its underlying value. It also involves missingness that depends on unobserved predictors.
  • 147. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-1479/8/2019 Solving missingness in Data Bases. Case Deletion: It is of two types: List Wise Deletion and Pair Wise Deletion, used in the MCAR case, because otherwise a biased sample might occur. . List-wise deletion removes all observations in which at least one missing value occurs. As we say above, the resulting sample size could be seriously diminished. Due to the disadvantage of list-deletion, pair wise deletion proceeds with analyzing all cases in which variables of interest are complete. Thus, if the interest centers on variables A and B to be correlated, analysis proceeds on those observations with non-missing values of variables A and B, regardless of missingness in other variables. If the study centers on different pairs of variables, then different sample sizes may result.
  • 148. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-1489/8/2019 Mean/ Mode/ Median Imputation: Imputation is a method to fill in the missing values with estimated ones. The objective is to employ known relationships that can be identified in the valid values of the data set to assist in estimating the missing values. Mean / Mode / Median imputation is one of the most frequently used methods. It consists of replacing the missing data for a given attribute by the mean, median (quantitative attribute) or mode (qualitative attribute) of all known values of that variable. It can be of two types: Generalized Imputation: In this case, we calculate the mean or median for all the complete values of that variable and then impute the missing values correspondingly. Conditional imputation: If missingness is known to differ by a third variable/s, obtain the mean/median/mode by the different values of the third variable/s and impute. Thus, in the case of missing age, obtain statistics corresponding to males and females, and impute.
  • 149. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-1499/8/2019 Prediction Model: Prediction model estimates values that will substitute missing data. In this case, divide data set into two sets: One set with no missing values for variable in question and another one with missing values. First data set becomes training data set of model while second data set with missing values is test data set and variable with missing values is treated as target variable. Next, create model to predict target variable based on other attributes of training data set and populate missing values of test data set. We can use regression, ANOVA, Logistic regression and various modeling technique to perform this. 2 drawbacks: ▪ Model estimated values usually more well-behaved than true values, i.e., smaller variance. ▪ If there are no relationships with attributes in the data set and the attribute with missing values, then model will not be precise for estimating missing values.
  • 150. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-1509/8/2019 A more general drawback. More likely that some information is MAR (lost data) while other is MNAR (high-earners income). It is very difficult or impossible to identify. In practice, most methods tend toward MAR or MCAR.
  • 151. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-1519/8/2019 KNN Imputation In this method of imputation, missing values of attribute are imputed using given number of attributes that are most similar to the attribute whose values are missing. Similarity of two attributes determined using distance function. It is also known to have certain advantage & disadvantages. Advantages: k-nearest neighbor can predict both qualitative & quantitative attributes Creation of predictive model for each attribute with missing data is not required Attributes with multiple missing values can be easily treated Correlation structure of data is taken into consideration Disadvantage: KNN algorithm very time-consuming in analyzing large database. It searches through all dataset looking for most similar instances. Choice of k-value is very critical. Higher value of k would include attributes which are significantly different from what we need whereas lower value of k implies missing out of significant attributes.
  • 152. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-1529/8/2019 Dummy coding. While not a solution, it recognizes the existence of the missing value. For each variable with a missing value anywhere in the data base, we create a dummy variable with value 0 when the corresponding value is not missing, and 1 when it is missing. The disadvantage is that the number of predictors can increase significantly. In some contexts, researchers drop the missing variables and work with the dummies instead. In most applied work, it is assumed that missingness is not of the MNAR type. In all cases of imputation, we should note that the imputed values may shrink the variance of the individual variables. Thus, it is appropriate to ‘contaminate’ these estimates with a random component, for instance, a normally distributed random error for a continuous variable. ISSUE If also want to transform data, decide whether to transform and then impute, or instead to impute raw data and then work with transformations.
  • 153. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-1539/8/2019
  • 154. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-1549/8/2019 Modeling Method based on trees (see 4th lecture and come back this page). Assume 3 variables have missing values, call them miss1, miss2 and miss3 and denote by pct1, pct2, pct3 the % of missing observations for each. Assume that pc1 < pct2 < pct3. Create trees and impute the ‘miss’ variables one at a time in descending order of missingness, and using ‘miss’ variables as dependent variables in a tree run. It is also possible to use Gradient Boosting or Random forests instead of Trees as the modeling tool.
  • 155. Leonardo Auslender –Ch. 1 Copyright 2004 Dataset ‘XXXX' with 8166 Obs # Nonmis s Obs % Missing Mean Std of Mean Mode Variable Model Name 1,430 82.49 64.681 3.545 3.990 Var_1 M1_INTERVAL_VARS Var_2 M1_INTERVAL_VARS 8,166 0.00 136.898 2.329 49.990 Var_3 M1_INTERVAL_VARS 5,014 38.60 520.282 9.850 49.990 M1_IMPUTED_INTERV AL_VARS 8,166 0.00 447.642 6.156 273.053 Var_4 M1_INTERVAL_VARS 3,741 54.19 0.261 0.029 0.000 M1_IMPUTED_INTERV AL_VARS 8,166 0.00 0.135 0.014 0.000 Var_5 M1_INTERVAL_VARS 4,358 46.63 0.207 0.022 0.000 M1_IMPUTED_INTERV AL_VARS 8,166 0.00 0.122 0.012 0.000 Var_6 M1_INTERVAL_VARS 7,344 10.07 11.770 0.849 0.000 M1_IMPUTED_INTERV AL_VARS 8,166 0.00 10.823 0.765 0.000 Var_7 M1_INTERVAL_VARS 8,164 0.02 12.207 0.246 5.000 M1_IMPUTED_INTERV AL_VARS 8,166 0.00 12.207 0.246 5.000 Example Notice change in means and std (mean)
  • 156. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-1569/8/2019 Distr of imputed Variables same as that of original variables.
  • 157. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-1579/8/2019 Simulating missingness at random in Fraud data set. Relatively low missingness. Imputed variables have similar Statistics as original variables. Basics and Measures of centralitity # Nonmis s Obs % Missing Mean Median Mode Variable Model Name 5,960 0.00 8.941 8.000 9.000 DOCTOR_VISITS M1_TRN_INTERVAL _VARS MEMBER_DURATI ON M1_TRN_INTERVAL _VARS 5,851 1.83 179.757 178.000 180.000 M1_TRN_IMPUTED _TRN_INTERVAL_V ARS 5,960 0.00 179.680 178.000 180.000 NO_CLAIMS M1_TRN_INTERVAL _VARS 5,875 1.43 0.404 0.000 0.000 M1_TRN_IMPUTED _TRN_INTERVAL_V ARS 5,960 0.00 0.404 0.000 0.000 NUM_MEMBERS M1_TRN_INTERVAL _VARS 5,875 1.43 1.986 2.000 1.000 M1_TRN_IMPUTED _TRN_INTERVAL_V ARS 5,960 0.00 1.985 2.000 1.000 OPTOM_PRESC M1_TRN_INTERVAL _VARS 5,960 0.00 1.170 1.000 0.000 TOTAL_SPEND M1_TRN_INTERVAL _VARS 5,960 0.00 18,607.9 70 16,300.0 00 15,000.0 00
  • 158. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-1589/8/2019 Measures of Dispersion Varianc e Std Deviati on Std of Mean Median Abs Dev Nrmlzd MAD Variable Model Name 52.31 7.23 0.09 5.00 7.41 DOCTOR_VISITS M1_TRN_INTERV AL_VARS MEMBER_DURAT ION M1_TRN_INTERV AL_VARS 6,782.3 2 82.35 1.08 57.00 84.51 M1_TRN_IMPUTE D_TRN_INTERVAL _VARS 6,674.7 6 81.70 1.06 56.00 83.03 NO_CLAIMS M1_TRN_INTERV AL_VARS 1.16 1.08 0.01 0.00 0.00 M1_TRN_IMPUTE D_TRN_INTERVAL _VARS 1.15 1.07 0.01 0.00 0.00 NUM_MEMBERS M1_TRN_INTERV AL_VARS 1.00 1.00 0.01 1.00 1.48 M1_TRN_IMPUTE D_TRN_INTERVAL _VARS 0.98 0.99 0.01 1.00 1.48 OPTOM_PRESC M1_TRN_INTERV AL_VARS 2.74 1.65 0.02 1.00 1.48 TOTAL_SPEND M1_TRN_INTERV AL_VARS 125,607 ,617.29 11,207. 48 145.17 6,000.00 8,895.60
  • 159. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-1599/8/2019
  • 160. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-1609/8/2019 Outliers and Variables Transformations. Outlier and variable transformation analysis are sometimes included as part of EDA. Since both topics must be understood in context of modeling of dependent or target variable, we will state some general issues. Wrongly asserted that analyst should verify existence of outliers and then blindly remove or impute them to more ‘accomodating’ values without reference to problem at hand. E.g., sometimes argued that registered data point for man’s height of 8 feet must be wrong or outlier, except that in antiquity there is historical evidence for that occurrence. In present times, tendency to disregard income levels above say, $50 million, when mean value in sample is probably $50,000. However, extreme values are real, and probably most interesting. Conversely, age of 300 years or more is quite suspicious, unless we are referring to a mummy. In cases when data points in reference to the model at hand can and should be disputed, outliers can then be treated as if they were missing values, and most likely of the MNAR kind. Thus, mean, median or mode imputation should not be considered as the immediate solution.
  • 161. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-1619/8/2019 If we view the data bi-variately, data points that otherwise would not be considered to be outliers, could be bi-variate outliers. For instance, weight of 400 pounds and 3 years of age are possible when considered univariately, but highly suspicious in one individual. In the area of variable transformations, we already saw the convenience of standardizing all variables in the case of principal components and/or clustering. There are other cases when the analyst implements single variable transformations, such as taking the log, which lowers the variance of the variable in question. Again, it is important to reiterate that most information is not of uni- variate but of multi-variate importance. Further, there is no magic in trying to obtain univariate or multi-variate normality, as it may be thought for the case of inference, since inference does not require that variable information be normally distributed.
  • 162. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-1629/8/2019 Sources of outlierness: • Data Entry or data processing Errors, such as errors in programming when merging data from many different sources. • Measurement Error due to faulty measuring device, for instance. • Experimental Error: Another cause of outliers is experimental error. For example: In a 100m sprint of 7 runners, one runner missed out on concentrating on the ‘Go’ call which caused him to start late. Hence, this caused the runner’s run time to be more than other runners. His total run time can be an outlier. • Intentional misreporting: Adverse effects of pharmaceuticals are well known to be under-reported. • Sampling error, when information unrelated to the study at hand is included. For instance, male individuals included in a study on pregnancy.
  • 163. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-1639/8/2019 Effects of Outliers. Outliers may distort analysis as is well known in the case of linear and logistic regressions. They also distort inference since by their very nature, they affect mean calculations, which are the focus of inference in many instances. We will review outlier detection while reviewing modeling methods. There are “robust’ modeling methods that are (more) impervious to outliers, such as robust regression. Tree modeling methods and their derivatives are also impervious to outliers.
  • 164. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-1649/8/2019 Variables transformations and MEDA. Raw data sets can have variables or features that are not directly useful for analysis. For instance, age coded in terms of “infant”, “teen’, etc. do not connote the underlying ordering, and can be more easily denoted in terms of an ordered sequence of numbers. In a different example, if we are studying volumes, variables that may affect them might require to be raised to the third power to correspond to the cubic nature of volumes. In short, variable transformation belongs in the MEDA realm because we are interested in how the transformed variable relates to others. The topic of variable transformations is also called variable engineering because different disciplines add more jargon to each other.
  • 165. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-1659/8/2019 Some prevailing practices in variable transformation (see previous sections for more detail). 1) Standardization (as we saw in clustering and also principal components): it does not alter the variable distribution but rescales and standardizes to a common variance of 1. 2) Linearization, via logs: If the underlying model is deemed to be multiplicative, log transformation turns it into an additive model. Likewise for skewed distributions, as in the case of count variables. And obviously, sometimes the required transformation is the square, cubic or square root of the original variable. 3) Binning: usually done by cutting the range of a continuous variable into sub-ranges that are deemed to be uniform or more representative, for instance, in the case of age mentioned previously. 4) Dummying: Used typically with a categorical variable, such as one denoting color. Some modeling methods, such as regression based methods, require that if there are k classes of a categorical variable, (k-1) dummy variables (0/1) variables be derived. Tree methods do not require this construction.
  • 166. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-1669/8/2019
  • 167. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-1679/8/2019 1. A baseball bat and a ball cost $1.10 together, and the bat costs $1 more than the ball. What’s the cost of the ball? 2. In a 2007 blog, David Munger (as stated by Adrian Paenza of Pagina 12, 2012/12/21), proposes the following question: without thinking more than a second, choose a number between 1 and 20 and ask friends to do the same and tally all the answers. What is the distribution. 3. Three friends go out to dinner and the bill is $30. They each contribute $10 but the waiter brings back $5 in $1 dollar bills because there was an error and the tab was just $25. They each take $1 and give $2 to the waiter. But then, one of them says that they each paid $9 for a total of $27, plus the $2 tip to the waiter, which all adds up to $29 and not to the $30 that they originally paid. Where is the missing dollar? 4. Explain the relevance of the central limit theorem to a class of freshmen in the social sciences who barely have knowledge about statistics. 5. What can you say about statistical inference when sample is whole population? 6. What is the number of barber shops in NYC? (coined by Roberto Lopez of Bed, Bath & Beyond, 2017).
  • 168. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-1689/8/2019 7) In a very tiny city there are two cab companies, the Yellows (with 15 cars) and the Blacks (with 75 cars). The core of the problem is that there was an accident during a drizzly night and that all cabs were on the streets. A witness testifies that a yellow cab was guilty of the accident. The police check his eyesight by showing him yellow and black cab pictures and he identified them correctly 80% of the time. That is, in one case out of five he confused a yellow cab for a black cab or vice-versa. Knowing what we know so far, is it more likely that the cab involved in the accident was yellow or black? The immediate unconditional answer (i.e., based on the direct evidence shown) is that there is 80% probability that the cab was yellow. State your reasoning, if any. 8) Can a random variable have infinite mean and/or variance? 9) State succintly differences between PCA, FA and clustering.
  • 169. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-1699/8/2019 10) In WWII, bomber runs over Germans suffered many losses. As a statistician for the government, your task is to recommend improvements in aircraft armor, defense, strategic formation, etc. The available data set is from the damage suffered by the returning aircraft, anti-aircraft damage, fighter plane damage, number of planes in formation, etc. State your recommendation once you have presented your line of reasoning.
  • 170. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-1709/8/2019 References Bacon F. (1620): Novum Organon, XLVI. Calinski T., Harabasz J. (1974): A dendrite method for cluster analysis, Communications in Statistics Huang Z. (1998): Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values, Data Mining and Knowledge Discovery 2. Johnstone I. (2001): On the Distribution of the largest eigenvalue in Principal Components Analysis, The Annals of Statistics, vol. 29, # 2. MacQueen J. (1967): “Some Methods for Classification and Analysis of Multivariate Observations.” In Proc. of Fifth Berkeley Symp. on Math. Statistics and Probability, Vol.1: Statistics, 281–97. Berkeley, Calif.: Univ. of California Press. Milligan G., Cooper M. (1985): An examination of procedures for determining the number of clusters in a data set, Psychometrika, 159-179. Sarle W. (1983): Cubic Clustering Criterion, SAS Press.
  • 171. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-1719/8/2019 References (cont. 1) Song J., Shin S. (2018): Stability approach to selecting the number of principal components, Computational Statistics, 33: 1923:1938 Tibshirani R. et al (2001): Estimating the number of clusters in a data set via the gap statistic., J. R. Statis. Soc. B, pp. 411-423
  • 172. Leonardo Auslender –Ch. 1 Copyright 2004 Ch. 1.1-1729/8/2019