2. Leonardo Auslender –Ch. 1 Copyright 2004 -2
2019-02-07
Contents
Definition
Data sets and Examples
UEDA: Univariate EDA
Definitions of central tendency and variability.
Data set descriptions.
Transformations.
Statistical Inference – Probability.
Univariate Data Distributions – Normality.
CLT – Statistical Tests.
Sampling (under construction)
Statistical Puzzles
Bayesian Inference (under construction).
BEDA
Continuous Variables – Correlations - Causation
Nominal Variables – Chi Square – Odds
Simpson’s Paradox
3. Leonardo Auslender –Ch. 1 Copyright 2004 -3
2019-02-07
Contents (cont.)
MEDA
Principal Components
Factor Analysis
Clustering and Segmentation
Canonical Discrimination Analysis
Missing Value imputation
Outliers and Variable Transformations.
4. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-4
2019-02-07
EDA: definition, purpose and usefulness.
Typically (or should be) initial step to view and try to comprehend data
set , assumed to be of rectangular form. Columns called variables, rows
observations. Comprehension of individual and across individual variables.
Given size of present data bases with hundreds if not thousands or more
variables or attributes and possibly millions of observations, it is either hard
or not possible to obtain full conceptual understanding of the
informational content imbued in data. Since many applications lead to
(some) modeling, also possible and desirable to perform EDA on outcome of
these procedures we can envision EDA applied twice, prior to and after
model creation, and thus possibly restarting modeling effort.
Variables are random when their values cannot be known with certainty, i.e.,
they are not deterministic variables. For instance, not possible to know the
next outcome of roulette bet, we “know” probability of success of such
outcome.
When probability is not knowable uncertainty. Otherwise, risk.
5. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-5
2019-02-07
Data sets in large data set setting, contain hundreds (if
not) more variables, with taxonomy:
1) Numeric,
2) Character, considered nominal, i.e., no cardinality.
3) Ordinal
4) Ratio: ratio of numeric variables.
5) Id: a. Id proper, nominal variables: patient id, SSN, etc.
b. Indices: Time index, patient visit number, etc.
Can compute Nominal Ordinal Interval Ratio
frequency distribution. Yes Yes Yes Yes
median and percentiles. No Yes Yes Yes
add or subtract. No No Yes Yes
mean, standard deviation,
standard error of the mean.
No No Yes Yes
ratio, or coefficient of
variation.
No No No Yes
6. Leonardo Auslender –Ch. 1 Copyright 2004 -6
2019-02-07
Proposed EDA steps
Variables (attributes) can be analyzed in themselves (univariate), in relation to
other single variables (bivariate) and as a whole (multivariate).
Conceptualization is more difficult as we progress from one to many, and
requires answers to e.g., which one is truly bigger?
We can envision 5 steps in EDA:
Data Set definition: at least number of observations, variables and taxonomy.
UEDA Univariate (central location and dispersion measures).
BEDA Bivariate (mostly correlation and contingency tables)
MEDA Multivariate: Missing values analysis and imputation, principal
components and clustering.
Outliers and variable transformation or Engineering (that involve UEDA, BEDA
and MEDA. E.g., log (Var1) is Univariate transformation. PCA is multivariate
Transformation).
7. Leonardo Auslender –Ch. 1 Copyright 2004 1.1-7
2019-02-07
Statistical Inference
It embraces all of EDA and Modeling, allows for
comparisons and statements about similarity or not in
many different situations.
All data sets assumed to be representative samples from an
infinite population, which can be unrealistic. If interested in
inferences on heights of first graders in specific school at
specific time, data is entire population, and there is no
uncertainty on recorded heights and thus statistical inference
not required..
If interest is in height changes across time (and thus work with
samples), then use statistical inference to infer information
about overall population (perhaps ideal, not real).
9. Leonardo Auslender –Ch. 1 Copyright 2004 — 9 —
Data set 1: Definition by way of Example
• Health insurance company: Ophtamologic
Insurance Claims
• Is claim valid or fraudulent?
• Present operation:
• Manual review of history and circumstances
Alternative:
Scoring analytical system.
Data Mining Solution: Use data on past claims
to verify fraud
10. Leonardo Auslender –Ch. 1 Copyright 2004 Ch. 1.1-10
2019-02-07
Alphabetic List of Variables and Attributes
# Variable Type Len Format Informa
t
Label
3 DOCTOR_VISITS Num 8 BEST12
.
F12. Total visits to a doctor
1 FRAUD Num 8 BEST12
.
F12. Fraudulent Activity yes/no
5 MEMBER_DURAT
ION
Num 8 Membership duration
4 NO_CLAIMS Num 8 BEST12
.
F12. No of claims made recently
7 NUM_MEMBERS Num 8 Number of members covered
6 OPTOM_PRESC Num 8 BEST12
.
F12. Number of opticals claimed
2 TOTAL_SPEND Num 8 BEST12
.
F12. Total spent on opticals
SAS EXAMPLE for fraud data set:
ods html;
proc contents data = fraud.fraud;
run;
ods html close;
11. Leonardo Auslender –Ch. 1 Copyright 2004 — 11 —
Data set 2 (DS2): Babies’ deaths (health
care example).
• Babies’ death or survivability.
• Determine basic statistics for mostly binary
variables.
• Data anomalies?
• Is present data representative or
anomalous?
12. Leonardo Auslender –Ch. 1 Copyright 2004 -12
2019-02-07
Alphabetic List of Variables and Attributes
# Variable Type Len Label
16 Const Num 8
18 H Num 8
17 M1 Num 8
7 abort Num 8 past abortion
1 death Num 8 Death
9 dyslab Num 8 Labor progress
5 gestage Num 8 Gestational AGe
8 hydramnios Num 8 Too much amniotic fluid
6 isoimm Num 8 Iso immunization
15 malpres Num 8 Mal Presented
11 nomonit Num 8 No Monitor
2 nonwhite Num 8 Non-White
4 nullip Num 8 Null Parity
10 placord Num 8 Placental - cord anomaly
14 prerupt Num 8 PROM
3 teenages Num 8 Early Age
12 twint Num 8 Twin, Triplet
13 ward Num 8 Public Ward
13. Leonardo Auslender –Ch. 1 Copyright 2004 -13
2019-02-07
Data Set HMEQ (DS 3)
Reports characteristics and delinquency information for 5,960
home equity loans: loan where the obligor uses the equity of
his or her home as the underlying collateral.
The data set has the following characteristics:
◾ BAD: 1 = applicant defaulted on loan or seriously delinquent; 0 = applicant paid loan
◾ LOAN: Amount of the loan request
◾ MORTDUE: Amount due on existing mortgage
◾ VALUE: Value of current property
◾ REASON: DebtCon = debt consolidation; HomeImp = home improvement
◾ JOB: Occupational categories
◾ YOJ: Years at present job
◾ DEROG: Number of major derogatory reports
◾ DELINQ: Number of delinquent credit lines
◾ CLAGE: Age of oldest credit line in months
◾ NINQ: Number of recent credit inquiries
◾ CLNO: Number of credit lines
◾ DEBTINC: Debt-to-income ratio
16. Leonardo Auslender –Ch. 1 Copyright 2004 -16
2019-02-07
NEXT:
UEDA: Single variable analysis.
BEDA: Analysis of pairs of variables, typically correlation and
chi-square analysis.
MEDA: Focuses on dimension reduction. Studied dimensions
are Variables and Observations
Methods to group variables (PCA, FA)
Methods to group observations (Cluster analysis).
Plus
Missing Value imputation
Multivariate transformations
Multivariate outlier detection.