Data screening

DATA SCREENING
Wei-Jiun, Shen Ph. D.

Anything that can
go wrong will go
wrong

Why do we need to
screen data?

Purpose
 Detect and correct data errors
 Detect and treat missing data
 Detect and handle insufficiently sampled variables
 Conduct transformations and standardizations
 Detect and handle outliers

First concern
 Accuracy of data file
 Descriptive statistics
 Graphic representations
 Honest correlations
 Missing data
 Pattern or amount
 Random or not
 Outliers

MISSING DATA
“blank” part in data set

Why is missing data a problem?
 Systematical problem
 Bias sampling
 Demographic variables
 Inappropriate measuring procedure
 Behavioral items
 Insufficient amount for analysis
 Small sample
 Misleading research results
 Biased data in, _______ out

Probability distribution of missingness
 Consider the probability of missingness
 Are certain groups more likely to have missing values?
 Respondents in female less likely to report age?
 Are certain responses more likely to be missing?
 Respondents with high SPA less likely to report anxiety?
 Certain analysis methods assume a certain probability
distribution

Missing completely at random (MCAR)
 Missing data is independent of any other
measured variable (y2) and independent of the
variable itself (y1)
 I.e., SES=y2; depression=y1.
 If participants dropped out across a range of SES
levels, then the missing on depression would be
independent of SES
 Little’s MCAR test in MVA indicates whether MCAR
or not (want ns)

Missing at random (MAR)
 Missing data may be dependent on another
measured variable (y2), but is independent of the
variable itself (y1).
 I.e., SES=y2; depression=y1.
 If participants only from high levels of SES dropped
out , then the missing on depression would be
dependent on SES. SES.
 MAR can be inferred if Little’s test is significant but
missingness predictable from other vars (other than
the variable itself) –tested by Separate Variance Test.
MNAR indicated if this test reveals missingness
related to the DV

Treatment for missing data
 Deleting cases or variables
 Descriptive statistics
 Estimating missing data
 Using missing data correlation matrix
 Treating missing data as data
 Repeating analyses with and without missing data
 Choosing among methods for dealing with
missing data
 Pattern or amount

Deletion or preservation?
 Deletion
 <5%
 MCAR/MAR
 Preservation
 MNAR
 Small sample
 Replacement
 Mean (grand or group)
 Regression (predict missing value by other IVs)
 Expectation Maximization (form missing data r matrix by
assumed distribution)

OUTLIER
Cases with extreme value on variables

Why is outlier a problem?
 Systematical problem
 Bias sampling
 Wrong population
 Statistical problem
 ↑error variance
 ↓statistical power
 ↑typeⅠ, Ⅱ error
 ↓normality
 Misleading research results
 Biased data in, _______ out

Influence of outlier
 Leverage × discrepancy

Treatment for outlier
 Estimating outlier
 Standardized score (z>2, 2.5, 3)
 Graphical methods (p-p, q-q plot)
 Mahalanobis distance (χ2 test)
 Deletion or transformation
 Critical to analysis or not
 Preservation
 Transformation
 Score alternation

NORMALITY,
LINEARITY &
HOMOSCEDASTICITY
Basic assumption

Key assumptions in GLM
 Normality
 Linearity
 Homogeneity of variance
 Interval level data
 Independence of observations

Normality
 Normal distribution

Test for normality
 Skewness & Kurtosis

Test for normality
 T-test for skewness & kurtosis score
 Kolmogorov-Smirnov test & Shaprio-wilk test
Z=
𝑠−0
𝑠 𝑠/𝑘
w=
( 𝑖=1
𝑛
𝑎 𝑖 𝑥 𝑖)
2
𝑖=1
𝑛
(𝑥 𝑖−𝐴)
2

Test for normality
 Plotting cumulative distribution function

Test for normality
 P-P plot (probability) & Q-Q plot (quantile)

Linearity
 Straight-line relationship between 2 variables

Homoscedasticity
 Homogeneity of variance
 Homogeneity of variance-covariance matrix

Data transformations
Directio
n
Skewness Treatment
+
Moderate New X = SQRT (X)
Substantial New X = LG10 (X)
Substantial with zero New X = LG10 (X+C)
Severe New X = 1/X
L-shaped with zero New X = 1 (X+C)
-
Moderate New X = SQRT (K-X)
Substantial New X = LG10 (K-X)
J-shaped New X = 1 (K-X)
C = a constant added to each score so that the smallest score is 1.
K = a constant from which each score is subtracted so that the smallest score is 1;
usually equal to the largest score + 1.

Check list
 Descriptive statistics
 Range
 Mean & SD
 Skewness & kurtosis
 Missing data (missing value analysis)
 Normal distribution
 Kolmogorov-Smirnov test (n>50)
 Shapiro-Wilk test (n<50)
 Skewness & kurtosis
 PP plot
 Outlier (single/multiple: z-score/Mahalanobis distance)
 Linearilty
 Homoscedasticity
 Multiconllinearity

Data screening

More Related Content

What's hot

Viewers also liked

Similar to Data screening

More from 緯鈞 沈

Recently uploaded

Data screening

More from 緯鈞沈