Applied Statistics Part 5: Transforming, Weighting, and Torturing Data
1. Applied Statistics
Part 5
By:
MM. H. Farjoo MD, PhD, Bioanimator
Shahid Beheshti University of Medical Sciences
Instagram: @bio_animation
2. Applied Statistics
part 5
Outliers
Transforming Data
Normalizing Data
Weighting Data
Torturing Data
Robustness
Homoscedasticity and Heteroscedasticity
3. Outliers
When analyzing data, sometimes one value is far
from the others; Such a value is called an outlier.
With an outlier, consider these:
Was the value entered into the computer correctly?
Is the outlier value scientifically impossible? (negative
height or weight, etc)
Were there any experimental problems caused by a flaw in
the Lab. devices?
Could the outlier be caused by biological diversity? (this
may be the most exciting finding in your data!)
4.
5. Outliers Cont,d
Don't throw out the data as an outlier until first thinking
about whether the finding is scientifically interesting.
You may have discovered a polymorphism in a gene, or a
new clinical syndrome.
It is especially important to beware of lognormal
distributions.
In lognormal distribution, you find very high values
which can easily be mistaken for outliers.
Removing these values would be a mistake.
6. Outliers
Hands-on practice
To find outliers in SPSS:
Analyze => Descriptive Statistics => Explore... => Statistics...
=> Outliers check box
To find outliers in Prism:
Column statistics (from welcome screen) => frequency
distribution data and histogram => Analyze => Column
analysis => Identify outliers
7. Transforming Data
Data transformations are an important tool for the proper
statistical analysis of biological data.
It is tried if a quantitative variable:
Does not fit a normal distribution
Has greatly different SD in different groups (SD)
It is NOT a form of playing around with your data in order to
get the answer you want!
So it is essential to be able to defend data transformation.
8. Transforming Data Cont,d
For transforming, a mathematical operation is performed
on each observation, and statistics is done on the
transformed numbers.
It is better to use a transformation that other researchers
commonly use in your field.
It is rather better to use a more common, but less effective
transformation, so people are not skeptical.
Data don't have to be perfectly normal; parametric tests
aren't sensitive to this assumption.
9.
10. Transforming Data Cont,d
It is NOT a good idea to report your results (means,
SD, CI etc.) in transformed units.
You should back-transform the results, and do the
opposite math function used for transformation.
It is also important that you decide which
transformation to use before you do the statistical test.
Trying different transformations until you find one
that gives you a significant result is cheating.
11. Transforming Data Cont,d
Log transformation:
Consists of taking the log of each observation.
You can use either base-10 logs or base-e logs, It
makes no difference for a statistical test
You should specify which log you're using, as it
affects the slope and intercept in a regression
Many variables in biology have log-normal
distributions
It means after log-transformation, the values are
normally distributed.
12. Transforming Data Cont,d
Square-root transformation:
Consists of taking the square root of each observation
Arcsine transformation.
Consists of taking the arcsine of the square root of a
number.
The numbers must be in the range 0 to 1
This is used for proportions, which range from 0 to 1
13. Transforming Data
Hands-on practice
To transform data in SPSS:
Transform => compute variables
To transform data in Prism:
Column statistics (from welcome screen) => frequency
distribution data and histogram => Analyze => Transform,
Normalize… => Transform => “Transform Y values using”
check box
14. Normalizing Data
We often want to compare data on different scales or
even different units.
To do so, we “eliminate” the scale of measurements,
and “constrain” them to predetermined restrictions.
This is called normalization, and puts different
variables into comparable units.
Investigators commonly normalize dose-response
curves so all curves begin, and end at constant values
(usually 0% & 100%).
15.
16. Normalizing Data Cont,d
To fit a curve to the normalized data, we “constrain”
the bottom and top plateaus to predetermined values
(usually 0% and 100%).
In this way all parameters of the curves are
comparable (eg: EC50, slope, intercept, etc)
If you normalize, don't also weight the data.
17.
18. Normalizing Data
Hands-on practice
To normalize data in SPSS:
Transform => Analyze => Regression => Probit...
To normalize data in Prism:
Column statistics (from welcome screen) => frequency
distribution data and histogram => Analyze => Transform,
Normalize… => Normalize
19. Weighting Data
The USA election candidates in 1936, were Alfred Landon,
and Franklin Roosevelt.
The magazine “Literary Digest” had always correctly
predicted the results in 1920, 1924, 1928 and 1932.
It surveyed 10 million people in 1936, and 2.4 million of them
responded (wow!).
Literary Digest predicted, Landon is the winner, but Landon
failed and Roosevelt won (a louder wow!!).
The sampling error was 19%, the largest ever in the USA.
The magazine was so discredited that it folded 2 years later in
1938; after 48 years of brilliant circulation.
21. Weighting Data (Cont’d)
A sample may NOT be representative of its population.
This happens because of:
Non-response
Self-selection (in an online survey)
Sampling error (eg. selection bias) or just bad luck!
A commonly applied correction technique is weighting.
Under-represented variables, get a weight larger than 1, and
over-represented groups get a weight smaller than 1.
In the calculations, not the variables proper, but the
weighted values are used.
22. Weighting Data (Cont’d)
A weighting adjustment can only be carried if appropriate and
valid auxiliary variables are available.
Gallup's Institute of Public Opinion correctly predicted the
result of the 1936 election using a sample size of only 50,000.
The morals of the Literary Digest story is:
Making a bad sample bigger, does NOT correct the sampling error
A badly chosen big sample is much worse than a well-chosen small
sample
Watch out for selection bias and nonresponse bias and correct them
by weighting
Gallup institute was better than Literary Digest!!
23. Male Female
Population 50% 50%
Sample 20% 80%
What film? Action (Western) Drama (Indian)
Weight 2.5 (50 / 20) 0.625 (50 / 80)
Result in sample 50% (2.5 * 20%) 50% (0.625 * 80%)
Weighting Data
Gender is an auxiliary variable
24.
25. Weighting Data
Hands-on practice
To weight data in SPSS:
Data => weight cases...
To weight data in Prism (note: prism uses weighting
for nonlinear regression):
XY (from welcome screen) => Nonlinear regression =>
Analyze => XY analysis => Nonlinear regression (curve fit)
=> choose an equation from “Fit” tab => Weight tab =>
Weighting method
26.
27. Torturing Data
When scientists don't get the results they want, they often
resort to tactics such as:
Change the definition of the outcome.
Use a different time scale.
Try different criteria for including or excluding a subject.
Arbitrarily decide which points to remove as outliers.
Try different ways to clump or separate subgroups.
Try different ways to normalize the data.
Try different algorithms for computing statistical tests.
Try different statistical tests.
If the results are still 'negative', then don't publish them!!
28. Robustness
Robustness in statistical tests means the test can
“resist” violation(s) of the test assumption(s).
If a tests is robust, the result is not affected
considerably by the absence or alterations of the
condition (eg: normal distribution).
It is similar to buffer solutions in chemistry which
resist against changes in pH.
29. Homoscedasticity & Heteroscedasticity
Homoscedasticity & Heteroscedasticity are a notion
usually considered in ANOVA, and t Test
Parametric tests assume that data are homoscedastic (have
the same SD in different groups).
If the SDs are heteroscedastic, the probability of
obtaining a false positive is greater than alpha level.
Heteroscedasticity is not a problem with balanced designs
(equal sample sizes in each group).
You should always compare the SDs of the groups for
heteroscedasticity (especially with unbalanced designs).
30. Homoscedasticity & Heteroscedasticity Cont,d
There is no agreement about when heteroscedasticity
is big enough for not using a test that relies on it.
To test homoscedasticity, Bartlett's test is used (H0 is
desirable)
Bartlett is not a very good test, so do not panic if it
returns a significant P value.
When SDs are different The first action is data
transformation.
31. Homoscedasticity & Heteroscedasticity Cont,d
Bartlett test is often used to compare the effect of
various transformations to obtain the biggest P value.
If data transformation is not successful, Welch test
(correction) is used as an alternative.
Welch test does not assume equal SDs.
Non-parametric tests do not assume normal
distribution but they do assume homoscedasticity.
So Non-parametric tests are NOT a good solution for
heteroscedasticity.