Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Overview of statistical tests: Data handling and data quality (Part II)
1. Overview of Statistical Tests II:Overview of Statistical Tests II:
Data Handling and Data Quality
Presented by: Jeff Skinner M SPresented by: Jeff Skinner, M.S.
Biostatistics Specialist
Bioinformatics and Computational Biosciences Branch
National Institute of Allergy and Infectious Diseases
Office of Cyber Infrastructure and Computational BiologyOffice of Cyber Infrastructure and Computational Biology
2. How Should I Handle My Data?o S ou d a d e y a a?
Three common problems:Three common problems:
• Building and testing a model with the same data
Stepwise model building procedures and similar methods Stepwise model building procedures and similar methods
Not using cross-validation or similar methods
• Confusion between biological and technicalg
replicates
Pseudo-replication
• Identification and handling of outliers
Outliers vs. high influence points
Outlier removal vs robust statistical methodsOutlier removal vs. robust statistical methods
3. Building and Testing a Model
with the Same Data
• When do we encounter the problem?
Using simple tests to inform complicated tests
U i d l l ti t h i Using model selection techniques
• What are the negative effects?
Choosing poor models or “overfitting”
• How do we avoid these problems?
Using designed experiments
Training, Testing and Confirmation data sets
Cross-validation techniquesCross validation techniques
4. Simple Tests Inform Complex Tests
• Suppose you want to model
the factors influencing thethe factors influencing the
severity of some disease
• It seems sensible to test all
the variables individually,
then test a larger model ofthen test a larger model of
only the significant effects
What are the potential
Variable Test P‐value
Region (hospital) Chi‐Square Test 0.0001
Gender Chi‐Square Test 0 073
• What are the potential
problems with this method?
Gender Chi‐Square Test 0.073
Age Logistic Regression 0.0043
Weight Logistic Regression 0.1674
Percent Body Fat Logistic Regression 0.0623
Sodium levels Logistic Regression 0 1049Sodium levels Logistic Regression 0.1049
Cholesterol Logistic Regression 0.000495
5. Over-fitting from Simple TestsOver-fitting from Simple Tests
Individual Tests
bl l
Multivariate Model
Variable P‐value
Region (hospital) 0.4281
Gender 0.0367
Age 0.0043
W i ht 0 1674
Variable P‐value
Gender 0.0447
Age 0.0106
Cholesterol 0.0032
Weight 0.1674
Percent Body Fat 0.2623
Sodium levels 0.1049
Cholesterol 0.0004
Gender * Age 0.1872
Gender * Cholesterol 0.3388
Age * Cholesterol 0.6763
Gender * Age * Cholesterol 0.8961
• Because the variables are significant in the individual tests,
they should be significant in the multivariate model
Some results from individual tests may be false positives• Some results from individual tests may be false positives
• Because we use the same data to test the multivariate model,
the same false positives will be found in its results
6. Simpson’s Paradox
Individual Tests
Variable P‐value
(h l)
Multivariate Model
Variable P‐value
Region (hospital) 0.4281
Gender 0.5367
Age 0.0043
Weight 0.1674
P t B d F t 0 2623
Gender 0.0447
Age 0.0106
Cholesterol 0.0032
Gender * Age 0.0229
Percent Body Fat 0.2623
Sodium levels 0.1049
Cholesterol 0.0004
Gender * Cholesterol 0.3388
Age * Cholesterol 0.6763
Gender * Age * Cholesterol 0.8961
• Sometimes the relationship between two variables changes
in the presence of a third variable. This is Simpson’s
paradoxparadox
• If individual tests are used to build a multivariate model, then
sometimes important variables will be omitted because their
significance was obscured by an interaction effectsignificance was obscured by an interaction effect
7. Model Selection MethodsModel Selection Methods
• Goal is to identify the optimal number of variables
and the best choice of variables for a multivariableand the best choice of variables for a multivariable
model using a data set with dozens of possible
variables
• Step-wise selection methods
Backwards selection: start with all variables, then remove any
unneededunneeded
Forwards selection: start with no variables, then add the best
variables
Mixed selection: variables can be added or removed from model
• Best subsets or all subsets methods
Fit all possible models, then identify the best models by some criteria
8. Model Selection Criteria
• P-values of each potential X-variable
I di id l l hi hl iti t th i bl Individual p-values are highly sensitive to other variables
Individual p-value don’t really test the hypothesis of interest
• R2 and adjusted-R2j
Represent the percent of variation explained by the model
Meaningless or misleading if model assumptions are not met
Akaike’s Information Criteria (AIC)• Akaike’s Information Criteria (AIC)
Computed as 2k – 2ln(L)
Function of the log-likelihood and number of parametersg p
• Mallow’s Cp
Computed as Cp = SSEp / MSEk – N + 2P
Intended to address the issue of model over fitting Intended to address the issue of model over-fitting
9. Model Selection Methods
• Model selection methods
find the optimal variables
for a multivariate model
Optimal number of variablesOptimal number of variables
Identity of the variables
• Model selection methods
sometimes use p-values as
selection criteria but theseselection criteria, but these
p-values should not be
used for hypothesis tests
10. Problems With Model SelectionProblems With Model Selection
• P-values do not test the real hypothesis of interestyp
Model selection seeks to identify the optimal number of variables
H0: k = 0 Ha: k > 0 where k = # variables
Individual p-values are computed for all possible combinations of Individual p-values are computed for all possible combinations of
variables, most of which are not in the final model
• Individual p-values are computed from multiple tests
Individual p-values would need a strict adjustment for multiple
testing
Final p-values unlikely to be statistically significant
• Data driven hypotheses
It is unfair to peek at the data, then only test the largest differences
More likely to generate false positivesMore likely to generate false positives
11. Data Mining AnalysesData Mining Analyses
• Make predictions from VERY LARGE data sets
Mi d t ti i (NGS) d t Microarray and next generation sequencing (NGS) data
Large databases of clinical or medical records
Credit, banking and financial data, g
• Special classification models used to accommodate
large samples sizes or large number of variables
Classification and regression trees (CART)
K nearest neighbors (KNN) methods K-nearest neighbors (KNN) methods
Neural Nets, support vector machines (SVN), …
12. Training a Data Mining ModelTraining a Data Mining Model
• Researchers often want to compare several dataResearchers often want to compare several data
mining methods to find the best classifier
CART methods versus KNN methods
SVN versus neural nets
M d t i i d l h t th t• Many data mining models have parameters that
must be optimized for each problem
How many branches or splits for a CART?How many branches or splits for a CART?
How many neighbors for KNN?
13. An Example from Data Miningp g
Training Data Test DataTraining Data
Misclassifies 2 data points
Test Data
Misclassifies 6 data points
14. An Example from Data Mining
T i i D tTraining Data
Misclassifies 0 data points
Test Data
Misclassifies 5 data points
15. How Do We Avoid Problems?How Do We Avoid Problems?
Divide our data into two or three groups:
• Training data
Build a model using individual tests or model selection
Train a data mining model to identif optimal parameters Train a data mining model to identify optimal parameters
• Test data
Evaluate the model built with the training dataEvaluate the model built with the training data
Perform hypothesis tests
• Confirmation data
Evaluate the model built with the training data
Confirm findings from Test data set
16. Cross-validation MethodsCross validation Methods
• Divide data into slices,1 Train
then train and test
models
Train model with slice #1,
1
2
3
Train
test with slices 2, …, 8
Train model with slice #2,
test with slices 1, 3, …, 8
4
5
6
Test
…
Train model with slice #8,
test with slices 1, …, 7
C il lt t
7
8
• Compile results to
evaluate the fit of all 8
models
17. Biological or Technical Replicates?Biological or Technical Replicates?
• How do I analyze data if I pool samples?How do I analyze data if I pool samples?
• How do I analyze data if I use replicate samples?• How do I analyze data if I use replicate samples?
• What if I take multiple measurements from the• What if I take multiple measurements from the
same patient or subject?
• What if I run experiments on a cell line?
18. Experimental Units vs. Sampling Unitsg
• A treatment is a unique combination of all the factors andq
covariates studied in your experiment
Th i t l it (EU) i th ll t tit th t• The experimental unit (EU) is the smallest entity that
can receive or accept one treatment combination
• The sampling unit (SU) is the smallest entity that will be
measured or observed in the experiment
• Experimental and sampling units are not always the same
19. Example: EU and SU are the SameExample: EU and SU are the Same
• Suppose 20 patients have the common cold
10 patients are randomly chosen to take a new drug
10 patients are randomly chosen for the placebo
Duration of their symptoms (hours) is the response variableDuration of their symptoms (hours) is the response variable
• EU and SU are the same in this experimentEU and SU are the same in this experiment
Drug and placebo treatments are applied to each patient
Each patient is sampled to record their duration of symptoms
f S Therefore EU = patient and SU = patient
20. Example: EU and SU are differentExample: EU and SU are different
• 20 flowers are planted in individual pots
10 fl d l h t i d f tili ll t 10 flowers are randomly chosen to receive dry fertilizer pellets
10 flowers are randomly chosen to receive liquid fertilizer
All six petals are harvested from each flower and petal lengthp p g
is measured as the response variable
• EU and SU are different in this experiment
Fertilizer treatment is applied to the individual plant or pot
Measurements are taken from individual flower petals Measurements are taken from individual flower petals
Therefore EU = plant and SU = petal (pseudo-replication)
21. Pseudo-Replicationp
• Confusion between EU’s and SU’s can artificially inflate
sample sizes and artificially decrease p-valuessample sizes and artificially decrease p values
E.g. It is tempting to treat each flower petal as a unique sample
(n = 6 x 20 = 120), but the petals are pseudo-replicates
“P d li ti d th D i f E l i l Fi ld “Pseudoreplication and the Design of Ecological Field
Experiments” (Hurlbert 1984, Ecological Monographs)
• Pooling samples can create pseudo-replication problems
E.g. 12 fruit flies are available for a microarray experiment, but
t l fli i t 4 f 3 fli h t t h RNAmust pool flies into 4 groups of 3 flies each to get enough RNA
Once data are pooled, it is not appropriate to analyze each
individual separately in the statistical model
22. Biological vs Technical ReplicationBiological vs. Technical Replication
• Sometimes, experiments use multiple EU’s to, p p
investigate multiple sources of error with a statistical
model
E g When measurements are inaccurate you want to estimate variation E.g. When measurements are inaccurate, you want to estimate variation
between subjects and multiple measurements
E.g. To evaluate the precision of 2 lie detector machines, you could test 6
subjects measured by 4 technicians each in repeated measurements
Subject and machine effects have EU = subject (biological replicates) ,
but the technician effect has EU = measurement (technical replicates)
• These kinds of experiments must be analyzed withThese kinds of experiments must be analyzed with
appropriate statistical methods
Split-plot methods evaluate multiple EU’s in one model
23. No Biological Replication?No Biological Replication?
• Sometimes experiments have no biological replicatesSometimes experiments have no biological replicates
Experiments with cell lines (e.g. cancer cell lines)
Experiments with purified proteins, DNA, macromolecules
Experiments with bacteria, viruses or pathogens???
Be very careful when you interpret results• Be very careful when you interpret results
Technical replicates represent the precision of your methods
Significant results apply to your specific sampleS g ca esu s app y o you spec c sa p e
Results may not extend to larger populations
24. An Illustrative Examplep
4 batches of
vaccine
dumped into
one “pool”
single sample
from “pool”
tested in ten
egg assaysvaccine one pool from pool egg assays
• Does the experiment have any replication?
Biological replication? No. Four batches dumped into one pool.
Technical replication? Yes Ten assays used to detect contaminationTechnical replication? Yes. Ten assays used to detect contamination.
• What can we making inferences about?
Population of all vaccine batches? No. No biological replication.
Contamination of the single sample? Yes. Ten technical replicates used
Contamination of this specific pool? Maybe.
Contamination of these specific batches? Maybe.
25. What Is An Outlier?What Is An Outlier?
• An outlier is an observation (i.e. sampling unit) thatAn outlier is an observation (i.e. sampling unit) that
does not belong to the population of interest
Outliers can and should be legitimately removed from the analysis
Identifying outliers is a biological question, not a statistical question
• A high influence point is an observation that has a• A high influence point is an observation that has a
large impact on the fit of your statistical model
High influence points might be outliers or legitimate data
Several methods to identify and handle high influence points
26. Examples of Outliers
• Errors, glitches, typos and “non-data”
Bubbles or bright spots on a microarray
Typos from medical chart (e g age = 334) Typos from medical chart (e.g. age = 334)
• Legitimate samples, but out of scopeg p , p
Patients with comorbidities or other conditions (e.g. diabetes
patient in an AIDS study)
27. Examples of High Influencep g
• High Leverage points
Observations with extreme combinations of predictor andObservations with extreme combinations of predictor and
response variables (i.e. outskirts of the design space)
Identified using leverage plots
• Large Residuals
Represent large difference between predicted values from thep g p
model and the observed value from the sample
Large residual = poor model fit for that value
• Large influence on model fit
Remove the value and the model changes dramaticallyg y
28. High Leverage Pointsg e e age o s
• We expect no relationship Leverage: hii = X’i(X’X)-1Xi
between hat size and IQ
A single observation can
Leverage: hii X i(X X) Xi
• A single observation can
change the slope of the line
Hat size = 38, IQ = 190
• Extreme combinations of X
and Y variables produceand Y variables produce
high influence over the
analysisy
29. Leverage Plotsg
• Red “confidence curves” identify significant leveragey g g
Curves that completely overlap the blue line are not significant
Curves that largely do not overlap the blue line have significant
leverage
• If leverage is problematic, respond carefully
Identify and remove any outliers, if they exist
Consider alternative models variable transformations weighting etc Consider alternative models, variable transformations, weighting, etc.
31. ResidualsResiduals
• Residuals = Observed – Predicted
Also called “errors”
ei = Yi - Ŷi
• Represent the unexplained variation
Should be independent, identically distributed and randomp , y
Overall trend in residuals represents model fit
Large individual residuals may represent high influence
Several different computations for residuals exist Several different computations for residuals exist
32. Residuals PlotResiduals Plot
• Residuals vs. X variable
E l t d l fit l ti t di t i bl Evaluate model fit relative to one predictor variable
Suspect one variable fits poorly in multivariable model
• Residuals vs. Predicted values
Evaluate model fit with respect to the entire model
Good if you want a single plot for multivariable model
R id l itt d X i bl• Residuals vs. omitted X variable
Interesting trends if important variable was omitted
33. Good Model Fit
• Expect a rectangular or oval
shaped could of residuals
• Residuals vs. X variable used
to evaluate independence
E.g. Do we need to model a
curved relationship with Age?
• Residuals vs. predicted used
to evaluate assumption of
identically distributes errors
E.g. Non-constant variance
E.g. Larger errors with higher
response values
37. Alternative Residual Computations
• Studentized residuals
Divide each residual by the estimate of the standard deviationy
Easier to identify high influence points (e.g. > 3 s.d. away from mean)
D l t d id l• Deleted residuals
Compute residual after deleting one observation
Evaluate the effect of one observation on model fit
• Deviance or Pearson residuals
Computed for categorical response models (e.g. logistic regression)
Often do not follow typical trends of residuals from linear models
40. Other Indicators of High Influenceg
• DFFITS
Influence of single point on single fitted value
Look for DFFIT > 1 for small n or DFFIT > 2*sqrt(p/n) for large n
• DFBETAS• DFBETAS
Influence of single point on regression coefficients
Look for DFBETAS > 1 for small n or DFBETAS > 2 / sqrt(n) for large n
• Cook’s Distance
Influence of single point on all fitted values
C i t F( ) di t ib ti Compare against F(p, n – p) distribution
See Kutner, Nachtsheim, Neter and Li. 2005. Applied Linear Statistical
Models for more details
41. SolutionsSolutions
• Remove high influence points if they may be outliersg p y y
• Fit a completely new model to the data
• Transform variables
Transform X to change relationship between X and Yg p
Transform Y to change distribution of model errors
Use a weighting scheme to reduce their influence• Use a weighting scheme to reduce their influence
Use wi = 1 / sdi for non-constant variance
Use wi = 1 / Yi
2 or wi = 1 / Xi
2 to weight regions of plot
42. Log-transform XLog transform X
• Relationship between X and Y changes
• May reduce impact of some high influence points• May reduce impact of some high influence points
44. Weighting SchemesWeighting Schemes
• Use wi = 1 / sdi for non-constant variance
• Use w = 1 / Y2 or w = 1 / X2 to weight regions of plot• Use wi = 1 / Yi
2 or wi = 1 / Xi
2 to weight regions of plot
45. Th k YThank You
For questions or comments please contact:
ScienceApps@niaid.nih.gov
301.496.4455
45