SlideShare a Scribd company logo
1 of 45
Download to read offline
Overview of Statistical Tests II:Overview of Statistical Tests II:
Data Handling and Data Quality
Presented by: Jeff Skinner M SPresented by: Jeff Skinner, M.S.
Biostatistics Specialist
Bioinformatics and Computational Biosciences Branch
National Institute of Allergy and Infectious Diseases
Office of Cyber Infrastructure and Computational BiologyOffice of Cyber Infrastructure and Computational Biology
How Should I Handle My Data?o S ou d a d e y a a?
Three common problems:Three common problems:
• Building and testing a model with the same data
 Stepwise model building procedures and similar methods Stepwise model building procedures and similar methods
 Not using cross-validation or similar methods
• Confusion between biological and technicalg
replicates
 Pseudo-replication
• Identification and handling of outliers
 Outliers vs. high influence points
 Outlier removal vs robust statistical methodsOutlier removal vs. robust statistical methods
Building and Testing a Model
with the Same Data
• When do we encounter the problem?
 Using simple tests to inform complicated tests
U i d l l ti t h i Using model selection techniques
• What are the negative effects?
 Choosing poor models or “overfitting”
• How do we avoid these problems?
 Using designed experiments
 Training, Testing and Confirmation data sets
 Cross-validation techniquesCross validation techniques
Simple Tests Inform Complex Tests
• Suppose you want to model
the factors influencing thethe factors influencing the
severity of some disease
• It seems sensible to test all
the variables individually,
then test a larger model ofthen test a larger model of
only the significant effects
What are the potential
Variable Test P‐value
Region (hospital) Chi‐Square Test 0.0001
Gender Chi‐Square Test 0 073
• What are the potential
problems with this method?
Gender Chi‐Square Test 0.073
Age Logistic Regression 0.0043
Weight Logistic Regression 0.1674
Percent Body Fat Logistic Regression 0.0623
Sodium levels Logistic Regression 0 1049Sodium levels Logistic Regression 0.1049
Cholesterol Logistic Regression 0.000495
Over-fitting from Simple TestsOver-fitting from Simple Tests
Individual Tests
bl l
Multivariate Model
Variable P‐value
Region (hospital) 0.4281
Gender 0.0367
Age 0.0043
W i ht 0 1674
Variable P‐value
Gender 0.0447
Age 0.0106
Cholesterol 0.0032
Weight 0.1674
Percent Body Fat 0.2623
Sodium levels 0.1049
Cholesterol 0.0004
Gender * Age 0.1872
Gender * Cholesterol 0.3388
Age * Cholesterol 0.6763
Gender * Age * Cholesterol 0.8961
• Because the variables are significant in the individual tests,
they should be significant in the multivariate model
Some results from individual tests may be false positives• Some results from individual tests may be false positives
• Because we use the same data to test the multivariate model,
the same false positives will be found in its results
Simpson’s Paradox
Individual Tests
Variable P‐value
(h l)
Multivariate Model
Variable P‐value
Region (hospital) 0.4281
Gender 0.5367
Age 0.0043
Weight 0.1674
P t B d F t 0 2623
Gender 0.0447
Age 0.0106
Cholesterol 0.0032
Gender * Age 0.0229
Percent Body Fat 0.2623
Sodium levels 0.1049
Cholesterol 0.0004
Gender * Cholesterol 0.3388
Age * Cholesterol 0.6763
Gender * Age * Cholesterol 0.8961
• Sometimes the relationship between two variables changes
in the presence of a third variable. This is Simpson’s
paradoxparadox
• If individual tests are used to build a multivariate model, then
sometimes important variables will be omitted because their
significance was obscured by an interaction effectsignificance was obscured by an interaction effect
Model Selection MethodsModel Selection Methods
• Goal is to identify the optimal number of variables
and the best choice of variables for a multivariableand the best choice of variables for a multivariable
model using a data set with dozens of possible
variables
• Step-wise selection methods
 Backwards selection: start with all variables, then remove any
unneededunneeded
 Forwards selection: start with no variables, then add the best
variables
 Mixed selection: variables can be added or removed from model
• Best subsets or all subsets methods
 Fit all possible models, then identify the best models by some criteria
Model Selection Criteria
• P-values of each potential X-variable
I di id l l hi hl iti t th i bl Individual p-values are highly sensitive to other variables
 Individual p-value don’t really test the hypothesis of interest
• R2 and adjusted-R2j
 Represent the percent of variation explained by the model
 Meaningless or misleading if model assumptions are not met
Akaike’s Information Criteria (AIC)• Akaike’s Information Criteria (AIC)
 Computed as 2k – 2ln(L)
 Function of the log-likelihood and number of parametersg p
• Mallow’s Cp
 Computed as Cp = SSEp / MSEk – N + 2P
 Intended to address the issue of model over fitting Intended to address the issue of model over-fitting
Model Selection Methods
• Model selection methods
find the optimal variables
for a multivariate model
 Optimal number of variablesOptimal number of variables
 Identity of the variables
• Model selection methods
sometimes use p-values as
selection criteria but theseselection criteria, but these
p-values should not be
used for hypothesis tests
Problems With Model SelectionProblems With Model Selection
• P-values do not test the real hypothesis of interestyp
 Model selection seeks to identify the optimal number of variables
 H0: k = 0 Ha: k > 0 where k = # variables
 Individual p-values are computed for all possible combinations of Individual p-values are computed for all possible combinations of
variables, most of which are not in the final model
• Individual p-values are computed from multiple tests
 Individual p-values would need a strict adjustment for multiple
testing
 Final p-values unlikely to be statistically significant
• Data driven hypotheses
 It is unfair to peek at the data, then only test the largest differences
 More likely to generate false positivesMore likely to generate false positives
Data Mining AnalysesData Mining Analyses
• Make predictions from VERY LARGE data sets
Mi d t ti i (NGS) d t Microarray and next generation sequencing (NGS) data
 Large databases of clinical or medical records
 Credit, banking and financial data, g
• Special classification models used to accommodate
large samples sizes or large number of variables
 Classification and regression trees (CART)
 K nearest neighbors (KNN) methods K-nearest neighbors (KNN) methods
 Neural Nets, support vector machines (SVN), …
Training a Data Mining ModelTraining a Data Mining Model
• Researchers often want to compare several dataResearchers often want to compare several data
mining methods to find the best classifier
 CART methods versus KNN methods
 SVN versus neural nets
M d t i i d l h t th t• Many data mining models have parameters that
must be optimized for each problem
 How many branches or splits for a CART?How many branches or splits for a CART?
 How many neighbors for KNN?
An Example from Data Miningp g
Training Data Test DataTraining Data
Misclassifies 2 data points
Test Data
Misclassifies 6 data points
An Example from Data Mining
T i i D tTraining Data
Misclassifies 0 data points
Test Data
Misclassifies 5 data points
How Do We Avoid Problems?How Do We Avoid Problems?
Divide our data into two or three groups:
• Training data
 Build a model using individual tests or model selection
Train a data mining model to identif optimal parameters Train a data mining model to identify optimal parameters
• Test data
 Evaluate the model built with the training dataEvaluate the model built with the training data
 Perform hypothesis tests
• Confirmation data
 Evaluate the model built with the training data
 Confirm findings from Test data set
Cross-validation MethodsCross validation Methods
• Divide data into slices,1 Train
then train and test
models
 Train model with slice #1,
1
2
3
Train
test with slices 2, …, 8
 Train model with slice #2,
test with slices 1, 3, …, 8
4
5
6
Test
 …
 Train model with slice #8,
test with slices 1, …, 7
C il lt t
7
8
• Compile results to
evaluate the fit of all 8
models
Biological or Technical Replicates?Biological or Technical Replicates?
• How do I analyze data if I pool samples?How do I analyze data if I pool samples?
• How do I analyze data if I use replicate samples?• How do I analyze data if I use replicate samples?
• What if I take multiple measurements from the• What if I take multiple measurements from the
same patient or subject?
• What if I run experiments on a cell line?
Experimental Units vs. Sampling Unitsg
• A treatment is a unique combination of all the factors andq
covariates studied in your experiment
Th i t l it (EU) i th ll t tit th t• The experimental unit (EU) is the smallest entity that
can receive or accept one treatment combination
• The sampling unit (SU) is the smallest entity that will be
measured or observed in the experiment
• Experimental and sampling units are not always the same
Example: EU and SU are the SameExample: EU and SU are the Same
• Suppose 20 patients have the common cold
 10 patients are randomly chosen to take a new drug
 10 patients are randomly chosen for the placebo
 Duration of their symptoms (hours) is the response variableDuration of their symptoms (hours) is the response variable
• EU and SU are the same in this experimentEU and SU are the same in this experiment
 Drug and placebo treatments are applied to each patient
 Each patient is sampled to record their duration of symptoms
f S Therefore EU = patient and SU = patient
Example: EU and SU are differentExample: EU and SU are different
• 20 flowers are planted in individual pots
10 fl d l h t i d f tili ll t 10 flowers are randomly chosen to receive dry fertilizer pellets
 10 flowers are randomly chosen to receive liquid fertilizer
 All six petals are harvested from each flower and petal lengthp p g
is measured as the response variable
• EU and SU are different in this experiment
 Fertilizer treatment is applied to the individual plant or pot
 Measurements are taken from individual flower petals Measurements are taken from individual flower petals
 Therefore EU = plant and SU = petal (pseudo-replication)
Pseudo-Replicationp
• Confusion between EU’s and SU’s can artificially inflate
sample sizes and artificially decrease p-valuessample sizes and artificially decrease p values
 E.g. It is tempting to treat each flower petal as a unique sample
(n = 6 x 20 = 120), but the petals are pseudo-replicates
“P d li ti d th D i f E l i l Fi ld “Pseudoreplication and the Design of Ecological Field
Experiments” (Hurlbert 1984, Ecological Monographs)
• Pooling samples can create pseudo-replication problems
 E.g. 12 fruit flies are available for a microarray experiment, but
t l fli i t 4 f 3 fli h t t h RNAmust pool flies into 4 groups of 3 flies each to get enough RNA
 Once data are pooled, it is not appropriate to analyze each
individual separately in the statistical model
Biological vs Technical ReplicationBiological vs. Technical Replication
• Sometimes, experiments use multiple EU’s to, p p
investigate multiple sources of error with a statistical
model
 E g When measurements are inaccurate you want to estimate variation E.g. When measurements are inaccurate, you want to estimate variation
between subjects and multiple measurements
 E.g. To evaluate the precision of 2 lie detector machines, you could test 6
subjects measured by 4 technicians each in repeated measurements
 Subject and machine effects have EU = subject (biological replicates) ,
but the technician effect has EU = measurement (technical replicates)
• These kinds of experiments must be analyzed withThese kinds of experiments must be analyzed with
appropriate statistical methods
 Split-plot methods evaluate multiple EU’s in one model
No Biological Replication?No Biological Replication?
• Sometimes experiments have no biological replicatesSometimes experiments have no biological replicates
 Experiments with cell lines (e.g. cancer cell lines)
 Experiments with purified proteins, DNA, macromolecules
 Experiments with bacteria, viruses or pathogens???
Be very careful when you interpret results• Be very careful when you interpret results
 Technical replicates represent the precision of your methods
 Significant results apply to your specific sampleS g ca esu s app y o you spec c sa p e
 Results may not extend to larger populations
An Illustrative Examplep
4 batches of
vaccine
dumped into
one “pool”
single sample
from “pool”
tested in ten
egg assaysvaccine one pool from pool egg assays
• Does the experiment have any replication?
 Biological replication? No. Four batches dumped into one pool.
 Technical replication? Yes Ten assays used to detect contaminationTechnical replication? Yes. Ten assays used to detect contamination.
• What can we making inferences about?
 Population of all vaccine batches? No. No biological replication.
 Contamination of the single sample? Yes. Ten technical replicates used
 Contamination of this specific pool? Maybe.
 Contamination of these specific batches? Maybe.
What Is An Outlier?What Is An Outlier?
• An outlier is an observation (i.e. sampling unit) thatAn outlier is an observation (i.e. sampling unit) that
does not belong to the population of interest
 Outliers can and should be legitimately removed from the analysis
 Identifying outliers is a biological question, not a statistical question
• A high influence point is an observation that has a• A high influence point is an observation that has a
large impact on the fit of your statistical model
 High influence points might be outliers or legitimate data
 Several methods to identify and handle high influence points
Examples of Outliers
• Errors, glitches, typos and “non-data”
 Bubbles or bright spots on a microarray
 Typos from medical chart (e g age = 334) Typos from medical chart (e.g. age = 334)
• Legitimate samples, but out of scopeg p , p
 Patients with comorbidities or other conditions (e.g. diabetes
patient in an AIDS study)
Examples of High Influencep g
• High Leverage points
 Observations with extreme combinations of predictor andObservations with extreme combinations of predictor and
response variables (i.e. outskirts of the design space)
 Identified using leverage plots
• Large Residuals
 Represent large difference between predicted values from thep g p
model and the observed value from the sample
 Large residual = poor model fit for that value
• Large influence on model fit
 Remove the value and the model changes dramaticallyg y
High Leverage Pointsg e e age o s
• We expect no relationship Leverage: hii = X’i(X’X)-1Xi
between hat size and IQ
A single observation can
Leverage: hii X i(X X) Xi
• A single observation can
change the slope of the line
 Hat size = 38, IQ = 190
• Extreme combinations of X
and Y variables produceand Y variables produce
high influence over the
analysisy
Leverage Plotsg
• Red “confidence curves” identify significant leveragey g g
 Curves that completely overlap the blue line are not significant
 Curves that largely do not overlap the blue line have significant
leverage
• If leverage is problematic, respond carefully
 Identify and remove any outliers, if they exist
 Consider alternative models variable transformations weighting etc Consider alternative models, variable transformations, weighting, etc.
Residuals
simple linear regression
400
500
300
100
200
-10 -8 -6 -4 -2
0
-100
ResidualsResiduals
• Residuals = Observed – Predicted
 Also called “errors”
 ei = Yi - Ŷi
• Represent the unexplained variation
 Should be independent, identically distributed and randomp , y
 Overall trend in residuals represents model fit
 Large individual residuals may represent high influence
 Several different computations for residuals exist Several different computations for residuals exist
Residuals PlotResiduals Plot
• Residuals vs. X variable
E l t d l fit l ti t di t i bl Evaluate model fit relative to one predictor variable
 Suspect one variable fits poorly in multivariable model
• Residuals vs. Predicted values
 Evaluate model fit with respect to the entire model
 Good if you want a single plot for multivariable model
R id l itt d X i bl• Residuals vs. omitted X variable
 Interesting trends if important variable was omitted
Good Model Fit
• Expect a rectangular or oval
shaped could of residuals
• Residuals vs. X variable used
to evaluate independence
 E.g. Do we need to model a
curved relationship with Age?
• Residuals vs. predicted used
to evaluate assumption of
identically distributes errors
 E.g. Non-constant variance
 E.g. Larger errors with higher
response values
Errors are NOT Independentp
Non-Constant Variance
40
20
5000 10000 15000 20000
0
Eliza Units
siduals
-20
Res
-40
-60
Errors Are NOT Normal
Alternative Residual Computations
• Studentized residuals
 Divide each residual by the estimate of the standard deviationy
 Easier to identify high influence points (e.g. > 3 s.d. away from mean)
D l t d id l• Deleted residuals
 Compute residual after deleting one observation
 Evaluate the effect of one observation on model fit
• Deviance or Pearson residuals
 Computed for categorical response models (e.g. logistic regression)
 Often do not follow typical trends of residuals from linear models
Studentized Residuals
Deleted Residuals
Other Indicators of High Influenceg
• DFFITS
 Influence of single point on single fitted value
 Look for DFFIT > 1 for small n or DFFIT > 2*sqrt(p/n) for large n
• DFBETAS• DFBETAS
 Influence of single point on regression coefficients
 Look for DFBETAS > 1 for small n or DFBETAS > 2 / sqrt(n) for large n
• Cook’s Distance
 Influence of single point on all fitted values
C i t F( ) di t ib ti Compare against F(p, n – p) distribution
 See Kutner, Nachtsheim, Neter and Li. 2005. Applied Linear Statistical
Models for more details
SolutionsSolutions
• Remove high influence points if they may be outliersg p y y
• Fit a completely new model to the data
• Transform variables
 Transform X to change relationship between X and Yg p
 Transform Y to change distribution of model errors
Use a weighting scheme to reduce their influence• Use a weighting scheme to reduce their influence
 Use wi = 1 / sdi for non-constant variance
 Use wi = 1 / Yi
2 or wi = 1 / Xi
2 to weight regions of plot
Log-transform XLog transform X
• Relationship between X and Y changes
• May reduce impact of some high influence points• May reduce impact of some high influence points
Log-transform Yg
Weighting SchemesWeighting Schemes
• Use wi = 1 / sdi for non-constant variance
• Use w = 1 / Y2 or w = 1 / X2 to weight regions of plot• Use wi = 1 / Yi
2 or wi = 1 / Xi
2 to weight regions of plot
Th k YThank You
For questions or comments please contact:
ScienceApps@niaid.nih.gov
301.496.4455
45

More Related Content

What's hot

Lesson 6 Nonparametric Test 2009 Ta
Lesson 6 Nonparametric Test 2009 TaLesson 6 Nonparametric Test 2009 Ta
Lesson 6 Nonparametric Test 2009 Ta
Sumit Prajapati
 
Statistical Significance Testing in Information Retrieval: An Empirical Analy...
Statistical Significance Testing in Information Retrieval: An Empirical Analy...Statistical Significance Testing in Information Retrieval: An Empirical Analy...
Statistical Significance Testing in Information Retrieval: An Empirical Analy...
Julián Urbano
 
Chosing the appropriate_statistical_test
Chosing the appropriate_statistical_testChosing the appropriate_statistical_test
Chosing the appropriate_statistical_test
BRAJESH KUMAR PARASHAR
 
Choosing statistical tests
Choosing statistical testsChoosing statistical tests
Choosing statistical tests
Akiode Noah
 

What's hot (20)

Chapter38
Chapter38Chapter38
Chapter38
 
Parmetric and non parametric statistical test in clinical trails
Parmetric and non parametric statistical test in clinical trailsParmetric and non parametric statistical test in clinical trails
Parmetric and non parametric statistical test in clinical trails
 
Lesson 6 Nonparametric Test 2009 Ta
Lesson 6 Nonparametric Test 2009 TaLesson 6 Nonparametric Test 2009 Ta
Lesson 6 Nonparametric Test 2009 Ta
 
non parametric statistics
non parametric statisticsnon parametric statistics
non parametric statistics
 
Statistical Significance Testing in Information Retrieval: An Empirical Analy...
Statistical Significance Testing in Information Retrieval: An Empirical Analy...Statistical Significance Testing in Information Retrieval: An Empirical Analy...
Statistical Significance Testing in Information Retrieval: An Empirical Analy...
 
Statistical applications in GraphPad Prism
Statistical applications in GraphPad PrismStatistical applications in GraphPad Prism
Statistical applications in GraphPad Prism
 
How to choose a right statistical test
How to choose a right statistical testHow to choose a right statistical test
How to choose a right statistical test
 
Chosing the appropriate_statistical_test
Chosing the appropriate_statistical_testChosing the appropriate_statistical_test
Chosing the appropriate_statistical_test
 
Non Parametric Tests
Non Parametric TestsNon Parametric Tests
Non Parametric Tests
 
Choosing statistical tests
Choosing statistical testsChoosing statistical tests
Choosing statistical tests
 
Non parametrict test
Non parametrict testNon parametrict test
Non parametrict test
 
Non parametric test
Non parametric testNon parametric test
Non parametric test
 
Commonly Used Statistics in Medical Research Part I
Commonly Used Statistics in Medical Research Part ICommonly Used Statistics in Medical Research Part I
Commonly Used Statistics in Medical Research Part I
 
Biostatistics ii
Biostatistics iiBiostatistics ii
Biostatistics ii
 
Statistics Introduction In Pharmacy
Statistics Introduction In PharmacyStatistics Introduction In Pharmacy
Statistics Introduction In Pharmacy
 
Bio statistics 2 /certified fixed orthodontic courses by Indian dental academy
Bio statistics 2 /certified fixed orthodontic courses by Indian dental academy Bio statistics 2 /certified fixed orthodontic courses by Indian dental academy
Bio statistics 2 /certified fixed orthodontic courses by Indian dental academy
 
Introduction to basics of bio statistics.
Introduction to basics of bio statistics.Introduction to basics of bio statistics.
Introduction to basics of bio statistics.
 
Stats test
Stats testStats test
Stats test
 
Parametric vs Nonparametric Tests: When to use which
Parametric vs Nonparametric Tests: When to use whichParametric vs Nonparametric Tests: When to use which
Parametric vs Nonparametric Tests: When to use which
 
NON-PARAMETRIC TESTS by Prajakta Sawant
NON-PARAMETRIC TESTS by Prajakta SawantNON-PARAMETRIC TESTS by Prajakta Sawant
NON-PARAMETRIC TESTS by Prajakta Sawant
 

Similar to Overview of statistical tests: Data handling and data quality (Part II)

Sampling, measurement, and stats(2013)
Sampling, measurement, and stats(2013)Sampling, measurement, and stats(2013)
Sampling, measurement, and stats(2013)
BarryCRNA
 
Sampling of Blood
Sampling of BloodSampling of Blood
Sampling of Blood
drantopa
 
How predictive models help Medicinal Chemists design better drugs_webinar
How predictive models help Medicinal Chemists design better drugs_webinarHow predictive models help Medicinal Chemists design better drugs_webinar
How predictive models help Medicinal Chemists design better drugs_webinar
Ann-Marie Roche
 

Similar to Overview of statistical tests: Data handling and data quality (Part II) (20)

Statistical Learning and Model Selection (1).pptx
Statistical Learning and Model Selection (1).pptxStatistical Learning and Model Selection (1).pptx
Statistical Learning and Model Selection (1).pptx
 
Sampling, measurement, and stats(2013)
Sampling, measurement, and stats(2013)Sampling, measurement, and stats(2013)
Sampling, measurement, and stats(2013)
 
Sampling of Blood
Sampling of BloodSampling of Blood
Sampling of Blood
 
Probability and data 1w
Probability and data 1wProbability and data 1w
Probability and data 1w
 
Non parametric study; Statistical approach for med student
Non parametric study; Statistical approach for med student Non parametric study; Statistical approach for med student
Non parametric study; Statistical approach for med student
 
Introduction to Data Management in Human Ecology
Introduction to Data Management in Human EcologyIntroduction to Data Management in Human Ecology
Introduction to Data Management in Human Ecology
 
Worked examples of sampling uncertainty evaluation
Worked examples of sampling uncertainty evaluationWorked examples of sampling uncertainty evaluation
Worked examples of sampling uncertainty evaluation
 
LR 9 Estimation.pdf
LR 9 Estimation.pdfLR 9 Estimation.pdf
LR 9 Estimation.pdf
 
How predictive models help Medicinal Chemists design better drugs_webinar
How predictive models help Medicinal Chemists design better drugs_webinarHow predictive models help Medicinal Chemists design better drugs_webinar
How predictive models help Medicinal Chemists design better drugs_webinar
 
Clinical prediction models: development, validation and beyond
Clinical prediction models:development, validation and beyondClinical prediction models:development, validation and beyond
Clinical prediction models: development, validation and beyond
 
Choosing Regression Models
Choosing Regression ModelsChoosing Regression Models
Choosing Regression Models
 
Elashoff approach section in grant applications
Elashoff approach section in grant applicationsElashoff approach section in grant applications
Elashoff approach section in grant applications
 
1.3 collecting sample data
1.3 collecting sample data1.3 collecting sample data
1.3 collecting sample data
 
Advanced Biostatistics and Data Analysis abdul ghafoor sajjad
Advanced Biostatistics and Data Analysis abdul ghafoor sajjadAdvanced Biostatistics and Data Analysis abdul ghafoor sajjad
Advanced Biostatistics and Data Analysis abdul ghafoor sajjad
 
SAMPLE SIZE CALCULATION IN DIFFERENT STUDY DESIGNS AT.pptx
SAMPLE SIZE CALCULATION IN DIFFERENT STUDY DESIGNS AT.pptxSAMPLE SIZE CALCULATION IN DIFFERENT STUDY DESIGNS AT.pptx
SAMPLE SIZE CALCULATION IN DIFFERENT STUDY DESIGNS AT.pptx
 
Statistics pres 10 27 2015 roy sabo
Statistics pres 10 27 2015   roy saboStatistics pres 10 27 2015   roy sabo
Statistics pres 10 27 2015 roy sabo
 
Quantitative analysis
Quantitative analysisQuantitative analysis
Quantitative analysis
 
Analysis 101
Analysis 101Analysis 101
Analysis 101
 
STRATOS ISCB 2019: Ruth Keogh
STRATOS ISCB 2019: Ruth KeoghSTRATOS ISCB 2019: Ruth Keogh
STRATOS ISCB 2019: Ruth Keogh
 
Marketing Research Project on T test
Marketing Research Project on T test Marketing Research Project on T test
Marketing Research Project on T test
 

More from Bioinformatics and Computational Biosciences Branch

More from Bioinformatics and Computational Biosciences Branch (20)

Hong_Celine_ES_workshop.pptx
Hong_Celine_ES_workshop.pptxHong_Celine_ES_workshop.pptx
Hong_Celine_ES_workshop.pptx
 
Virus Sequence Alignment and Phylogenetic Analysis 2019
Virus Sequence Alignment and Phylogenetic Analysis 2019Virus Sequence Alignment and Phylogenetic Analysis 2019
Virus Sequence Alignment and Phylogenetic Analysis 2019
 
Nephele 2.0: How to get the most out of your Nephele results
Nephele 2.0: How to get the most out of your Nephele resultsNephele 2.0: How to get the most out of your Nephele results
Nephele 2.0: How to get the most out of your Nephele results
 
Introduction to METAGENOTE
Introduction to METAGENOTE Introduction to METAGENOTE
Introduction to METAGENOTE
 
Intro to homology modeling
Intro to homology modelingIntro to homology modeling
Intro to homology modeling
 
Protein fold recognition and ab_initio modeling
Protein fold recognition and ab_initio modelingProtein fold recognition and ab_initio modeling
Protein fold recognition and ab_initio modeling
 
Homology modeling: Modeller
Homology modeling: ModellerHomology modeling: Modeller
Homology modeling: Modeller
 
Protein docking
Protein dockingProtein docking
Protein docking
 
Protein function prediction
Protein function predictionProtein function prediction
Protein function prediction
 
Protein structure prediction with a focus on Rosetta
Protein structure prediction with a focus on RosettaProtein structure prediction with a focus on Rosetta
Protein structure prediction with a focus on Rosetta
 
Biological networks
Biological networksBiological networks
Biological networks
 
UNIX Basics and Cluster Computing
UNIX Basics and Cluster ComputingUNIX Basics and Cluster Computing
UNIX Basics and Cluster Computing
 
Intro to JMP for statistics
Intro to JMP for statisticsIntro to JMP for statistics
Intro to JMP for statistics
 
Better graphics in R
Better graphics in RBetter graphics in R
Better graphics in R
 
Automating biostatistics workflows using R-based webtools
Automating biostatistics workflows using R-based webtoolsAutomating biostatistics workflows using R-based webtools
Automating biostatistics workflows using R-based webtools
 
GraphPad Prism: Curve fitting
GraphPad Prism: Curve fittingGraphPad Prism: Curve fitting
GraphPad Prism: Curve fitting
 
Appendix: Crash course in R and BioConductor
Appendix: Crash course in R and BioConductorAppendix: Crash course in R and BioConductor
Appendix: Crash course in R and BioConductor
 
Crash course in R and BioConductor
Crash course in R and BioConductorCrash course in R and BioConductor
Crash course in R and BioConductor
 
GraphPad Prism: Customizing your graphs
GraphPad Prism: Customizing your graphsGraphPad Prism: Customizing your graphs
GraphPad Prism: Customizing your graphs
 
Design of experiments
Design of experiments Design of experiments
Design of experiments
 

Recently uploaded

Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
amitlee9823
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
amitlee9823
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
amitlee9823
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
AroojKhan71
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
amitlee9823
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
amitlee9823
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
MarinCaroMartnezBerg
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 

Recently uploaded (20)

Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 

Overview of statistical tests: Data handling and data quality (Part II)

  • 1. Overview of Statistical Tests II:Overview of Statistical Tests II: Data Handling and Data Quality Presented by: Jeff Skinner M SPresented by: Jeff Skinner, M.S. Biostatistics Specialist Bioinformatics and Computational Biosciences Branch National Institute of Allergy and Infectious Diseases Office of Cyber Infrastructure and Computational BiologyOffice of Cyber Infrastructure and Computational Biology
  • 2. How Should I Handle My Data?o S ou d a d e y a a? Three common problems:Three common problems: • Building and testing a model with the same data  Stepwise model building procedures and similar methods Stepwise model building procedures and similar methods  Not using cross-validation or similar methods • Confusion between biological and technicalg replicates  Pseudo-replication • Identification and handling of outliers  Outliers vs. high influence points  Outlier removal vs robust statistical methodsOutlier removal vs. robust statistical methods
  • 3. Building and Testing a Model with the Same Data • When do we encounter the problem?  Using simple tests to inform complicated tests U i d l l ti t h i Using model selection techniques • What are the negative effects?  Choosing poor models or “overfitting” • How do we avoid these problems?  Using designed experiments  Training, Testing and Confirmation data sets  Cross-validation techniquesCross validation techniques
  • 4. Simple Tests Inform Complex Tests • Suppose you want to model the factors influencing thethe factors influencing the severity of some disease • It seems sensible to test all the variables individually, then test a larger model ofthen test a larger model of only the significant effects What are the potential Variable Test P‐value Region (hospital) Chi‐Square Test 0.0001 Gender Chi‐Square Test 0 073 • What are the potential problems with this method? Gender Chi‐Square Test 0.073 Age Logistic Regression 0.0043 Weight Logistic Regression 0.1674 Percent Body Fat Logistic Regression 0.0623 Sodium levels Logistic Regression 0 1049Sodium levels Logistic Regression 0.1049 Cholesterol Logistic Regression 0.000495
  • 5. Over-fitting from Simple TestsOver-fitting from Simple Tests Individual Tests bl l Multivariate Model Variable P‐value Region (hospital) 0.4281 Gender 0.0367 Age 0.0043 W i ht 0 1674 Variable P‐value Gender 0.0447 Age 0.0106 Cholesterol 0.0032 Weight 0.1674 Percent Body Fat 0.2623 Sodium levels 0.1049 Cholesterol 0.0004 Gender * Age 0.1872 Gender * Cholesterol 0.3388 Age * Cholesterol 0.6763 Gender * Age * Cholesterol 0.8961 • Because the variables are significant in the individual tests, they should be significant in the multivariate model Some results from individual tests may be false positives• Some results from individual tests may be false positives • Because we use the same data to test the multivariate model, the same false positives will be found in its results
  • 6. Simpson’s Paradox Individual Tests Variable P‐value (h l) Multivariate Model Variable P‐value Region (hospital) 0.4281 Gender 0.5367 Age 0.0043 Weight 0.1674 P t B d F t 0 2623 Gender 0.0447 Age 0.0106 Cholesterol 0.0032 Gender * Age 0.0229 Percent Body Fat 0.2623 Sodium levels 0.1049 Cholesterol 0.0004 Gender * Cholesterol 0.3388 Age * Cholesterol 0.6763 Gender * Age * Cholesterol 0.8961 • Sometimes the relationship between two variables changes in the presence of a third variable. This is Simpson’s paradoxparadox • If individual tests are used to build a multivariate model, then sometimes important variables will be omitted because their significance was obscured by an interaction effectsignificance was obscured by an interaction effect
  • 7. Model Selection MethodsModel Selection Methods • Goal is to identify the optimal number of variables and the best choice of variables for a multivariableand the best choice of variables for a multivariable model using a data set with dozens of possible variables • Step-wise selection methods  Backwards selection: start with all variables, then remove any unneededunneeded  Forwards selection: start with no variables, then add the best variables  Mixed selection: variables can be added or removed from model • Best subsets or all subsets methods  Fit all possible models, then identify the best models by some criteria
  • 8. Model Selection Criteria • P-values of each potential X-variable I di id l l hi hl iti t th i bl Individual p-values are highly sensitive to other variables  Individual p-value don’t really test the hypothesis of interest • R2 and adjusted-R2j  Represent the percent of variation explained by the model  Meaningless or misleading if model assumptions are not met Akaike’s Information Criteria (AIC)• Akaike’s Information Criteria (AIC)  Computed as 2k – 2ln(L)  Function of the log-likelihood and number of parametersg p • Mallow’s Cp  Computed as Cp = SSEp / MSEk – N + 2P  Intended to address the issue of model over fitting Intended to address the issue of model over-fitting
  • 9. Model Selection Methods • Model selection methods find the optimal variables for a multivariate model  Optimal number of variablesOptimal number of variables  Identity of the variables • Model selection methods sometimes use p-values as selection criteria but theseselection criteria, but these p-values should not be used for hypothesis tests
  • 10. Problems With Model SelectionProblems With Model Selection • P-values do not test the real hypothesis of interestyp  Model selection seeks to identify the optimal number of variables  H0: k = 0 Ha: k > 0 where k = # variables  Individual p-values are computed for all possible combinations of Individual p-values are computed for all possible combinations of variables, most of which are not in the final model • Individual p-values are computed from multiple tests  Individual p-values would need a strict adjustment for multiple testing  Final p-values unlikely to be statistically significant • Data driven hypotheses  It is unfair to peek at the data, then only test the largest differences  More likely to generate false positivesMore likely to generate false positives
  • 11. Data Mining AnalysesData Mining Analyses • Make predictions from VERY LARGE data sets Mi d t ti i (NGS) d t Microarray and next generation sequencing (NGS) data  Large databases of clinical or medical records  Credit, banking and financial data, g • Special classification models used to accommodate large samples sizes or large number of variables  Classification and regression trees (CART)  K nearest neighbors (KNN) methods K-nearest neighbors (KNN) methods  Neural Nets, support vector machines (SVN), …
  • 12. Training a Data Mining ModelTraining a Data Mining Model • Researchers often want to compare several dataResearchers often want to compare several data mining methods to find the best classifier  CART methods versus KNN methods  SVN versus neural nets M d t i i d l h t th t• Many data mining models have parameters that must be optimized for each problem  How many branches or splits for a CART?How many branches or splits for a CART?  How many neighbors for KNN?
  • 13. An Example from Data Miningp g Training Data Test DataTraining Data Misclassifies 2 data points Test Data Misclassifies 6 data points
  • 14. An Example from Data Mining T i i D tTraining Data Misclassifies 0 data points Test Data Misclassifies 5 data points
  • 15. How Do We Avoid Problems?How Do We Avoid Problems? Divide our data into two or three groups: • Training data  Build a model using individual tests or model selection Train a data mining model to identif optimal parameters Train a data mining model to identify optimal parameters • Test data  Evaluate the model built with the training dataEvaluate the model built with the training data  Perform hypothesis tests • Confirmation data  Evaluate the model built with the training data  Confirm findings from Test data set
  • 16. Cross-validation MethodsCross validation Methods • Divide data into slices,1 Train then train and test models  Train model with slice #1, 1 2 3 Train test with slices 2, …, 8  Train model with slice #2, test with slices 1, 3, …, 8 4 5 6 Test  …  Train model with slice #8, test with slices 1, …, 7 C il lt t 7 8 • Compile results to evaluate the fit of all 8 models
  • 17. Biological or Technical Replicates?Biological or Technical Replicates? • How do I analyze data if I pool samples?How do I analyze data if I pool samples? • How do I analyze data if I use replicate samples?• How do I analyze data if I use replicate samples? • What if I take multiple measurements from the• What if I take multiple measurements from the same patient or subject? • What if I run experiments on a cell line?
  • 18. Experimental Units vs. Sampling Unitsg • A treatment is a unique combination of all the factors andq covariates studied in your experiment Th i t l it (EU) i th ll t tit th t• The experimental unit (EU) is the smallest entity that can receive or accept one treatment combination • The sampling unit (SU) is the smallest entity that will be measured or observed in the experiment • Experimental and sampling units are not always the same
  • 19. Example: EU and SU are the SameExample: EU and SU are the Same • Suppose 20 patients have the common cold  10 patients are randomly chosen to take a new drug  10 patients are randomly chosen for the placebo  Duration of their symptoms (hours) is the response variableDuration of their symptoms (hours) is the response variable • EU and SU are the same in this experimentEU and SU are the same in this experiment  Drug and placebo treatments are applied to each patient  Each patient is sampled to record their duration of symptoms f S Therefore EU = patient and SU = patient
  • 20. Example: EU and SU are differentExample: EU and SU are different • 20 flowers are planted in individual pots 10 fl d l h t i d f tili ll t 10 flowers are randomly chosen to receive dry fertilizer pellets  10 flowers are randomly chosen to receive liquid fertilizer  All six petals are harvested from each flower and petal lengthp p g is measured as the response variable • EU and SU are different in this experiment  Fertilizer treatment is applied to the individual plant or pot  Measurements are taken from individual flower petals Measurements are taken from individual flower petals  Therefore EU = plant and SU = petal (pseudo-replication)
  • 21. Pseudo-Replicationp • Confusion between EU’s and SU’s can artificially inflate sample sizes and artificially decrease p-valuessample sizes and artificially decrease p values  E.g. It is tempting to treat each flower petal as a unique sample (n = 6 x 20 = 120), but the petals are pseudo-replicates “P d li ti d th D i f E l i l Fi ld “Pseudoreplication and the Design of Ecological Field Experiments” (Hurlbert 1984, Ecological Monographs) • Pooling samples can create pseudo-replication problems  E.g. 12 fruit flies are available for a microarray experiment, but t l fli i t 4 f 3 fli h t t h RNAmust pool flies into 4 groups of 3 flies each to get enough RNA  Once data are pooled, it is not appropriate to analyze each individual separately in the statistical model
  • 22. Biological vs Technical ReplicationBiological vs. Technical Replication • Sometimes, experiments use multiple EU’s to, p p investigate multiple sources of error with a statistical model  E g When measurements are inaccurate you want to estimate variation E.g. When measurements are inaccurate, you want to estimate variation between subjects and multiple measurements  E.g. To evaluate the precision of 2 lie detector machines, you could test 6 subjects measured by 4 technicians each in repeated measurements  Subject and machine effects have EU = subject (biological replicates) , but the technician effect has EU = measurement (technical replicates) • These kinds of experiments must be analyzed withThese kinds of experiments must be analyzed with appropriate statistical methods  Split-plot methods evaluate multiple EU’s in one model
  • 23. No Biological Replication?No Biological Replication? • Sometimes experiments have no biological replicatesSometimes experiments have no biological replicates  Experiments with cell lines (e.g. cancer cell lines)  Experiments with purified proteins, DNA, macromolecules  Experiments with bacteria, viruses or pathogens??? Be very careful when you interpret results• Be very careful when you interpret results  Technical replicates represent the precision of your methods  Significant results apply to your specific sampleS g ca esu s app y o you spec c sa p e  Results may not extend to larger populations
  • 24. An Illustrative Examplep 4 batches of vaccine dumped into one “pool” single sample from “pool” tested in ten egg assaysvaccine one pool from pool egg assays • Does the experiment have any replication?  Biological replication? No. Four batches dumped into one pool.  Technical replication? Yes Ten assays used to detect contaminationTechnical replication? Yes. Ten assays used to detect contamination. • What can we making inferences about?  Population of all vaccine batches? No. No biological replication.  Contamination of the single sample? Yes. Ten technical replicates used  Contamination of this specific pool? Maybe.  Contamination of these specific batches? Maybe.
  • 25. What Is An Outlier?What Is An Outlier? • An outlier is an observation (i.e. sampling unit) thatAn outlier is an observation (i.e. sampling unit) that does not belong to the population of interest  Outliers can and should be legitimately removed from the analysis  Identifying outliers is a biological question, not a statistical question • A high influence point is an observation that has a• A high influence point is an observation that has a large impact on the fit of your statistical model  High influence points might be outliers or legitimate data  Several methods to identify and handle high influence points
  • 26. Examples of Outliers • Errors, glitches, typos and “non-data”  Bubbles or bright spots on a microarray  Typos from medical chart (e g age = 334) Typos from medical chart (e.g. age = 334) • Legitimate samples, but out of scopeg p , p  Patients with comorbidities or other conditions (e.g. diabetes patient in an AIDS study)
  • 27. Examples of High Influencep g • High Leverage points  Observations with extreme combinations of predictor andObservations with extreme combinations of predictor and response variables (i.e. outskirts of the design space)  Identified using leverage plots • Large Residuals  Represent large difference between predicted values from thep g p model and the observed value from the sample  Large residual = poor model fit for that value • Large influence on model fit  Remove the value and the model changes dramaticallyg y
  • 28. High Leverage Pointsg e e age o s • We expect no relationship Leverage: hii = X’i(X’X)-1Xi between hat size and IQ A single observation can Leverage: hii X i(X X) Xi • A single observation can change the slope of the line  Hat size = 38, IQ = 190 • Extreme combinations of X and Y variables produceand Y variables produce high influence over the analysisy
  • 29. Leverage Plotsg • Red “confidence curves” identify significant leveragey g g  Curves that completely overlap the blue line are not significant  Curves that largely do not overlap the blue line have significant leverage • If leverage is problematic, respond carefully  Identify and remove any outliers, if they exist  Consider alternative models variable transformations weighting etc Consider alternative models, variable transformations, weighting, etc.
  • 31. ResidualsResiduals • Residuals = Observed – Predicted  Also called “errors”  ei = Yi - Ŷi • Represent the unexplained variation  Should be independent, identically distributed and randomp , y  Overall trend in residuals represents model fit  Large individual residuals may represent high influence  Several different computations for residuals exist Several different computations for residuals exist
  • 32. Residuals PlotResiduals Plot • Residuals vs. X variable E l t d l fit l ti t di t i bl Evaluate model fit relative to one predictor variable  Suspect one variable fits poorly in multivariable model • Residuals vs. Predicted values  Evaluate model fit with respect to the entire model  Good if you want a single plot for multivariable model R id l itt d X i bl• Residuals vs. omitted X variable  Interesting trends if important variable was omitted
  • 33. Good Model Fit • Expect a rectangular or oval shaped could of residuals • Residuals vs. X variable used to evaluate independence  E.g. Do we need to model a curved relationship with Age? • Residuals vs. predicted used to evaluate assumption of identically distributes errors  E.g. Non-constant variance  E.g. Larger errors with higher response values
  • 34. Errors are NOT Independentp
  • 35. Non-Constant Variance 40 20 5000 10000 15000 20000 0 Eliza Units siduals -20 Res -40 -60
  • 36. Errors Are NOT Normal
  • 37. Alternative Residual Computations • Studentized residuals  Divide each residual by the estimate of the standard deviationy  Easier to identify high influence points (e.g. > 3 s.d. away from mean) D l t d id l• Deleted residuals  Compute residual after deleting one observation  Evaluate the effect of one observation on model fit • Deviance or Pearson residuals  Computed for categorical response models (e.g. logistic regression)  Often do not follow typical trends of residuals from linear models
  • 40. Other Indicators of High Influenceg • DFFITS  Influence of single point on single fitted value  Look for DFFIT > 1 for small n or DFFIT > 2*sqrt(p/n) for large n • DFBETAS• DFBETAS  Influence of single point on regression coefficients  Look for DFBETAS > 1 for small n or DFBETAS > 2 / sqrt(n) for large n • Cook’s Distance  Influence of single point on all fitted values C i t F( ) di t ib ti Compare against F(p, n – p) distribution  See Kutner, Nachtsheim, Neter and Li. 2005. Applied Linear Statistical Models for more details
  • 41. SolutionsSolutions • Remove high influence points if they may be outliersg p y y • Fit a completely new model to the data • Transform variables  Transform X to change relationship between X and Yg p  Transform Y to change distribution of model errors Use a weighting scheme to reduce their influence• Use a weighting scheme to reduce their influence  Use wi = 1 / sdi for non-constant variance  Use wi = 1 / Yi 2 or wi = 1 / Xi 2 to weight regions of plot
  • 42. Log-transform XLog transform X • Relationship between X and Y changes • May reduce impact of some high influence points• May reduce impact of some high influence points
  • 44. Weighting SchemesWeighting Schemes • Use wi = 1 / sdi for non-constant variance • Use w = 1 / Y2 or w = 1 / X2 to weight regions of plot• Use wi = 1 / Yi 2 or wi = 1 / Xi 2 to weight regions of plot
  • 45. Th k YThank You For questions or comments please contact: ScienceApps@niaid.nih.gov 301.496.4455 45