A D L T 6 7 3 : T E A C H I N G A S S C H O L A R S H I P I N
M E D I C A L E D U C A T I O N
T U E S D A Y , O C T O B E R 2 7 , 2 0 1 5
An Overview of Quantitative
Data Analysis
Outline of Today’s Class
 Brief Description of Statistical Thinking
 Analytic Methods
 Summary Measures
 Appropriate Research Questions
 Determining the appropriate Statistical Methodology
 Group Discussion
 Designing a Data Collection Plan
 Sources, Capture and Storage
 Sample Size Determination
 Group Discussion
 Additional Resources
Statistical Thinking
 Population
 All possible subjects
 EX: All US patients
 Sample
 Subjects you observe
 EX: Patients seen at VCU
 Sampling from Population
 Sample should be Microcosm
 Are other samples different?
 Small samples  rare events
 Larger samples are better
Population
Sample
Statistical Thinking
 If this is the population…  Does sample look like
this…
 …or this…
Statistical Thinking (Example)
 Sample:
 Experimental drug to reduce side effects from surgical procedure
 Historical rate: 10% experience no side effects
 New Trial: 33 successes (no side effects) in 200 patients
 Sample percentage: 𝑝=33/200 = 16.5%
 What does evidence imply about population?
 Does new drug show improvement?
 What happens if we run this experiment again?
 Is this sample “big” enough to represent population?
 If historical rate is truly 10%, what would samples look like?
Statistical Thinking (Example)
# of
Successes
out of 200
Frequency
out of
1000
Proportion # of
Successes
out of 200
Frequency
out of
1000
Proportion
5 1 0.001 21 75 0.075
6 0 0.000 22 78 0.078
7 1 0.001 23 71 0.071
8 0 0.000 24 54 0.054
9 2 0.002 25 33 0.033
10 8 0.008 26 25 0.025
11 9 0.009 27 27 0.027
12 20 0.020 28 11 0.011
13 25 0.025 29 8 0.008
14 35 0.035 30 11 0.011
15 48 0.048 31 4 0.004
16 65 0.065 32 4 0.004
17 69 0.069 33 1 0.001
18 99 0.099 34 1 0.001
19 94 0.094 35 1 0.001
20 119 0.119 36 1 0.001
Simulation Study of Samples with 200 Dichotomous Observations with Known
10% Success Rate
Statistical Thinking (Example)
Histogram of 1,000 Simulated Samples of 200 Dichotomous Outcomes with
Assumed p = 10% Success Rate
0
20
40
60
80
100
120
140
Statistical Thinking (Example)
 If the true proportion was really p = 10%…
 Then our event (33 successes) would be observed about 1 in every 1000
trials
 Estimated Probability: 1/1000 = 0.001
 Two possible explanations for our sample:
 Rate really is 10%  we observed a rare event
 Our assumption (p = 10%) was incorrect
 Revised Statistical Thinking:
 If observed event is likely given our assumptions, then our assumptions
are probably correct
 If observed event is unlikely given our assumptions, then our
assumptions are probably NOT correct
Analytic Methods: Summary Measures
 Representative Measures
 Reflect the most “typical” or “average” data value
 Continuous Measurements:
 Mean (Average), Median and Mode
 Categorical Measurements:
 Frequencies and Proportions
 Measures of Variability
 Reflect how much subjects differ from one another
 Continuous Measurements:
 Standard deviation, range, interquartile range
 Categorical Measurements:
 None that are meaningful (sorry!)
Analytic Methods : Research Question
 Translating Research Question into Testable Hypotheses
 Research question must be in form that allows statistical method to be assigned
 Three components: # of groups, measurement type, # of measures
 1. Subjects
 Identify population under consideration
 Determine # of groups (and what distinguishes them)
 2. Measurements
 Identify the measurement Type (continuous or categorical)
 Determine the summary measure (e.g., mean or proportion)
 # of Times Measured (Once, twice or greater?)
 3. Statement (i.e., what are you trying to do?)
 Estimation (is something simply being measured?)
 Change (is something being tracked over time? Before and after an event?)
 Comparison (is something compared between groups?)
Analytic Methods : Research Question
 Who is under consideration?
 Identify population under consideration
 Determine # of groups (and what distinguishes them)
 Ex.: Is the proportion of patients up to date on their colon
cancer screening greater for those who receive an email
reminder from their physician than for those who do not
receive such email messages?
 Population?
 Number of Groups?
Analytic Methods : Research Question
 Is it measureable?
 Identify the measurement Type (continuous or categorical)
 Determine the summary measure (e.g., mean or proportion)
 # of Times Measured (Once, twice or greater?)
 Ex.: Is the proportion of patients up to date on their colon
cancer screening greater for those who receive an email
reminder from their physician than for those who do not
receive such email messages?
 Measurement Type?
 Summary Measures?
 Number of Times Measured?
Analytic Methods: Research Question
 Is the “question” clear?
 Estimation (is something simply being measured?)
 Change (is something being tracked over time? Before and
after an event?)
 Comparison (is something compared between groups?)
 Ex.: Is the proportion of patients up to date on their colon
cancer screening greater for those who receive an email
reminder from their physician than for those who do not
receive such email messages?
 Statement: Estimation, Change or Comparison? Combination?
Analytic Methods: Continuous Data
# of Measurements
# of Samples Single Pre/Post Repeated Measures
1 Sample t-test Paired t-test Repeated Measures ANOVA
(RMA) / Linear Mixed
Model (LMM)*
2 Samples Two-sample
t-test
RMA / LMM* RMA / LMM*
“k” Samples Analysis of
Variance
(ANOVA)
RMA / LMM* RMA / LMM*
Adjusting for
Covariates:
Multiple Linear Regression*, Analysis of Covariance
(ANCOVA)*, Linear Mixed Models*
*Will likely require statistical assistance
Analytic Methods: Categorical Data
# of Measurements
# of Samples Single Pre/Post Repeated Measures
1 Sample z-test McNemar’s
Test
Generalized Linear
Mixed Models (GLMM)*
2 Samples Chi-square
Test
GLMM* GLMM*
“k” Samples Chi-square
Test
GLMM* GLMM*
Adjusting for
Covariates:
Multiple Logistic Regression*, Generalized Linear
Mixed Models*
*Will likely require statistical assistance
Analytical Methods: Examples
 Research Question: What are CD3 cell counts in
BMT recipients 60 days after transplantation?
 Research Question: Are CD3 cell counts in BMT
recipients 60 days after transplantation larger than
counts at baseline (day 0)?
 Research Question: Are CD3 cell counts in BMT
recipients receiving a 5.1-unit ATG dose as big as the
countss in recipients receiving a 7.5-unit dose?
Group Discussions
 Please break into groups by table
 For the next 10-15 minutes, take turns discussing what
analytic approaches are appropriate for your proposed
study
 Is your outcome continuous or categorical?
 How many groups are you investigating?
 How many measurements are you taking?
 What statistical methodology should you use?
 If your study is qualitative, discuss how statistical
methodologies could be used (e.g. data summary,
association)
Data Collection Plan: Sources
 What information do you need to answer your research
question?
 Electronic Health Records (EHR):
 CERNER
 ONCORE
 Integrated Personal Health Record (IPHR):
 MyPreventiveCare
 Chart reviews
 Surveys
 Prospective biological measurements
 Need to know:
 Who will physically obtain/collect data?
 How often will it be done?
 If prospective biological measures: how will it be done?
Data Collection Plan: Capture
 How will you obtain the necessary information?
 EHR or IPHR extraction
 Chart audits
 Surveys
 In-Person, mail, email or online
 Prospective measurements
 Need to know:
 Who will do this?
 How often will it be done?
 If prospective biological measures: how will it be done?
Data Collection Plan: Storage
 Where will your data be stored?
 Paper records  No
 Microsoft Excel or Access
 REDCAP
 Collects and stores survey data…and much more
 SAS database (or SPSS, R, etc.)
 Work with your statistician to create dataset
 Need to know:
 How often it will be updated?
 Is it secure?
 Is it IRB/HIPAA compliant?
Data Collection Plan
 Helpful Suggestions:
 Consult a statistician or database manager before you start
collecting data
 Preferably the person who will be analyzing your data
 If you are collecting and storing data yourself:
 Record it directly into storage unit as you collect it
 E.g., Microsoft Excel
 Record it as it will be analyzed
 One row per subject per time point
• New row for each additional time point
 One column per measurement
Data Collection Plan: Example
Sample Size Determination
 As a general rule, larger sample sizes:
 Lead to more representative samples
 Lead to better estimation of parameters (e.g., representative
measures)
 Provide estimators with lower variability
N=9 N=36 N=100
Sample Size Determination
Averages over 10,000 Simulations
Sample Size Sample
Mean
Sample Std.
Dev.
Standard
Error*
9 204.4 36.5 12.3
16 204.3 37.1 9.5
25 204.2 37.2 7.8
36 204.1 37.5 6.5
49 204.1 37.6 5.5
64 204.2 37.7 4.9
81 204.1 37.7 4.2
100 204.1 37.7 3.9
1000 204.1 37.7 1.2
*SE: explains variability in estimator; not the sample data
Sample Size Determination
 Possible Decisions
 Type I Error: find difference where there shouldn’t be one
 Type II Error: fail to find difference where it should be
 Power = 1 - β
True State
Your Decision H0 is “True” HA is True
Reject H0 Type I Error
α
Correct
Decision
Fail to Reject H0 Correct
Decision
Type II Error
β
Sample Size Determination
 Determinants of Required Sample Size
 Significance Level (α): probability of rejecting H0 when it’s true
 Power (1-β): probability of failing to reject H0 when it’s false
 These values are selected during design phase
 α = 5%
 1-β = 80% (sometimes 90%).
Sample Size Determination
 Determinants of Required Sample Size
 Measure of variability (usually standard deviation) inherent
in study population
 As measurement becomes more variable…
 Standard error of test statistic increases…
 p-value increases…
 Ability to reject H0 decreases…
 Power decreases
 Controlling variability:
 Better measurement methodology
 Homogeneous samples
Sample Size Determination
 Determinants of Required Sample Size
 Effect Size: smallest difference or change in outcome you hope to
find
 As difference you want to observe decreases…
 Test statistic decreases…
 p-value increases…
 Ability to reject H0 decreases…
 Power decreases
 Considerations:
 Clinical significance
 Clinical possibility
 Large differences easier to detect and harder to find
Sample Size Determination
 Calculating Required Sample Size
 Equations exist (involving α, β, variability and effect size) for
simple analytic methods (t-test, chi-square, etc.)
 Advanced methods require professional assistance
 Where do you find variability and effect size?
 Previous literature of similar populations
 Retrospective Study / Chart Audits
 Pilot study
 Guesstimate!
Sample Size Determination
 What if required sample size is too large?
 Consider a different outcome
 Continuous measures generally require smaller sample sizes than
categorical measures
 Consider fewer groups or add multiple sites
 Fewer Groups  More subjects per group
 Multiple Sites  Larger subject pool (maybe more representative…)
 Will require more sophisticated analytic methods
 Reconfigure study as a “pilot”
 Switch emphasis from “hypothesis testing” to “estimation”
 Goal: data summaries and confidence intervals
 Use to power larger study
Group Discussion
 Please break into groups by table
 For the next 10-15 minutes, take turns discussing:
 Data Management:
 Where will you get your data?
 How will you capture it?
 How will you store it?
 Sample Size Determination:
 Are you able to power your study?
 Where will (did) you find information for your power analysis?
Additional Resources
 VCU Department of Biostatistics
 15 full-time faculty
 Can assist with: study design, sample size determination, interim
and final analyses, dissemination
 Grant funding (or prospects of funding) usually required
 BIOS 516 Biostatistical Consulting: graduate students available for
FREE consultations
 Contact Russ Boyle (russell.boyle@vcuhealth.org)
 Provide a protocol
 Offer co-authorship
Additional Resources
 VCU Center for Clinical and Translation Research
 Research Incubator: study design, sample size determination,
and other resources (e.g. grant writing)
 Contact: Pam Dillon (pmdillon@vcu.edu)
 Biomedical Informatics: data management and storage (e.g.
REDCAP)
 Support requested online:
(http://www.cctr.vcu.edu/informatics/index.html)
Additional Resources
 Textbook (i.e., shameless plug):
 Statistical Research Methods: A Guide for Non-Statisticians
 Sabo and Boone, Springer, 2013
 Available on the web:

Statistics pres 10 27 2015 roy sabo

  • 1.
    A D LT 6 7 3 : T E A C H I N G A S S C H O L A R S H I P I N M E D I C A L E D U C A T I O N T U E S D A Y , O C T O B E R 2 7 , 2 0 1 5 An Overview of Quantitative Data Analysis
  • 2.
    Outline of Today’sClass  Brief Description of Statistical Thinking  Analytic Methods  Summary Measures  Appropriate Research Questions  Determining the appropriate Statistical Methodology  Group Discussion  Designing a Data Collection Plan  Sources, Capture and Storage  Sample Size Determination  Group Discussion  Additional Resources
  • 3.
    Statistical Thinking  Population All possible subjects  EX: All US patients  Sample  Subjects you observe  EX: Patients seen at VCU  Sampling from Population  Sample should be Microcosm  Are other samples different?  Small samples  rare events  Larger samples are better Population Sample
  • 4.
    Statistical Thinking  Ifthis is the population…  Does sample look like this…  …or this…
  • 5.
    Statistical Thinking (Example) Sample:  Experimental drug to reduce side effects from surgical procedure  Historical rate: 10% experience no side effects  New Trial: 33 successes (no side effects) in 200 patients  Sample percentage: 𝑝=33/200 = 16.5%  What does evidence imply about population?  Does new drug show improvement?  What happens if we run this experiment again?  Is this sample “big” enough to represent population?  If historical rate is truly 10%, what would samples look like?
  • 6.
    Statistical Thinking (Example) #of Successes out of 200 Frequency out of 1000 Proportion # of Successes out of 200 Frequency out of 1000 Proportion 5 1 0.001 21 75 0.075 6 0 0.000 22 78 0.078 7 1 0.001 23 71 0.071 8 0 0.000 24 54 0.054 9 2 0.002 25 33 0.033 10 8 0.008 26 25 0.025 11 9 0.009 27 27 0.027 12 20 0.020 28 11 0.011 13 25 0.025 29 8 0.008 14 35 0.035 30 11 0.011 15 48 0.048 31 4 0.004 16 65 0.065 32 4 0.004 17 69 0.069 33 1 0.001 18 99 0.099 34 1 0.001 19 94 0.094 35 1 0.001 20 119 0.119 36 1 0.001 Simulation Study of Samples with 200 Dichotomous Observations with Known 10% Success Rate
  • 7.
    Statistical Thinking (Example) Histogramof 1,000 Simulated Samples of 200 Dichotomous Outcomes with Assumed p = 10% Success Rate 0 20 40 60 80 100 120 140
  • 8.
    Statistical Thinking (Example) If the true proportion was really p = 10%…  Then our event (33 successes) would be observed about 1 in every 1000 trials  Estimated Probability: 1/1000 = 0.001  Two possible explanations for our sample:  Rate really is 10%  we observed a rare event  Our assumption (p = 10%) was incorrect  Revised Statistical Thinking:  If observed event is likely given our assumptions, then our assumptions are probably correct  If observed event is unlikely given our assumptions, then our assumptions are probably NOT correct
  • 9.
    Analytic Methods: SummaryMeasures  Representative Measures  Reflect the most “typical” or “average” data value  Continuous Measurements:  Mean (Average), Median and Mode  Categorical Measurements:  Frequencies and Proportions  Measures of Variability  Reflect how much subjects differ from one another  Continuous Measurements:  Standard deviation, range, interquartile range  Categorical Measurements:  None that are meaningful (sorry!)
  • 10.
    Analytic Methods :Research Question  Translating Research Question into Testable Hypotheses  Research question must be in form that allows statistical method to be assigned  Three components: # of groups, measurement type, # of measures  1. Subjects  Identify population under consideration  Determine # of groups (and what distinguishes them)  2. Measurements  Identify the measurement Type (continuous or categorical)  Determine the summary measure (e.g., mean or proportion)  # of Times Measured (Once, twice or greater?)  3. Statement (i.e., what are you trying to do?)  Estimation (is something simply being measured?)  Change (is something being tracked over time? Before and after an event?)  Comparison (is something compared between groups?)
  • 11.
    Analytic Methods :Research Question  Who is under consideration?  Identify population under consideration  Determine # of groups (and what distinguishes them)  Ex.: Is the proportion of patients up to date on their colon cancer screening greater for those who receive an email reminder from their physician than for those who do not receive such email messages?  Population?  Number of Groups?
  • 12.
    Analytic Methods :Research Question  Is it measureable?  Identify the measurement Type (continuous or categorical)  Determine the summary measure (e.g., mean or proportion)  # of Times Measured (Once, twice or greater?)  Ex.: Is the proportion of patients up to date on their colon cancer screening greater for those who receive an email reminder from their physician than for those who do not receive such email messages?  Measurement Type?  Summary Measures?  Number of Times Measured?
  • 13.
    Analytic Methods: ResearchQuestion  Is the “question” clear?  Estimation (is something simply being measured?)  Change (is something being tracked over time? Before and after an event?)  Comparison (is something compared between groups?)  Ex.: Is the proportion of patients up to date on their colon cancer screening greater for those who receive an email reminder from their physician than for those who do not receive such email messages?  Statement: Estimation, Change or Comparison? Combination?
  • 14.
    Analytic Methods: ContinuousData # of Measurements # of Samples Single Pre/Post Repeated Measures 1 Sample t-test Paired t-test Repeated Measures ANOVA (RMA) / Linear Mixed Model (LMM)* 2 Samples Two-sample t-test RMA / LMM* RMA / LMM* “k” Samples Analysis of Variance (ANOVA) RMA / LMM* RMA / LMM* Adjusting for Covariates: Multiple Linear Regression*, Analysis of Covariance (ANCOVA)*, Linear Mixed Models* *Will likely require statistical assistance
  • 15.
    Analytic Methods: CategoricalData # of Measurements # of Samples Single Pre/Post Repeated Measures 1 Sample z-test McNemar’s Test Generalized Linear Mixed Models (GLMM)* 2 Samples Chi-square Test GLMM* GLMM* “k” Samples Chi-square Test GLMM* GLMM* Adjusting for Covariates: Multiple Logistic Regression*, Generalized Linear Mixed Models* *Will likely require statistical assistance
  • 16.
    Analytical Methods: Examples Research Question: What are CD3 cell counts in BMT recipients 60 days after transplantation?  Research Question: Are CD3 cell counts in BMT recipients 60 days after transplantation larger than counts at baseline (day 0)?  Research Question: Are CD3 cell counts in BMT recipients receiving a 5.1-unit ATG dose as big as the countss in recipients receiving a 7.5-unit dose?
  • 17.
    Group Discussions  Pleasebreak into groups by table  For the next 10-15 minutes, take turns discussing what analytic approaches are appropriate for your proposed study  Is your outcome continuous or categorical?  How many groups are you investigating?  How many measurements are you taking?  What statistical methodology should you use?  If your study is qualitative, discuss how statistical methodologies could be used (e.g. data summary, association)
  • 18.
    Data Collection Plan:Sources  What information do you need to answer your research question?  Electronic Health Records (EHR):  CERNER  ONCORE  Integrated Personal Health Record (IPHR):  MyPreventiveCare  Chart reviews  Surveys  Prospective biological measurements  Need to know:  Who will physically obtain/collect data?  How often will it be done?  If prospective biological measures: how will it be done?
  • 19.
    Data Collection Plan:Capture  How will you obtain the necessary information?  EHR or IPHR extraction  Chart audits  Surveys  In-Person, mail, email or online  Prospective measurements  Need to know:  Who will do this?  How often will it be done?  If prospective biological measures: how will it be done?
  • 20.
    Data Collection Plan:Storage  Where will your data be stored?  Paper records  No  Microsoft Excel or Access  REDCAP  Collects and stores survey data…and much more  SAS database (or SPSS, R, etc.)  Work with your statistician to create dataset  Need to know:  How often it will be updated?  Is it secure?  Is it IRB/HIPAA compliant?
  • 21.
    Data Collection Plan Helpful Suggestions:  Consult a statistician or database manager before you start collecting data  Preferably the person who will be analyzing your data  If you are collecting and storing data yourself:  Record it directly into storage unit as you collect it  E.g., Microsoft Excel  Record it as it will be analyzed  One row per subject per time point • New row for each additional time point  One column per measurement
  • 22.
  • 23.
    Sample Size Determination As a general rule, larger sample sizes:  Lead to more representative samples  Lead to better estimation of parameters (e.g., representative measures)  Provide estimators with lower variability N=9 N=36 N=100
  • 24.
    Sample Size Determination Averagesover 10,000 Simulations Sample Size Sample Mean Sample Std. Dev. Standard Error* 9 204.4 36.5 12.3 16 204.3 37.1 9.5 25 204.2 37.2 7.8 36 204.1 37.5 6.5 49 204.1 37.6 5.5 64 204.2 37.7 4.9 81 204.1 37.7 4.2 100 204.1 37.7 3.9 1000 204.1 37.7 1.2 *SE: explains variability in estimator; not the sample data
  • 25.
    Sample Size Determination Possible Decisions  Type I Error: find difference where there shouldn’t be one  Type II Error: fail to find difference where it should be  Power = 1 - β True State Your Decision H0 is “True” HA is True Reject H0 Type I Error α Correct Decision Fail to Reject H0 Correct Decision Type II Error β
  • 26.
    Sample Size Determination Determinants of Required Sample Size  Significance Level (α): probability of rejecting H0 when it’s true  Power (1-β): probability of failing to reject H0 when it’s false  These values are selected during design phase  α = 5%  1-β = 80% (sometimes 90%).
  • 27.
    Sample Size Determination Determinants of Required Sample Size  Measure of variability (usually standard deviation) inherent in study population  As measurement becomes more variable…  Standard error of test statistic increases…  p-value increases…  Ability to reject H0 decreases…  Power decreases  Controlling variability:  Better measurement methodology  Homogeneous samples
  • 28.
    Sample Size Determination Determinants of Required Sample Size  Effect Size: smallest difference or change in outcome you hope to find  As difference you want to observe decreases…  Test statistic decreases…  p-value increases…  Ability to reject H0 decreases…  Power decreases  Considerations:  Clinical significance  Clinical possibility  Large differences easier to detect and harder to find
  • 29.
    Sample Size Determination Calculating Required Sample Size  Equations exist (involving α, β, variability and effect size) for simple analytic methods (t-test, chi-square, etc.)  Advanced methods require professional assistance  Where do you find variability and effect size?  Previous literature of similar populations  Retrospective Study / Chart Audits  Pilot study  Guesstimate!
  • 30.
    Sample Size Determination What if required sample size is too large?  Consider a different outcome  Continuous measures generally require smaller sample sizes than categorical measures  Consider fewer groups or add multiple sites  Fewer Groups  More subjects per group  Multiple Sites  Larger subject pool (maybe more representative…)  Will require more sophisticated analytic methods  Reconfigure study as a “pilot”  Switch emphasis from “hypothesis testing” to “estimation”  Goal: data summaries and confidence intervals  Use to power larger study
  • 31.
    Group Discussion  Pleasebreak into groups by table  For the next 10-15 minutes, take turns discussing:  Data Management:  Where will you get your data?  How will you capture it?  How will you store it?  Sample Size Determination:  Are you able to power your study?  Where will (did) you find information for your power analysis?
  • 32.
    Additional Resources  VCUDepartment of Biostatistics  15 full-time faculty  Can assist with: study design, sample size determination, interim and final analyses, dissemination  Grant funding (or prospects of funding) usually required  BIOS 516 Biostatistical Consulting: graduate students available for FREE consultations  Contact Russ Boyle (russell.boyle@vcuhealth.org)  Provide a protocol  Offer co-authorship
  • 33.
    Additional Resources  VCUCenter for Clinical and Translation Research  Research Incubator: study design, sample size determination, and other resources (e.g. grant writing)  Contact: Pam Dillon (pmdillon@vcu.edu)  Biomedical Informatics: data management and storage (e.g. REDCAP)  Support requested online: (http://www.cctr.vcu.edu/informatics/index.html)
  • 34.
    Additional Resources  Textbook(i.e., shameless plug):  Statistical Research Methods: A Guide for Non-Statisticians  Sabo and Boone, Springer, 2013  Available on the web: