SlideShare a Scribd company logo
4/12/2011 Data analysis and causal inference 1
Data analysis and causal inference – 1
Victor J. Schoenbach, PhD home page
Department of Epidemiology
Gillings School of Global Public Health
University of North Carolina at Chapel Hill
www.unc.edu/epid600/
Principles of Epidemiology for Public Health (EPID600)
12/30/2001 Data analysis and causal inference 2
The Physicist, the Chemist, and the Statistician
From “Science Jokes”, posted to Usenet groups by Joachim Verhagen
(verhagen@fys.ruu.nl); downloaded from, Keith M. Gregg,
keith.gregg@stanford.edu, www-leland.stanford.edu/~keithg/humor.shtml
“Three professors (a physicist, a chemist,
and a statistician) are called in to see their
dean. Just as they arrive the dean is called
out of his office, leaving the three professors
there. The professors see with alarm that
there is a fire in the wastebasket.
12/30/2001 Data analysis and causal inference 3
The Physicist, the Chemist, and the Statistician
From “Science Jokes”, posted to Usenet groups by Joachim Verhagen
(verhagen@fys.ruu.nl); downloaded from, Keith M. Gregg, keith.gregg@stanford.edu,
www-leland.stanford.edu/~keithg/humor.shtml
“The physicist says, ‘I know what to do! We
must cool down the materials until their
temperature is lower than the ignition
temperature and then the fire will go out.’
12/30/2001 Data analysis and causal inference 4
The Physicist, the Chemist, and the Statistician
From “Science Jokes”, posted to Usenet groups by Joachim Verhagen
(verhagen@fys.ruu.nl); downloaded from, Keith M. Gregg, keith.gregg@stanford.edu,
www-leland.stanford.edu/~keithg/humor.shtml
“The chemist says, ‘No! No! I know what to
do! We must cut off the supply of oxygen so
that the fire will go out due to lack of one of
the reactants.’
12/30/2001 Data analysis and causal inference 5
The Physicist, the Chemist, and the Statistician
From “Science Jokes”, posted to Usenet groups by Joachim Verhagen
(verhagen@fys.ruu.nl); downloaded from, Keith M. Gregg, keith.gregg@stanford.edu,
www-leland.stanford.edu/~keithg/humor.shtml
“While the physicist and chemist debate
what course to take, they both are alarmed
to see the statistician running around the
room starting other fires. They both scream,
‘What are you doing?’
To which the statistician replies, ‘Trying to
get an adequate sample size.’”
12/30/2001 Data analysis and causal inference 6
Data management
• Managing epidemiologic data is “mass
production”
• A systematic, organized, professional
approach is critical for detecting and
avoiding problems
12/30/2001 Data analysis and causal inference 7
“You can never, never take
anything for granted.”
Noel Hinners, vice president for flight
systems at Lockheed Martin Astronautics,
whose engineering team reported
measurements in English units that the
Mars Climate Orbiter navigation team
assumed were metric units.
12/30/2001 Data analysis and causal inference 8
Without the documentation, the data may be
of little if any value (1995 NSFG)
00000000000003122222222402143041000
00000000000001144112131 070520310
00000000000003233112131 072331040
000000000000011163322227070350110
00000000000003133022221 02451121000
00000000000001111112131 02110041000
00000000000002111112131 07307131000
00000000000002122112131 01073041000
12/30/2001 Data analysis and causal inference 9
Data analysis and causal inference
• “Our data say nothing at all.”
(Epidemiology guru Sander Greenland, Congress of
Epidemiology 2001, Toronto)
• Data are observer notes, respondent
answers, biochemical measurements,
contents of medical records, machine
readable datasets, …
• What does one do with them?
11/13/2007 Data analysis and causal inference 10
Steps in data management
• Design the data collection process
• Write down all data collection procedures
• Train and supervise data collectors
• Monitor all data collection activities
• Document all data collection experiences
• Keep track of, document, and safeguard
data
11/13/2007 Data analysis and causal inference 11
Data processing
• Review, edit, and code data forms,
documenting exceptions and actions
• Convert to electronic form
• “Clean” data – check for illegal or
improbable values, combinations of values
• Prepare summaries
The case of the missing eights
• Cancer Prevention study II
(N=1.2 million)
• Contractor keyed 20,000
forms/wk; checked weekly.
• 28-item food frequency had
peculiar pattern of missings
• Pulled original QQs to check
• Programmer checked code
• Cause: “O” instead of “0”
Steven D. Stellman. Am J Epidemiol
1989;129(4):857-860
4/12/2011 Data analysis and causal inference 12
4/12/2011 Data analysis and causal inference 13
Can you find the data management error?
48 * get non-hispanic white population in county for 2000, first by adding
49 ages 15-24, 25-34, 35-44, and 45-64, then by excluding ages 45-64;
50
51 CWHITES=CST00609+CST00610+CST00611+CST00612;
52 CWHITES2=CWHITES-CST00612;
53
54 * get non-hispanic black population in county;
55
56 CBLACKS=CST00616+CST00617+CST00618+CST00619;
57 CBLACKS2=CBLACKS-CST00619;
58
59 * get hispanic or latino population in county;
60
61 CHISPS=CST00623+CST00624+CST00625+CST00626;
62 CHISPS2=CHISPS-CST00626;
63 (continues on next slide)
4/12/2011 Data analysis and causal inference 14
Can you find the data management error?
CST00637 Female population white alone aged 15-24, 2000 – county
CST00638 Female population white alone aged 25-34, 2000 – county
CST00639 Female population white alone aged 35-44, 2000 – county
CST00640 Female population white alone aged 45-64, 2000 – county
CST00644 Female population black* alone aged 15-24, 2000 – county
CST00645 Female population black* alone aged 25-34, 2000 – county
CST00646 Female population black* alone aged 35-44, 2000 – county
CST00647 Female population black* alone aged 45-64, 2000 – county
CST00651 Female population Hispanic* aged 15-24, 2000 – county
CST00652 Female population Hispanic* aged 25-34, 2000 – county
CST00653 Female population Hispanic* aged 35-44, 2000 – county
CST00654 Female population Hispanic* aged 45-64, 2000 – county
* Full variable name: “black or African American”, “Hispanic or Latino”
(continues on next slide)
4/12/2011 Data analysis and causal inference 15
Can you find the data management error?
64 * get non-hispanic white female population in county;
65
66 CWFEMALES=CST00637+CST00638+CST00639+CST00640;
67 CWFEMALES2=CWFEMALES-CST00640;
68
69 * get non-hispanic black female population in county;
70
71 CBFEMALES=CST00644+CST00645+CST00646+CST00647;
72 CBFEMALES2=CBFEMALES-CST00646;
73
74 * get hispanic female population in county;
75
76 CHFEMALES=CST00651+CST00652+CST00653+CST00654;
77 CHFEMALES2=CHFEMALES-CST00654;
(continues on next slide)
4/12/2011 Data analysis and causal inference 16
Can you find the data management error?
64 * get non-hispanic white female population in county;
65
66 CWFEMALES=CST00637+CST00638+CST00639+CST00640;
67 CWFEMALES2=CWFEMALES-CST00640;
68
69 * get non-hispanic black female population in county;
70
71 CBFEMALES=CST00644+CST00645+CST00646+CST00647;
72 CBFEMALES2=CBFEMALES-CST00646;
73
74 * get hispanic female population in county;
75
76 CHFEMALES=CST00651+CST00652+CST00653+CST00654;
77 CHFEMALES2=CHFEMALES-CST00654;
12/30/2001 Data analysis and causal inference 17
Data exploration
• Examine the data – frequency
distributions, cross-tabulations,
scatterplots – be alert for surprises and
suspicious findings
• Examine means and prevalence for
factors of interest, overall and within
interesting subgroups
• Look at associations, prevalence ratios,
relative risks, odds ratios, correlations
12/30/2001 Data analysis and causal inference 18
Carry out focused data analysis
• Desirable to have a written analysis plan
based on the research questions
• Typically carry out “crude” analyses and
analyses controlling for important
variables
• Methods of control: stratification,
mathematical modeling
Distribution of U.S. household income, 2007
(CPS data)
4/12/2011 Data analysis and causal inference 19
Income in $1000s/year
Source: http://img55.imageshack.us/i/incomedistr07jo6.jpg/
12/30/2001 Data analysis and causal inference 20
Stratified analysis
• Divide the dataset into subsets according
to relevant covariables (e.g., age, sex,
smoking, …)
• Examine the estimates and associations
within each subset (unless there are too
many)
• Take averages across the subsets
11/13/2007 Data analysis and causal inference 21
Mathematical modeling
• Express the outcome as some
mathematical function of the relevant
covariables
• “Fit” this function to the data, so that it
models the relations in the data
• Interpret the resulting model to draw
inferences about associations
11/13/2007 Data analysis and causal inference 22
Selecting a pattern to sew a pair of pants
• Want one that fits the need
• Can sew without a pattern, but takes
time and may not look good
• Select a pattern that will be well
received
• Have you seen anyone wearing it?
• Has it been featured in magazines
12/30/2001 Data analysis and causal inference 23
The strategy of statistical data analysis
Look for an available statistical
model that will fit the situation (e.g.,
binomial, normal, chi-square, linear)
• Have others used it?
• Has it appeared in a methodology
article?
12/30/2001 Data analysis and causal inference 24
The strategy of statistical data analysis
Summarize the data in terms of the
statistical model
– Mean
– Standard deviation
– Other parameters
4/22/2002 Data analysis and causal inference 25
But should always look at the data
• Distributions can have same mean
and standard deviation but look very
different – e.g., same mean:
5 5
4/18/2006 Data analysis and causal inference 26
Regression models - Conceptual
• Suppose risk factors of:
Age 50 years
BP 130 mmHG systolic
CHL 220 mg/dL
SMK 30 pack-years
4/13/2010 Data analysis and causal inference 27
Regression models - Conceptual
Example of an additive model:
Risk of CHD =
Risk from Age (“Age_risk”)
Risk from BP (“BP_risk”)
Risk from CHL (“CHL_risk”)
Risk from SMK (“SMK_risk”)
4/13/2010 Data analysis and causal inference 28
Propose the model
Risk of CHD = Age_risk + BP_risk + CHL_risk + SMK_risk
Age_risk = Age in years x risk increase per year
BP_risk = BP in mmHG x risk increase per mmHG
CHL_risk = Cholest. in mg/dL x risk increase per mg/dL
SMK_risk = Pack-years x risk increase per pack-year
4/13/2010 Data analysis and causal inference 29
Fit the model – estimate the coefficients
• Risk = β0 +β1Age + β2BP + β3CHL + β4SMK
β0 = baseline risk
β1 = risk increase per year
β2 = risk increase per mmHG
β3 = risk increase per mg/dL
β4 = risk increase per pack-year
• Use the data and statistical techniques to
estimate β1, β2, β3, β4.
12/30/2001 Data analysis and causal inference 30
P-values and Power
• P-value: “the probability of obtaining
an interesting-looking sample from a
boring population” (1 – specificity)
• Power: “the probability of obtaining
an interesting-looking sample from
an interesting population” (sensitivity)
11/16/2004 Data analysis and causal inference 31
The P-value
If my study observes 0.5 [e.g., ln(OR)]
0
Boring population
0.7 [ln(OR)]
Interesting population
11/22/2005 Data analysis and causal inference 32
The P-value
If my study observes 0.5 [e.g., ln(OR)]
0
Boring population
0.7
Interesting population
P-value
11/16/2004 Data analysis and causal inference 33
The Problem with the P-value
But the P-value does not tell me the
probability that what I observed was
due to chance
0
Boring population
0.7
Interesting population
11/16/2004 Data analysis and causal inference 34
If I study only boring populations
0
Distributions of samples from boring populations
11/16/2004 Data analysis and causal inference 35
If I study only interesting populations
0
0.7
Distributions of samples from interesting populations
11/22/2005 Data analysis and causal inference 36
Many boring populations
0
Boring populations
0.7
Interesting populations
11/22/2005 Data analysis and causal inference 37
Many interesting populations
0
Boring populations
0.7
Interesting populations
12/30/2001 Data analysis and causal inference 38
Do epidemiologists study boring populations?
That probability depends on how many boring
populations there are. If we study
10 interesting populations
100 boring populations
with 90% power and 5% significance level, we
expect us to obtain 9 interesting samples from
the interesting populations and 5 from the
boring populations
11/22/2005 Data analysis and causal inference 39
P-values and predictive values
Results:
14 interesting samples
5 came from boring populations
Probability that an interesting sample
came from a boring population:
5/14 = 36% – not 5%!
Analogous to positive predictive value
4/12/2011 Data analysis and causal inference 40
Analogy to positive predictive value
Populations
Samples Interesting
(“cases”)
Boring
(“noncases”)
Total
Interesting
(“positive”)
9 5 14 PV+
64%
Boring
(“negative”)
1 95 96
Total 10 100 110
(with 90%
sensitivity)
(with 95%
specificity)
4/12/2011 Data analysis and causal inference 41
Meta-analysis
• Literature reviews
• Systematic literature reviews
• Every study is an observation from a
population of possible studies
• The set of studies that have been
published may be a biased sample
from that population
7/1/2009 Data analysis and causal inference 42
What should guide data analysis
• What are the research questions?
– Estimate means (e.g., cholesterol)
and prevalences (e.g., HIV)
– Assess associations (e.g., Is blood
lead associated with elevated blood
pressure?; Do prepaid health plans
provide more preventative care? Do
bednets protect against malaria?)
11/20/2007 Data analysis and causal inference 43
Association of helmet use with death in motorcycle
crashes: a matched-pair cohort study
(Daniel Norvell and Peter Cummings, AJE 2002;156:483-7)
• Data from the National Highway Traffic
Safety Administration’s Fatality Analysis
Reporting System
• Exposure: helmet use; Outcome: death
• Potential confounders: sex, seat position,
age, state helmet law
11/20/2007 Data analysis and causal inference 44
Association of helmet use with death in motorcycle
crashes: a matched-pair cohort study
(Daniel Norvell and Peter Cummings, AJE 2002;156:483-7)
• 9,222 driver-passenger pairs after
exclusions
• Relative risk of death for a helmeted rider
was 0.65 (0.57-0.74), (0.61 adjusted for
seat position)
• Examined effect measure modification by
seat position and by type of crash.
When the
proofreader takes a
week off
12/29/2009, B5
Dec 2009 Close
28 10547.08
25 10520.10
24 10520.10
23 10466.44
22 10464.93
21 10414.14
18 10328.89
17 10308.26
www.google.com/finance/historical?q=INDEXDJX:.DJI Dec 22 23 24 25 28
I hope he’s having
a good break!
12/31/2009, B6
Dec 23 24 25 28 29
Dec 2009 Close
29 10545.41
28 10547.08
25 10520.10
24 10520.10
23 10466.44
22 10464.93
21 10414.14
18 10328.89
17 10308.26
www.google.com/finance/historical?q=INDEXDJX:.DJI
4/12/2011 Data analysis and causal inference 48
Thank you
• Arigato
• Asanti
• Dhanyavaad
• Dumela
• Gracias
• Merci
• Obrigato
• Xie xie

More Related Content

Similar to 13a Data analysis and causal inference – 1

Data Curation and Debugging for Data Centric AI
Data Curation and Debugging for Data Centric AIData Curation and Debugging for Data Centric AI
Data Curation and Debugging for Data Centric AI
Paul Groth
 
05 astrostat feigelson
05 astrostat feigelson05 astrostat feigelson
05 astrostat feigelson
Marco Quartulli
 
What is the reproducibility crisis in science and what can we do about it?
What is the reproducibility crisis in science and what can we do about it?What is the reproducibility crisis in science and what can we do about it?
What is the reproducibility crisis in science and what can we do about it?
Dorothy Bishop
 
Lecture 1
Lecture 1Lecture 1
Lecture 1
PublicHealth9
 
Table of Contents16304_TTLX_Walker.indd 1 8312 1152.docx
Table of Contents16304_TTLX_Walker.indd   1 8312   1152.docxTable of Contents16304_TTLX_Walker.indd   1 8312   1152.docx
Table of Contents16304_TTLX_Walker.indd 1 8312 1152.docx
mattinsonjanel
 
Dia sds2015 web version
Dia sds2015 web versionDia sds2015 web version
Dia sds2015 web version
Michael Brodie
 
Dichotomania and other challenges for the collaborating biostatistician
Dichotomania and other challenges for the collaborating biostatisticianDichotomania and other challenges for the collaborating biostatistician
Dichotomania and other challenges for the collaborating biostatistician
Laure Wynants
 
Nicholas Jewell MedicReS World Congress 2014
Nicholas Jewell MedicReS World Congress 2014Nicholas Jewell MedicReS World Congress 2014
Nicholas Jewell MedicReS World Congress 2014
MedicReS
 
Interactive Visualization Systems and Data Integration Methods for Supporting...
Interactive Visualization Systems and Data Integration Methods for Supporting...Interactive Visualization Systems and Data Integration Methods for Supporting...
Interactive Visualization Systems and Data Integration Methods for Supporting...
Don Pellegrino
 
Human resources section2b-textbook_on_public_health_and_community_medicine
Human resources section2b-textbook_on_public_health_and_community_medicineHuman resources section2b-textbook_on_public_health_and_community_medicine
Human resources section2b-textbook_on_public_health_and_community_medicine
Prabir Chatterjee
 
Open Notebooks Science
Open Notebooks ScienceOpen Notebooks Science
Open Notebooks Science
Andrew Lang
 
D. G. Mayo: Your data-driven claims must still be probed severely
D. G. Mayo: Your data-driven claims must still be probed severelyD. G. Mayo: Your data-driven claims must still be probed severely
D. G. Mayo: Your data-driven claims must still be probed severely
jemille6
 
A rendezvous with the uncertainty monster
A rendezvous with the uncertainty monsterA rendezvous with the uncertainty monster
A rendezvous with the uncertainty monster
SIM4NEXUS
 
WRIT1301 Library Instruction: Finding & Understanding Statistics
WRIT1301 Library Instruction: Finding & Understanding StatisticsWRIT1301 Library Instruction: Finding & Understanding Statistics
WRIT1301 Library Instruction: Finding & Understanding Statistics
ciakov
 
Shelley Hurwitz MedicReS World Congress 2014
Shelley Hurwitz MedicReS World Congress 2014Shelley Hurwitz MedicReS World Congress 2014
Shelley Hurwitz MedicReS World Congress 2014
MedicReS
 
Maps and data esri health care 2012
Maps and data   esri health care 2012Maps and data   esri health care 2012
Maps and data esri health care 2012
J T "Tom" Johnson
 
From Replication Crisis to Credibility Revolution
From Replication Crisis to Credibility RevolutionFrom Replication Crisis to Credibility Revolution
From Replication Crisis to Credibility Revolution
Koki Ikeda
 
Lr 1 Intro.pdf
Lr 1 Intro.pdfLr 1 Intro.pdf
Lr 1 Intro.pdf
giovanniealvarez1
 
DataONE Education Module 01: Why Data Management?
DataONE Education Module 01: Why Data Management?DataONE Education Module 01: Why Data Management?
DataONE Education Module 01: Why Data Management?
DataONE
 
Peter Keating Dec2008
Peter Keating Dec2008Peter Keating Dec2008
Peter Keating Dec2008
US Cochrane Center
 

Similar to 13a Data analysis and causal inference – 1 (20)

Data Curation and Debugging for Data Centric AI
Data Curation and Debugging for Data Centric AIData Curation and Debugging for Data Centric AI
Data Curation and Debugging for Data Centric AI
 
05 astrostat feigelson
05 astrostat feigelson05 astrostat feigelson
05 astrostat feigelson
 
What is the reproducibility crisis in science and what can we do about it?
What is the reproducibility crisis in science and what can we do about it?What is the reproducibility crisis in science and what can we do about it?
What is the reproducibility crisis in science and what can we do about it?
 
Lecture 1
Lecture 1Lecture 1
Lecture 1
 
Table of Contents16304_TTLX_Walker.indd 1 8312 1152.docx
Table of Contents16304_TTLX_Walker.indd   1 8312   1152.docxTable of Contents16304_TTLX_Walker.indd   1 8312   1152.docx
Table of Contents16304_TTLX_Walker.indd 1 8312 1152.docx
 
Dia sds2015 web version
Dia sds2015 web versionDia sds2015 web version
Dia sds2015 web version
 
Dichotomania and other challenges for the collaborating biostatistician
Dichotomania and other challenges for the collaborating biostatisticianDichotomania and other challenges for the collaborating biostatistician
Dichotomania and other challenges for the collaborating biostatistician
 
Nicholas Jewell MedicReS World Congress 2014
Nicholas Jewell MedicReS World Congress 2014Nicholas Jewell MedicReS World Congress 2014
Nicholas Jewell MedicReS World Congress 2014
 
Interactive Visualization Systems and Data Integration Methods for Supporting...
Interactive Visualization Systems and Data Integration Methods for Supporting...Interactive Visualization Systems and Data Integration Methods for Supporting...
Interactive Visualization Systems and Data Integration Methods for Supporting...
 
Human resources section2b-textbook_on_public_health_and_community_medicine
Human resources section2b-textbook_on_public_health_and_community_medicineHuman resources section2b-textbook_on_public_health_and_community_medicine
Human resources section2b-textbook_on_public_health_and_community_medicine
 
Open Notebooks Science
Open Notebooks ScienceOpen Notebooks Science
Open Notebooks Science
 
D. G. Mayo: Your data-driven claims must still be probed severely
D. G. Mayo: Your data-driven claims must still be probed severelyD. G. Mayo: Your data-driven claims must still be probed severely
D. G. Mayo: Your data-driven claims must still be probed severely
 
A rendezvous with the uncertainty monster
A rendezvous with the uncertainty monsterA rendezvous with the uncertainty monster
A rendezvous with the uncertainty monster
 
WRIT1301 Library Instruction: Finding & Understanding Statistics
WRIT1301 Library Instruction: Finding & Understanding StatisticsWRIT1301 Library Instruction: Finding & Understanding Statistics
WRIT1301 Library Instruction: Finding & Understanding Statistics
 
Shelley Hurwitz MedicReS World Congress 2014
Shelley Hurwitz MedicReS World Congress 2014Shelley Hurwitz MedicReS World Congress 2014
Shelley Hurwitz MedicReS World Congress 2014
 
Maps and data esri health care 2012
Maps and data   esri health care 2012Maps and data   esri health care 2012
Maps and data esri health care 2012
 
From Replication Crisis to Credibility Revolution
From Replication Crisis to Credibility RevolutionFrom Replication Crisis to Credibility Revolution
From Replication Crisis to Credibility Revolution
 
Lr 1 Intro.pdf
Lr 1 Intro.pdfLr 1 Intro.pdf
Lr 1 Intro.pdf
 
DataONE Education Module 01: Why Data Management?
DataONE Education Module 01: Why Data Management?DataONE Education Module 01: Why Data Management?
DataONE Education Module 01: Why Data Management?
 
Peter Keating Dec2008
Peter Keating Dec2008Peter Keating Dec2008
Peter Keating Dec2008
 

More from Abdiwali Abdullahi Abdiwali

Communicable disease control
Communicable disease controlCommunicable disease control
Communicable disease control
Abdiwali Abdullahi Abdiwali
 
Vector control
Vector controlVector control
Outbreak investigation steps
Outbreak investigation stepsOutbreak investigation steps
Outbreak investigation steps
Abdiwali Abdullahi Abdiwali
 
STATEMENT BY THE HUMANITARIAN COORDINATOR FOR SOMALIA, PHILIPPE LAZZARINI
STATEMENT BY THE HUMANITARIAN COORDINATOR FOR SOMALIA, PHILIPPE LAZZARINISTATEMENT BY THE HUMANITARIAN COORDINATOR FOR SOMALIA, PHILIPPE LAZZARINI
STATEMENT BY THE HUMANITARIAN COORDINATOR FOR SOMALIA, PHILIPPE LAZZARINI
Abdiwali Abdullahi Abdiwali
 
Basic introduction communicable
Basic introduction communicableBasic introduction communicable
Basic introduction communicable
Abdiwali Abdullahi Abdiwali
 
PRIMARY HEALTH CARE
PRIMARY HEALTH CAREPRIMARY HEALTH CARE
PRIMARY HEALTH CARE
Abdiwali Abdullahi Abdiwali
 
Lecture Notes of Nutrition For Health Extension Workers
Lecture Notes of Nutrition For Health Extension WorkersLecture Notes of Nutrition For Health Extension Workers
Lecture Notes of Nutrition For Health Extension Workers
Abdiwali Abdullahi Abdiwali
 
HIV prevention-2020-road-map
HIV prevention-2020-road-mapHIV prevention-2020-road-map
HIV prevention-2020-road-map
Abdiwali Abdullahi Abdiwali
 
Ethics in public health surveillance
Ethics in public health surveillanceEthics in public health surveillance
Ethics in public health surveillance
Abdiwali Abdullahi Abdiwali
 
Health planing and management
Health planing and managementHealth planing and management
Health planing and management
Abdiwali Abdullahi Abdiwali
 
WHO–recommended standards for surveillance of selected vaccine-preventable di...
WHO–recommended standards for surveillance of selected vaccine-preventable di...WHO–recommended standards for surveillance of selected vaccine-preventable di...
WHO–recommended standards for surveillance of selected vaccine-preventable di...
Abdiwali Abdullahi Abdiwali
 
Expended program in Immunization
Expended program in ImmunizationExpended program in Immunization
Expended program in Immunization
Abdiwali Abdullahi Abdiwali
 
Water & sanitation handbook
Water & sanitation handbookWater & sanitation handbook
Water & sanitation handbook
Abdiwali Abdullahi Abdiwali
 
Somali phast step guide.
Somali phast step guide.Somali phast step guide.
Somali phast step guide.
Abdiwali Abdullahi Abdiwali
 
Introduction to Health education
Introduction to Health educationIntroduction to Health education
Introduction to Health education
Abdiwali Abdullahi Abdiwali
 
Indicators for assessing infant and young child feeding practices Part 1 Defi...
Indicators for assessing infant and young child feeding practices Part 1 Defi...Indicators for assessing infant and young child feeding practices Part 1 Defi...
Indicators for assessing infant and young child feeding practices Part 1 Defi...
Abdiwali Abdullahi Abdiwali
 
POLICY MAKING PROCESS training policy influence appiah-kubi, 16 dec 2015
POLICY MAKING PROCESS training policy influence appiah-kubi, 16 dec 2015POLICY MAKING PROCESS training policy influence appiah-kubi, 16 dec 2015
POLICY MAKING PROCESS training policy influence appiah-kubi, 16 dec 2015
Abdiwali Abdullahi Abdiwali
 
Case controlstudyi
Case controlstudyi  Case controlstudyi
Case controlstudyi
Abdiwali Abdullahi Abdiwali
 
Assessmentof nutritional status
Assessmentof nutritional statusAssessmentof nutritional status
Assessmentof nutritional status
Abdiwali Abdullahi Abdiwali
 
Ancylostomiasis
AncylostomiasisAncylostomiasis

More from Abdiwali Abdullahi Abdiwali (20)

Communicable disease control
Communicable disease controlCommunicable disease control
Communicable disease control
 
Vector control
Vector controlVector control
Vector control
 
Outbreak investigation steps
Outbreak investigation stepsOutbreak investigation steps
Outbreak investigation steps
 
STATEMENT BY THE HUMANITARIAN COORDINATOR FOR SOMALIA, PHILIPPE LAZZARINI
STATEMENT BY THE HUMANITARIAN COORDINATOR FOR SOMALIA, PHILIPPE LAZZARINISTATEMENT BY THE HUMANITARIAN COORDINATOR FOR SOMALIA, PHILIPPE LAZZARINI
STATEMENT BY THE HUMANITARIAN COORDINATOR FOR SOMALIA, PHILIPPE LAZZARINI
 
Basic introduction communicable
Basic introduction communicableBasic introduction communicable
Basic introduction communicable
 
PRIMARY HEALTH CARE
PRIMARY HEALTH CAREPRIMARY HEALTH CARE
PRIMARY HEALTH CARE
 
Lecture Notes of Nutrition For Health Extension Workers
Lecture Notes of Nutrition For Health Extension WorkersLecture Notes of Nutrition For Health Extension Workers
Lecture Notes of Nutrition For Health Extension Workers
 
HIV prevention-2020-road-map
HIV prevention-2020-road-mapHIV prevention-2020-road-map
HIV prevention-2020-road-map
 
Ethics in public health surveillance
Ethics in public health surveillanceEthics in public health surveillance
Ethics in public health surveillance
 
Health planing and management
Health planing and managementHealth planing and management
Health planing and management
 
WHO–recommended standards for surveillance of selected vaccine-preventable di...
WHO–recommended standards for surveillance of selected vaccine-preventable di...WHO–recommended standards for surveillance of selected vaccine-preventable di...
WHO–recommended standards for surveillance of selected vaccine-preventable di...
 
Expended program in Immunization
Expended program in ImmunizationExpended program in Immunization
Expended program in Immunization
 
Water & sanitation handbook
Water & sanitation handbookWater & sanitation handbook
Water & sanitation handbook
 
Somali phast step guide.
Somali phast step guide.Somali phast step guide.
Somali phast step guide.
 
Introduction to Health education
Introduction to Health educationIntroduction to Health education
Introduction to Health education
 
Indicators for assessing infant and young child feeding practices Part 1 Defi...
Indicators for assessing infant and young child feeding practices Part 1 Defi...Indicators for assessing infant and young child feeding practices Part 1 Defi...
Indicators for assessing infant and young child feeding practices Part 1 Defi...
 
POLICY MAKING PROCESS training policy influence appiah-kubi, 16 dec 2015
POLICY MAKING PROCESS training policy influence appiah-kubi, 16 dec 2015POLICY MAKING PROCESS training policy influence appiah-kubi, 16 dec 2015
POLICY MAKING PROCESS training policy influence appiah-kubi, 16 dec 2015
 
Case controlstudyi
Case controlstudyi  Case controlstudyi
Case controlstudyi
 
Assessmentof nutritional status
Assessmentof nutritional statusAssessmentof nutritional status
Assessmentof nutritional status
 
Ancylostomiasis
AncylostomiasisAncylostomiasis
Ancylostomiasis
 

Recently uploaded

Netter's Atlas of Human Anatomy 7.ed.pdf
Netter's Atlas of Human Anatomy 7.ed.pdfNetter's Atlas of Human Anatomy 7.ed.pdf
Netter's Atlas of Human Anatomy 7.ed.pdf
BrissaOrtiz3
 
Complementary feeding in infant IAP PROTOCOLS
Complementary feeding in infant IAP PROTOCOLSComplementary feeding in infant IAP PROTOCOLS
Complementary feeding in infant IAP PROTOCOLS
chiranthgowda16
 
Tests for analysis of different pharmaceutical.pptx
Tests for analysis of different pharmaceutical.pptxTests for analysis of different pharmaceutical.pptx
Tests for analysis of different pharmaceutical.pptx
taiba qazi
 
Vestibulocochlear Nerve by Dr. Rabia Inam Gandapore.pptx
Vestibulocochlear Nerve by Dr. Rabia Inam Gandapore.pptxVestibulocochlear Nerve by Dr. Rabia Inam Gandapore.pptx
Vestibulocochlear Nerve by Dr. Rabia Inam Gandapore.pptx
Dr. Rabia Inam Gandapore
 
Top Effective Soaps for Fungal Skin Infections in India
Top Effective Soaps for Fungal Skin Infections in IndiaTop Effective Soaps for Fungal Skin Infections in India
Top Effective Soaps for Fungal Skin Infections in India
SwisschemDerma
 
Part II - Body Grief: Losing parts of ourselves and our identity before, duri...
Part II - Body Grief: Losing parts of ourselves and our identity before, duri...Part II - Body Grief: Losing parts of ourselves and our identity before, duri...
Part II - Body Grief: Losing parts of ourselves and our identity before, duri...
bkling
 
CHEMOTHERAPY_RDP_CHAPTER 6_Anti Malarial Drugs.pdf
CHEMOTHERAPY_RDP_CHAPTER 6_Anti Malarial Drugs.pdfCHEMOTHERAPY_RDP_CHAPTER 6_Anti Malarial Drugs.pdf
CHEMOTHERAPY_RDP_CHAPTER 6_Anti Malarial Drugs.pdf
rishi2789
 
TEST BANK For An Introduction to Brain and Behavior, 7th Edition by Bryan Kol...
TEST BANK For An Introduction to Brain and Behavior, 7th Edition by Bryan Kol...TEST BANK For An Introduction to Brain and Behavior, 7th Edition by Bryan Kol...
TEST BANK For An Introduction to Brain and Behavior, 7th Edition by Bryan Kol...
rightmanforbloodline
 
TEST BANK For Community Health Nursing A Canadian Perspective, 5th Edition by...
TEST BANK For Community Health Nursing A Canadian Perspective, 5th Edition by...TEST BANK For Community Health Nursing A Canadian Perspective, 5th Edition by...
TEST BANK For Community Health Nursing A Canadian Perspective, 5th Edition by...
Donc Test
 
Histololgy of Female Reproductive System.pptx
Histololgy of Female Reproductive System.pptxHistololgy of Female Reproductive System.pptx
Histololgy of Female Reproductive System.pptx
AyeshaZaid1
 
Role of Mukta Pishti in the Management of Hyperthyroidism
Role of Mukta Pishti in the Management of HyperthyroidismRole of Mukta Pishti in the Management of Hyperthyroidism
Role of Mukta Pishti in the Management of Hyperthyroidism
Dr. Jyothirmai Paindla
 
Abortion PG Seminar Power point presentation
Abortion PG Seminar Power point presentationAbortion PG Seminar Power point presentation
Abortion PG Seminar Power point presentation
AksshayaRajanbabu
 
Integrating Ayurveda into Parkinson’s Management: A Holistic Approach
Integrating Ayurveda into Parkinson’s Management: A Holistic ApproachIntegrating Ayurveda into Parkinson’s Management: A Holistic Approach
Integrating Ayurveda into Parkinson’s Management: A Holistic Approach
Ayurveda ForAll
 
Ketone bodies and metabolism-biochemistry
Ketone bodies and metabolism-biochemistryKetone bodies and metabolism-biochemistry
Ketone bodies and metabolism-biochemistry
Dhayanithi C
 
CHEMOTHERAPY_RDP_CHAPTER 2 _LEPROSY.pdf1
CHEMOTHERAPY_RDP_CHAPTER 2 _LEPROSY.pdf1CHEMOTHERAPY_RDP_CHAPTER 2 _LEPROSY.pdf1
CHEMOTHERAPY_RDP_CHAPTER 2 _LEPROSY.pdf1
rishi2789
 
Top 10 Best Ayurvedic Kidney Stone Syrups in India
Top 10 Best Ayurvedic Kidney Stone Syrups in IndiaTop 10 Best Ayurvedic Kidney Stone Syrups in India
Top 10 Best Ayurvedic Kidney Stone Syrups in India
Swastik Ayurveda
 
Osteoporosis - Definition , Evaluation and Management .pdf
Osteoporosis - Definition , Evaluation and Management .pdfOsteoporosis - Definition , Evaluation and Management .pdf
Osteoporosis - Definition , Evaluation and Management .pdf
Jim Jacob Roy
 
Muscles of Mastication by Dr. Rabia Inam Gandapore.pptx
Muscles of Mastication by Dr. Rabia Inam Gandapore.pptxMuscles of Mastication by Dr. Rabia Inam Gandapore.pptx
Muscles of Mastication by Dr. Rabia Inam Gandapore.pptx
Dr. Rabia Inam Gandapore
 
Best Ayurvedic medicine for Gas and Indigestion
Best Ayurvedic medicine for Gas and IndigestionBest Ayurvedic medicine for Gas and Indigestion
Best Ayurvedic medicine for Gas and Indigestion
Swastik Ayurveda
 
Efficacy of Avartana Sneha in Ayurveda
Efficacy of Avartana Sneha in AyurvedaEfficacy of Avartana Sneha in Ayurveda
Efficacy of Avartana Sneha in Ayurveda
Dr. Jyothirmai Paindla
 

Recently uploaded (20)

Netter's Atlas of Human Anatomy 7.ed.pdf
Netter's Atlas of Human Anatomy 7.ed.pdfNetter's Atlas of Human Anatomy 7.ed.pdf
Netter's Atlas of Human Anatomy 7.ed.pdf
 
Complementary feeding in infant IAP PROTOCOLS
Complementary feeding in infant IAP PROTOCOLSComplementary feeding in infant IAP PROTOCOLS
Complementary feeding in infant IAP PROTOCOLS
 
Tests for analysis of different pharmaceutical.pptx
Tests for analysis of different pharmaceutical.pptxTests for analysis of different pharmaceutical.pptx
Tests for analysis of different pharmaceutical.pptx
 
Vestibulocochlear Nerve by Dr. Rabia Inam Gandapore.pptx
Vestibulocochlear Nerve by Dr. Rabia Inam Gandapore.pptxVestibulocochlear Nerve by Dr. Rabia Inam Gandapore.pptx
Vestibulocochlear Nerve by Dr. Rabia Inam Gandapore.pptx
 
Top Effective Soaps for Fungal Skin Infections in India
Top Effective Soaps for Fungal Skin Infections in IndiaTop Effective Soaps for Fungal Skin Infections in India
Top Effective Soaps for Fungal Skin Infections in India
 
Part II - Body Grief: Losing parts of ourselves and our identity before, duri...
Part II - Body Grief: Losing parts of ourselves and our identity before, duri...Part II - Body Grief: Losing parts of ourselves and our identity before, duri...
Part II - Body Grief: Losing parts of ourselves and our identity before, duri...
 
CHEMOTHERAPY_RDP_CHAPTER 6_Anti Malarial Drugs.pdf
CHEMOTHERAPY_RDP_CHAPTER 6_Anti Malarial Drugs.pdfCHEMOTHERAPY_RDP_CHAPTER 6_Anti Malarial Drugs.pdf
CHEMOTHERAPY_RDP_CHAPTER 6_Anti Malarial Drugs.pdf
 
TEST BANK For An Introduction to Brain and Behavior, 7th Edition by Bryan Kol...
TEST BANK For An Introduction to Brain and Behavior, 7th Edition by Bryan Kol...TEST BANK For An Introduction to Brain and Behavior, 7th Edition by Bryan Kol...
TEST BANK For An Introduction to Brain and Behavior, 7th Edition by Bryan Kol...
 
TEST BANK For Community Health Nursing A Canadian Perspective, 5th Edition by...
TEST BANK For Community Health Nursing A Canadian Perspective, 5th Edition by...TEST BANK For Community Health Nursing A Canadian Perspective, 5th Edition by...
TEST BANK For Community Health Nursing A Canadian Perspective, 5th Edition by...
 
Histololgy of Female Reproductive System.pptx
Histololgy of Female Reproductive System.pptxHistololgy of Female Reproductive System.pptx
Histololgy of Female Reproductive System.pptx
 
Role of Mukta Pishti in the Management of Hyperthyroidism
Role of Mukta Pishti in the Management of HyperthyroidismRole of Mukta Pishti in the Management of Hyperthyroidism
Role of Mukta Pishti in the Management of Hyperthyroidism
 
Abortion PG Seminar Power point presentation
Abortion PG Seminar Power point presentationAbortion PG Seminar Power point presentation
Abortion PG Seminar Power point presentation
 
Integrating Ayurveda into Parkinson’s Management: A Holistic Approach
Integrating Ayurveda into Parkinson’s Management: A Holistic ApproachIntegrating Ayurveda into Parkinson’s Management: A Holistic Approach
Integrating Ayurveda into Parkinson’s Management: A Holistic Approach
 
Ketone bodies and metabolism-biochemistry
Ketone bodies and metabolism-biochemistryKetone bodies and metabolism-biochemistry
Ketone bodies and metabolism-biochemistry
 
CHEMOTHERAPY_RDP_CHAPTER 2 _LEPROSY.pdf1
CHEMOTHERAPY_RDP_CHAPTER 2 _LEPROSY.pdf1CHEMOTHERAPY_RDP_CHAPTER 2 _LEPROSY.pdf1
CHEMOTHERAPY_RDP_CHAPTER 2 _LEPROSY.pdf1
 
Top 10 Best Ayurvedic Kidney Stone Syrups in India
Top 10 Best Ayurvedic Kidney Stone Syrups in IndiaTop 10 Best Ayurvedic Kidney Stone Syrups in India
Top 10 Best Ayurvedic Kidney Stone Syrups in India
 
Osteoporosis - Definition , Evaluation and Management .pdf
Osteoporosis - Definition , Evaluation and Management .pdfOsteoporosis - Definition , Evaluation and Management .pdf
Osteoporosis - Definition , Evaluation and Management .pdf
 
Muscles of Mastication by Dr. Rabia Inam Gandapore.pptx
Muscles of Mastication by Dr. Rabia Inam Gandapore.pptxMuscles of Mastication by Dr. Rabia Inam Gandapore.pptx
Muscles of Mastication by Dr. Rabia Inam Gandapore.pptx
 
Best Ayurvedic medicine for Gas and Indigestion
Best Ayurvedic medicine for Gas and IndigestionBest Ayurvedic medicine for Gas and Indigestion
Best Ayurvedic medicine for Gas and Indigestion
 
Efficacy of Avartana Sneha in Ayurveda
Efficacy of Avartana Sneha in AyurvedaEfficacy of Avartana Sneha in Ayurveda
Efficacy of Avartana Sneha in Ayurveda
 

13a Data analysis and causal inference – 1

  • 1. 4/12/2011 Data analysis and causal inference 1 Data analysis and causal inference – 1 Victor J. Schoenbach, PhD home page Department of Epidemiology Gillings School of Global Public Health University of North Carolina at Chapel Hill www.unc.edu/epid600/ Principles of Epidemiology for Public Health (EPID600)
  • 2. 12/30/2001 Data analysis and causal inference 2 The Physicist, the Chemist, and the Statistician From “Science Jokes”, posted to Usenet groups by Joachim Verhagen (verhagen@fys.ruu.nl); downloaded from, Keith M. Gregg, keith.gregg@stanford.edu, www-leland.stanford.edu/~keithg/humor.shtml “Three professors (a physicist, a chemist, and a statistician) are called in to see their dean. Just as they arrive the dean is called out of his office, leaving the three professors there. The professors see with alarm that there is a fire in the wastebasket.
  • 3. 12/30/2001 Data analysis and causal inference 3 The Physicist, the Chemist, and the Statistician From “Science Jokes”, posted to Usenet groups by Joachim Verhagen (verhagen@fys.ruu.nl); downloaded from, Keith M. Gregg, keith.gregg@stanford.edu, www-leland.stanford.edu/~keithg/humor.shtml “The physicist says, ‘I know what to do! We must cool down the materials until their temperature is lower than the ignition temperature and then the fire will go out.’
  • 4. 12/30/2001 Data analysis and causal inference 4 The Physicist, the Chemist, and the Statistician From “Science Jokes”, posted to Usenet groups by Joachim Verhagen (verhagen@fys.ruu.nl); downloaded from, Keith M. Gregg, keith.gregg@stanford.edu, www-leland.stanford.edu/~keithg/humor.shtml “The chemist says, ‘No! No! I know what to do! We must cut off the supply of oxygen so that the fire will go out due to lack of one of the reactants.’
  • 5. 12/30/2001 Data analysis and causal inference 5 The Physicist, the Chemist, and the Statistician From “Science Jokes”, posted to Usenet groups by Joachim Verhagen (verhagen@fys.ruu.nl); downloaded from, Keith M. Gregg, keith.gregg@stanford.edu, www-leland.stanford.edu/~keithg/humor.shtml “While the physicist and chemist debate what course to take, they both are alarmed to see the statistician running around the room starting other fires. They both scream, ‘What are you doing?’ To which the statistician replies, ‘Trying to get an adequate sample size.’”
  • 6. 12/30/2001 Data analysis and causal inference 6 Data management • Managing epidemiologic data is “mass production” • A systematic, organized, professional approach is critical for detecting and avoiding problems
  • 7. 12/30/2001 Data analysis and causal inference 7 “You can never, never take anything for granted.” Noel Hinners, vice president for flight systems at Lockheed Martin Astronautics, whose engineering team reported measurements in English units that the Mars Climate Orbiter navigation team assumed were metric units.
  • 8. 12/30/2001 Data analysis and causal inference 8 Without the documentation, the data may be of little if any value (1995 NSFG) 00000000000003122222222402143041000 00000000000001144112131 070520310 00000000000003233112131 072331040 000000000000011163322227070350110 00000000000003133022221 02451121000 00000000000001111112131 02110041000 00000000000002111112131 07307131000 00000000000002122112131 01073041000
  • 9. 12/30/2001 Data analysis and causal inference 9 Data analysis and causal inference • “Our data say nothing at all.” (Epidemiology guru Sander Greenland, Congress of Epidemiology 2001, Toronto) • Data are observer notes, respondent answers, biochemical measurements, contents of medical records, machine readable datasets, … • What does one do with them?
  • 10. 11/13/2007 Data analysis and causal inference 10 Steps in data management • Design the data collection process • Write down all data collection procedures • Train and supervise data collectors • Monitor all data collection activities • Document all data collection experiences • Keep track of, document, and safeguard data
  • 11. 11/13/2007 Data analysis and causal inference 11 Data processing • Review, edit, and code data forms, documenting exceptions and actions • Convert to electronic form • “Clean” data – check for illegal or improbable values, combinations of values • Prepare summaries
  • 12. The case of the missing eights • Cancer Prevention study II (N=1.2 million) • Contractor keyed 20,000 forms/wk; checked weekly. • 28-item food frequency had peculiar pattern of missings • Pulled original QQs to check • Programmer checked code • Cause: “O” instead of “0” Steven D. Stellman. Am J Epidemiol 1989;129(4):857-860 4/12/2011 Data analysis and causal inference 12
  • 13. 4/12/2011 Data analysis and causal inference 13 Can you find the data management error? 48 * get non-hispanic white population in county for 2000, first by adding 49 ages 15-24, 25-34, 35-44, and 45-64, then by excluding ages 45-64; 50 51 CWHITES=CST00609+CST00610+CST00611+CST00612; 52 CWHITES2=CWHITES-CST00612; 53 54 * get non-hispanic black population in county; 55 56 CBLACKS=CST00616+CST00617+CST00618+CST00619; 57 CBLACKS2=CBLACKS-CST00619; 58 59 * get hispanic or latino population in county; 60 61 CHISPS=CST00623+CST00624+CST00625+CST00626; 62 CHISPS2=CHISPS-CST00626; 63 (continues on next slide)
  • 14. 4/12/2011 Data analysis and causal inference 14 Can you find the data management error? CST00637 Female population white alone aged 15-24, 2000 – county CST00638 Female population white alone aged 25-34, 2000 – county CST00639 Female population white alone aged 35-44, 2000 – county CST00640 Female population white alone aged 45-64, 2000 – county CST00644 Female population black* alone aged 15-24, 2000 – county CST00645 Female population black* alone aged 25-34, 2000 – county CST00646 Female population black* alone aged 35-44, 2000 – county CST00647 Female population black* alone aged 45-64, 2000 – county CST00651 Female population Hispanic* aged 15-24, 2000 – county CST00652 Female population Hispanic* aged 25-34, 2000 – county CST00653 Female population Hispanic* aged 35-44, 2000 – county CST00654 Female population Hispanic* aged 45-64, 2000 – county * Full variable name: “black or African American”, “Hispanic or Latino” (continues on next slide)
  • 15. 4/12/2011 Data analysis and causal inference 15 Can you find the data management error? 64 * get non-hispanic white female population in county; 65 66 CWFEMALES=CST00637+CST00638+CST00639+CST00640; 67 CWFEMALES2=CWFEMALES-CST00640; 68 69 * get non-hispanic black female population in county; 70 71 CBFEMALES=CST00644+CST00645+CST00646+CST00647; 72 CBFEMALES2=CBFEMALES-CST00646; 73 74 * get hispanic female population in county; 75 76 CHFEMALES=CST00651+CST00652+CST00653+CST00654; 77 CHFEMALES2=CHFEMALES-CST00654; (continues on next slide)
  • 16. 4/12/2011 Data analysis and causal inference 16 Can you find the data management error? 64 * get non-hispanic white female population in county; 65 66 CWFEMALES=CST00637+CST00638+CST00639+CST00640; 67 CWFEMALES2=CWFEMALES-CST00640; 68 69 * get non-hispanic black female population in county; 70 71 CBFEMALES=CST00644+CST00645+CST00646+CST00647; 72 CBFEMALES2=CBFEMALES-CST00646; 73 74 * get hispanic female population in county; 75 76 CHFEMALES=CST00651+CST00652+CST00653+CST00654; 77 CHFEMALES2=CHFEMALES-CST00654;
  • 17. 12/30/2001 Data analysis and causal inference 17 Data exploration • Examine the data – frequency distributions, cross-tabulations, scatterplots – be alert for surprises and suspicious findings • Examine means and prevalence for factors of interest, overall and within interesting subgroups • Look at associations, prevalence ratios, relative risks, odds ratios, correlations
  • 18. 12/30/2001 Data analysis and causal inference 18 Carry out focused data analysis • Desirable to have a written analysis plan based on the research questions • Typically carry out “crude” analyses and analyses controlling for important variables • Methods of control: stratification, mathematical modeling
  • 19. Distribution of U.S. household income, 2007 (CPS data) 4/12/2011 Data analysis and causal inference 19 Income in $1000s/year Source: http://img55.imageshack.us/i/incomedistr07jo6.jpg/
  • 20. 12/30/2001 Data analysis and causal inference 20 Stratified analysis • Divide the dataset into subsets according to relevant covariables (e.g., age, sex, smoking, …) • Examine the estimates and associations within each subset (unless there are too many) • Take averages across the subsets
  • 21. 11/13/2007 Data analysis and causal inference 21 Mathematical modeling • Express the outcome as some mathematical function of the relevant covariables • “Fit” this function to the data, so that it models the relations in the data • Interpret the resulting model to draw inferences about associations
  • 22. 11/13/2007 Data analysis and causal inference 22 Selecting a pattern to sew a pair of pants • Want one that fits the need • Can sew without a pattern, but takes time and may not look good • Select a pattern that will be well received • Have you seen anyone wearing it? • Has it been featured in magazines
  • 23. 12/30/2001 Data analysis and causal inference 23 The strategy of statistical data analysis Look for an available statistical model that will fit the situation (e.g., binomial, normal, chi-square, linear) • Have others used it? • Has it appeared in a methodology article?
  • 24. 12/30/2001 Data analysis and causal inference 24 The strategy of statistical data analysis Summarize the data in terms of the statistical model – Mean – Standard deviation – Other parameters
  • 25. 4/22/2002 Data analysis and causal inference 25 But should always look at the data • Distributions can have same mean and standard deviation but look very different – e.g., same mean: 5 5
  • 26. 4/18/2006 Data analysis and causal inference 26 Regression models - Conceptual • Suppose risk factors of: Age 50 years BP 130 mmHG systolic CHL 220 mg/dL SMK 30 pack-years
  • 27. 4/13/2010 Data analysis and causal inference 27 Regression models - Conceptual Example of an additive model: Risk of CHD = Risk from Age (“Age_risk”) Risk from BP (“BP_risk”) Risk from CHL (“CHL_risk”) Risk from SMK (“SMK_risk”)
  • 28. 4/13/2010 Data analysis and causal inference 28 Propose the model Risk of CHD = Age_risk + BP_risk + CHL_risk + SMK_risk Age_risk = Age in years x risk increase per year BP_risk = BP in mmHG x risk increase per mmHG CHL_risk = Cholest. in mg/dL x risk increase per mg/dL SMK_risk = Pack-years x risk increase per pack-year
  • 29. 4/13/2010 Data analysis and causal inference 29 Fit the model – estimate the coefficients • Risk = β0 +β1Age + β2BP + β3CHL + β4SMK β0 = baseline risk β1 = risk increase per year β2 = risk increase per mmHG β3 = risk increase per mg/dL β4 = risk increase per pack-year • Use the data and statistical techniques to estimate β1, β2, β3, β4.
  • 30. 12/30/2001 Data analysis and causal inference 30 P-values and Power • P-value: “the probability of obtaining an interesting-looking sample from a boring population” (1 – specificity) • Power: “the probability of obtaining an interesting-looking sample from an interesting population” (sensitivity)
  • 31. 11/16/2004 Data analysis and causal inference 31 The P-value If my study observes 0.5 [e.g., ln(OR)] 0 Boring population 0.7 [ln(OR)] Interesting population
  • 32. 11/22/2005 Data analysis and causal inference 32 The P-value If my study observes 0.5 [e.g., ln(OR)] 0 Boring population 0.7 Interesting population P-value
  • 33. 11/16/2004 Data analysis and causal inference 33 The Problem with the P-value But the P-value does not tell me the probability that what I observed was due to chance 0 Boring population 0.7 Interesting population
  • 34. 11/16/2004 Data analysis and causal inference 34 If I study only boring populations 0 Distributions of samples from boring populations
  • 35. 11/16/2004 Data analysis and causal inference 35 If I study only interesting populations 0 0.7 Distributions of samples from interesting populations
  • 36. 11/22/2005 Data analysis and causal inference 36 Many boring populations 0 Boring populations 0.7 Interesting populations
  • 37. 11/22/2005 Data analysis and causal inference 37 Many interesting populations 0 Boring populations 0.7 Interesting populations
  • 38. 12/30/2001 Data analysis and causal inference 38 Do epidemiologists study boring populations? That probability depends on how many boring populations there are. If we study 10 interesting populations 100 boring populations with 90% power and 5% significance level, we expect us to obtain 9 interesting samples from the interesting populations and 5 from the boring populations
  • 39. 11/22/2005 Data analysis and causal inference 39 P-values and predictive values Results: 14 interesting samples 5 came from boring populations Probability that an interesting sample came from a boring population: 5/14 = 36% – not 5%! Analogous to positive predictive value
  • 40. 4/12/2011 Data analysis and causal inference 40 Analogy to positive predictive value Populations Samples Interesting (“cases”) Boring (“noncases”) Total Interesting (“positive”) 9 5 14 PV+ 64% Boring (“negative”) 1 95 96 Total 10 100 110 (with 90% sensitivity) (with 95% specificity)
  • 41. 4/12/2011 Data analysis and causal inference 41 Meta-analysis • Literature reviews • Systematic literature reviews • Every study is an observation from a population of possible studies • The set of studies that have been published may be a biased sample from that population
  • 42. 7/1/2009 Data analysis and causal inference 42 What should guide data analysis • What are the research questions? – Estimate means (e.g., cholesterol) and prevalences (e.g., HIV) – Assess associations (e.g., Is blood lead associated with elevated blood pressure?; Do prepaid health plans provide more preventative care? Do bednets protect against malaria?)
  • 43. 11/20/2007 Data analysis and causal inference 43 Association of helmet use with death in motorcycle crashes: a matched-pair cohort study (Daniel Norvell and Peter Cummings, AJE 2002;156:483-7) • Data from the National Highway Traffic Safety Administration’s Fatality Analysis Reporting System • Exposure: helmet use; Outcome: death • Potential confounders: sex, seat position, age, state helmet law
  • 44. 11/20/2007 Data analysis and causal inference 44 Association of helmet use with death in motorcycle crashes: a matched-pair cohort study (Daniel Norvell and Peter Cummings, AJE 2002;156:483-7) • 9,222 driver-passenger pairs after exclusions • Relative risk of death for a helmeted rider was 0.65 (0.57-0.74), (0.61 adjusted for seat position) • Examined effect measure modification by seat position and by type of crash.
  • 45.
  • 46. When the proofreader takes a week off 12/29/2009, B5 Dec 2009 Close 28 10547.08 25 10520.10 24 10520.10 23 10466.44 22 10464.93 21 10414.14 18 10328.89 17 10308.26 www.google.com/finance/historical?q=INDEXDJX:.DJI Dec 22 23 24 25 28
  • 47. I hope he’s having a good break! 12/31/2009, B6 Dec 23 24 25 28 29 Dec 2009 Close 29 10545.41 28 10547.08 25 10520.10 24 10520.10 23 10466.44 22 10464.93 21 10414.14 18 10328.89 17 10308.26 www.google.com/finance/historical?q=INDEXDJX:.DJI
  • 48. 4/12/2011 Data analysis and causal inference 48 Thank you • Arigato • Asanti • Dhanyavaad • Dumela • Gracias • Merci • Obrigato • Xie xie

Editor's Notes

  1. Xin chao, Guten tag, wilkommen, karibuni, dumela, merhaba, shalom, huan-ying, bienvenidos, boa tarde This two-part lecture is about data analysis and causal inference.
  2. As long as we’re talking about data analysis, let’s begin with a little story about statisticians (The Physicist, the Chemist, and the Statistician, in “Science Jokes”, posted to Usenet groups by Joachim Verhagen (verhagen@fys.ruu.nl); downloaded from, Keith M. Gregg, keith.gregg@stanford.edu, www-leland.stanford.edu/~keithg/humor.shtml) “Three professors (a physicist, a chemist, and a statistician) are called in to see their dean. Just as they arrive the dean is called out of his office, leaving the three professors there. The professors see with alarm that there is a fire in the wastebasket.”
  3. “The physicist says, ‘I know what to do! We must cool down the materials until their temperature is lower than the ignition temperature and then the fire will go out.’”
  4. “The chemist says, ‘No! No! I know what to do! We must cut off the supply of oxygen so that the fire will go out due to lack of one of the reactants.’”
  5. “While the physicist and chemist debate what course to take, they both are alarmed to see the statistician running around the room starting other fires. They both scream, ‘What are you doing?’ To which the statistician replies, ‘Trying to get an adequate sample size.’”
  6. The first thing we do with data is to manage them (note that epidemiologists usually regard the word “data” as a plural word, based on its Latin root; however, other fields often consider “data” to be singular). Since epidemiologic studies tend to have many – hundreds, thousands, or even millions – of observations and often tens or hundreds of data items for each observation, managing epidemiologic data involves “mass production”. Therefore a systematic, organized, professional approach is critical for detecting and avoiding problems with the data.
  7. Data management, including careful and thorough documentation, is one of those activities like sanitation, hygiene, laundry, maintenance, and the like that are critical to health and well-being but largely underappreciated. The consequences of lapses in managing data can be far-reaching, and one can never take anything for granted. One of the more dramatic consequences of a lapse in data management was the destruction of the Mars Climate Orbiter when it touched down at too high a speed on the Martian surface. In the investigation of the crash, it turned out that the force data reported by the Lockheed Martin engineering team had been in English units, but the navigation team at NASA had assumed that they were in metric units. And so, as Noel Hinners, vice president for flight systems at Lockheed Martin Astronautics said, “you can never, never take anything for granted.”
  8. Without proper documentation, data may be of little if any value. For example, on the slide is an excerpt of data from the 1995 National Survey of Family Growth that my colleague Dr. Adaora Adimora and I have been analyzing to study concurrent sexual partnerships among U.S. women. Sometimes people will go to great lengths to save their data for years and years, only to find that they never had or neglected to save the documentation for it. Without the documentation, the data are, essentially, useless.
  9. The preceding lecture began with Sander Greenland’s assertion that data say nothing at all. Data consist of observer notes, respondent answers, biochemical measurements, contents of medical records, machine readable datasets, and other kinds of information from which we attempt to derive meaning. So what does one do with them? Analysis and interpretation of the data create the meaning that we ascribe to the data.
  10. The steps in data management are to: 1. Design the process by which data will be collected, writing down all data collection procedures 2. Train and supervise data collectors and monitor all data collection activities 3. Document all data collection experiences so that later it will be possible to reconstruct what happened or how issues that arose were resolved and very importantly, 4. Keep track of, document, and safeguard the data and the documentation It may seem superfluous to remind people not to lose their data. But as I said, data management is an under-appreciated activity, so people tend to be casual about it (“I’ll back it up ‘tomorrow’”). A project I worked on back in the mainframe era nearly lost an entire year’s work because of a disk crash. The person responsible for backing up the disk – my boss – had kept putting off the task. Fortunately we had shared a copy of the files with another organization, and they had, fortunately, not recycled the tape yet! The American College of Epidemiology had to recreate its membership database when it was lost. In November 2002 thieves stole the hard drives from 9 personal computers in the Epidemiological and Communicable Diseases unit at the Indian Council of Medical Research in New Delhi. There was apparently no backup. So you see, epidemiologists are as fallible in this area as the rest of us are.
  11. The next steps in data management are to review, edit, and code the data forms (e.g., questionnaires, abstracts of records, notes from observations). For example, the questionnaire may have instructed respondents to “mark one response”, but you may get questionnaires where two responses are circled, or a response is marked midway between two choices, and the like. Someone needs to decide how to handle these situations and to edit the forms accordingly. Questions about how these situations were handled may well arise. So it is important to document the coding decisions, the forms that had exceptions, and the actions taken. Occasionally it may be necessary to go back and revise all of the exceptions handled in a certain way, and it is much easier to work from a list than to have to go through all of the forms again. [For example, in a multi-site project in which I participated, the data center proposed to code intermediate responses (e.g., when “2” and “3” were both circled or a mark was made between them) as the higher number, a plan which was endorsed by the Data Analysis Committee. Later, though, the principal investigator at one of the sites persuaded the Steering Committee that the responses should have been coded with fractional values (e.g., “2.5”), necessitating re-review of thousands of forms to identify the exceptions.] After the forms are edited, the data are converted to electronic form, usually by keying into a computer, sometimes by optical scanning. Increasingly interview data are captured directly by computer, through CATI (Computer-Assisted Telephone Interview), CAPI (Computer-Assisted Personal Interview), and A-CASI (Audio Computer-Assisted Self-Interview) technology.
  12. Stellman reports an unusual data error encountered in the Cancer Prevention Study II, with 1.2 million questionnaires completed during fall 1982. Data were entered and key-verified under contract. The firm typically processed 20,000 forms/week, and researchers subjected each batch to an “exhaustive battery” of computer checks. As the researchers were beginning a factor analysis of the 28-item food frequency section, however, they examined the distribution of missing values and found to their surprise and puzzlement that there appeared to be no questionnaires with exactly 8 or 18 missing food items. After an intensive investigation, including pulling a sample of the original data forms they concluded that there was a programming problem: “The contractor’s lead programmer was asked to inspect all code related to the flag in question, but could find no errors. This was an exceptionally capable individual, whose word could be accepted as final. Seeking a possible (but unlikely) flaw in our own data logging process, we examined originally delivered data tapes . . ., but these proved to be identical in content to the system files. The problem simply had to originate with the contractor. At the time we were reaching this conclusion, the programmer called back again with a sheepish tone to say she had discovered the problem in her program. After all data items had been entered, the number of missing items was subtracted from 28 and the result was tested against zero; if the numbers were equal, the first item in the series was output as the flag character and the remaining 27 were output as blanks. But the line of code with the test contained a misprint: A letter “O” had been typed instead of a zero (one of the hardest programming errors to detect). In the machine level language of the contractor’s computer, this mistyped instruction was still a legal one, but it gave a test result of “true” for any number of missing items that ended with the digit 8.” (859-860) Steven D. Stellman. The case of the missing eights. Am J Epidemiol 1989;129(4):857-860, http://aje.oxfordjournals.org/content/129/4/857.full.pdf
  13. 48 * get non-hispanic white population in county for 2000, first by adding 49 ages 15-24, 25-34, 35-44, and 45-64, then by excluding ages 45-64; 50 51 CWHITES=CST00609+CST00610+CST00611+CST00612; 52 CWHITES2=CWHITES-CST00612; 53 54 * get non-hispanic black population in county; 55 56 CBLACKS=CST00616+CST00617+CST00618+CST00619; 57 CBLACKS2=CBLACKS-CST00619; 58 59 * get hispanic or latino population in county; 60 61 CHISPS=CST00623+CST00624+CST00625+CST00626; 62 CHISPS2=CHISPS-CST00626; 63 (continues on next slide)
  14. CST00637 Female population white alone aged 15-24, 2000 – county CST00638 Female population white alone aged 25-34, 2000 – county CST00639 Female population white alone aged 35-44, 2000 – county CST00640Female population white alone aged 45-64, 2000 – county CST00644Female population black* alone aged 15-24, 2000 – county CST00645Female population black* alone aged 25-34, 2000 – county CST00646Female population black* alone aged 35-44, 2000 – county CST00647Female population black* alone aged 45-64, 2000 – county CST00651 Female population Hispanic* aged 15-24, 2000 – county CST00652 Female population Hispanic* aged 25-34, 2000 – county CST00653 Female population Hispanic* aged 35-44, 2000 – county CST00654 Female population Hispanic* aged 45-64, 2000 – county * Full variable name: “black or African American”, “Hispanic or Latino” (continues on next slide)
  15. 64 * get non-hispanic white female population in county; 65 66 CWFEMALES=CST00637+CST00638+CST00639+CST00640; 67 CWFEMALES2=CWFEMALES-CST00640; 68 69 * get non-hispanic black female population in county; 70 71 CBFEMALES=CST00644+CST00645+CST00646+CST00647; 72 CBFEMALES2=CBFEMALES-CST00646; 73 74 * get hispanic female population in county; 75 76 CHFEMALES=CST00651+CST00652+CST00653+CST00654; 77 CHFEMALES2=CHFEMALES-CST00654; (continues on next slide)
  16. 64 * get non-hispanic white female population in county; 65 66 CWFEMALES=CST00637+CST00638+CST00639+CST00640; 67 CWFEMALES2=CWFEMALES-CST00640; 68 69 * get non-hispanic black female population in county; 70 71 CBFEMALES=CST00644+CST00645+CST00646+CST00647; 72 CBFEMALES2=CBFEMALES-CST00646; 73 74 * get hispanic female population in county; 75 76 CHFEMALES=CST00651+CST00652+CST00653+CST00654; 77 CHFEMALES2=CHFEMALES-CST00654;
  17. After or in the process of “cleaning” the data (reviewing distributions for illegal or improbable values or combinations of values, such as pregnant males), it is important to examine the data to familiarize oneself with them, to be aware of how various factors are distributed, and, always, to be on the alert for surprises and suspicious findings. Analysts inspect frequency distributions, cross-tabulations, and scatterplots to “see” the data. One worthwhile practice is to make sure that the numbers of respondents in each table are what they should be, since respondents can be lost or duplicated when datasets are merged, variables are recoded, subgroups are examined, and so forth. Summary statistics, such as means, proportions (e.g., prevalence, incidence proportions), and rates (e.g., incidence rates), are examined in the dataset as a whole and within various subgroups. Even if groups will be combined for analysis, it is good practice to look at data by gender, age group, and various other dimensions relevant to the study population and type of data. The next step is to look at associations among factors, with such measures as prevalence ratios, relative risks, odds ratios, and correlation coefficients. Graphical analysis techniques can be revealing at all of these stages.
  18. A thorough exploration of the data helps to catch problems resulting either from errors or lapses of some kind and also helps for identifying features of the data that have implications for the formal statistical analysis (e.g., outliers, skewed distributions). For example, some statistical analysis techniques assume that variables are normally distributed. If the exploration reveals that they are not, a transformation must be applied or statistical analysis techniques employed that do not make this assumption. It is generally desirable to have a written analysis plan to guide the analysis of the research questions. Even if you have a clear idea of the study question and how to proceed to examine it, it is easy to become lost in the process of scanning distributions, examining hundreds of means and proportions, and staring at screens and printouts. So write down your plan with as much specificity as you can. Usually the data analysis plan will call for a formal assessment of the crude estimates and associations, followed by estimates and associations that control for important covariables identified in your analysis plan, such as potential confounders. There are two major methods of controlling for covariables: stratified analysis and mathematical modeling.
  19. For example, the distribution of U.S. household income is not a “normal” distribution.
  20. In stratified analysis, the dataset is divided into subsets according to one or more covariables to be controlled. For example, an overall dataset might be examined within subgroups formed by gender, age group, urban-rural, smoking status, blood pressure, etc., depending upon the factors being studied. Ideally the results will be inspected within each stratum, unless there are too many strata to make that practical. Then, by averaging the estimates across sets of strata and across all of them, the analyst obtains adjusted estimates that control for the stratification variables. Age standardization, considered earlier in the course, is an example of stratified analysis. Important advantages of stratified analysis are that it usually requires fewer assumptions about the distributions of variables and their relationships, and it shows all of the data. With mathematical modeling, it is easy to miss important features of the data because they are not in view. Disadvantages of stratified analysis are that if there are several variables to control the number of strata becomes large very quickly. Also, variables must be categorized in order to form the strata. Besides the work involved in categorizing the variables, categorization can reduce available precision. But looking at stratified analyses is a good idea as an accompaniment to other analysis methods, even if one does not ultimately report the stratified analysis results.
  21. Mathematical modeling is the second and now most widely used method for examining associations while controlling for important covariables. With mathematical modeling, we find a way to express the outcome, such as incidence, as a mathematical function of the covariables we consider important. For example, we might express the incidence of heart disease as a function of age, smoking status, blood pressure level, cholesterol level, presence of diabetes, and so on. We usually specify the form of this function (e.g., whether we add the effect of one factor to the effect of another, multiply the effect of one factor by another, and so forth) and then tailor the function to fit the data. An analogy is choosing a dress pattern and then adjusting it to fit the person for whom the dress is being made. Fitting the model involves statistical procedures that estimate parameters that indicate the quantitative contribution to the outcome of each of the factors in the model (contingent, of course, always on the assumptions that the model was based on and the model form). The model will have been chosen so that these parameters have a useful interpretation. For example, the parameters might estimate the difference in risk of the outcome attributable to a factor or the odds ratio relating a factor to the outcome. When the parameter is a ratio, the model usually works with it on the log scale. So that’s why we use logistic and log bionomial regression.
  22. A possible analogy to statistical analysis of data, especially inferential statistics, where the analyst attempts to draw inferences about a population from a sample of data, is the way a seamstress or tailor might approach sewing a pair of pants. In selecting a pattern, s/he looks first for one that will suit the purpose for which the pants are intended (so for example, dress pants, work pants, casual pants, shorts, athletic shorts, etc.). It’s possible to sew without a pattern, but the result may not look good. Also, s/he will want to chose a pattern that will be well received when the pants are worn. One consideration is whether s/he has seen anyone wearing pants in that style. Another might be whether the pattern has been featured in a fashion magazine.
  23. So in analyzing data, the analyst looks for an available statistical model that appears to fit the situation – for example, the binomial, normal, or chi-square distribution, or the linear or logistic model. If others have used that model (i.e., that pattern, essentially) with data of the type we are dealing with, the result is more likely to be well received by our peers. Similarly, if the model has been presented in a scientific journal (perhaps that is the statistician’s equivalent of a fashion magazine) the result is likely to be well-received.
  24. Having chosen the pattern, the person sewing a pair of pants will need to select the correct size pattern or adjust the pattern to fit the person who will wear the pants. Similarly, having chosen the type of statistical model, the data analyst will select the “size” model for the data, which involves estimating the parameters that the model uses. For example, a normal distribution is a family of distributions that can be wide or narrow and can be located anywhere on the real number line. The analyst selects a specific distribution that fits the location of the data on the real number line (the mean, essentially), the dispersion of the data around its mean (standard deviation or variance), and whether the data are skewed to the right or left, and so forth.
  25. Of course, just as the seamstress will want to see the person whom the pattern is to fit, the data analyst will want to look at the data before selecting the model. For example, two distributions can have the same mean and standard deviation but differ greatly in other respects. The two distributions in the diagram both have a mean of 5, but otherwise they are very different.
  26. Epidemiology also uses mathematical models to determine expected outcomes. Suppose we want to estimate the risk of CHD for someone age 50 years, with systolic blood pressure of 130 mmHG, serum cholesterol of 220 mg/dL, who has smoked a pack of cigarettes/day for 30 years. Graphical examples of simple linear regression: http://www.sjsu.edu/faculty/gerstman/StatPrimer/regression.pdf http://cast.massey.ac.nz/core/index.html?book=biometric (more thorough)
  27. Regression models are the kind most often used in epidemiology. With a regression model we begin with a concept of how risk of the outcome relates to a set of risk factors.
  28. Each risk factor will need a multiplier (a “coefficient”) to translate its value into a risk-equivalent.
  29. Then we use the data and statistical techniques (regression analysis) to estimate the most likely values of the coefficients. That process is called “fitting the model”.
  30. P-values are ubiquitous in health research, though they are widely misunderstood as well. Here is an attempt to convey an intuitive sense of how they work. A p-value might be regarded as the probability of obtaining an interesting-looking sample from a boring population. We know that even if nothing is going on in a population, a particular sample just might appear to have an intriguing association. The p-value is computed to tell us the probability of obtaining an unusual sample even when no such association exists in the underlying population. Statistical power is in some respects the inverse of the p-value. Statistical power is the probability of obtaining an interesting-looking sample from an interesting population. Both the p-value and statistical power are the probability of obtaining an interesting sample (i.e., one with an association of interest). But we know that by the vagaries of random sampling a particular sample might not represent the population well. So the p-value and power tell us how likely an interesting sample could arise in these two very different situations, the situation where there is no association in the population and the situation where there is an association in the population.
  31. The problem with p-values is not so much what they try to do as how we try to interpret them. This slide shows two possible populations. The one on the left is the boring one – it has no association. The one on the right is the interesting one – in this population there is an association we are interested in detecting. Suppose I conduct a sample survey – a cross-sectional study with a sample of people randomly selected from the population. I will use the OR as the measure of association, and to make the situation easier to diagram, I am going to show the OR on the log scale (the distribution of the log of the OR is symmetrical, whereas that of the OR is not). To orient you, an OR of 1.0 has a natural log of zero and would correspond to a “boring” population. An OR of 2.0 has a natural log of 0.7; an OR of 1.65 has a natural log of 0.5. It’s just a 1-for-1 transformation. [We will write the natural logarithm of the OR as ln(OR).] So now I draw a sample, compute the ln(OR), and the result happens to be, say, 0.5. That’s an OR of 1.65, represented by the vertical blue line. I do not know what the population really looks like, so I consider the possibilities. One possibility is that there is no association in the population – i.e., it’s boring (that’s the one on the left), and the true value of the association is ln(OR)=0, or the OR=1.0. Another possibility is that the population is not a boring one – it might, for example, be the interesting one on the right, where the true value of the association is ln(OR)=0.7, in other words, OR=2.0. We would like to know the probability that the sample I obtained came from one of the interesting possible populations rather than from the boring population.
  32. So we would like to know the probability that our sample came from an interesting population. But it’s much easier to figure out the reverse – the probability that a particular population would give rise to my sample. So the p-value is designed to provide the probability that, given the size of my study, the boring population would produce a sample as (or more) interesting as the one I obtained. In the diagram, the normal-looking curve on the left shows the distribution of values of the ln(OR) that would be observed if I repeated my study a large number of times in a boring population. Most of the times the sample I obtain would be boring, but sometimes it would be interesting. The pink in the right tail of the graph shows the proportion of times that the sample I obtain would have a ln(OR) of 0.5 or greater. The pink area on the left of the distribution shows the probability that I would observe a ln(OR) of -0.5 or lower, which for now let’s think of as equally interesting. This is not quite the information I wanted to know, but it is nevertheless useful.
  33. So the p-value tells me the probability that the boring population on the left would yield a sample as interesting or more so than the one I obtained. But the p-value does not tell me the probability that what I observed actually came from that boring population, i.e., that the association was due only to chance. The reason is that that probability – the probability that the sample I obtained came from the boring population – depends on how many boring populations I study and how many interesting populations I study.
  34. For example, if I study only boring populations, the probability that my samples come from boring populations is 100%. Even when by chance I observe an interesting sample, as I will from time to time, if I study only boring populations, then that sample must have come from one.
  35. In contrast, if I study only interesting populations, then all of my samples must come from interesting populations – even when by chance, as will happen, I get a boring sample (in other words, one that does not show an association).
  36. If I have been rather unsuccessful in identifying worthwhile hypotheses to test, most of the populations I am studying are boring. Thus, even when the p-value is less than 0.05, there is a substantial probability that the sample simply represents an atypical sample from a boring population.
  37. On the other hand, if you have been very successful in identifying worthwhile hypotheses to test, then most of the populations you study are interesting. Even if the p-value for a particular sample is greater than 0.05, there is a substantial probability that the sample simply represents an atypical sample from an interesting population.
  38. So the probabilities that a given sample we obtain came from a boring or an interesting population depend on the relative proportions of boring and interesting populations that we study – information that’s generally not possible to know. Suppose that every epidemiologist studies 10 interesting populations (or 10 true associations) and 100 boring populations (or 100 non-existent associations). If the statistical power (probability of obtaining an interesting sample from an interesting population) is 90%, then we expect that epidemiologists will obtain, on average, 9 interesting samples from the 10 interesting populations. Similarly, if our criterion for an “interesting sample” is a p-value less than 5% (that’s from our 5% significance level), then we expect epidemiologists to obtain, on average, 5 interesting samples from the 100 boring populations.
  39. So these epidemiologists have observed, on average, 14 interesting samples, 5 of which came from boring populations. All these interesting samples had a p-value less than 5%, by our definition of an interesting sample. But the probability that a given interesting sample came from a boring population is 5/14 = 36%, not 5%! You may be noticing a similarity to the concept of positive predictive value that we studied in the lecture on population screening. Indeed, just as the predictive value of a positive screening test depends especially on the prevalence of the condition for which we are screening, the predictive value of a “significant” association depends on the proportion of interesting populations under study. In that analogy, statistical power corresponds to sensitivity – the probability of observing a real association when there actually is one (i.e., classifying an interesting population as an “interesting” one). The significance level (alpha, the cutpoint for deciding what is a “significant” p-value) corresponds to the false positive rate (1 minus the specificity), the probability of classifying a boring population as an interesting one.
  40. The table on the slide displays the numbers from the previous example, in the form we used for evaluating screening tests: sensitivity of 9/10 (90%), specificity of 95/100 (from the false positive rate [significance level] of 5/100), and PPV of 9/14. So now you know that a p-value does not tell you the probability that a given result is due to chance (i.e., comes from a boring population) and that a “significant finding (p<0.05)” does not tell us that there is less than a 5% probability that the results were due to chance. We have to interpret a “significant” finding the way we would a positive result from a screening test with that false positive rate. Setting a more stringent significance level (e.g., p-values < 0.01) reduces the false positive rate (increases specificity), which increases the probability that a “significant” finding was not due to chance (i.e., that a “significant” finding does come from an interesting population). But the actual probability depends on the proportion of interesting populations being studied as well as the significance level, just as positive predictive value depends upon disease prevalence and specificity.
  41. I hope that the preceding discussion assists you in interpreting p-values you encounter in the literature. Let’s return now to the broader strategy of data analysis and interpretation, particularly attempts to infer causation from epidemiologic data. The analysis is directed by the research questions. One category of research question is to gather information on the distribution of variables of interest. For example, we might be interested in conducting a study to estimate the distribution of serum cholesterol or blood lead levels in a population, or the prevalence of HIV or of use of well water. Another category of research questions involves associations, such as whether blood lead level is associated with elevated blood pressure, or do prepaid health plans provide more preventive care than fee-for-service plans, or do bednets protect against malaria?.
  42. Here is an example of data analysis in a study with a causal hypothesis: “does motorcycle helmet use reduce risk of death?” Daniel Norvell and Peter Cummings (American Journal of Epidemiology 2002;156:483-7) used data from the National Highway Traffic Safety Administration’s Fatality Analysis Reporting System, which collects information for all crashes on US public roads in which a fatality occurs. The primary exposure was helmet use; the primary outcome was death. As you recall, the causal comparison that we would like to make contrasts the risk of death to motorcycle riders and passengers wearing helmets with the risk of death to motorcycle riders and passengers not wearing helmets. Since that comparison involves a counterfactual, we use a substitute population. What should that substitute be? If we compare death risks for helmeted riders with death risks for unhelmeted riders, we would certainly be concerned about differences between riders who wear helmets and riders who do not in regard to driving behavior and crash characteristics. The authors circumvented this concern to some extent by comparing the death risk of the driver and passenger on the same motorcycle. That comparison tends to equalize driver and crash-related factors. The authors also identified and controlled for a number of potential confounders: sex, seat position, age, and presence of a state helmet law (since a law requiring helmet use might lead crash survivors to report helmet use falsely, which would make helmets appear to be more protective (because only the survivors are able to report use).
  43. The dataset included 9,222 driver-passenger pairs after exclusions. The primary analysis found a crude relative risk of 0.65, with 95% confidence interval 0.57-0.74). When the association was adjusted for seat position, the relative risk estimate strengthened slightly, to 0.61. Note that an overall measure of association, whether crude or adjusted, does not tell us whether the association is the same in various important groups. Whether or not a factor, such as seat position, is a confounder it can define groups in which the association being measured is stronger or weaker (or absent). The authors investigated the possibility of effect measure modification by seat position and found a small difference: an adjusted relative risk of death of 0.65 for helmeted compared to unhelmeted drivers, and a slightly stronger association of 0.58 for helmeted versus unhelmeted passengers, though the confidence intervals overlapped considerably. (The authors also tried examining both seat position and sex simultaneously, but the two factors were very strongly related: 97.4% of the women were passengers.) However, whether or not the crash involved a collision was indeed a powerful effect modifier. In the 88% of crashes involving a collision with a vehicle or object, the adjusted relative risk of death was 0.65 for a helmeted rider. By contrast, in crashes in which there was no collision (skidding, turning over), the adjusted relative risk was 0.36, so that helmet use appeared to be much more protective for non-collision crashes.
  44. .