13a Data analysis and causal inference – 1

4/12/2011 Data analysis and causal inference 1
Data analysis and causal inference – 1
Victor J. Schoenbach, PhD home page
Department of Epidemiology
Gillings School of Global Public Health
University of North Carolina at Chapel Hill
www.unc.edu/epid600/
Principles of Epidemiology for Public Health (EPID600)

The Physicist, the Chemist, and the Statistician
From “Science Jokes”, posted to Usenet groups by Joachim Verhagen
(verhagen@fys.ruu.nl); downloaded from, Keith M. Gregg,
keith.gregg@stanford.edu, www-leland.stanford.edu/~keithg/humor.shtml
“Three professors (a physicist, a chemist,
and a statistician) are called in to see their
dean. Just as they arrive the dean is called
out of his office, leaving the three professors
there. The professors see with alarm that
there is a fire in the wastebasket.

(verhagen@fys.ruu.nl); downloaded from, Keith M. Gregg, keith.gregg@stanford.edu,
www-leland.stanford.edu/~keithg/humor.shtml
“The physicist says, ‘I know what to do! We
must cool down the materials until their
temperature is lower than the ignition
temperature and then the fire will go out.’

“The chemist says, ‘No! No! I know what to
do! We must cut off the supply of oxygen so
that the fire will go out due to lack of one of
the reactants.’

“While the physicist and chemist debate
what course to take, they both are alarmed
to see the statistician running around the
room starting other fires. They both scream,
‘What are you doing?’
To which the statistician replies, ‘Trying to
get an adequate sample size.’”

Data management
• Managing epidemiologic data is “mass
production”
• A systematic, organized, professional
approach is critical for detecting and
avoiding problems

“You can never, never take
anything for granted.”
Noel Hinners, vice president for flight
systems at Lockheed Martin Astronautics,
whose engineering team reported
measurements in English units that the
Mars Climate Orbiter navigation team
assumed were metric units.

Without the documentation, the data may be
of little if any value (1995 NSFG)
00000000000003122222222402143041000
00000000000001144112131 070520310
00000000000003233112131 072331040
000000000000011163322227070350110
00000000000003133022221 02451121000
00000000000001111112131 02110041000
00000000000002111112131 07307131000
00000000000002122112131 01073041000

Data analysis and causal inference
• “Our data say nothing at all.”
(Epidemiology guru Sander Greenland, Congress of
Epidemiology 2001, Toronto)
• Data are observer notes, respondent
answers, biochemical measurements,
contents of medical records, machine
readable datasets, …
• What does one do with them?

Steps in data management
• Design the data collection process
• Write down all data collection procedures
• Train and supervise data collectors
• Monitor all data collection activities
• Document all data collection experiences
• Keep track of, document, and safeguard
data

Data processing
• Review, edit, and code data forms,
documenting exceptions and actions
• Convert to electronic form
• “Clean” data – check for illegal or
improbable values, combinations of values
• Prepare summaries

The case of the missing eights
• Cancer Prevention study II
(N=1.2 million)
• Contractor keyed 20,000
forms/wk; checked weekly.
• 28-item food frequency had
peculiar pattern of missings
• Pulled original QQs to check
• Programmer checked code
• Cause: “O” instead of “0”
Steven D. Stellman. Am J Epidemiol
1989;129(4):857-860

Can you find the data management error?
48 * get non-hispanic white population in county for 2000, first by adding
49 ages 15-24, 25-34, 35-44, and 45-64, then by excluding ages 45-64;
50
51 CWHITES=CST00609+CST00610+CST00611+CST00612;
52 CWHITES2=CWHITES-CST00612;
53
54 * get non-hispanic black population in county;
55
56 CBLACKS=CST00616+CST00617+CST00618+CST00619;
57 CBLACKS2=CBLACKS-CST00619;
58
59 * get hispanic or latino population in county;
60
61 CHISPS=CST00623+CST00624+CST00625+CST00626;
62 CHISPS2=CHISPS-CST00626;
63 (continues on next slide)

CST00637 Female population white alone aged 15-24, 2000 – county
CST00644 Female population black* alone aged 15-24, 2000 – county
CST00651 Female population Hispanic* aged 15-24, 2000 – county
* Full variable name: “black or African American”, “Hispanic or Latino”
(continues on next slide)

64 * get non-hispanic white female population in county;
65
66 CWFEMALES=CST00637+CST00638+CST00639+CST00640;
67 CWFEMALES2=CWFEMALES-CST00640;
68
69 * get non-hispanic black female population in county;
70
71 CBFEMALES=CST00644+CST00645+CST00646+CST00647;
72 CBFEMALES2=CBFEMALES-CST00646;
73
74 * get hispanic female population in county;
75
76 CHFEMALES=CST00651+CST00652+CST00653+CST00654;
77 CHFEMALES2=CHFEMALES-CST00654;
(continues on next slide)

64 * get non-hispanic white female population in county;
65
66 CWFEMALES=CST00637+CST00638+CST00639+CST00640;
67 CWFEMALES2=CWFEMALES-CST00640;
68
69 * get non-hispanic black female population in county;
70
71 CBFEMALES=CST00644+CST00645+CST00646+CST00647;
72 CBFEMALES2=CBFEMALES-CST00646;
73
74 * get hispanic female population in county;
75
76 CHFEMALES=CST00651+CST00652+CST00653+CST00654;
77 CHFEMALES2=CHFEMALES-CST00654;

Data exploration
• Examine the data – frequency
distributions, cross-tabulations,
scatterplots – be alert for surprises and
suspicious findings
• Examine means and prevalence for
factors of interest, overall and within
interesting subgroups
• Look at associations, prevalence ratios,
relative risks, odds ratios, correlations

Carry out focused data analysis
• Desirable to have a written analysis plan
based on the research questions
• Typically carry out “crude” analyses and
analyses controlling for important
variables
• Methods of control: stratification,
mathematical modeling

Distribution of U.S. household income, 2007
(CPS data)
Income in $1000s/year
Source: http://img55.imageshack.us/i/incomedistr07jo6.jpg/

Stratified analysis
• Divide the dataset into subsets according
to relevant covariables (e.g., age, sex,
smoking, …)
• Examine the estimates and associations
within each subset (unless there are too
many)
• Take averages across the subsets

Mathematical modeling
• Express the outcome as some
mathematical function of the relevant
covariables
• “Fit” this function to the data, so that it
models the relations in the data
• Interpret the resulting model to draw
inferences about associations

Selecting a pattern to sew a pair of pants
• Want one that fits the need
• Can sew without a pattern, but takes
time and may not look good
• Select a pattern that will be well
received
• Have you seen anyone wearing it?
• Has it been featured in magazines

The strategy of statistical data analysis
Look for an available statistical
model that will fit the situation (e.g.,
binomial, normal, chi-square, linear)
• Have others used it?
• Has it appeared in a methodology
article?

The strategy of statistical data analysis
Summarize the data in terms of the
statistical model
– Mean
– Standard deviation
– Other parameters

But should always look at the data
• Distributions can have same mean
and standard deviation but look very
different – e.g., same mean:
5 5

Regression models - Conceptual
• Suppose risk factors of:
Age 50 years
BP 130 mmHG systolic
CHL 220 mg/dL
SMK 30 pack-years

Regression models - Conceptual
Example of an additive model:
Risk of CHD =
Risk from Age (“Age_risk”)
Risk from BP (“BP_risk”)
Risk from CHL (“CHL_risk”)
Risk from SMK (“SMK_risk”)

Propose the model
Risk of CHD = Age_risk + BP_risk + CHL_risk + SMK_risk
Age_risk = Age in years x risk increase per year
BP_risk = BP in mmHG x risk increase per mmHG
CHL_risk = Cholest. in mg/dL x risk increase per mg/dL
SMK_risk = Pack-years x risk increase per pack-year

Fit the model – estimate the coefficients
• Risk = β0 +β1Age + β2BP + β3CHL + β4SMK
β0 = baseline risk
β1 = risk increase per year
β2 = risk increase per mmHG
β3 = risk increase per mg/dL
β4 = risk increase per pack-year
• Use the data and statistical techniques to
estimate β1, β2, β3, β4.

P-values and Power
• P-value: “the probability of obtaining
an interesting-looking sample from a
boring population” (1 – specificity)
• Power: “the probability of obtaining
an interesting-looking sample from
an interesting population” (sensitivity)

The P-value
If my study observes 0.5 [e.g., ln(OR)]
0
Boring population
0.7 [ln(OR)]
Interesting population

The P-value
If my study observes 0.5 [e.g., ln(OR)]
0
Boring population
0.7
P-value

The Problem with the P-value
But the P-value does not tell me the
probability that what I observed was
due to chance
0
Boring population
0.7

If I study only boring populations
0
Distributions of samples from boring populations

If I study only interesting populations
0
0.7
Distributions of samples from interesting populations

Many boring populations
0
Boring populations
0.7
Interesting populations

Many interesting populations
0
Boring populations
0.7
Interesting populations

Do epidemiologists study boring populations?
That probability depends on how many boring
populations there are. If we study
10 interesting populations
100 boring populations
with 90% power and 5% significance level, we
expect us to obtain 9 interesting samples from
the interesting populations and 5 from the
boring populations

P-values and predictive values
Results:
14 interesting samples
5 came from boring populations
Probability that an interesting sample
came from a boring population:
5/14 = 36% – not 5%!
Analogous to positive predictive value

Analogy to positive predictive value
Populations
Samples Interesting
(“cases”)
Boring
(“noncases”)
Total
Interesting
(“positive”)
9 5 14 PV+
64%
Boring
(“negative”)
1 95 96
Total 10 100 110
(with 90%
sensitivity)
(with 95%
specificity)

Meta-analysis
• Literature reviews
• Systematic literature reviews
• Every study is an observation from a
population of possible studies
• The set of studies that have been
published may be a biased sample
from that population

What should guide data analysis
• What are the research questions?
– Estimate means (e.g., cholesterol)
and prevalences (e.g., HIV)
– Assess associations (e.g., Is blood
lead associated with elevated blood
pressure?; Do prepaid health plans
provide more preventative care? Do
bednets protect against malaria?)

Association of helmet use with death in motorcycle
crashes: a matched-pair cohort study
(Daniel Norvell and Peter Cummings, AJE 2002;156:483-7)
• Data from the National Highway Traffic
Safety Administration’s Fatality Analysis
Reporting System
• Exposure: helmet use; Outcome: death
• Potential confounders: sex, seat position,
age, state helmet law

Association of helmet use with death in motorcycle
crashes: a matched-pair cohort study
(Daniel Norvell and Peter Cummings, AJE 2002;156:483-7)
• 9,222 driver-passenger pairs after
exclusions
• Relative risk of death for a helmeted rider
was 0.65 (0.57-0.74), (0.61 adjusted for
seat position)
• Examined effect measure modification by
seat position and by type of crash.

When the
proofreader takes a
week off
12/29/2009, B5
Dec 2009 Close
28 10547.08
25 10520.10
24 10520.10
23 10466.44
22 10464.93
21 10414.14
18 10328.89
17 10308.26
www.google.com/finance/historical?q=INDEXDJX:.DJI Dec 22 23 24 25 28

I hope he’s having
a good break!
12/31/2009, B6
Dec 23 24 25 28 29
Dec 2009 Close
29 10545.41
28 10547.08
25 10520.10
24 10520.10
23 10466.44
22 10464.93
21 10414.14
18 10328.89
17 10308.26
www.google.com/finance/historical?q=INDEXDJX:.DJI

Thank you
• Arigato
• Asanti
• Dhanyavaad
• Dumela
• Gracias
• Merci
• Obrigato
• Xie xie

13a Data analysis and causal inference – 1

Recommended

Recommended

More Related Content

Similar to 13a Data analysis and causal inference – 1

Similar to 13a Data analysis and causal inference – 1 (20)

More from Abdiwali Abdullahi Abdiwali

More from Abdiwali Abdullahi Abdiwali (20)

Recently uploaded

Recently uploaded (20)

13a Data analysis and causal inference – 1

Editor's Notes