SlideShare a Scribd company logo
1 of 46
Download to read offline
Lessons for humans from machine
learning: discovering potential drivers
of adolescent obesity
David H. Rehkopf
Associate Professor
Division of Primary Care and Population Health
Department of Medicine
Stanford University School of Medicine
University of Luxembourg
May 20, 2019
1
Broadest motivating question about
exposures of income, employment and
work
(Krieger et al. Am J
Public Health 2005)
Coronary heart
disease
mortality,
Massachusetts,
1989-1991, by
Census tract
poverty
2
International Journal of Epidemology, 39: 97-106 (2009)
Objectives of talk
Introduction: understand what is meant by “machine
learning methods”
① Control for confounding: understand how machine
learning methods may be useful in certain situations
② Subgroup effects: understand how machine learning
methods may be useful in determining what subgroups
may be most important
③ Relative importance: understand how machine learning
methods may be useful in determining the relative
importance of a large number of predictors.
Synthesis: understand basic concerns with these
applications and a few guidelines for when they may
complement or be used instead of traditional regression
approaches 5
the best machine learning algorithm
according to Andrew Ng…
6
[Started Machine learning MOOC at Stanford…over 8
million people have enrolled...]
logistic regression!!!!
7
Machine learning in general…
It generally means giving computers the ability to learn
without being explicitly programmed.
It generally means letting the computer specify the
model (or part of the model) rather than the human.
8
Machine learning for social and medical
scientists…
Some problems that the machine may
help to address…
① I believe the true form between my dependent and
independent variable is non-linear but I’m not sure exactly
the best function.
② I just want to explain as much variance as possible in my
dependent variable and I have a whole bunch of predictor
variables.
③ I am interested in identifying subgroups of individuals who
vary in the outcome rather than the strength of association
with a particular predictor variable.
④ I want to find the most parsimonious predictor model.
⑤ I have a whole bunch of potential control variables to use
and not sure what the best model is. 9
Introduction to
machine
learning
methods.
11
12
Data modeling
(traditional, fit
with
estimation)
Algorithmic
modeling(machine
learning, fit with
training)
http://cran.r-project.org/web/views/MachineLearning.html
Two types of machine learning algorithms
(among very many): recursive partitioning
and random forest
Recursive Partitioning Regression Tree –
1. algorithm examines each risk factor and picks the split point within a
variable that best differentiates the outcome of interest – dividing the
sample into two nodes
2. Repeats for each of the new nodes.
3. This process continues to expand the number of nodes in the tree as long
as a new split resulted in two groups where the difference in the outcome
was significant at the p<0.05 level (or many other options for this rule).
Random forest -
4. This procedure is repeated for a large number of trees (2500), with each
one created from a subset of the overall data.
What makes this different?
Algorithm picks the best model from variables selected
by the investigator.
1. Can use a large number of predictor variables.
2. Best model is fit by creating models on internal subsets
of the data and testing fit on a separate part of the data.
3. Automated search for split points allows for best fit to
functional form of relationship between predictor and
outcome.
4. Tree based approach allows visualization of groups that
may differ greatly from overall population effects.
5. Random forest greatly increases robustness of process
by repeating multiple times on further subsets of the
data.
16
1. Control for confounding: understand
how machine learning methods may be
useful in certain situations
+ =
Or
1. Causes exposure
2. Causes outcome
3. Not on causal pathway between exposure and outcome
4. Not caused by exposure and outcome
17
the problem…
An even bigger problem…
Example: Work context and
hypertension
19
POPULATION An occupational cohort obtained from 47
United States manufacturing plants (n=10,545).
Predictors. Aggregated plant characteristics: Job Satisfaction,
Feelings toward Management, Workplace Involvement,
Work Stress
Outcome. Incident hypertension
Confounders. We consider individual characteristics of
plant composition that may impact exposure and
outcome: gender, wages, race, grade, employment type.
Geographic variation in health
(Murray et al. Plos Med 2006)
Concern for confounding by county
characteristics - DAG
21
hypertension
Demographic
characteristics
of plant
Plant
characteristics
County characteristics
Problem of area level confounders – good
news/bad news.
Bad news there are likely to be confounding variables that are not at
the individual level.
Bad news because exposures may be ecological or area level,
individual level covariates may not be useful in blocking all the
effects of more macro-social variables.
The good news is that there are free and comprehensive data sources of
potential area level confounders that are available.
The bad news is that there is not strong evidence for the choice of
which of these measures is most appropriate for a particular
outcome.
Bad news is that functional form of the dependence of each with an
outcome of interest is primarily unknown.
Good news is that most of this bad news can be addressed by machine
learning methods.
County characteristics (68)
23
'Census 2000 total resident population, 4/1/00', 'Percent population change, 4/1/00 to 7/1/05', 'Births 4/1/00 to 7/1/00', 'Deaths
4/1/00 to 7/1/00', Net international migration 4/1/00 to 7/1/00', 'Census 2000 housing units, 4/1/00', 'Percent housing unit
growth, base 4/1/00 to 7/1/05', 'Percent of resident population aged 0 to 14 years, 7/1/05', 'Percent of resident population
aged 15 to 64 years, 7/1/05', 'Percent of resident population aged 65 years and over, 7/1/05', 'Percent of resident population
aged 85 years and over, 7/1/05', 'Sex Ratio, 7/1/05', 'Median age of total resident population, 7/1/05', 'Median age of male
resident population, 7/1/05', 'Median age of female resident population, 7/1/05', 'Percent of resident population white alone,
7/1/05', 'Percent of resident population black alone, 7/1/05', 'Percent of resident population American Indian and Alaska
native alone, 7/1/05', 'Percent of resident population Asian alone, 7/1/05', 'Percent of resident population Native Hawaiian
and other Pacific islander alone, 7/1/05', 'Percent of resident population of two or more races, 7/1/05', 'Percent of resident
population non-Hispanic, 7/1/05', 'Labor force, annual average estimate, 2005’, 'Unemployment rate, annual average
estimate, 2005', '2004 ERS Economic Type', '2004 ERS Policy Type: Housing stress', '2004 ERS Policy Type: Low-
education', '2004 ERS Policy Type: Low-employment', '2004 ERS Policy Type: Persistent poverty', '2004 ERS Policy Type:
Population loss', '2004 ERS Policy Type: Nonmetropolitan recreation', '2004 ERS Policy Type: Retirement destination', '2003
ERS Urban Influence Code', '2003 ERS Rural-Urban Continuum Code', 'Population (persons), 2005', 'Per capita personal
income (dollars), 2005', 'Contributions for government social insurance ($1,000s), 2005', 'Contributions for government social
insurance: Employee and self-employed contributions for government social insurance ($1,000s), 2005', 'Contributions for
government social insurance: Employer contributions for government social insurance ($1,000s), 2005', 'Adjustment for
residence ($1,000s), 2005', 'Net earnings by place of residence ($1,000s), 2005', 'Dividends, interest, and rent ($1,000s),
2005', 'Personal current transfer receipts ($1,000s), 2005', 'Per capita net earnings by place of residence, 2005', 'Per capita
personal current transfer receipts, 2005', 'Per capita income maintenance, 2005', 'Per capita unemployment insurance
benefits, 2005', 'Per capita retirement and other benefits, 2005', 'Per capita dividends, interest, and rent, 2005', 'Average
earnings per job, 2005', 'Average wage and salary disbursements per job, 2005', 'Average nonfarm proprietors'' income,
2005', 'Standardized score for mean temperature for January, 1941-1970', 'Standardized score for mean hours of sunlight for
January, 1941-1970', 'Standardized score for mean temperature for July, 1941-1970', 'Standardized score for mean relative
humidity for July, 1941-1970', 'Standardized score for land surface form typography code', 'Standardized score for natural log
of percent water area', 'ERS Natural Amenity Scale', 'ERS Natural Amenity Rank', '2004 IECC (supplement to 2003 IECC)
Climate Zone', '2004 IECC (supplement to 2003 IECC) warm-humid counties', 'Koppen classification corresponding to 2004
IECC Climate Zone', 'America Climate Region', 'Index crime rate (per 100,000 persons), 2004', '2004 presidential election:
Percent of votes for Bush', '2004 presidential election: Percent of votes for Kerry', '2004 presidential election: Percent of
votes for other candidates'
Data adaptive search for County Characteristics Predicting
Exposures and Outcome
(variable importance measures from Random Forest
Algorithm)
24
Relative
importance
Job Environment
Exposure
Hypertension
1.
Social insurance
expenditures
Income
maintenance
2. Earnings
Unemployment
benefits
3. Population size
Retirement
benefits
4.
Income
maintenance
Earnings
5. % white Population size
6.
Retirement
benefits
Urbanicity
7. % pop age 0-14 Urban influence
Estimates of Association between Workplace Social
Characteristics (baseline and change 2006-2008) and
Incident Hypertension (2006-2008)
25
Baseline Change
Model 1 –
individual
level
covariates
Model 2 – also
including 10
ecological
confounders
Model 1 –
individual
level
covariates
Model 2 – also
including 10
ecological
confounders
Estimate (SE) Estimate (SE) Estimate (SE) Estimate (SE)
Job
Satisfaction
-0.013 (0.0067) -0.060 (0.032) 0.002 (0.00093) 0.037 (0.041)
Perception of
management
0.0016 (0.011) -0.051 (0.042) -0.054** (0.012) -0.0019 (0.068)
Workplace
involvement
-0.0261* (0.0077) -0.040 (0.024) -0.0086 (0.0088) 0.0030 (0.0040)
Work stress 0.0047 (0.010) 0.036 (0.034) 0.027* (0.014) 0.032 (0.054)
2. Subgroup effects: understand how
machine learning methods may be
useful in determining what subgroups
may be most important
!
Limitations to traditional approach to
subgroup analysis and interaction in a
regression framework
Traditional regression approach: A priori decide on
subgroups to examine or look for subgroup
effects ad hoc in the data.
Limitations: Literature may not be rich enough to
know what is important, multiple comparison
issues, can only examine a limited set of
covariates.
28
Regression trees
Recursive partitioning is an automated method for creating
a regression tree.
1. Splitting (partitioning)
2. When to stop (terminal nodes)
3. Pruning (optimized by 10 fold cross-validation)
Implement using rpart package in R, by Terry Therneau
and Beth Atkinson, based on the work of Leo Breiman
NHLBI Growth and Health Study, Research
question.
Question: What subgroups best predict change in BMI,
allowing for interactions and non-linear prediction?
NHLBI Growth and Health Study (1987-1997), age 9-10 at
baseline.
A total of 2379 girls (1213 black and 1166 white)
Detailed social, environmental, economic, psychological,
behavioral and dietary data
Social and individual factors predicting
change in BMI, girls, age 9-19
Model rpart() and ctree() packages in R
Outcome BMI change from 9-19
Predictors
Dietary intake and eating behaviors: total kcal, % kcal from fat, % kcal from protein, eats
breakfast, eats snack food, eats fast food, eats while watching television, eats with soda
on the table*, family eats dinner together*, eats dinner alone*, time to eat dinner*
Behavioral: physical activity
Psychological: body dissatisfaction (EDI)**, bulimia (EDI)**, distrust (EDI)**, drive for
thinness (EDI)**, ineffective (EDI)**, interoceptive awareness (EDI)**, maturity fears
(EDI)**, perfection (EDI)**, self-worth (Harter), physical appearance (Harter), social
acceptance (Harter), athletic competence (Harter), behavioral conduct (Harter),
cognitive restructuring (Tobin), express emotions (Tobin), self criticism (Tobin), anxiety
(Reynolds)**, perceived stress (Cohen)**, emotional eating index
Social: number of siblings, race (black vs. white), male currently in household, income,
education
Parent Health: BMI*, self-reported health*, physical activity (self-reported)*, importance of
exercise*, depression*
Social and individual factors predicting
change in BMI, girls, age 9-19
Second application: For whom is BMI
most effected by the EITC policy?
Research question: What is the impact of a large
poverty reduction policy (the Earned Income
Tax Credit) on child BMI?
1. We may care about subgroups of the
population who do not benefit.
2. It may give us insights into why or why not the
policy is effective.
3. We may want to change or supplement the
policy to reach all population groups. 32
Model based recursive partitioning
“Party” package in R, using the “mob” function.
Data: NLSY79 Children, ages 2-18, years 1984-2008
Outcome: BMI percentile
Fit a structural part of the model with known confounders
and exposure of interest, and then scan over remaining
covariates to examine if effects differ by subgroups.
bmimob1<- mob(bmipctdif ~ adjeitc + sex | agemos + year + mar + div
+ hrswrk + dep + region + flchild + southbirth + southchild +
rosen + rotter + pearlin + cesd + black + other + region1 +
region2 + region3 + urbanR + urbanchildR + eduR + afqtR + momeduR
+ dadeduR + fbR + hisp + adjinc + healthySum + fastfoodSum +
unhealthySum, control = mob_control(minsplit = 26), data =
eitcNLSY, model = linearModel)
33
Significant subgroups of EITC
predicting BMI percentile, NLSY1979
Children, age 2-18
year
p < 0.001
1
≤ 1998 > 1998
agemos
p < 0.001
2
≤ 84 > 84
agemos
p < 0.001
3
≤ 31 > 31
Node 4 (n = 427)
-393 4323.5
-119
119
black
p = 0.028
5
≤ 0 > 0
Node 6 (n = 1434)
-393 4323.5
-119
119
Node 7 (n = 633)
-393 4323.5
-119
119
agemos
p < 0.001
8
≤ 149 > 149
afqtR
p = 0.002
9
≤ 7282 > 7282
year
p = 0.004
10
≤ 1994 > 1994
Node 11 (n = 169)
-393 4323.5
-119
119
Node 12 (n = 35)
-393 4323.5
-119
119
black
p < 0.001
13
≤ 0 > 0
dadeduR
p = 0.012
14
≤ 4 > 4
Node 15 (n = 58)
-393 4323.5
-119
119
Node 16 (n = 760)
-393 4323.5
-119
119
Node 17 (n = 428)
-393 4323.5
-119
119
Node 18 (n = 453)
-393 4323.5
-119
119
Node 19 (n = 909)
-393 4323.5
-119
119
34
3. Relative importance: understand
how machine learning methods may be
useful in determining the relative
importance of a large number of
predictors.
35
36
Random forest approach
Creates multiple decision trees based on random selection
of observations
Evaluates how good decision trees are in predicting
outcomes among those individuals not used to create the
decision tree
Variable importance measure is the average change in node
impurity comparing final model with model with single
randomized variable of interest.
Implementation using randomForest package in R, R port
by Andy Liaw and Matthew Weiner based on original
Fortran code by Leo Breiman and Adele Cutler
Random forest
37
Social and individual predictors of
adolescent obesity
Model rforest() package in R
Outcome BMI age-sex specific percentile change from 9-19
Predictors
Dietary intake and eating behaviors: total kcal, % kcal from fat, % kcal from protein, eats
breakfast, eats snack food, eats fast food, eats while watching television, eats with soda
on the table*, family eats dinner together*, eats dinner alone*, time to eat dinner*
Behavioral: physical activity
Psychological: body dissatisfaction (EDI)**, bulimia (EDI)**, distrust (EDI)**, drive for
thinness (EDI)**, ineffective (EDI)**, interoceptive awareness (EDI)**, maturity fears
(EDI)**, perfection (EDI)**, self-worth (Harter), physical appearance (Harter), social
acceptance (Harter), athletic competence (Harter), behavioral conduct (Harter),
cognitive restructuring (Tobin), express emotions (Tobin), self criticism (Tobin), anxiety
(Reynolds)**, perceived stress (Cohen)**, emotional eating index
Social: number of siblings, race (black vs. white), male currently in household, income,
education
Parent Health: BMI*, self-reported health*, physical activity (self-reported)*, importance of
exercise*, depression*
39
40
41
42
Synthesis 1: Control for confounding: understand
how machine learning methods may be useful
in certain situations
Limitations:
1. Must be careful not to include colliders
2. Could potentially reduce power
Strengths:
1. Decreasing bias
2. Efficient
3. Reproducible
When:
1. Causal model not well understood
2. Clear priors about causal ordering
3. Large number of possible potential confounders 43
Synthesis 2: Subgroup effects: understand how
machine learning methods may be useful in
determining what subgroups may be most
important
Limitations of Recursive partitioning based approaches:
1. Not completely hypothesis driven (but inputs are)
2. Limited in cross-validation
Strengths:
1. Hypothesis generating
2. Checking robustness of results
3. Explicit approach for examining heterogeneity
4. Considering subgroup and interactions as fundamental
When:
1. A prior concerns or interest in heterogeneity of effects
2. Unclear priors in literature about where subgroup effects may be
most important.
3. Results can be replicated in another dataset 44
year
p < 0.001
1
≤ 1998 > 1998
agemos
p < 0.001
2
≤ 84 > 84
agemos
p < 0.001
3
≤ 31 > 31
Node 4 (n = 427)
-393 4323.5
-119
119
black
p = 0.028
5
≤ 0 > 0
Node 6 (n = 1434)
-393 4323.5
-119
119
Node 7 (n = 633)
-393 4323.5
-119
119
agemos
p < 0.001
8
≤ 149 > 149
afqtR
p = 0.002
9
≤ 7282 > 7282
year
p = 0.004
10
≤ 1994 > 1994
Node 11 (n = 169)
-393 4323.5
-119
119
Node 12 (n = 35)
-393 4323.5
-119
119
black
p < 0.001
13
≤ 0 > 0
dadeduR
p = 0.012
14
≤ 4 > 4
Node 15 (n = 58)
-393 4323.5
-119
119
Node 16 (n = 760)
-393 4323.5
-119
119
Node 17 (n = 428)
-393 4323.5
-119
119
Node 18 (n = 453)
-393 4323.5
-119
119
Node 19 (n = 909)
-393 4323.5
-119
119
Synthesis 3: Relative importance: understand how
machine learning methods may be useful in
determining the relative importance of a large
number of predictors.
Limitations:
1. impacted by differential measurement error
2. direction of cause ambiguous
Strengths:
1. Broader view of potential causes
2. Considering subgroup and interactions as fundamental
3. Multiple regression trees result in results that are typically more
stable than traditional regression results
When:
1. Prior knowledge of large number of potential risk factors
2. large number of well measured covariates
3. Combine with matching approaches for causal inference 45
Thank you.
46
drehkopf@stanford.edu
@drehkopf
Funding: NIA (2014-present) (K01
AG047280)

More Related Content

Similar to 19 lu_machine_learning_final

Risk Communication
Risk CommunicationRisk Communication
Risk CommunicationDrNeilArnott
 
Chemical Risk Assessment and Translation to Socio-Economic Assessments
Chemical Risk Assessment and Translation to Socio-Economic AssessmentsChemical Risk Assessment and Translation to Socio-Economic Assessments
Chemical Risk Assessment and Translation to Socio-Economic AssessmentsOECD Environment
 
Risky Business: Risk communicat ion in the provider-patient encounter
Risky Business: Risk communicat ion in the provider-patient encounterRisky Business: Risk communicat ion in the provider-patient encounter
Risky Business: Risk communicat ion in the provider-patient encounterZackary Berger
 
Prevalence of Physical Activity and Barriers to Physical Activity Among Yerev...
Prevalence of Physical Activity and Barriers to Physical Activity Among Yerev...Prevalence of Physical Activity and Barriers to Physical Activity Among Yerev...
Prevalence of Physical Activity and Barriers to Physical Activity Among Yerev...CRRC-Armenia
 
Scenario You are a lieutenant in charge of an undercove.docx
Scenario       You are a lieutenant in charge of an undercove.docxScenario       You are a lieutenant in charge of an undercove.docx
Scenario You are a lieutenant in charge of an undercove.docxkenjordan97598
 
The Economic Burden of Asbestos-Related Cancers in Canada
The Economic Burden of Asbestos-Related Cancers in Canada The Economic Burden of Asbestos-Related Cancers in Canada
The Economic Burden of Asbestos-Related Cancers in Canada Uyen Vu
 
Lecture at EPISEA 2010 conference gaps in stragegic information on MARPs 24…
Lecture at EPISEA 2010 conference gaps in stragegic information on MARPs 24…Lecture at EPISEA 2010 conference gaps in stragegic information on MARPs 24…
Lecture at EPISEA 2010 conference gaps in stragegic information on MARPs 24…Dr Ajith Karawita
 
Magellan Health’s Programmatic Suicide Deterrent System
Magellan Health’s Programmatic Suicide Deterrent System Magellan Health’s Programmatic Suicide Deterrent System
Magellan Health’s Programmatic Suicide Deterrent System David Covington
 
Risk preferences among small farmers in Lesotho: evidence from laboratory exp...
Risk preferences among small farmers in Lesotho: evidence from laboratory exp...Risk preferences among small farmers in Lesotho: evidence from laboratory exp...
Risk preferences among small farmers in Lesotho: evidence from laboratory exp...The Transfer Project
 
Benefits-Costs Analysis: Triumphs and Troubles
Benefits-Costs Analysis: Triumphs and TroublesBenefits-Costs Analysis: Triumphs and Troubles
Benefits-Costs Analysis: Triumphs and TroublesOECD Environment
 
Benefits Of Exercise Essay. Benefits of Exercise Essay Essay on Benefits of ...
Benefits Of Exercise Essay. Benefits of Exercise Essay  Essay on Benefits of ...Benefits Of Exercise Essay. Benefits of Exercise Essay  Essay on Benefits of ...
Benefits Of Exercise Essay. Benefits of Exercise Essay Essay on Benefits of ...Liza Shirar
 
Parental Aspirations for Children's Education: Is There a "Girl Effect"? Expe...
Parental Aspirations for Children's Education: Is There a "Girl Effect"? Expe...Parental Aspirations for Children's Education: Is There a "Girl Effect"? Expe...
Parental Aspirations for Children's Education: Is There a "Girl Effect"? Expe...essp2
 
Simcoe County - Infrastructure Table - RBA slide-deck
Simcoe County - Infrastructure Table - RBA slide-deckSimcoe County - Infrastructure Table - RBA slide-deck
Simcoe County - Infrastructure Table - RBA slide-deckMahendra Patel
 
Sheet1Score -54321ScoreAccurately described the leader’s style, t.docx
Sheet1Score -54321ScoreAccurately described the leader’s style, t.docxSheet1Score -54321ScoreAccurately described the leader’s style, t.docx
Sheet1Score -54321ScoreAccurately described the leader’s style, t.docxedgar6wallace88877
 
Define epidemiology and identify the epidemiological models.pdf
Define epidemiology and identify the epidemiological models.pdfDefine epidemiology and identify the epidemiological models.pdf
Define epidemiology and identify the epidemiological models.pdfsdfghj21
 

Similar to 19 lu_machine_learning_final (18)

Risk Communication
Risk CommunicationRisk Communication
Risk Communication
 
Sample
SampleSample
Sample
 
Chemical Risk Assessment and Translation to Socio-Economic Assessments
Chemical Risk Assessment and Translation to Socio-Economic AssessmentsChemical Risk Assessment and Translation to Socio-Economic Assessments
Chemical Risk Assessment and Translation to Socio-Economic Assessments
 
Risky Business: Risk communicat ion in the provider-patient encounter
Risky Business: Risk communicat ion in the provider-patient encounterRisky Business: Risk communicat ion in the provider-patient encounter
Risky Business: Risk communicat ion in the provider-patient encounter
 
Prevalence of Physical Activity and Barriers to Physical Activity Among Yerev...
Prevalence of Physical Activity and Barriers to Physical Activity Among Yerev...Prevalence of Physical Activity and Barriers to Physical Activity Among Yerev...
Prevalence of Physical Activity and Barriers to Physical Activity Among Yerev...
 
Scenario You are a lieutenant in charge of an undercove.docx
Scenario       You are a lieutenant in charge of an undercove.docxScenario       You are a lieutenant in charge of an undercove.docx
Scenario You are a lieutenant in charge of an undercove.docx
 
The Economic Burden of Asbestos-Related Cancers in Canada
The Economic Burden of Asbestos-Related Cancers in Canada The Economic Burden of Asbestos-Related Cancers in Canada
The Economic Burden of Asbestos-Related Cancers in Canada
 
Lecture at EPISEA 2010 conference gaps in stragegic information on MARPs 24…
Lecture at EPISEA 2010 conference gaps in stragegic information on MARPs 24…Lecture at EPISEA 2010 conference gaps in stragegic information on MARPs 24…
Lecture at EPISEA 2010 conference gaps in stragegic information on MARPs 24…
 
Magellan Health’s Programmatic Suicide Deterrent System
Magellan Health’s Programmatic Suicide Deterrent System Magellan Health’s Programmatic Suicide Deterrent System
Magellan Health’s Programmatic Suicide Deterrent System
 
Risk preferences among small farmers in Lesotho: evidence from laboratory exp...
Risk preferences among small farmers in Lesotho: evidence from laboratory exp...Risk preferences among small farmers in Lesotho: evidence from laboratory exp...
Risk preferences among small farmers in Lesotho: evidence from laboratory exp...
 
Benefits-Costs Analysis: Triumphs and Troubles
Benefits-Costs Analysis: Triumphs and TroublesBenefits-Costs Analysis: Triumphs and Troubles
Benefits-Costs Analysis: Triumphs and Troubles
 
Benefits Of Exercise Essay. Benefits of Exercise Essay Essay on Benefits of ...
Benefits Of Exercise Essay. Benefits of Exercise Essay  Essay on Benefits of ...Benefits Of Exercise Essay. Benefits of Exercise Essay  Essay on Benefits of ...
Benefits Of Exercise Essay. Benefits of Exercise Essay Essay on Benefits of ...
 
Parental Aspirations for Children's Education: Is There a "Girl Effect"? Expe...
Parental Aspirations for Children's Education: Is There a "Girl Effect"? Expe...Parental Aspirations for Children's Education: Is There a "Girl Effect"? Expe...
Parental Aspirations for Children's Education: Is There a "Girl Effect"? Expe...
 
Simcoe County - Infrastructure Table - RBA slide-deck
Simcoe County - Infrastructure Table - RBA slide-deckSimcoe County - Infrastructure Table - RBA slide-deck
Simcoe County - Infrastructure Table - RBA slide-deck
 
Sampling Technique
Sampling TechniqueSampling Technique
Sampling Technique
 
Sheet1Score -54321ScoreAccurately described the leader’s style, t.docx
Sheet1Score -54321ScoreAccurately described the leader’s style, t.docxSheet1Score -54321ScoreAccurately described the leader’s style, t.docx
Sheet1Score -54321ScoreAccurately described the leader’s style, t.docx
 
Concept Note
Concept NoteConcept Note
Concept Note
 
Define epidemiology and identify the epidemiological models.pdf
Define epidemiology and identify the epidemiological models.pdfDefine epidemiology and identify the epidemiological models.pdf
Define epidemiology and identify the epidemiological models.pdf
 

Recently uploaded

College Call Girls Pune Mira 9907093804 Short 1500 Night 6000 Best call girls...
College Call Girls Pune Mira 9907093804 Short 1500 Night 6000 Best call girls...College Call Girls Pune Mira 9907093804 Short 1500 Night 6000 Best call girls...
College Call Girls Pune Mira 9907093804 Short 1500 Night 6000 Best call girls...Miss joya
 
Call Girls Horamavu WhatsApp Number 7001035870 Meeting With Bangalore Escorts
Call Girls Horamavu WhatsApp Number 7001035870 Meeting With Bangalore EscortsCall Girls Horamavu WhatsApp Number 7001035870 Meeting With Bangalore Escorts
Call Girls Horamavu WhatsApp Number 7001035870 Meeting With Bangalore Escortsvidya singh
 
VIP Call Girls Pune Vani 9907093804 Short 1500 Night 6000 Best call girls Ser...
VIP Call Girls Pune Vani 9907093804 Short 1500 Night 6000 Best call girls Ser...VIP Call Girls Pune Vani 9907093804 Short 1500 Night 6000 Best call girls Ser...
VIP Call Girls Pune Vani 9907093804 Short 1500 Night 6000 Best call girls Ser...Miss joya
 
♛VVIP Hyderabad Call Girls Chintalkunta🖕7001035870🖕Riya Kappor Top Call Girl ...
♛VVIP Hyderabad Call Girls Chintalkunta🖕7001035870🖕Riya Kappor Top Call Girl ...♛VVIP Hyderabad Call Girls Chintalkunta🖕7001035870🖕Riya Kappor Top Call Girl ...
♛VVIP Hyderabad Call Girls Chintalkunta🖕7001035870🖕Riya Kappor Top Call Girl ...astropune
 
Call Girls Colaba Mumbai ❤️ 9920874524 👈 Cash on Delivery
Call Girls Colaba Mumbai ❤️ 9920874524 👈 Cash on DeliveryCall Girls Colaba Mumbai ❤️ 9920874524 👈 Cash on Delivery
Call Girls Colaba Mumbai ❤️ 9920874524 👈 Cash on Deliverynehamumbai
 
Low Rate Call Girls Patna Anika 8250192130 Independent Escort Service Patna
Low Rate Call Girls Patna Anika 8250192130 Independent Escort Service PatnaLow Rate Call Girls Patna Anika 8250192130 Independent Escort Service Patna
Low Rate Call Girls Patna Anika 8250192130 Independent Escort Service Patnamakika9823
 
Call Girls Ludhiana Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Ludhiana Just Call 9907093804 Top Class Call Girl Service AvailableCall Girls Ludhiana Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Ludhiana Just Call 9907093804 Top Class Call Girl Service AvailableDipal Arora
 
Bangalore Call Girls Majestic 📞 9907093804 High Profile Service 100% Safe
Bangalore Call Girls Majestic 📞 9907093804 High Profile Service 100% SafeBangalore Call Girls Majestic 📞 9907093804 High Profile Service 100% Safe
Bangalore Call Girls Majestic 📞 9907093804 High Profile Service 100% Safenarwatsonia7
 
High Profile Call Girls Coimbatore Saanvi☎️ 8250192130 Independent Escort Se...
High Profile Call Girls Coimbatore Saanvi☎️  8250192130 Independent Escort Se...High Profile Call Girls Coimbatore Saanvi☎️  8250192130 Independent Escort Se...
High Profile Call Girls Coimbatore Saanvi☎️ 8250192130 Independent Escort Se...narwatsonia7
 
VIP Russian Call Girls in Varanasi Samaira 8250192130 Independent Escort Serv...
VIP Russian Call Girls in Varanasi Samaira 8250192130 Independent Escort Serv...VIP Russian Call Girls in Varanasi Samaira 8250192130 Independent Escort Serv...
VIP Russian Call Girls in Varanasi Samaira 8250192130 Independent Escort Serv...Neha Kaur
 
Call Girl Number in Panvel Mumbai📲 9833363713 💞 Full Night Enjoy
Call Girl Number in Panvel Mumbai📲 9833363713 💞 Full Night EnjoyCall Girl Number in Panvel Mumbai📲 9833363713 💞 Full Night Enjoy
Call Girl Number in Panvel Mumbai📲 9833363713 💞 Full Night Enjoybabeytanya
 
VIP Service Call Girls Sindhi Colony 📳 7877925207 For 18+ VIP Call Girl At Th...
VIP Service Call Girls Sindhi Colony 📳 7877925207 For 18+ VIP Call Girl At Th...VIP Service Call Girls Sindhi Colony 📳 7877925207 For 18+ VIP Call Girl At Th...
VIP Service Call Girls Sindhi Colony 📳 7877925207 For 18+ VIP Call Girl At Th...jageshsingh5554
 
Lucknow Call girls - 8800925952 - 24x7 service with hotel room
Lucknow Call girls - 8800925952 - 24x7 service with hotel roomLucknow Call girls - 8800925952 - 24x7 service with hotel room
Lucknow Call girls - 8800925952 - 24x7 service with hotel roomdiscovermytutordmt
 
Russian Call Girls in Pune Tanvi 9907093804 Short 1500 Night 6000 Best call g...
Russian Call Girls in Pune Tanvi 9907093804 Short 1500 Night 6000 Best call g...Russian Call Girls in Pune Tanvi 9907093804 Short 1500 Night 6000 Best call g...
Russian Call Girls in Pune Tanvi 9907093804 Short 1500 Night 6000 Best call g...Miss joya
 
Artifacts in Nuclear Medicine with Identifying and resolving artifacts.
Artifacts in Nuclear Medicine with Identifying and resolving artifacts.Artifacts in Nuclear Medicine with Identifying and resolving artifacts.
Artifacts in Nuclear Medicine with Identifying and resolving artifacts.MiadAlsulami
 
Best Rate (Hyderabad) Call Girls Jahanuma ⟟ 8250192130 ⟟ High Class Call Girl...
Best Rate (Hyderabad) Call Girls Jahanuma ⟟ 8250192130 ⟟ High Class Call Girl...Best Rate (Hyderabad) Call Girls Jahanuma ⟟ 8250192130 ⟟ High Class Call Girl...
Best Rate (Hyderabad) Call Girls Jahanuma ⟟ 8250192130 ⟟ High Class Call Girl...astropune
 
Call Girls Service In Shyam Nagar Whatsapp 8445551418 Independent Escort Service
Call Girls Service In Shyam Nagar Whatsapp 8445551418 Independent Escort ServiceCall Girls Service In Shyam Nagar Whatsapp 8445551418 Independent Escort Service
Call Girls Service In Shyam Nagar Whatsapp 8445551418 Independent Escort Serviceparulsinha
 
CALL ON ➥9907093804 🔝 Call Girls Hadapsar ( Pune) Girls Service
CALL ON ➥9907093804 🔝 Call Girls Hadapsar ( Pune)  Girls ServiceCALL ON ➥9907093804 🔝 Call Girls Hadapsar ( Pune)  Girls Service
CALL ON ➥9907093804 🔝 Call Girls Hadapsar ( Pune) Girls ServiceMiss joya
 
Call Girls Darjeeling Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Darjeeling Just Call 9907093804 Top Class Call Girl Service AvailableCall Girls Darjeeling Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Darjeeling Just Call 9907093804 Top Class Call Girl Service AvailableDipal Arora
 

Recently uploaded (20)

College Call Girls Pune Mira 9907093804 Short 1500 Night 6000 Best call girls...
College Call Girls Pune Mira 9907093804 Short 1500 Night 6000 Best call girls...College Call Girls Pune Mira 9907093804 Short 1500 Night 6000 Best call girls...
College Call Girls Pune Mira 9907093804 Short 1500 Night 6000 Best call girls...
 
Call Girls Horamavu WhatsApp Number 7001035870 Meeting With Bangalore Escorts
Call Girls Horamavu WhatsApp Number 7001035870 Meeting With Bangalore EscortsCall Girls Horamavu WhatsApp Number 7001035870 Meeting With Bangalore Escorts
Call Girls Horamavu WhatsApp Number 7001035870 Meeting With Bangalore Escorts
 
VIP Call Girls Pune Vani 9907093804 Short 1500 Night 6000 Best call girls Ser...
VIP Call Girls Pune Vani 9907093804 Short 1500 Night 6000 Best call girls Ser...VIP Call Girls Pune Vani 9907093804 Short 1500 Night 6000 Best call girls Ser...
VIP Call Girls Pune Vani 9907093804 Short 1500 Night 6000 Best call girls Ser...
 
♛VVIP Hyderabad Call Girls Chintalkunta🖕7001035870🖕Riya Kappor Top Call Girl ...
♛VVIP Hyderabad Call Girls Chintalkunta🖕7001035870🖕Riya Kappor Top Call Girl ...♛VVIP Hyderabad Call Girls Chintalkunta🖕7001035870🖕Riya Kappor Top Call Girl ...
♛VVIP Hyderabad Call Girls Chintalkunta🖕7001035870🖕Riya Kappor Top Call Girl ...
 
Call Girls Colaba Mumbai ❤️ 9920874524 👈 Cash on Delivery
Call Girls Colaba Mumbai ❤️ 9920874524 👈 Cash on DeliveryCall Girls Colaba Mumbai ❤️ 9920874524 👈 Cash on Delivery
Call Girls Colaba Mumbai ❤️ 9920874524 👈 Cash on Delivery
 
Low Rate Call Girls Patna Anika 8250192130 Independent Escort Service Patna
Low Rate Call Girls Patna Anika 8250192130 Independent Escort Service PatnaLow Rate Call Girls Patna Anika 8250192130 Independent Escort Service Patna
Low Rate Call Girls Patna Anika 8250192130 Independent Escort Service Patna
 
Call Girls Ludhiana Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Ludhiana Just Call 9907093804 Top Class Call Girl Service AvailableCall Girls Ludhiana Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Ludhiana Just Call 9907093804 Top Class Call Girl Service Available
 
Bangalore Call Girls Majestic 📞 9907093804 High Profile Service 100% Safe
Bangalore Call Girls Majestic 📞 9907093804 High Profile Service 100% SafeBangalore Call Girls Majestic 📞 9907093804 High Profile Service 100% Safe
Bangalore Call Girls Majestic 📞 9907093804 High Profile Service 100% Safe
 
High Profile Call Girls Coimbatore Saanvi☎️ 8250192130 Independent Escort Se...
High Profile Call Girls Coimbatore Saanvi☎️  8250192130 Independent Escort Se...High Profile Call Girls Coimbatore Saanvi☎️  8250192130 Independent Escort Se...
High Profile Call Girls Coimbatore Saanvi☎️ 8250192130 Independent Escort Se...
 
VIP Russian Call Girls in Varanasi Samaira 8250192130 Independent Escort Serv...
VIP Russian Call Girls in Varanasi Samaira 8250192130 Independent Escort Serv...VIP Russian Call Girls in Varanasi Samaira 8250192130 Independent Escort Serv...
VIP Russian Call Girls in Varanasi Samaira 8250192130 Independent Escort Serv...
 
Escort Service Call Girls In Sarita Vihar,, 99530°56974 Delhi NCR
Escort Service Call Girls In Sarita Vihar,, 99530°56974 Delhi NCREscort Service Call Girls In Sarita Vihar,, 99530°56974 Delhi NCR
Escort Service Call Girls In Sarita Vihar,, 99530°56974 Delhi NCR
 
Call Girl Number in Panvel Mumbai📲 9833363713 💞 Full Night Enjoy
Call Girl Number in Panvel Mumbai📲 9833363713 💞 Full Night EnjoyCall Girl Number in Panvel Mumbai📲 9833363713 💞 Full Night Enjoy
Call Girl Number in Panvel Mumbai📲 9833363713 💞 Full Night Enjoy
 
VIP Service Call Girls Sindhi Colony 📳 7877925207 For 18+ VIP Call Girl At Th...
VIP Service Call Girls Sindhi Colony 📳 7877925207 For 18+ VIP Call Girl At Th...VIP Service Call Girls Sindhi Colony 📳 7877925207 For 18+ VIP Call Girl At Th...
VIP Service Call Girls Sindhi Colony 📳 7877925207 For 18+ VIP Call Girl At Th...
 
Lucknow Call girls - 8800925952 - 24x7 service with hotel room
Lucknow Call girls - 8800925952 - 24x7 service with hotel roomLucknow Call girls - 8800925952 - 24x7 service with hotel room
Lucknow Call girls - 8800925952 - 24x7 service with hotel room
 
Russian Call Girls in Pune Tanvi 9907093804 Short 1500 Night 6000 Best call g...
Russian Call Girls in Pune Tanvi 9907093804 Short 1500 Night 6000 Best call g...Russian Call Girls in Pune Tanvi 9907093804 Short 1500 Night 6000 Best call g...
Russian Call Girls in Pune Tanvi 9907093804 Short 1500 Night 6000 Best call g...
 
Artifacts in Nuclear Medicine with Identifying and resolving artifacts.
Artifacts in Nuclear Medicine with Identifying and resolving artifacts.Artifacts in Nuclear Medicine with Identifying and resolving artifacts.
Artifacts in Nuclear Medicine with Identifying and resolving artifacts.
 
Best Rate (Hyderabad) Call Girls Jahanuma ⟟ 8250192130 ⟟ High Class Call Girl...
Best Rate (Hyderabad) Call Girls Jahanuma ⟟ 8250192130 ⟟ High Class Call Girl...Best Rate (Hyderabad) Call Girls Jahanuma ⟟ 8250192130 ⟟ High Class Call Girl...
Best Rate (Hyderabad) Call Girls Jahanuma ⟟ 8250192130 ⟟ High Class Call Girl...
 
Call Girls Service In Shyam Nagar Whatsapp 8445551418 Independent Escort Service
Call Girls Service In Shyam Nagar Whatsapp 8445551418 Independent Escort ServiceCall Girls Service In Shyam Nagar Whatsapp 8445551418 Independent Escort Service
Call Girls Service In Shyam Nagar Whatsapp 8445551418 Independent Escort Service
 
CALL ON ➥9907093804 🔝 Call Girls Hadapsar ( Pune) Girls Service
CALL ON ➥9907093804 🔝 Call Girls Hadapsar ( Pune)  Girls ServiceCALL ON ➥9907093804 🔝 Call Girls Hadapsar ( Pune)  Girls Service
CALL ON ➥9907093804 🔝 Call Girls Hadapsar ( Pune) Girls Service
 
Call Girls Darjeeling Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Darjeeling Just Call 9907093804 Top Class Call Girl Service AvailableCall Girls Darjeeling Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Darjeeling Just Call 9907093804 Top Class Call Girl Service Available
 

19 lu_machine_learning_final

  • 1. Lessons for humans from machine learning: discovering potential drivers of adolescent obesity David H. Rehkopf Associate Professor Division of Primary Care and Population Health Department of Medicine Stanford University School of Medicine University of Luxembourg May 20, 2019 1
  • 2. Broadest motivating question about exposures of income, employment and work (Krieger et al. Am J Public Health 2005) Coronary heart disease mortality, Massachusetts, 1989-1991, by Census tract poverty 2
  • 3. International Journal of Epidemology, 39: 97-106 (2009)
  • 4.
  • 5. Objectives of talk Introduction: understand what is meant by “machine learning methods” ① Control for confounding: understand how machine learning methods may be useful in certain situations ② Subgroup effects: understand how machine learning methods may be useful in determining what subgroups may be most important ③ Relative importance: understand how machine learning methods may be useful in determining the relative importance of a large number of predictors. Synthesis: understand basic concerns with these applications and a few guidelines for when they may complement or be used instead of traditional regression approaches 5
  • 6. the best machine learning algorithm according to Andrew Ng… 6 [Started Machine learning MOOC at Stanford…over 8 million people have enrolled...]
  • 8. Machine learning in general… It generally means giving computers the ability to learn without being explicitly programmed. It generally means letting the computer specify the model (or part of the model) rather than the human. 8 Machine learning for social and medical scientists…
  • 9. Some problems that the machine may help to address… ① I believe the true form between my dependent and independent variable is non-linear but I’m not sure exactly the best function. ② I just want to explain as much variance as possible in my dependent variable and I have a whole bunch of predictor variables. ③ I am interested in identifying subgroups of individuals who vary in the outcome rather than the strength of association with a particular predictor variable. ④ I want to find the most parsimonious predictor model. ⑤ I have a whole bunch of potential control variables to use and not sure what the best model is. 9
  • 11. 11
  • 14. Two types of machine learning algorithms (among very many): recursive partitioning and random forest Recursive Partitioning Regression Tree – 1. algorithm examines each risk factor and picks the split point within a variable that best differentiates the outcome of interest – dividing the sample into two nodes 2. Repeats for each of the new nodes. 3. This process continues to expand the number of nodes in the tree as long as a new split resulted in two groups where the difference in the outcome was significant at the p<0.05 level (or many other options for this rule). Random forest - 4. This procedure is repeated for a large number of trees (2500), with each one created from a subset of the overall data.
  • 15. What makes this different? Algorithm picks the best model from variables selected by the investigator. 1. Can use a large number of predictor variables. 2. Best model is fit by creating models on internal subsets of the data and testing fit on a separate part of the data. 3. Automated search for split points allows for best fit to functional form of relationship between predictor and outcome. 4. Tree based approach allows visualization of groups that may differ greatly from overall population effects. 5. Random forest greatly increases robustness of process by repeating multiple times on further subsets of the data.
  • 16. 16 1. Control for confounding: understand how machine learning methods may be useful in certain situations + = Or 1. Causes exposure 2. Causes outcome 3. Not on causal pathway between exposure and outcome 4. Not caused by exposure and outcome
  • 18. An even bigger problem…
  • 19. Example: Work context and hypertension 19 POPULATION An occupational cohort obtained from 47 United States manufacturing plants (n=10,545). Predictors. Aggregated plant characteristics: Job Satisfaction, Feelings toward Management, Workplace Involvement, Work Stress Outcome. Incident hypertension Confounders. We consider individual characteristics of plant composition that may impact exposure and outcome: gender, wages, race, grade, employment type.
  • 20. Geographic variation in health (Murray et al. Plos Med 2006)
  • 21. Concern for confounding by county characteristics - DAG 21 hypertension Demographic characteristics of plant Plant characteristics County characteristics
  • 22. Problem of area level confounders – good news/bad news. Bad news there are likely to be confounding variables that are not at the individual level. Bad news because exposures may be ecological or area level, individual level covariates may not be useful in blocking all the effects of more macro-social variables. The good news is that there are free and comprehensive data sources of potential area level confounders that are available. The bad news is that there is not strong evidence for the choice of which of these measures is most appropriate for a particular outcome. Bad news is that functional form of the dependence of each with an outcome of interest is primarily unknown. Good news is that most of this bad news can be addressed by machine learning methods.
  • 23. County characteristics (68) 23 'Census 2000 total resident population, 4/1/00', 'Percent population change, 4/1/00 to 7/1/05', 'Births 4/1/00 to 7/1/00', 'Deaths 4/1/00 to 7/1/00', Net international migration 4/1/00 to 7/1/00', 'Census 2000 housing units, 4/1/00', 'Percent housing unit growth, base 4/1/00 to 7/1/05', 'Percent of resident population aged 0 to 14 years, 7/1/05', 'Percent of resident population aged 15 to 64 years, 7/1/05', 'Percent of resident population aged 65 years and over, 7/1/05', 'Percent of resident population aged 85 years and over, 7/1/05', 'Sex Ratio, 7/1/05', 'Median age of total resident population, 7/1/05', 'Median age of male resident population, 7/1/05', 'Median age of female resident population, 7/1/05', 'Percent of resident population white alone, 7/1/05', 'Percent of resident population black alone, 7/1/05', 'Percent of resident population American Indian and Alaska native alone, 7/1/05', 'Percent of resident population Asian alone, 7/1/05', 'Percent of resident population Native Hawaiian and other Pacific islander alone, 7/1/05', 'Percent of resident population of two or more races, 7/1/05', 'Percent of resident population non-Hispanic, 7/1/05', 'Labor force, annual average estimate, 2005’, 'Unemployment rate, annual average estimate, 2005', '2004 ERS Economic Type', '2004 ERS Policy Type: Housing stress', '2004 ERS Policy Type: Low- education', '2004 ERS Policy Type: Low-employment', '2004 ERS Policy Type: Persistent poverty', '2004 ERS Policy Type: Population loss', '2004 ERS Policy Type: Nonmetropolitan recreation', '2004 ERS Policy Type: Retirement destination', '2003 ERS Urban Influence Code', '2003 ERS Rural-Urban Continuum Code', 'Population (persons), 2005', 'Per capita personal income (dollars), 2005', 'Contributions for government social insurance ($1,000s), 2005', 'Contributions for government social insurance: Employee and self-employed contributions for government social insurance ($1,000s), 2005', 'Contributions for government social insurance: Employer contributions for government social insurance ($1,000s), 2005', 'Adjustment for residence ($1,000s), 2005', 'Net earnings by place of residence ($1,000s), 2005', 'Dividends, interest, and rent ($1,000s), 2005', 'Personal current transfer receipts ($1,000s), 2005', 'Per capita net earnings by place of residence, 2005', 'Per capita personal current transfer receipts, 2005', 'Per capita income maintenance, 2005', 'Per capita unemployment insurance benefits, 2005', 'Per capita retirement and other benefits, 2005', 'Per capita dividends, interest, and rent, 2005', 'Average earnings per job, 2005', 'Average wage and salary disbursements per job, 2005', 'Average nonfarm proprietors'' income, 2005', 'Standardized score for mean temperature for January, 1941-1970', 'Standardized score for mean hours of sunlight for January, 1941-1970', 'Standardized score for mean temperature for July, 1941-1970', 'Standardized score for mean relative humidity for July, 1941-1970', 'Standardized score for land surface form typography code', 'Standardized score for natural log of percent water area', 'ERS Natural Amenity Scale', 'ERS Natural Amenity Rank', '2004 IECC (supplement to 2003 IECC) Climate Zone', '2004 IECC (supplement to 2003 IECC) warm-humid counties', 'Koppen classification corresponding to 2004 IECC Climate Zone', 'America Climate Region', 'Index crime rate (per 100,000 persons), 2004', '2004 presidential election: Percent of votes for Bush', '2004 presidential election: Percent of votes for Kerry', '2004 presidential election: Percent of votes for other candidates'
  • 24. Data adaptive search for County Characteristics Predicting Exposures and Outcome (variable importance measures from Random Forest Algorithm) 24 Relative importance Job Environment Exposure Hypertension 1. Social insurance expenditures Income maintenance 2. Earnings Unemployment benefits 3. Population size Retirement benefits 4. Income maintenance Earnings 5. % white Population size 6. Retirement benefits Urbanicity 7. % pop age 0-14 Urban influence
  • 25. Estimates of Association between Workplace Social Characteristics (baseline and change 2006-2008) and Incident Hypertension (2006-2008) 25 Baseline Change Model 1 – individual level covariates Model 2 – also including 10 ecological confounders Model 1 – individual level covariates Model 2 – also including 10 ecological confounders Estimate (SE) Estimate (SE) Estimate (SE) Estimate (SE) Job Satisfaction -0.013 (0.0067) -0.060 (0.032) 0.002 (0.00093) 0.037 (0.041) Perception of management 0.0016 (0.011) -0.051 (0.042) -0.054** (0.012) -0.0019 (0.068) Workplace involvement -0.0261* (0.0077) -0.040 (0.024) -0.0086 (0.0088) 0.0030 (0.0040) Work stress 0.0047 (0.010) 0.036 (0.034) 0.027* (0.014) 0.032 (0.054)
  • 26. 2. Subgroup effects: understand how machine learning methods may be useful in determining what subgroups may be most important !
  • 27. Limitations to traditional approach to subgroup analysis and interaction in a regression framework Traditional regression approach: A priori decide on subgroups to examine or look for subgroup effects ad hoc in the data. Limitations: Literature may not be rich enough to know what is important, multiple comparison issues, can only examine a limited set of covariates.
  • 28. 28 Regression trees Recursive partitioning is an automated method for creating a regression tree. 1. Splitting (partitioning) 2. When to stop (terminal nodes) 3. Pruning (optimized by 10 fold cross-validation) Implement using rpart package in R, by Terry Therneau and Beth Atkinson, based on the work of Leo Breiman
  • 29. NHLBI Growth and Health Study, Research question. Question: What subgroups best predict change in BMI, allowing for interactions and non-linear prediction? NHLBI Growth and Health Study (1987-1997), age 9-10 at baseline. A total of 2379 girls (1213 black and 1166 white) Detailed social, environmental, economic, psychological, behavioral and dietary data
  • 30. Social and individual factors predicting change in BMI, girls, age 9-19 Model rpart() and ctree() packages in R Outcome BMI change from 9-19 Predictors Dietary intake and eating behaviors: total kcal, % kcal from fat, % kcal from protein, eats breakfast, eats snack food, eats fast food, eats while watching television, eats with soda on the table*, family eats dinner together*, eats dinner alone*, time to eat dinner* Behavioral: physical activity Psychological: body dissatisfaction (EDI)**, bulimia (EDI)**, distrust (EDI)**, drive for thinness (EDI)**, ineffective (EDI)**, interoceptive awareness (EDI)**, maturity fears (EDI)**, perfection (EDI)**, self-worth (Harter), physical appearance (Harter), social acceptance (Harter), athletic competence (Harter), behavioral conduct (Harter), cognitive restructuring (Tobin), express emotions (Tobin), self criticism (Tobin), anxiety (Reynolds)**, perceived stress (Cohen)**, emotional eating index Social: number of siblings, race (black vs. white), male currently in household, income, education Parent Health: BMI*, self-reported health*, physical activity (self-reported)*, importance of exercise*, depression*
  • 31. Social and individual factors predicting change in BMI, girls, age 9-19
  • 32. Second application: For whom is BMI most effected by the EITC policy? Research question: What is the impact of a large poverty reduction policy (the Earned Income Tax Credit) on child BMI? 1. We may care about subgroups of the population who do not benefit. 2. It may give us insights into why or why not the policy is effective. 3. We may want to change or supplement the policy to reach all population groups. 32
  • 33. Model based recursive partitioning “Party” package in R, using the “mob” function. Data: NLSY79 Children, ages 2-18, years 1984-2008 Outcome: BMI percentile Fit a structural part of the model with known confounders and exposure of interest, and then scan over remaining covariates to examine if effects differ by subgroups. bmimob1<- mob(bmipctdif ~ adjeitc + sex | agemos + year + mar + div + hrswrk + dep + region + flchild + southbirth + southchild + rosen + rotter + pearlin + cesd + black + other + region1 + region2 + region3 + urbanR + urbanchildR + eduR + afqtR + momeduR + dadeduR + fbR + hisp + adjinc + healthySum + fastfoodSum + unhealthySum, control = mob_control(minsplit = 26), data = eitcNLSY, model = linearModel) 33
  • 34. Significant subgroups of EITC predicting BMI percentile, NLSY1979 Children, age 2-18 year p < 0.001 1 ≤ 1998 > 1998 agemos p < 0.001 2 ≤ 84 > 84 agemos p < 0.001 3 ≤ 31 > 31 Node 4 (n = 427) -393 4323.5 -119 119 black p = 0.028 5 ≤ 0 > 0 Node 6 (n = 1434) -393 4323.5 -119 119 Node 7 (n = 633) -393 4323.5 -119 119 agemos p < 0.001 8 ≤ 149 > 149 afqtR p = 0.002 9 ≤ 7282 > 7282 year p = 0.004 10 ≤ 1994 > 1994 Node 11 (n = 169) -393 4323.5 -119 119 Node 12 (n = 35) -393 4323.5 -119 119 black p < 0.001 13 ≤ 0 > 0 dadeduR p = 0.012 14 ≤ 4 > 4 Node 15 (n = 58) -393 4323.5 -119 119 Node 16 (n = 760) -393 4323.5 -119 119 Node 17 (n = 428) -393 4323.5 -119 119 Node 18 (n = 453) -393 4323.5 -119 119 Node 19 (n = 909) -393 4323.5 -119 119 34
  • 35. 3. Relative importance: understand how machine learning methods may be useful in determining the relative importance of a large number of predictors. 35
  • 36. 36 Random forest approach Creates multiple decision trees based on random selection of observations Evaluates how good decision trees are in predicting outcomes among those individuals not used to create the decision tree Variable importance measure is the average change in node impurity comparing final model with model with single randomized variable of interest. Implementation using randomForest package in R, R port by Andy Liaw and Matthew Weiner based on original Fortran code by Leo Breiman and Adele Cutler
  • 38. Social and individual predictors of adolescent obesity Model rforest() package in R Outcome BMI age-sex specific percentile change from 9-19 Predictors Dietary intake and eating behaviors: total kcal, % kcal from fat, % kcal from protein, eats breakfast, eats snack food, eats fast food, eats while watching television, eats with soda on the table*, family eats dinner together*, eats dinner alone*, time to eat dinner* Behavioral: physical activity Psychological: body dissatisfaction (EDI)**, bulimia (EDI)**, distrust (EDI)**, drive for thinness (EDI)**, ineffective (EDI)**, interoceptive awareness (EDI)**, maturity fears (EDI)**, perfection (EDI)**, self-worth (Harter), physical appearance (Harter), social acceptance (Harter), athletic competence (Harter), behavioral conduct (Harter), cognitive restructuring (Tobin), express emotions (Tobin), self criticism (Tobin), anxiety (Reynolds)**, perceived stress (Cohen)**, emotional eating index Social: number of siblings, race (black vs. white), male currently in household, income, education Parent Health: BMI*, self-reported health*, physical activity (self-reported)*, importance of exercise*, depression*
  • 39. 39
  • 40. 40
  • 41. 41
  • 42. 42
  • 43. Synthesis 1: Control for confounding: understand how machine learning methods may be useful in certain situations Limitations: 1. Must be careful not to include colliders 2. Could potentially reduce power Strengths: 1. Decreasing bias 2. Efficient 3. Reproducible When: 1. Causal model not well understood 2. Clear priors about causal ordering 3. Large number of possible potential confounders 43
  • 44. Synthesis 2: Subgroup effects: understand how machine learning methods may be useful in determining what subgroups may be most important Limitations of Recursive partitioning based approaches: 1. Not completely hypothesis driven (but inputs are) 2. Limited in cross-validation Strengths: 1. Hypothesis generating 2. Checking robustness of results 3. Explicit approach for examining heterogeneity 4. Considering subgroup and interactions as fundamental When: 1. A prior concerns or interest in heterogeneity of effects 2. Unclear priors in literature about where subgroup effects may be most important. 3. Results can be replicated in another dataset 44 year p < 0.001 1 ≤ 1998 > 1998 agemos p < 0.001 2 ≤ 84 > 84 agemos p < 0.001 3 ≤ 31 > 31 Node 4 (n = 427) -393 4323.5 -119 119 black p = 0.028 5 ≤ 0 > 0 Node 6 (n = 1434) -393 4323.5 -119 119 Node 7 (n = 633) -393 4323.5 -119 119 agemos p < 0.001 8 ≤ 149 > 149 afqtR p = 0.002 9 ≤ 7282 > 7282 year p = 0.004 10 ≤ 1994 > 1994 Node 11 (n = 169) -393 4323.5 -119 119 Node 12 (n = 35) -393 4323.5 -119 119 black p < 0.001 13 ≤ 0 > 0 dadeduR p = 0.012 14 ≤ 4 > 4 Node 15 (n = 58) -393 4323.5 -119 119 Node 16 (n = 760) -393 4323.5 -119 119 Node 17 (n = 428) -393 4323.5 -119 119 Node 18 (n = 453) -393 4323.5 -119 119 Node 19 (n = 909) -393 4323.5 -119 119
  • 45. Synthesis 3: Relative importance: understand how machine learning methods may be useful in determining the relative importance of a large number of predictors. Limitations: 1. impacted by differential measurement error 2. direction of cause ambiguous Strengths: 1. Broader view of potential causes 2. Considering subgroup and interactions as fundamental 3. Multiple regression trees result in results that are typically more stable than traditional regression results When: 1. Prior knowledge of large number of potential risk factors 2. large number of well measured covariates 3. Combine with matching approaches for causal inference 45