An introduction to machine learning for social scientists and epidemiologists, focusing on three ways these approaches may improve inference in these fields
Call Girls Darjeeling Just Call 9907093804 Top Class Call Girl Service Available
19 lu_machine_learning_final
1. Lessons for humans from machine
learning: discovering potential drivers
of adolescent obesity
David H. Rehkopf
Associate Professor
Division of Primary Care and Population Health
Department of Medicine
Stanford University School of Medicine
University of Luxembourg
May 20, 2019
1
2. Broadest motivating question about
exposures of income, employment and
work
(Krieger et al. Am J
Public Health 2005)
Coronary heart
disease
mortality,
Massachusetts,
1989-1991, by
Census tract
poverty
2
5. Objectives of talk
Introduction: understand what is meant by “machine
learning methods”
① Control for confounding: understand how machine
learning methods may be useful in certain situations
② Subgroup effects: understand how machine learning
methods may be useful in determining what subgroups
may be most important
③ Relative importance: understand how machine learning
methods may be useful in determining the relative
importance of a large number of predictors.
Synthesis: understand basic concerns with these
applications and a few guidelines for when they may
complement or be used instead of traditional regression
approaches 5
6. the best machine learning algorithm
according to Andrew Ng…
6
[Started Machine learning MOOC at Stanford…over 8
million people have enrolled...]
8. Machine learning in general…
It generally means giving computers the ability to learn
without being explicitly programmed.
It generally means letting the computer specify the
model (or part of the model) rather than the human.
8
Machine learning for social and medical
scientists…
9. Some problems that the machine may
help to address…
① I believe the true form between my dependent and
independent variable is non-linear but I’m not sure exactly
the best function.
② I just want to explain as much variance as possible in my
dependent variable and I have a whole bunch of predictor
variables.
③ I am interested in identifying subgroups of individuals who
vary in the outcome rather than the strength of association
with a particular predictor variable.
④ I want to find the most parsimonious predictor model.
⑤ I have a whole bunch of potential control variables to use
and not sure what the best model is. 9
14. Two types of machine learning algorithms
(among very many): recursive partitioning
and random forest
Recursive Partitioning Regression Tree –
1. algorithm examines each risk factor and picks the split point within a
variable that best differentiates the outcome of interest – dividing the
sample into two nodes
2. Repeats for each of the new nodes.
3. This process continues to expand the number of nodes in the tree as long
as a new split resulted in two groups where the difference in the outcome
was significant at the p<0.05 level (or many other options for this rule).
Random forest -
4. This procedure is repeated for a large number of trees (2500), with each
one created from a subset of the overall data.
15. What makes this different?
Algorithm picks the best model from variables selected
by the investigator.
1. Can use a large number of predictor variables.
2. Best model is fit by creating models on internal subsets
of the data and testing fit on a separate part of the data.
3. Automated search for split points allows for best fit to
functional form of relationship between predictor and
outcome.
4. Tree based approach allows visualization of groups that
may differ greatly from overall population effects.
5. Random forest greatly increases robustness of process
by repeating multiple times on further subsets of the
data.
16. 16
1. Control for confounding: understand
how machine learning methods may be
useful in certain situations
+ =
Or
1. Causes exposure
2. Causes outcome
3. Not on causal pathway between exposure and outcome
4. Not caused by exposure and outcome
19. Example: Work context and
hypertension
19
POPULATION An occupational cohort obtained from 47
United States manufacturing plants (n=10,545).
Predictors. Aggregated plant characteristics: Job Satisfaction,
Feelings toward Management, Workplace Involvement,
Work Stress
Outcome. Incident hypertension
Confounders. We consider individual characteristics of
plant composition that may impact exposure and
outcome: gender, wages, race, grade, employment type.
21. Concern for confounding by county
characteristics - DAG
21
hypertension
Demographic
characteristics
of plant
Plant
characteristics
County characteristics
22. Problem of area level confounders – good
news/bad news.
Bad news there are likely to be confounding variables that are not at
the individual level.
Bad news because exposures may be ecological or area level,
individual level covariates may not be useful in blocking all the
effects of more macro-social variables.
The good news is that there are free and comprehensive data sources of
potential area level confounders that are available.
The bad news is that there is not strong evidence for the choice of
which of these measures is most appropriate for a particular
outcome.
Bad news is that functional form of the dependence of each with an
outcome of interest is primarily unknown.
Good news is that most of this bad news can be addressed by machine
learning methods.
23. County characteristics (68)
23
'Census 2000 total resident population, 4/1/00', 'Percent population change, 4/1/00 to 7/1/05', 'Births 4/1/00 to 7/1/00', 'Deaths
4/1/00 to 7/1/00', Net international migration 4/1/00 to 7/1/00', 'Census 2000 housing units, 4/1/00', 'Percent housing unit
growth, base 4/1/00 to 7/1/05', 'Percent of resident population aged 0 to 14 years, 7/1/05', 'Percent of resident population
aged 15 to 64 years, 7/1/05', 'Percent of resident population aged 65 years and over, 7/1/05', 'Percent of resident population
aged 85 years and over, 7/1/05', 'Sex Ratio, 7/1/05', 'Median age of total resident population, 7/1/05', 'Median age of male
resident population, 7/1/05', 'Median age of female resident population, 7/1/05', 'Percent of resident population white alone,
7/1/05', 'Percent of resident population black alone, 7/1/05', 'Percent of resident population American Indian and Alaska
native alone, 7/1/05', 'Percent of resident population Asian alone, 7/1/05', 'Percent of resident population Native Hawaiian
and other Pacific islander alone, 7/1/05', 'Percent of resident population of two or more races, 7/1/05', 'Percent of resident
population non-Hispanic, 7/1/05', 'Labor force, annual average estimate, 2005’, 'Unemployment rate, annual average
estimate, 2005', '2004 ERS Economic Type', '2004 ERS Policy Type: Housing stress', '2004 ERS Policy Type: Low-
education', '2004 ERS Policy Type: Low-employment', '2004 ERS Policy Type: Persistent poverty', '2004 ERS Policy Type:
Population loss', '2004 ERS Policy Type: Nonmetropolitan recreation', '2004 ERS Policy Type: Retirement destination', '2003
ERS Urban Influence Code', '2003 ERS Rural-Urban Continuum Code', 'Population (persons), 2005', 'Per capita personal
income (dollars), 2005', 'Contributions for government social insurance ($1,000s), 2005', 'Contributions for government social
insurance: Employee and self-employed contributions for government social insurance ($1,000s), 2005', 'Contributions for
government social insurance: Employer contributions for government social insurance ($1,000s), 2005', 'Adjustment for
residence ($1,000s), 2005', 'Net earnings by place of residence ($1,000s), 2005', 'Dividends, interest, and rent ($1,000s),
2005', 'Personal current transfer receipts ($1,000s), 2005', 'Per capita net earnings by place of residence, 2005', 'Per capita
personal current transfer receipts, 2005', 'Per capita income maintenance, 2005', 'Per capita unemployment insurance
benefits, 2005', 'Per capita retirement and other benefits, 2005', 'Per capita dividends, interest, and rent, 2005', 'Average
earnings per job, 2005', 'Average wage and salary disbursements per job, 2005', 'Average nonfarm proprietors'' income,
2005', 'Standardized score for mean temperature for January, 1941-1970', 'Standardized score for mean hours of sunlight for
January, 1941-1970', 'Standardized score for mean temperature for July, 1941-1970', 'Standardized score for mean relative
humidity for July, 1941-1970', 'Standardized score for land surface form typography code', 'Standardized score for natural log
of percent water area', 'ERS Natural Amenity Scale', 'ERS Natural Amenity Rank', '2004 IECC (supplement to 2003 IECC)
Climate Zone', '2004 IECC (supplement to 2003 IECC) warm-humid counties', 'Koppen classification corresponding to 2004
IECC Climate Zone', 'America Climate Region', 'Index crime rate (per 100,000 persons), 2004', '2004 presidential election:
Percent of votes for Bush', '2004 presidential election: Percent of votes for Kerry', '2004 presidential election: Percent of
votes for other candidates'
24. Data adaptive search for County Characteristics Predicting
Exposures and Outcome
(variable importance measures from Random Forest
Algorithm)
24
Relative
importance
Job Environment
Exposure
Hypertension
1.
Social insurance
expenditures
Income
maintenance
2. Earnings
Unemployment
benefits
3. Population size
Retirement
benefits
4.
Income
maintenance
Earnings
5. % white Population size
6.
Retirement
benefits
Urbanicity
7. % pop age 0-14 Urban influence
25. Estimates of Association between Workplace Social
Characteristics (baseline and change 2006-2008) and
Incident Hypertension (2006-2008)
25
Baseline Change
Model 1 –
individual
level
covariates
Model 2 – also
including 10
ecological
confounders
Model 1 –
individual
level
covariates
Model 2 – also
including 10
ecological
confounders
Estimate (SE) Estimate (SE) Estimate (SE) Estimate (SE)
Job
Satisfaction
-0.013 (0.0067) -0.060 (0.032) 0.002 (0.00093) 0.037 (0.041)
Perception of
management
0.0016 (0.011) -0.051 (0.042) -0.054** (0.012) -0.0019 (0.068)
Workplace
involvement
-0.0261* (0.0077) -0.040 (0.024) -0.0086 (0.0088) 0.0030 (0.0040)
Work stress 0.0047 (0.010) 0.036 (0.034) 0.027* (0.014) 0.032 (0.054)
26. 2. Subgroup effects: understand how
machine learning methods may be
useful in determining what subgroups
may be most important
!
27. Limitations to traditional approach to
subgroup analysis and interaction in a
regression framework
Traditional regression approach: A priori decide on
subgroups to examine or look for subgroup
effects ad hoc in the data.
Limitations: Literature may not be rich enough to
know what is important, multiple comparison
issues, can only examine a limited set of
covariates.
28. 28
Regression trees
Recursive partitioning is an automated method for creating
a regression tree.
1. Splitting (partitioning)
2. When to stop (terminal nodes)
3. Pruning (optimized by 10 fold cross-validation)
Implement using rpart package in R, by Terry Therneau
and Beth Atkinson, based on the work of Leo Breiman
29. NHLBI Growth and Health Study, Research
question.
Question: What subgroups best predict change in BMI,
allowing for interactions and non-linear prediction?
NHLBI Growth and Health Study (1987-1997), age 9-10 at
baseline.
A total of 2379 girls (1213 black and 1166 white)
Detailed social, environmental, economic, psychological,
behavioral and dietary data
30. Social and individual factors predicting
change in BMI, girls, age 9-19
Model rpart() and ctree() packages in R
Outcome BMI change from 9-19
Predictors
Dietary intake and eating behaviors: total kcal, % kcal from fat, % kcal from protein, eats
breakfast, eats snack food, eats fast food, eats while watching television, eats with soda
on the table*, family eats dinner together*, eats dinner alone*, time to eat dinner*
Behavioral: physical activity
Psychological: body dissatisfaction (EDI)**, bulimia (EDI)**, distrust (EDI)**, drive for
thinness (EDI)**, ineffective (EDI)**, interoceptive awareness (EDI)**, maturity fears
(EDI)**, perfection (EDI)**, self-worth (Harter), physical appearance (Harter), social
acceptance (Harter), athletic competence (Harter), behavioral conduct (Harter),
cognitive restructuring (Tobin), express emotions (Tobin), self criticism (Tobin), anxiety
(Reynolds)**, perceived stress (Cohen)**, emotional eating index
Social: number of siblings, race (black vs. white), male currently in household, income,
education
Parent Health: BMI*, self-reported health*, physical activity (self-reported)*, importance of
exercise*, depression*
32. Second application: For whom is BMI
most effected by the EITC policy?
Research question: What is the impact of a large
poverty reduction policy (the Earned Income
Tax Credit) on child BMI?
1. We may care about subgroups of the
population who do not benefit.
2. It may give us insights into why or why not the
policy is effective.
3. We may want to change or supplement the
policy to reach all population groups. 32
33. Model based recursive partitioning
“Party” package in R, using the “mob” function.
Data: NLSY79 Children, ages 2-18, years 1984-2008
Outcome: BMI percentile
Fit a structural part of the model with known confounders
and exposure of interest, and then scan over remaining
covariates to examine if effects differ by subgroups.
bmimob1<- mob(bmipctdif ~ adjeitc + sex | agemos + year + mar + div
+ hrswrk + dep + region + flchild + southbirth + southchild +
rosen + rotter + pearlin + cesd + black + other + region1 +
region2 + region3 + urbanR + urbanchildR + eduR + afqtR + momeduR
+ dadeduR + fbR + hisp + adjinc + healthySum + fastfoodSum +
unhealthySum, control = mob_control(minsplit = 26), data =
eitcNLSY, model = linearModel)
33
35. 3. Relative importance: understand
how machine learning methods may be
useful in determining the relative
importance of a large number of
predictors.
35
36. 36
Random forest approach
Creates multiple decision trees based on random selection
of observations
Evaluates how good decision trees are in predicting
outcomes among those individuals not used to create the
decision tree
Variable importance measure is the average change in node
impurity comparing final model with model with single
randomized variable of interest.
Implementation using randomForest package in R, R port
by Andy Liaw and Matthew Weiner based on original
Fortran code by Leo Breiman and Adele Cutler
38. Social and individual predictors of
adolescent obesity
Model rforest() package in R
Outcome BMI age-sex specific percentile change from 9-19
Predictors
Dietary intake and eating behaviors: total kcal, % kcal from fat, % kcal from protein, eats
breakfast, eats snack food, eats fast food, eats while watching television, eats with soda
on the table*, family eats dinner together*, eats dinner alone*, time to eat dinner*
Behavioral: physical activity
Psychological: body dissatisfaction (EDI)**, bulimia (EDI)**, distrust (EDI)**, drive for
thinness (EDI)**, ineffective (EDI)**, interoceptive awareness (EDI)**, maturity fears
(EDI)**, perfection (EDI)**, self-worth (Harter), physical appearance (Harter), social
acceptance (Harter), athletic competence (Harter), behavioral conduct (Harter),
cognitive restructuring (Tobin), express emotions (Tobin), self criticism (Tobin), anxiety
(Reynolds)**, perceived stress (Cohen)**, emotional eating index
Social: number of siblings, race (black vs. white), male currently in household, income,
education
Parent Health: BMI*, self-reported health*, physical activity (self-reported)*, importance of
exercise*, depression*
43. Synthesis 1: Control for confounding: understand
how machine learning methods may be useful
in certain situations
Limitations:
1. Must be careful not to include colliders
2. Could potentially reduce power
Strengths:
1. Decreasing bias
2. Efficient
3. Reproducible
When:
1. Causal model not well understood
2. Clear priors about causal ordering
3. Large number of possible potential confounders 43
44. Synthesis 2: Subgroup effects: understand how
machine learning methods may be useful in
determining what subgroups may be most
important
Limitations of Recursive partitioning based approaches:
1. Not completely hypothesis driven (but inputs are)
2. Limited in cross-validation
Strengths:
1. Hypothesis generating
2. Checking robustness of results
3. Explicit approach for examining heterogeneity
4. Considering subgroup and interactions as fundamental
When:
1. A prior concerns or interest in heterogeneity of effects
2. Unclear priors in literature about where subgroup effects may be
most important.
3. Results can be replicated in another dataset 44
year
p < 0.001
1
≤ 1998 > 1998
agemos
p < 0.001
2
≤ 84 > 84
agemos
p < 0.001
3
≤ 31 > 31
Node 4 (n = 427)
-393 4323.5
-119
119
black
p = 0.028
5
≤ 0 > 0
Node 6 (n = 1434)
-393 4323.5
-119
119
Node 7 (n = 633)
-393 4323.5
-119
119
agemos
p < 0.001
8
≤ 149 > 149
afqtR
p = 0.002
9
≤ 7282 > 7282
year
p = 0.004
10
≤ 1994 > 1994
Node 11 (n = 169)
-393 4323.5
-119
119
Node 12 (n = 35)
-393 4323.5
-119
119
black
p < 0.001
13
≤ 0 > 0
dadeduR
p = 0.012
14
≤ 4 > 4
Node 15 (n = 58)
-393 4323.5
-119
119
Node 16 (n = 760)
-393 4323.5
-119
119
Node 17 (n = 428)
-393 4323.5
-119
119
Node 18 (n = 453)
-393 4323.5
-119
119
Node 19 (n = 909)
-393 4323.5
-119
119
45. Synthesis 3: Relative importance: understand how
machine learning methods may be useful in
determining the relative importance of a large
number of predictors.
Limitations:
1. impacted by differential measurement error
2. direction of cause ambiguous
Strengths:
1. Broader view of potential causes
2. Considering subgroup and interactions as fundamental
3. Multiple regression trees result in results that are typically more
stable than traditional regression results
When:
1. Prior knowledge of large number of potential risk factors
2. large number of well measured covariates
3. Combine with matching approaches for causal inference 45