Organised by the Bioinformatics group at the BRCMH, IoP, SLaM and Maudsley Digital, this symposium showcased talks regarding the important roles of big data in mental health biomedical research and treatments.
2. • Epidemiology
• A life course perspective
– 1265 used to be ‘big’
• Administrative data sets
– Birth records
– Driver’s licences
– Mortality
• Geographic and neighbourhood indices
Overview
3. Epidemiology is the study of the distribution and
determinants of health-related states or events
(including disease), and the application of this study to
the control of diseases and other health problems.
World Health Organisation
4. Some key child/adolescent/young adult
outcomes
Illicit Drug
Use
Teen
Pregnancy
Overweight
/Obesity
Psychiatric
disorders
Tobacco
Use
Alcohol
Use
6. Prenatal Influences
• Family history measures – e.g. parental
alcoholism, parental smoking, parental
obesity, maternal teen childbirth;
• Prenatal exposures – e.g. maternal
smoking during pregnancy; inadequate
prenatal care;
• Perinatal outcomes – e.g. low birth
weight/prematurity; congenital
abnormalities.
7. Major Environmental Exposures during
Childhood
• Parental separation
• Inter-partner conflict, violence
• Childhood physical/sexual abuse or neglect
• Blended families: Step-parent, step-sibling,
half-sibling presence in the home
8. Socioeconomic Influences
• Maternal, paternal educational levels
• Family income, household poverty
• Neighborhood socioeconomic factors:
– median household income
– household poverty
– educational levels
– occupational statuses
9.
10. Neighborhood influences
Age structure and
race/ethnic structure
Frequency of
overweight/ obesity
Frequency of teen
childbirth
Frequency of severe
alcoholism
Frequency of
smoking
Frequency of single
parent households
12. Imperfect indicators of many of these variables can be
derived from administrative data-bases:
• Birth/ marriage/ mortality records
• Driver’s licenses
• Crime statistics
• US Census/America Community Survey data.
14. Missouri Birth Record Data
• Mother’s age, educational level, marital status, race/
ethnicity.
• Number of previous births
• If father named, father’s educational level, father’s age.
• Maternal pre-pregnancy weight.
• Maternal smoking during pregnancy (now, by trimester;
also smoking prior to pregnancy).
• Birth weight & prematurity, birth complications
• Maternal state/country of birth, paternal if listed.
15. First Application: MOAFTS
Missouri Adolescent Female Twin Study
• Identify all female-female twin pairs born in Missouri
born July 1st 1975-June 30th 1985.
• Baseline interviews at ages 13, 15, 17, or 19, mid-
1990s, median age 15, (Wave 1)
• Five-year follow-up, at minimum age 18 (Wave 4)
• Eight-year follow-up (Wave 5)
• Ten(+)-year follow-up (Wave 7) – younger cohort
born 1979-1985.
16. 16
Rates of Participation and Predictors of Non-
Participation
93.1% of entire cohort participated in at least one
wave of the study
Geocoding data available for:
– 92.7% of AA families
– 94.3% of EA families
• No neighborhood factors predictive of AA family non-
participation
• Neighborhood poverty, family disruption and low
income/household poverty associated with EA family
non-participation
• BUT, differences in participation rates modest
17. Further Possible Applications
• Between-family matched controls for prenatal
tobacco exposure, prematurity, neighborhood
environment etc.
• Full sib pairs discordant for:
– low birth weight/ prematurity
– Maternal smoking during pregnancy
• Maternal half-sib pairs suggesting blended families.
18. Linking family members and data sets
• For birth records, we assign a unique ID for every child,
and mother IDs and, where possible, father IDs, to allow
reconstruction of sibships and pedigrees
– Not always possible for older births parental DOB is
not listed.
• We cross-link birth records and driver’s license records
• Allows identification of e.g.:
- DUI parents;
- Sibships discordant for young adult BMI
19. Missouri Driver’s License/ State ID Data-
base
• 1996-2012
• ~ 9-10 million unique individuals, 22 million records
• ~ 400,000 individuals with a drunk-driving
conviction, ~ 70,000 with recurrent convictions.
• NOTE: ~ 1/3rd traffic fatalities alcohol-related (used
to be ~ ½).
20. Driver’s license/state ID data-base
• Height and weight, at first application for license
– good correlation (~.80) with self-report in young adulthood
• History of drunk-driving and other convictions, license
suspensions, BAC for each DUI conviction.
• Interval censored data on change in residential address
over time (5 year renewal).
• Sometimes updated for moves out of state, or death
21. Driver’s License Data-base:
Data accuracy?
• Reasonable
• But socioeconomic factors, via access to a good
lawyer, play an important role in DUI convictions
• There may be long delays between incident and
conviction.
• In 2012, the state still has records with e.g. missing
gender information.
22. Applications?
• Clustering of DUI drivers in neighborhoods
• Mortality in DUI men, women
• Morbidity studies through medical record merges.
• Sampling and case identification. Matched control
identification.
23. – Availability/ sophistication of Geographic
Information Systems (GIS) has facilitated increasing
research focus on neighborhood/ built
environment and aspects of physical and mental
health
• Exercise/ obesity
• Crime
• Mental health
• Drug use – especially alcohol
Neighborhood Research
24. – Aspects of neighborhood environment
(disadvantage, outlet density) have been shown to
be independently associated with measures of
alcohol consumption and alcohol related harm:
• Injury/ accident
• Violence
– Associations assumed to be causal & neighborhood
interventions have been proposed as possible
means to reduce alcohol consumption/ harm
Neighborhoods and Alcohol
25. 25
Mapping Alcohol Outlets
• We obtained the street address of all alcohol
outlets in Missouri licensed by the Division of
Alcohol and Tobacco Control for each year from
1992-2011
• Approx. 10,500 each year
• Calculated the road network distance between
the study participant’s address of residence and
the nearest alcohol outlet using a GIS.
• Less than 0.1% of outlets could not be geocoded.
27. Mean number of alcohol outlets within a
3 mile radius & distance to nearest outlet
by level of alcohol consumption
Alcohol Consumption Mean Number of
Alcohol Outlets (3
miles)
Distance (Mi) to
nearest outlet
1 (Heaviest Drinking
25%)
63.5 .75
2 48.6 .98
3 (Lightest Drinking
25%)
39.9 1.09
27
28. 28
Alcohol outlet density and alcohol
consumption in breast cancer survivors.
• In a statewide sample of Missouri breast cancer
survivors, 18.4% reported consuming more than one
drink on average per day.
• Women who lived within 3 miles of an outlet had
higher adjusted odds of alcohol use (OR: 2.09; 95% CI:
1.08 – 4.05) than those who lived at least 3 miles
from the nearest outlet in adjusted models.
29. “Place affects health—neighborhood and
community environments exert their own
health influences, independent of the risk
factors associated with individuals and
households.”
Institute of Medicine, 2011, page 74
30. • Birth records:
– 70,000 a year
– 2.5 -3 million total.
• Drivers license records including same
individual over time: 22 million
• Approx. 9 million unique individuals
– including 400,000 with DUIs.
Population level Geocoding
31. • We aim to cross-link family members (both
parents, children when old enough)
• Address standardization software is important
– the number of addresses is more manageable than
the number of individuals.
• Estimate we can geocode 80-85% of addresses
at high-throughput
– use small sub samples for quality control.
Population level Geocoding
32. • Ultimate goal is to achieve scalability
– combine across many states birth record data (most use
variations on a standard CDC format) and state ID/driver's
license data
• Even working at the zip code level (a crude
categorization), there is considerable geographic
variation
– E.g., percentage of women with a DUI ranges from 0-14%
– E.g., zipcode explain 2.9% of the variance in self-report BMI
Population level Geocoding
33. 33
Conclusions
• Traditional epidemiological research can be
enhanced:
– Using administrative data sets to identify informative
cases
– Linking administrative data sets to address important
questions
– Linking epidemiological and administrative data sets to
geographic/ neighbourhood information
34. 34
Challenges
• Data quality (divorce records in Missouri,
although available, are of poor quality, precluding
their use)
• Changes to procedures across time
• Missouri birth records
• Access
• Computing