Presentation by Sue Ranney of Revolution Analytics at JSM 2012, San Diego CA, Aug 1 2012.
The Center for Disease Control and Prevention recently issued a report, widely cited in the popular press, on the increased incidence of multiple births in the United States over the last 30 years. Twin birth rates were extracted from annual birth data by a variety of mother's characteristics in order to examine this trend. Our research extends this analysis by applying multivariate analysis to individual-level data obtained from public-use data sets on all births in the United States from 1985 to 2009. We combine the data into a single, multi-year data file (an .xdf file easily accessed by R) containing over 100-million birth records. To analyze the relationship between parental characteristics and multiple birth pregnancies, we first change the unit of observation from the baby to the pregnancy in order to remove replicated observations of parents of multiples. Then, estimating a logistic regression on all of the remaining observations, we show that the trends in increased multiple births are more strongly associated with the age of father than the age of mother, and that controlling for ages, the relative incidence of multiple births for black mothers has been declining.
The Rise in Multiple Births in the U.S.: An Analysis of a Hundred-Million Birth Records with R
1. Revolution Confidential
T he R is e in Multiple
B irths in the U.S .:
A n A nalys is of a
Hundred-Million
B irth R ec ords with R
S us an I. R anney, P h.D.
J S M 2012
2. T he Tools Revolution Confidential
Open Source R
Flexible, powerful, great graphics
Great for prototyping, not very scalable
RevoScaleR (with Revolution R Enterprise)
Efficient file format (.xdf)
Functions of accessing/importing external data sets
(fixed format & delimited text, SPSS, SAS, ODBC)
Very fast, parallelized, distributed analysis functions
(summary stats, crosstabs/cubes, linear models,
kmeans clustering, logistic regression, glm)
2
3. T he Data Revolution Confidential
Public-use data sets containing information
on all births in the United States for each
year from 1985 to 2009 are available to
download:
http://www.cdc.gov/nchs/data_access/Vitalstatsonline.htm
“These natality files are gigantic; they’re
approximately 3.1 GB uncompressed.
That’s a little larger than R can easily
process” – Joseph Adler, R in a Nutshell
3
4. T he U.S . B irth Data (c ontinued) Revolution Confidential
Data for each year are contained in a compressed,
fixed-format, text files
Typically 3 to 4 million records per file
Variables and structure of the files sometimes change
from year to year, with birth certificates revised in1989
and 2003. Warnings:
NOTE: THE RECORD LAYOUT OF THIS FILE HAS
CHANGED SUBSTANTIALLY. USERS
SHOULD READ THIS DOCUMENT
CAREFULLY.
Reporting can differ state-to-state for any given year
4
5. T he P roc es s Revolution Confidential
Basic “life cycle of analysis”
Import and combine data
25 years
Over 100 million obs.
Check and clean data
Basic variable summaries
Big data logistic/glm regressions
All on my laptop
Option to distribute computations
to cluster
5
6. T he Ques tion Revolution Confidential
CDC Report in Jan. 2012
6
7. T he Ques tion Revolution Confidential
What accounts for the increase in multiple
births in the United States?
Can we separate out effects of mother’s and
father’s ages, race/ethnicity?
Examine time trends by sub-group, assumed
to be associated with fertility treatment
CDC finding: The older age of women at
childbirth accounts for only about 1/3 of the rise
in twinning over 30 years (but this mixes in the
increased rate of “twinning” for older women)
7
9. E xample of Differenc es for Different Years Revolution Confidential
To create common variables across years, use
common names and new factor levels for ‘colInfo’ in
rxImport function. For example:
For 1985:
SEX = list(type="factor", start=35, width=1,
levels=c("1", "2"),
newLevels = c("Male", "Female"),
description = "Sex of Infant“)
For 2003:
SEX = list(type="factor", start=436, width=1,
levels=c("M", "F"),
newLevels = c("Male", "Female"),
description = "Sex of Infant”)
9
10. C reating Trans formed Variables on Import Revolution Confidential
Use standard R syntax to create transformed
variables.
For example, create a factor for Mom’s Age
using the imported MAGER integer variable:
MomAgeR7 = cut(MAGER,
breaks =c(0, 19, 24, 29, 34, 39, 98, 99),
labels = c("Under 20", "20-24", "25-29",
"30-34", "35-39", "Over 39", "Missing"))
Create binary variable for “IsMultiple”
IsMultiple = DPLURAL_REC != 'Single'
10
11. S teps for C omplete Import Revolution Confidential
Lists for column information and transforms are
created for 3 base years: 1985, 1989, 2003
when there were very large changes in the
structure of the input files
Changes to these lists are made where
appropriate for in-between years
A test script is run, importing only 1000
observations per year for a subset of years
Full script is run, importing each year, sorting
according to key demographic characteristics,
and appending to a master .xdf file
11
13. E xamining B as ic Information Revolution Confidential
Basic file information
>rxGetInfo(birthAll)
File name:
C:RevolutionDataCDCBirthUS85to09S.xdf
Number of observations: 100672041
Number of variables: 50
Number of blocks: 215
Use rxSummary to compute summary statistics
for continuous data and category counts for
each of the factors (about 4 minutes on my
laptop)
rxSummary(~., data=birthAll, blocksPerRead = 10)
13
14. E xample of S ummary S tatis tic s Revolution Confidential
MomAgeR7 Counts DadAgeR8 Counts
Under 20 11918891 Under 20 3226602
20-24 25975642 20-24 15304803
25-29 28701398 25-29 23805056
30-34 22341530 30-34 23179418
35-39 9788753 35-39 13289015
Over 39 1945827 40-44 4984146
Missing 0 Over 44 2140207
Missing 14742794
14
15. His tograms by Year Revolution Confidential
Easily check for basic errors in data import
(e.g. wrong position in file) by creating
histograms by year – very fast (just seconds
on my laptop)
Example: Distribution of mother’s age by
year. Use F() to have the integer year treated
as a factor.
rxHistogram(~MAGER| F(DOB_YY),
data=birthAll, blocksPerRead = 10,
layout=c(5,5))
15
16. A ge of Mother Over Time Revolution Confidential
16
17. Drill Down and E xtrac t S ubs amples Revolution Confidential
Take a quick look at “older” fathers: Dad’s
rxSummary(~F(UFAGECOMB), Age Counts
data=birthAll, 80 141
blocksPerRead = 10) 81 108
82 81
What’s going on with 89-year old 83 74
Dads? Extract a data frame: 84 56
dad89 <- rxDataStep( 85 43
inData = birthAll, 86 43
rowSelection = UFAGECOMB == 89, 87 26
varsToKeep = c("DOB_YY", "MAGER", 88 27
"MAR", "STATENAT", "FRACEREC"), 89 3327
blocksPerRead = 10)
17
18. Year and S tate for 89-Year-Old F athers
Revolution Confidential
rxCube(~F(DOB_YY):STATENAT,
data=dad89, removeZeroCounts=TRUE)
F_DOB_YY STATENAT Counts
1990 California 1
1999 California 1
2000 California 1
1996 Hawaii 1
1997 Louisiana 1
1986 New Jersey 1
1995 New Jersey 1
1996 Ohio 1
1989 Texas 3316
1990 Texas 1
2001 Texas 1
1985 Washington 1
18
20. Unit of Obs ervation: Delivery Revolution Confidential
Appropriate unit of observation is the
delivery (resulting in 1 or more live births)
rather than the individual birth.
Use 1/Plurality as probability weight
Alternative: Look at nearby records to
compute the “Reported Delivery Birth Order”
(RDPO), then select on only the 1st born in
the delivery for the analysis
20
25. L ogis tic R egres s ion Revolution Confidential
Logistic Regression Results for: IsMultiple ~ DadAgeR8 + MomAgeR7 +
FRaceEthn + MRaceEthn +
DadAgeR8:FRaceEthn:MNTHS_SINCE_JAN85 +
MomAgeR7:MRaceEthn:MNTHS_SINCE_JAN85
File name: C:RevolutionDataCDCBirthUS85to09R.xdf
Probability weights: PluralWeight
Dependent variable(s): IsMultiple
Total independent variables: 118 (Including number dropped: 12)
Number of valid observations: 100672041
Number of missing observations: 0 -2*LogLikelihood: 14555447.8514
(Residual deviance on 100671935 degrees of freedom)
About 6 minutes on my laptop
25
26. C ounts of Deliveries by all Demographic
C ombinations by Year Revolution Confidential
rxCube(~DadAgeR8:MomAgeR7:FRaceEthn:MRaceEthn:F(DOB_YY),
data = birthAllR, pweights = "PluralWeight",
blocksPerRead = 10)
Under 10 seconds to compute on my laptop
Resulting data frame has 50,400 rows
representing all the demographic
combinations for each of the 25 years
44,661 have counts <= 1000
Provides input data for predictions and
weights for aggregating predictions
26
27. P redic t and A ggregate by Year Revolution Confidential
predOut <- rxPredict(logitObj,
data = catAllDF)
Create predictions for each detailed
demographic group
Aggregate for each year using population
percentages for each detailed group for each
year
Compare with actuals
Perform “What if?” scenarios
Scenario 1: Only change in demographic; no
“fertility-treatment” time trand
27
29. Drill Down: L ook at P redic tions for S pec ific
G roups Revolution Confidential
Select out predicted values from prediction
data frame for detailed groups:
Age of mother and father
Race/ethnicity of mother and father
29
32. Revolution Confidential
C onc lus ions from L ogis tic R egres s ion Revolution Confidential
Dads matter: Use of fertility treatment
associated with older dads as well as older
Mom’s
Hispanics show relatively small increase in
multiples for both younger and older couples
Asian pattern similar to whites, but lower for all
age groups
Black show similar increase in multiples for both
younger and older couples
Other Factors Related to Multiples? (Diet, high BMI,
other genetic factors)
32
34. Revolution Confidential
G L M Tweedie Model: How Many B abies ? Revolution Confidential
When power parameter is between 1 and 2,
Tweedie is a compound Poisson distribution
Appropriate for data with positive data that
also includes a ‘clump’ of exact zeros
Dependent variable: Number of additional
babies (Plurality – 1)
Same independent variables
34
36. Revolution Confidential
F urther R es earc h Revolution Confidential
Data management
Further cleaning of data
Import more variables
Import more years (additional use of weights)
Multiple Births Analysis
More variables (e.g., proxies for fertility treatment
trends)
Investigation of sub-groups (e.g., young blacks)
Improved computation of number of births per
delivery
Other analyses with birth data
36
37. Revolution Confidential
S ummary of A pproac h
Revolution Confidential
“Unpredictability” of multiple births requires
large data set to have the power to capture
effects
Significant challenges in importing and cleaning
the data – using R and .xdf files makes it
possible
Even with a huge data, “cells” of tables looking
at multiple factors can be small
Using combined.xdf file, we can use individual-
level analysis to examine conditional effects of
a variety of factors
37
38. Revolution Confidential
R eferenc es
Revolution Confidential
Martin JA, Hamilton BE, Osterman MJK. Three
decades of twin births in the United States, 1980-
2009. NCHS data brief, no. 80. Hyattsville, MD:
National Center for Health Statistics. 2012.
Blondel B, Kaminiski, M. Trends in the Occurrence,
Determinants, and Consequences of Multiple
Births. Seminars in Perinatology. 26(4):239-49,
2002.
Vahratian, Anjel. Utilization of Fertility-Related
Services in the United States. Fertil Steril. 2008
October; 90(4):1317-1319.
38
39. Revolution Confidential
T hank you!
Revolution Confidential
R-Core Team
R Package Developers
R Community
Revolution R Enterprise Customers and Beta
Testers
Colleagues at Revolution Analytics
Contact:
sue@revolutionanalytics.com
39