The Rise in Multiple Births in the U.S.: An Analysis of a Hundred-Million Birth Records with R


Published on

Presentation by Sue Ranney of Revolution Analytics at JSM 2012, San Diego CA, Aug 1 2012.

The Center for Disease Control and Prevention recently issued a report, widely cited in the popular press, on the increased incidence of multiple births in the United States over the last 30 years. Twin birth rates were extracted from annual birth data by a variety of mother's characteristics in order to examine this trend. Our research extends this analysis by applying multivariate analysis to individual-level data obtained from public-use data sets on all births in the United States from 1985 to 2009. We combine the data into a single, multi-year data file (an .xdf file easily accessed by R) containing over 100-million birth records. To analyze the relationship between parental characteristics and multiple birth pregnancies, we first change the unit of observation from the baby to the pregnancy in order to remove replicated observations of parents of multiples. Then, estimating a logistic regression on all of the remaining observations, we show that the trends in increased multiple births are more strongly associated with the age of father than the age of mother, and that controlling for ages, the relative incidence of multiple births for black mothers has been declining.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

The Rise in Multiple Births in the U.S.: An Analysis of a Hundred-Million Birth Records with R

  1. 1. Revolution ConfidentialT he R is e in Multiple B irths in the U.S .: A n A nalys is of a Hundred-MillionB irth R ec ords with R S us an I. R anney, P h.D. J S M 2012
  2. 2. T he Tools Revolution Confidential Open Source R  Flexible, powerful, great graphics  Great for prototyping, not very scalable RevoScaleR (with Revolution R Enterprise)  Efficient file format (.xdf)  Functions of accessing/importing external data sets (fixed format & delimited text, SPSS, SAS, ODBC)  Very fast, parallelized, distributed analysis functions (summary stats, crosstabs/cubes, linear models, kmeans clustering, logistic regression, glm) 2
  3. 3. T he Data Revolution Confidential Public-use data sets containing information on all births in the United States for each year from 1985 to 2009 are available to download: “These natality files are gigantic; they’re approximately 3.1 GB uncompressed. That’s a little larger than R can easily process” – Joseph Adler, R in a Nutshell 3
  4. 4. T he U.S . B irth Data (c ontinued) Revolution Confidential Data for each year are contained in a compressed, fixed-format, text files Typically 3 to 4 million records per file Variables and structure of the files sometimes change from year to year, with birth certificates revised in1989 and 2003. Warnings: NOTE: THE RECORD LAYOUT OF THIS FILE HAS CHANGED SUBSTANTIALLY. USERS SHOULD READ THIS DOCUMENT CAREFULLY. Reporting can differ state-to-state for any given year 4
  5. 5. T he P roc es s Revolution Confidential  Basic “life cycle of analysis”  Import and combine data  25 years  Over 100 million obs.  Check and clean data  Basic variable summaries  Big data logistic/glm regressions  All on my laptop  Option to distribute computations to cluster 5
  6. 6. T he Ques tion Revolution ConfidentialCDC Report in Jan. 2012 6
  7. 7. T he Ques tion Revolution Confidential What accounts for the increase in multiple births in the United States? Can we separate out effects of mother’s and father’s ages, race/ethnicity? Examine time trends by sub-group, assumed to be associated with fertility treatment CDC finding: The older age of women at childbirth accounts for only about 1/3 of the rise in twinning over 30 years (but this mixes in the increased rate of “twinning” for older women) 7
  8. 8. Revolution ConfidentialImporting the U.S . B irth Data for Us e in R 8
  9. 9. E xample of Differenc es for Different Years Revolution ConfidentialTo create common variables across years, use common names and new factor levels for ‘colInfo’ in rxImport function. For example: For 1985: SEX = list(type="factor", start=35, width=1, levels=c("1", "2"), newLevels = c("Male", "Female"), description = "Sex of Infant“) For 2003: SEX = list(type="factor", start=436, width=1, levels=c("M", "F"), newLevels = c("Male", "Female"), description = "Sex of Infant”) 9
  10. 10. C reating Trans formed Variables on Import Revolution ConfidentialUse standard R syntax to create transformed variables. For example, create a factor for Mom’s Age using the imported MAGER integer variable: MomAgeR7 = cut(MAGER, breaks =c(0, 19, 24, 29, 34, 39, 98, 99), labels = c("Under 20", "20-24", "25-29", "30-34", "35-39", "Over 39", "Missing")) Create binary variable for “IsMultiple” IsMultiple = DPLURAL_REC != Single 10
  11. 11. S teps for C omplete Import Revolution Confidential Lists for column information and transforms are created for 3 base years: 1985, 1989, 2003 when there were very large changes in the structure of the input files Changes to these lists are made where appropriate for in-between years A test script is run, importing only 1000 observations per year for a subset of years Full script is run, importing each year, sorting according to key demographic characteristics, and appending to a master .xdf file 11
  12. 12. Revolution ConfidentialE xamining and C leaning the B ig Data F ile 12
  13. 13. E xamining B as ic Information Revolution Confidential Basic file information>rxGetInfo(birthAll)File name: C:RevolutionDataCDCBirthUS85to09S.xdf Number of observations: 100672041 Number of variables: 50 Number of blocks: 215Use rxSummary to compute summary statistics for continuous data and category counts for each of the factors (about 4 minutes on my laptop)rxSummary(~., data=birthAll, blocksPerRead = 10) 13
  14. 14. E xample of S ummary S tatis tic s Revolution ConfidentialMomAgeR7 Counts DadAgeR8 CountsUnder 20 11918891 Under 20 322660220-24 25975642 20-24 1530480325-29 28701398 25-29 2380505630-34 22341530 30-34 2317941835-39 9788753 35-39 13289015Over 39 1945827 40-44 4984146Missing 0 Over 44 2140207 Missing 14742794 14
  15. 15. His tograms by Year Revolution ConfidentialEasily check for basic errors in data import (e.g. wrong position in file) by creating histograms by year – very fast (just seconds on my laptop) Example: Distribution of mother’s age by year. Use F() to have the integer year treated as a factor.rxHistogram(~MAGER| F(DOB_YY), data=birthAll, blocksPerRead = 10, layout=c(5,5)) 15
  16. 16. A ge of Mother Over Time Revolution Confidential 16
  17. 17. Drill Down and E xtrac t S ubs amples Revolution Confidential Take a quick look at “older” fathers: Dad’srxSummary(~F(UFAGECOMB), Age Counts data=birthAll, 80 141 blocksPerRead = 10) 81 108 82 81 What’s going on with 89-year old 83 74 Dads? Extract a data frame: 84 56dad89 <- rxDataStep( 85 43 inData = birthAll, 86 43 rowSelection = UFAGECOMB == 89, 87 26 varsToKeep = c("DOB_YY", "MAGER", 88 27 "MAR", "STATENAT", "FRACEREC"), 89 3327 blocksPerRead = 10) 17
  18. 18. Year and S tate for 89-Year-Old F athers Revolution Confidential rxCube(~F(DOB_YY):STATENAT, data=dad89, removeZeroCounts=TRUE) F_DOB_YY STATENAT Counts 1990 California 1 1999 California 1 2000 California 1 1996 Hawaii 1 1997 Louisiana 1 1986 New Jersey 1 1995 New Jersey 1 1996 Ohio 1 1989 Texas 3316 1990 Texas 1 2001 Texas 1 1985 Washington 1 18
  19. 19. Revolution ConfidentialDemographic s of Multiple B irths 1985-2009 19
  20. 20. Unit of Obs ervation: Delivery Revolution Confidential Appropriate unit of observation is the delivery (resulting in 1 or more live births) rather than the individual birth. Use 1/Plurality as probability weight Alternative: Look at nearby records to compute the “Reported Delivery Birth Order” (RDPO), then select on only the 1st born in the delivery for the analysis 20
  21. 21. Revolution Confidential 21
  22. 22. Revolution Confidential 22
  23. 23. Revolution Confidential 23
  24. 24. Revolution Confidential Multiple B irths L ogis ticR egres s ion & P redic tions 24
  25. 25. L ogis tic R egres s ion Revolution ConfidentialLogistic Regression Results for: IsMultiple ~ DadAgeR8 + MomAgeR7 + FRaceEthn + MRaceEthn + DadAgeR8:FRaceEthn:MNTHS_SINCE_JAN85 + MomAgeR7:MRaceEthn:MNTHS_SINCE_JAN85 File name: C:RevolutionDataCDCBirthUS85to09R.xdf Probability weights: PluralWeight Dependent variable(s): IsMultiple Total independent variables: 118 (Including number dropped: 12) Number of valid observations: 100672041 Number of missing observations: 0 -2*LogLikelihood: 14555447.8514 (Residual deviance on 100671935 degrees of freedom)About 6 minutes on my laptop 25
  26. 26. C ounts of Deliveries by all DemographicC ombinations by Year Revolution ConfidentialrxCube(~DadAgeR8:MomAgeR7:FRaceEthn:MRaceEthn:F(DOB_YY), data = birthAllR, pweights = "PluralWeight", blocksPerRead = 10) Under 10 seconds to compute on my laptop Resulting data frame has 50,400 rows representing all the demographic combinations for each of the 25 years 44,661 have counts <= 1000 Provides input data for predictions and weights for aggregating predictions 26
  27. 27. P redic t and A ggregate by Year Revolution ConfidentialpredOut <- rxPredict(logitObj, data = catAllDF) Create predictions for each detailed demographic group Aggregate for each year using population percentages for each detailed group for each year Compare with actuals Perform “What if?” scenarios  Scenario 1: Only change in demographic; no “fertility-treatment” time trand 27
  28. 28. Revolution Confidential 28
  29. 29. Drill Down: L ook at P redic tions for S pec ificG roups Revolution Confidential Select out predicted values from prediction data frame for detailed groups:  Age of mother and father  Race/ethnicity of mother and father 29
  30. 30. Revolution Confidential 30
  31. 31. Revolution Confidential 31
  32. 32. Revolution ConfidentialC onc lus ions from L ogis tic R egres s ion Revolution Confidential Dads matter: Use of fertility treatment associated with older dads as well as older Mom’s Hispanics show relatively small increase in multiples for both younger and older couples Asian pattern similar to whites, but lower for all age groups Black show similar increase in multiples for both younger and older couples  Other Factors Related to Multiples? (Diet, high BMI, other genetic factors) 32
  33. 33. Revolution ConfidentialB ut How Many B abies ? Revolution Confidential 33
  34. 34. Revolution ConfidentialG L M Tweedie Model: How Many B abies ? Revolution Confidential When power parameter is between 1 and 2, Tweedie is a compound Poisson distribution Appropriate for data with positive data that also includes a ‘clump’ of exact zeros Dependent variable: Number of additional babies (Plurality – 1) Same independent variables 34
  35. 35. Revolution Confidential Revolution Confidential 35
  36. 36. Revolution ConfidentialF urther R es earc h Revolution Confidential Data management  Further cleaning of data  Import more variables  Import more years (additional use of weights) Multiple Births Analysis  More variables (e.g., proxies for fertility treatment trends)  Investigation of sub-groups (e.g., young blacks)  Improved computation of number of births per delivery Other analyses with birth data 36
  37. 37. Revolution ConfidentialS ummary of A pproac h Revolution Confidential “Unpredictability” of multiple births requires large data set to have the power to capture effects Significant challenges in importing and cleaning the data – using R and .xdf files makes it possible Even with a huge data, “cells” of tables looking at multiple factors can be small Using combined.xdf file, we can use individual- level analysis to examine conditional effects of a variety of factors 37
  38. 38. Revolution ConfidentialR eferenc es Revolution Confidential Martin JA, Hamilton BE, Osterman MJK. Three decades of twin births in the United States, 1980- 2009. NCHS data brief, no. 80. Hyattsville, MD: National Center for Health Statistics. 2012. Blondel B, Kaminiski, M. Trends in the Occurrence, Determinants, and Consequences of Multiple Births. Seminars in Perinatology. 26(4):239-49, 2002. Vahratian, Anjel. Utilization of Fertility-Related Services in the United States. Fertil Steril. 2008 October; 90(4):1317-1319. 38
  39. 39. Revolution ConfidentialT hank you! Revolution Confidential R-Core Team R Package Developers R Community Revolution R Enterprise Customers and Beta Testers Colleagues at Revolution 39