SlideShare a Scribd company logo
1 of 39
Download to read offline
Revolution Confidential




T he R is e in Multiple
  B irths in the U.S .:
  A n A nalys is of a
   Hundred-Million
B irth R ec ords with R
   S us an I. R anney, P h.D.

           J S M 2012
T he Tools                                        Revolution Confidential




 Open Source R
   Flexible, powerful, great graphics
   Great for prototyping, not very scalable
 RevoScaleR (with Revolution R Enterprise)
   Efficient file format (.xdf)
   Functions of accessing/importing external data sets
    (fixed format & delimited text, SPSS, SAS, ODBC)
   Very fast, parallelized, distributed analysis functions
    (summary stats, crosstabs/cubes, linear models,
    kmeans clustering, logistic regression, glm)

                                                                    2
T he Data                                        Revolution Confidential




 Public-use data sets containing information
  on all births in the United States for each
  year from 1985 to 2009 are available to
  download:
 http://www.cdc.gov/nchs/data_access/Vitalstatsonline.htm
 “These natality files are gigantic; they’re
  approximately 3.1 GB uncompressed.
  That’s a little larger than R can easily
  process” – Joseph Adler, R in a Nutshell

                                                                   3
T he U.S . B irth Data (c ontinued)                Revolution Confidential




 Data for each year are contained in a compressed,
  fixed-format, text files
 Typically 3 to 4 million records per file
 Variables and structure of the files sometimes change
  from year to year, with birth certificates revised in1989
  and 2003. Warnings:
  NOTE: THE RECORD LAYOUT OF THIS FILE HAS
      CHANGED SUBSTANTIALLY. USERS
      SHOULD READ THIS DOCUMENT
      CAREFULLY.
 Reporting can differ state-to-state for any given year

                                                                     4
T he P roc es s                               Revolution Confidential




      Basic “life cycle of analysis”
         Import and combine data
           25 years
           Over 100 million obs.
         Check and clean data
         Basic variable summaries
         Big data logistic/glm regressions
      All on my laptop
      Option to distribute computations
       to cluster
                                                                5
T he Ques tion            Revolution Confidential



CDC Report in Jan. 2012




                                            6
T he Ques tion                                  Revolution Confidential




   What accounts for the increase in multiple
   births in the United States?
     Can we separate out effects of mother’s and
     father’s ages, race/ethnicity?
     Examine time trends by sub-group, assumed
     to be associated with fertility treatment
     CDC finding: The older age of women at
     childbirth accounts for only about 1/3 of the rise
     in twinning over 30 years (but this mixes in the
     increased rate of “twinning” for older women)


                                                                  7
Revolution Confidential




Importing the U.S . B irth Data for
            Us e in R




                                                  8
E xample of Differenc es for Different Years  Revolution Confidential




To create common variables across years, use
  common names and new factor levels for ‘colInfo’ in
  rxImport function. For example:
 For 1985:
  SEX = list(type="factor", start=35, width=1,
     levels=c("1", "2"),
     newLevels = c("Male", "Female"),
     description = "Sex of Infant“)
 For 2003:
  SEX = list(type="factor", start=436, width=1,
      levels=c("M", "F"),
      newLevels = c("Male", "Female"),
      description = "Sex of Infant”)


                                                                9
C reating Trans formed Variables on Import  Revolution Confidential




Use standard R syntax to create transformed
  variables.
 For example, create a factor for Mom’s Age
  using the imported MAGER integer variable:
   MomAgeR7 = cut(MAGER,
     breaks =c(0, 19, 24, 29, 34, 39, 98, 99),
     labels = c("Under 20", "20-24", "25-29",
     "30-34", "35-39", "Over 39", "Missing"))
 Create binary variable for “IsMultiple”
   IsMultiple = DPLURAL_REC != 'Single'


                                                             10
S teps for C omplete Import                Revolution Confidential




 Lists for column information and transforms are
  created for 3 base years: 1985, 1989, 2003
  when there were very large changes in the
  structure of the input files
 Changes to these lists are made where
  appropriate for in-between years
 A test script is run, importing only 1000
  observations per year for a subset of years
 Full script is run, importing each year, sorting
  according to key demographic characteristics,
  and appending to a master .xdf file

                                                            11
Revolution Confidential




E xamining and C leaning the B ig
           Data F ile




                                               12
E xamining B as ic Information            Revolution Confidential




 Basic file information
>rxGetInfo(birthAll)
File name:
  C:RevolutionDataCDCBirthUS85to09S.xdf
  Number of observations: 100672041
  Number of variables: 50
  Number of blocks: 215
Use rxSummary to compute summary statistics
 for continuous data and category counts for
 each of the factors (about 4 minutes on my
 laptop)
rxSummary(~., data=birthAll, blocksPerRead = 10)


                                                           13
E xample of S ummary S tatis tic s     Revolution Confidential




MomAgeR7   Counts     DadAgeR8   Counts
Under 20   11918891   Under 20    3226602
20-24      25975642   20-24      15304803
25-29      28701398   25-29      23805056
30-34      22341530   30-34      23179418
35-39       9788753   35-39      13289015
Over 39     1945827   40-44       4984146
Missing           0   Over 44     2140207
                      Missing    14742794



                                                        14
His tograms by Year                     Revolution Confidential




Easily check for basic errors in data import
  (e.g. wrong position in file) by creating
  histograms by year – very fast (just seconds
  on my laptop)
 Example: Distribution of mother’s age by
  year. Use F() to have the integer year treated
  as a factor.
rxHistogram(~MAGER| F(DOB_YY),
  data=birthAll, blocksPerRead = 10,
  layout=c(5,5))

                                                         15
A ge of Mother Over Time   Revolution Confidential




                                            16
Drill Down and E xtrac t S ubs amples             Revolution Confidential




 Take a quick look at “older” fathers:   Dad’s
rxSummary(~F(UFAGECOMB),                  Age     Counts
  data=birthAll,                          80       141
  blocksPerRead = 10)                     81       108
                                          82        81
 What’s going on with 89-year old        83        74
  Dads? Extract a data frame:             84        56
dad89 <- rxDataStep(                      85        43
  inData = birthAll,                      86        43
  rowSelection = UFAGECOMB == 89,         87        26
  varsToKeep = c("DOB_YY", "MAGER",       88        27
  "MAR", "STATENAT", "FRACEREC"),         89      3327
  blocksPerRead = 10)

                                                                   17
Year and S tate for 89-Year-Old F athers
                                    Revolution Confidential




 rxCube(~F(DOB_YY):STATENAT,
   data=dad89, removeZeroCounts=TRUE)
       F_DOB_YY STATENAT   Counts
       1990 California     1
       1999 California     1
       2000 California     1
       1996 Hawaii         1
       1997 Louisiana      1
       1986 New Jersey     1
       1995 New Jersey     1
       1996 Ohio           1
       1989 Texas       3316
       1990 Texas          1
       2001 Texas          1
       1985 Washington     1
                                                     18
Revolution Confidential



Demographic s of Multiple B irths
         1985-2009




                                                19
Unit of Obs ervation: Delivery          Revolution Confidential




 Appropriate unit of observation is the
  delivery (resulting in 1 or more live births)
  rather than the individual birth.
 Use 1/Plurality as probability weight
 Alternative: Look at nearby records to
  compute the “Reported Delivery Birth Order”
  (RDPO), then select on only the 1st born in
  the delivery for the analysis


                                                         20
Revolution Confidential




                 21
Revolution Confidential




                 22
Revolution Confidential




                 23
Revolution Confidential




 Multiple B irths L ogis tic
R egres s ion & P redic tions




                                                 24
L ogis tic R egres s ion                                  Revolution Confidential



Logistic Regression Results for: IsMultiple ~ DadAgeR8 + MomAgeR7 +
   FRaceEthn + MRaceEthn +
   DadAgeR8:FRaceEthn:MNTHS_SINCE_JAN85 +
   MomAgeR7:MRaceEthn:MNTHS_SINCE_JAN85

  File name: C:RevolutionDataCDCBirthUS85to09R.xdf
  Probability weights: PluralWeight
  Dependent variable(s): IsMultiple
  Total independent variables: 118 (Including number dropped: 12)
  Number of valid observations: 100672041
  Number of missing observations: 0 -2*LogLikelihood: 14555447.8514
    (Residual deviance on 100671935 degrees of freedom)


About 6 minutes on my laptop



                                                                           25
C ounts of Deliveries by all Demographic
C ombinations by Year                               Revolution Confidential




rxCube(~DadAgeR8:MomAgeR7:FRaceEthn:MRaceEthn:F(DOB_YY),
   data = birthAllR, pweights = "PluralWeight",
   blocksPerRead = 10)

 Under 10 seconds to compute on my laptop
 Resulting data frame has 50,400 rows
  representing all the demographic
  combinations for each of the 25 years
 44,661 have counts <= 1000
 Provides input data for predictions and
  weights for aggregating predictions
                                                                     26
P redic t and A ggregate by Year             Revolution Confidential




predOut <- rxPredict(logitObj,
  data = catAllDF)
 Create predictions for each detailed
  demographic group
 Aggregate for each year using population
  percentages for each detailed group for each
  year
 Compare with actuals
 Perform “What if?” scenarios
   Scenario 1: Only change in demographic; no
    “fertility-treatment” time trand

                                                              27
Revolution Confidential




                 28
Drill Down: L ook at P redic tions for S pec ific
G roups                                   Revolution Confidential




 Select out predicted values from prediction
  data frame for detailed groups:
   Age of mother and father
   Race/ethnicity of mother and father




                                                           29
Revolution Confidential




                 30
Revolution Confidential




                 31
Revolution Confidential




C onc lus ions from L ogis tic R egres s ion     Revolution Confidential




 Dads matter: Use of fertility treatment
  associated with older dads as well as older
  Mom’s
 Hispanics show relatively small increase in
  multiples for both younger and older couples
 Asian pattern similar to whites, but lower for all
  age groups
 Black show similar increase in multiples for both
  younger and older couples
   Other Factors Related to Multiples? (Diet, high BMI,
    other genetic factors)
                                                                  32
Revolution Confidential




B ut How Many B abies ?                   Revolution Confidential




                                                           33
Revolution Confidential




G L M Tweedie Model: How Many B abies ?      Revolution Confidential




 When power parameter is between 1 and 2,
  Tweedie is a compound Poisson distribution
 Appropriate for data with positive data that
  also includes a ‘clump’ of exact zeros
 Dependent variable: Number of additional
  babies (Plurality – 1)
 Same independent variables



                                                              34
Revolution Confidential




                          Revolution Confidential




                                           35
Revolution Confidential




F urther R es earc h                               Revolution Confidential




 Data management
   Further cleaning of data
   Import more variables
   Import more years (additional use of weights)
 Multiple Births Analysis
   More variables (e.g., proxies for fertility treatment
    trends)
   Investigation of sub-groups (e.g., young blacks)
   Improved computation of number of births per
    delivery
 Other analyses with birth data
                                                                    36
Revolution Confidential




S ummary of A pproac h
                                               Revolution Confidential




 “Unpredictability” of multiple births requires
  large data set to have the power to capture
  effects
 Significant challenges in importing and cleaning
  the data – using R and .xdf files makes it
  possible
 Even with a huge data, “cells” of tables looking
  at multiple factors can be small
 Using combined.xdf file, we can use individual-
  level analysis to examine conditional effects of
  a variety of factors

                                                                37
Revolution Confidential




R eferenc es
                                                Revolution Confidential




 Martin JA, Hamilton BE, Osterman MJK. Three
  decades of twin births in the United States, 1980-
  2009. NCHS data brief, no. 80. Hyattsville, MD:
  National Center for Health Statistics. 2012.
 Blondel B, Kaminiski, M. Trends in the Occurrence,
  Determinants, and Consequences of Multiple
  Births. Seminars in Perinatology. 26(4):239-49,
  2002.
 Vahratian, Anjel. Utilization of Fertility-Related
  Services in the United States. Fertil Steril. 2008
  October; 90(4):1317-1319.
                                                                 38
Revolution Confidential




T hank you!
                                            Revolution Confidential




 R-Core Team
 R Package Developers
 R Community
 Revolution R Enterprise Customers and Beta
  Testers
 Colleagues at Revolution Analytics
Contact:
sue@revolutionanalytics.com
                                                             39

More Related Content

Similar to The Rise in Multiple Births in the U.S.: An Analysis of a Hundred-Million Birth Records with R

FormalWriteupTornado_1
FormalWriteupTornado_1FormalWriteupTornado_1
FormalWriteupTornado_1Katie Harvey
 
Open Analytics Environment
Open Analytics EnvironmentOpen Analytics Environment
Open Analytics EnvironmentIan Foster
 
New Advances in High Performance Analytics with R: 'Big Data' Decision Trees ...
New Advances in High Performance Analytics with R: 'Big Data' Decision Trees ...New Advances in High Performance Analytics with R: 'Big Data' Decision Trees ...
New Advances in High Performance Analytics with R: 'Big Data' Decision Trees ...Revolution Analytics
 
Data mining
Data miningData mining
Data miningScilab
 
Consistency, Availability, Partition: Make Your Choice
Consistency, Availability, Partition: Make Your ChoiceConsistency, Availability, Partition: Make Your Choice
Consistency, Availability, Partition: Make Your ChoiceAndrea Giuliano
 
Internship_Presentation
Internship_PresentationInternship_Presentation
Internship_PresentationSourabh Gujar
 
Data management Final Project
Data management Final ProjectData management Final Project
Data management Final ProjectRutuja Gangane
 
Explore, Analyze and Present your data
Explore, Analyze and Present your dataExplore, Analyze and Present your data
Explore, Analyze and Present your datagcalmettes
 
Stata time-series-fall-2011
Stata time-series-fall-2011Stata time-series-fall-2011
Stata time-series-fall-2011Feryanto W K
 
Final Report
Final ReportFinal Report
Final ReportBrian Wu
 
Accelerating Genomics SNPs Processing and Interpretation with Apache Spark
Accelerating Genomics SNPs Processing and Interpretation with Apache SparkAccelerating Genomics SNPs Processing and Interpretation with Apache Spark
Accelerating Genomics SNPs Processing and Interpretation with Apache SparkDatabricks
 
A Comparison Of Fitness Scallng Methods In Evolutionary Algorithms
A Comparison Of Fitness Scallng Methods In Evolutionary AlgorithmsA Comparison Of Fitness Scallng Methods In Evolutionary Algorithms
A Comparison Of Fitness Scallng Methods In Evolutionary AlgorithmsTracy Hill
 
A Genome Sequence Analysis System Built with Hypertable
A Genome Sequence Analysis System Built with HypertableA Genome Sequence Analysis System Built with Hypertable
A Genome Sequence Analysis System Built with HypertableDATAVERSITY
 
Logistic Regression in Case-Control Study
Logistic Regression in Case-Control StudyLogistic Regression in Case-Control Study
Logistic Regression in Case-Control StudySatish Gupta
 
Project 1FINA 415-15BGroup of 5.Due by 18092015..docx
Project 1FINA 415-15BGroup of 5.Due by 18092015..docxProject 1FINA 415-15BGroup of 5.Due by 18092015..docx
Project 1FINA 415-15BGroup of 5.Due by 18092015..docxwkyra78
 

Similar to The Rise in Multiple Births in the U.S.: An Analysis of a Hundred-Million Birth Records with R (20)

FormalWriteupTornado_1
FormalWriteupTornado_1FormalWriteupTornado_1
FormalWriteupTornado_1
 
Bioinformatica 27-10-2011-t4-alignments
Bioinformatica 27-10-2011-t4-alignmentsBioinformatica 27-10-2011-t4-alignments
Bioinformatica 27-10-2011-t4-alignments
 
Bill howe 3_mapreduce
Bill howe 3_mapreduceBill howe 3_mapreduce
Bill howe 3_mapreduce
 
Open Analytics Environment
Open Analytics EnvironmentOpen Analytics Environment
Open Analytics Environment
 
New Advances in High Performance Analytics with R: 'Big Data' Decision Trees ...
New Advances in High Performance Analytics with R: 'Big Data' Decision Trees ...New Advances in High Performance Analytics with R: 'Big Data' Decision Trees ...
New Advances in High Performance Analytics with R: 'Big Data' Decision Trees ...
 
Sofi Presentation
Sofi PresentationSofi Presentation
Sofi Presentation
 
Power DataViz
Power DataVizPower DataViz
Power DataViz
 
Data mining
Data miningData mining
Data mining
 
Consistency, Availability, Partition: Make Your Choice
Consistency, Availability, Partition: Make Your ChoiceConsistency, Availability, Partition: Make Your Choice
Consistency, Availability, Partition: Make Your Choice
 
Internship_Presentation
Internship_PresentationInternship_Presentation
Internship_Presentation
 
Data management Final Project
Data management Final ProjectData management Final Project
Data management Final Project
 
Explore, Analyze and Present your data
Explore, Analyze and Present your dataExplore, Analyze and Present your data
Explore, Analyze and Present your data
 
Stata time-series-fall-2011
Stata time-series-fall-2011Stata time-series-fall-2011
Stata time-series-fall-2011
 
Final Report
Final ReportFinal Report
Final Report
 
Major project.pptx
Major project.pptxMajor project.pptx
Major project.pptx
 
Accelerating Genomics SNPs Processing and Interpretation with Apache Spark
Accelerating Genomics SNPs Processing and Interpretation with Apache SparkAccelerating Genomics SNPs Processing and Interpretation with Apache Spark
Accelerating Genomics SNPs Processing and Interpretation with Apache Spark
 
A Comparison Of Fitness Scallng Methods In Evolutionary Algorithms
A Comparison Of Fitness Scallng Methods In Evolutionary AlgorithmsA Comparison Of Fitness Scallng Methods In Evolutionary Algorithms
A Comparison Of Fitness Scallng Methods In Evolutionary Algorithms
 
A Genome Sequence Analysis System Built with Hypertable
A Genome Sequence Analysis System Built with HypertableA Genome Sequence Analysis System Built with Hypertable
A Genome Sequence Analysis System Built with Hypertable
 
Logistic Regression in Case-Control Study
Logistic Regression in Case-Control StudyLogistic Regression in Case-Control Study
Logistic Regression in Case-Control Study
 
Project 1FINA 415-15BGroup of 5.Due by 18092015..docx
Project 1FINA 415-15BGroup of 5.Due by 18092015..docxProject 1FINA 415-15BGroup of 5.Due by 18092015..docx
Project 1FINA 415-15BGroup of 5.Due by 18092015..docx
 

More from Revolution Analytics

Speeding up R with Parallel Programming in the Cloud
Speeding up R with Parallel Programming in the CloudSpeeding up R with Parallel Programming in the Cloud
Speeding up R with Parallel Programming in the CloudRevolution Analytics
 
Migrating Existing Open Source Machine Learning to Azure
Migrating Existing Open Source Machine Learning to AzureMigrating Existing Open Source Machine Learning to Azure
Migrating Existing Open Source Machine Learning to AzureRevolution Analytics
 
Speed up R with parallel programming in the Cloud
Speed up R with parallel programming in the CloudSpeed up R with parallel programming in the Cloud
Speed up R with parallel programming in the CloudRevolution Analytics
 
Predicting Loan Delinquency at One Million Transactions per Second
Predicting Loan Delinquency at One Million Transactions per SecondPredicting Loan Delinquency at One Million Transactions per Second
Predicting Loan Delinquency at One Million Transactions per SecondRevolution Analytics
 
The Value of Open Source Communities
The Value of Open Source CommunitiesThe Value of Open Source Communities
The Value of Open Source CommunitiesRevolution Analytics
 
Building a scalable data science platform with R
Building a scalable data science platform with RBuilding a scalable data science platform with R
Building a scalable data science platform with RRevolution Analytics
 
The Business Economics and Opportunity of Open Source Data Science
The Business Economics and Opportunity of Open Source Data ScienceThe Business Economics and Opportunity of Open Source Data Science
The Business Economics and Opportunity of Open Source Data ScienceRevolution Analytics
 
Taking R Analytics to SQL and the Cloud
Taking R Analytics to SQL and the CloudTaking R Analytics to SQL and the Cloud
Taking R Analytics to SQL and the CloudRevolution Analytics
 
The Network structure of R packages on CRAN & BioConductor
The Network structure of R packages on CRAN & BioConductorThe Network structure of R packages on CRAN & BioConductor
The Network structure of R packages on CRAN & BioConductorRevolution Analytics
 
The network structure of cran 2015 07-02 final
The network structure of cran 2015 07-02 finalThe network structure of cran 2015 07-02 final
The network structure of cran 2015 07-02 finalRevolution Analytics
 
Simple Reproducibility with the checkpoint package
Simple Reproducibilitywith the checkpoint packageSimple Reproducibilitywith the checkpoint package
Simple Reproducibility with the checkpoint packageRevolution Analytics
 

More from Revolution Analytics (20)

Speeding up R with Parallel Programming in the Cloud
Speeding up R with Parallel Programming in the CloudSpeeding up R with Parallel Programming in the Cloud
Speeding up R with Parallel Programming in the Cloud
 
Migrating Existing Open Source Machine Learning to Azure
Migrating Existing Open Source Machine Learning to AzureMigrating Existing Open Source Machine Learning to Azure
Migrating Existing Open Source Machine Learning to Azure
 
R in Minecraft
R in Minecraft R in Minecraft
R in Minecraft
 
The case for R for AI developers
The case for R for AI developersThe case for R for AI developers
The case for R for AI developers
 
Speed up R with parallel programming in the Cloud
Speed up R with parallel programming in the CloudSpeed up R with parallel programming in the Cloud
Speed up R with parallel programming in the Cloud
 
The R Ecosystem
The R EcosystemThe R Ecosystem
The R Ecosystem
 
R Then and Now
R Then and NowR Then and Now
R Then and Now
 
Predicting Loan Delinquency at One Million Transactions per Second
Predicting Loan Delinquency at One Million Transactions per SecondPredicting Loan Delinquency at One Million Transactions per Second
Predicting Loan Delinquency at One Million Transactions per Second
 
Reproducible Data Science with R
Reproducible Data Science with RReproducible Data Science with R
Reproducible Data Science with R
 
The Value of Open Source Communities
The Value of Open Source CommunitiesThe Value of Open Source Communities
The Value of Open Source Communities
 
The R Ecosystem
The R EcosystemThe R Ecosystem
The R Ecosystem
 
R at Microsoft (useR! 2016)
R at Microsoft (useR! 2016)R at Microsoft (useR! 2016)
R at Microsoft (useR! 2016)
 
Building a scalable data science platform with R
Building a scalable data science platform with RBuilding a scalable data science platform with R
Building a scalable data science platform with R
 
R at Microsoft
R at MicrosoftR at Microsoft
R at Microsoft
 
The Business Economics and Opportunity of Open Source Data Science
The Business Economics and Opportunity of Open Source Data ScienceThe Business Economics and Opportunity of Open Source Data Science
The Business Economics and Opportunity of Open Source Data Science
 
Taking R Analytics to SQL and the Cloud
Taking R Analytics to SQL and the CloudTaking R Analytics to SQL and the Cloud
Taking R Analytics to SQL and the Cloud
 
The Network structure of R packages on CRAN & BioConductor
The Network structure of R packages on CRAN & BioConductorThe Network structure of R packages on CRAN & BioConductor
The Network structure of R packages on CRAN & BioConductor
 
The network structure of cran 2015 07-02 final
The network structure of cran 2015 07-02 finalThe network structure of cran 2015 07-02 final
The network structure of cran 2015 07-02 final
 
Simple Reproducibility with the checkpoint package
Simple Reproducibilitywith the checkpoint packageSimple Reproducibilitywith the checkpoint package
Simple Reproducibility with the checkpoint package
 
R at Microsoft
R at MicrosoftR at Microsoft
R at Microsoft
 

Recently uploaded

Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsAndrey Dotsenko
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Neo4j
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentationphoebematthew05
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 

Recently uploaded (20)

Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentation
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 

The Rise in Multiple Births in the U.S.: An Analysis of a Hundred-Million Birth Records with R

  • 1. Revolution Confidential T he R is e in Multiple B irths in the U.S .: A n A nalys is of a Hundred-Million B irth R ec ords with R S us an I. R anney, P h.D. J S M 2012
  • 2. T he Tools Revolution Confidential  Open Source R  Flexible, powerful, great graphics  Great for prototyping, not very scalable  RevoScaleR (with Revolution R Enterprise)  Efficient file format (.xdf)  Functions of accessing/importing external data sets (fixed format & delimited text, SPSS, SAS, ODBC)  Very fast, parallelized, distributed analysis functions (summary stats, crosstabs/cubes, linear models, kmeans clustering, logistic regression, glm) 2
  • 3. T he Data Revolution Confidential  Public-use data sets containing information on all births in the United States for each year from 1985 to 2009 are available to download: http://www.cdc.gov/nchs/data_access/Vitalstatsonline.htm  “These natality files are gigantic; they’re approximately 3.1 GB uncompressed. That’s a little larger than R can easily process” – Joseph Adler, R in a Nutshell 3
  • 4. T he U.S . B irth Data (c ontinued) Revolution Confidential  Data for each year are contained in a compressed, fixed-format, text files  Typically 3 to 4 million records per file  Variables and structure of the files sometimes change from year to year, with birth certificates revised in1989 and 2003. Warnings: NOTE: THE RECORD LAYOUT OF THIS FILE HAS CHANGED SUBSTANTIALLY. USERS SHOULD READ THIS DOCUMENT CAREFULLY.  Reporting can differ state-to-state for any given year 4
  • 5. T he P roc es s Revolution Confidential  Basic “life cycle of analysis”  Import and combine data  25 years  Over 100 million obs.  Check and clean data  Basic variable summaries  Big data logistic/glm regressions  All on my laptop  Option to distribute computations to cluster 5
  • 6. T he Ques tion Revolution Confidential CDC Report in Jan. 2012 6
  • 7. T he Ques tion Revolution Confidential What accounts for the increase in multiple births in the United States? Can we separate out effects of mother’s and father’s ages, race/ethnicity? Examine time trends by sub-group, assumed to be associated with fertility treatment CDC finding: The older age of women at childbirth accounts for only about 1/3 of the rise in twinning over 30 years (but this mixes in the increased rate of “twinning” for older women) 7
  • 8. Revolution Confidential Importing the U.S . B irth Data for Us e in R 8
  • 9. E xample of Differenc es for Different Years Revolution Confidential To create common variables across years, use common names and new factor levels for ‘colInfo’ in rxImport function. For example:  For 1985: SEX = list(type="factor", start=35, width=1, levels=c("1", "2"), newLevels = c("Male", "Female"), description = "Sex of Infant“)  For 2003: SEX = list(type="factor", start=436, width=1, levels=c("M", "F"), newLevels = c("Male", "Female"), description = "Sex of Infant”) 9
  • 10. C reating Trans formed Variables on Import Revolution Confidential Use standard R syntax to create transformed variables.  For example, create a factor for Mom’s Age using the imported MAGER integer variable: MomAgeR7 = cut(MAGER, breaks =c(0, 19, 24, 29, 34, 39, 98, 99), labels = c("Under 20", "20-24", "25-29", "30-34", "35-39", "Over 39", "Missing"))  Create binary variable for “IsMultiple” IsMultiple = DPLURAL_REC != 'Single' 10
  • 11. S teps for C omplete Import Revolution Confidential  Lists for column information and transforms are created for 3 base years: 1985, 1989, 2003 when there were very large changes in the structure of the input files  Changes to these lists are made where appropriate for in-between years  A test script is run, importing only 1000 observations per year for a subset of years  Full script is run, importing each year, sorting according to key demographic characteristics, and appending to a master .xdf file 11
  • 12. Revolution Confidential E xamining and C leaning the B ig Data F ile 12
  • 13. E xamining B as ic Information Revolution Confidential  Basic file information >rxGetInfo(birthAll) File name: C:RevolutionDataCDCBirthUS85to09S.xdf Number of observations: 100672041 Number of variables: 50 Number of blocks: 215 Use rxSummary to compute summary statistics for continuous data and category counts for each of the factors (about 4 minutes on my laptop) rxSummary(~., data=birthAll, blocksPerRead = 10) 13
  • 14. E xample of S ummary S tatis tic s Revolution Confidential MomAgeR7 Counts DadAgeR8 Counts Under 20 11918891 Under 20 3226602 20-24 25975642 20-24 15304803 25-29 28701398 25-29 23805056 30-34 22341530 30-34 23179418 35-39 9788753 35-39 13289015 Over 39 1945827 40-44 4984146 Missing 0 Over 44 2140207 Missing 14742794 14
  • 15. His tograms by Year Revolution Confidential Easily check for basic errors in data import (e.g. wrong position in file) by creating histograms by year – very fast (just seconds on my laptop)  Example: Distribution of mother’s age by year. Use F() to have the integer year treated as a factor. rxHistogram(~MAGER| F(DOB_YY), data=birthAll, blocksPerRead = 10, layout=c(5,5)) 15
  • 16. A ge of Mother Over Time Revolution Confidential 16
  • 17. Drill Down and E xtrac t S ubs amples Revolution Confidential  Take a quick look at “older” fathers: Dad’s rxSummary(~F(UFAGECOMB), Age Counts data=birthAll, 80 141 blocksPerRead = 10) 81 108 82 81  What’s going on with 89-year old 83 74 Dads? Extract a data frame: 84 56 dad89 <- rxDataStep( 85 43 inData = birthAll, 86 43 rowSelection = UFAGECOMB == 89, 87 26 varsToKeep = c("DOB_YY", "MAGER", 88 27 "MAR", "STATENAT", "FRACEREC"), 89 3327 blocksPerRead = 10) 17
  • 18. Year and S tate for 89-Year-Old F athers Revolution Confidential rxCube(~F(DOB_YY):STATENAT, data=dad89, removeZeroCounts=TRUE) F_DOB_YY STATENAT Counts 1990 California 1 1999 California 1 2000 California 1 1996 Hawaii 1 1997 Louisiana 1 1986 New Jersey 1 1995 New Jersey 1 1996 Ohio 1 1989 Texas 3316 1990 Texas 1 2001 Texas 1 1985 Washington 1 18
  • 19. Revolution Confidential Demographic s of Multiple B irths 1985-2009 19
  • 20. Unit of Obs ervation: Delivery Revolution Confidential  Appropriate unit of observation is the delivery (resulting in 1 or more live births) rather than the individual birth.  Use 1/Plurality as probability weight  Alternative: Look at nearby records to compute the “Reported Delivery Birth Order” (RDPO), then select on only the 1st born in the delivery for the analysis 20
  • 24. Revolution Confidential Multiple B irths L ogis tic R egres s ion & P redic tions 24
  • 25. L ogis tic R egres s ion Revolution Confidential Logistic Regression Results for: IsMultiple ~ DadAgeR8 + MomAgeR7 + FRaceEthn + MRaceEthn + DadAgeR8:FRaceEthn:MNTHS_SINCE_JAN85 + MomAgeR7:MRaceEthn:MNTHS_SINCE_JAN85 File name: C:RevolutionDataCDCBirthUS85to09R.xdf Probability weights: PluralWeight Dependent variable(s): IsMultiple Total independent variables: 118 (Including number dropped: 12) Number of valid observations: 100672041 Number of missing observations: 0 -2*LogLikelihood: 14555447.8514 (Residual deviance on 100671935 degrees of freedom) About 6 minutes on my laptop 25
  • 26. C ounts of Deliveries by all Demographic C ombinations by Year Revolution Confidential rxCube(~DadAgeR8:MomAgeR7:FRaceEthn:MRaceEthn:F(DOB_YY), data = birthAllR, pweights = "PluralWeight", blocksPerRead = 10)  Under 10 seconds to compute on my laptop  Resulting data frame has 50,400 rows representing all the demographic combinations for each of the 25 years  44,661 have counts <= 1000  Provides input data for predictions and weights for aggregating predictions 26
  • 27. P redic t and A ggregate by Year Revolution Confidential predOut <- rxPredict(logitObj, data = catAllDF)  Create predictions for each detailed demographic group  Aggregate for each year using population percentages for each detailed group for each year  Compare with actuals  Perform “What if?” scenarios  Scenario 1: Only change in demographic; no “fertility-treatment” time trand 27
  • 29. Drill Down: L ook at P redic tions for S pec ific G roups Revolution Confidential  Select out predicted values from prediction data frame for detailed groups:  Age of mother and father  Race/ethnicity of mother and father 29
  • 32. Revolution Confidential C onc lus ions from L ogis tic R egres s ion Revolution Confidential  Dads matter: Use of fertility treatment associated with older dads as well as older Mom’s  Hispanics show relatively small increase in multiples for both younger and older couples  Asian pattern similar to whites, but lower for all age groups  Black show similar increase in multiples for both younger and older couples  Other Factors Related to Multiples? (Diet, high BMI, other genetic factors) 32
  • 33. Revolution Confidential B ut How Many B abies ? Revolution Confidential 33
  • 34. Revolution Confidential G L M Tweedie Model: How Many B abies ? Revolution Confidential  When power parameter is between 1 and 2, Tweedie is a compound Poisson distribution  Appropriate for data with positive data that also includes a ‘clump’ of exact zeros  Dependent variable: Number of additional babies (Plurality – 1)  Same independent variables 34
  • 35. Revolution Confidential Revolution Confidential 35
  • 36. Revolution Confidential F urther R es earc h Revolution Confidential  Data management  Further cleaning of data  Import more variables  Import more years (additional use of weights)  Multiple Births Analysis  More variables (e.g., proxies for fertility treatment trends)  Investigation of sub-groups (e.g., young blacks)  Improved computation of number of births per delivery  Other analyses with birth data 36
  • 37. Revolution Confidential S ummary of A pproac h Revolution Confidential  “Unpredictability” of multiple births requires large data set to have the power to capture effects  Significant challenges in importing and cleaning the data – using R and .xdf files makes it possible  Even with a huge data, “cells” of tables looking at multiple factors can be small  Using combined.xdf file, we can use individual- level analysis to examine conditional effects of a variety of factors 37
  • 38. Revolution Confidential R eferenc es Revolution Confidential  Martin JA, Hamilton BE, Osterman MJK. Three decades of twin births in the United States, 1980- 2009. NCHS data brief, no. 80. Hyattsville, MD: National Center for Health Statistics. 2012.  Blondel B, Kaminiski, M. Trends in the Occurrence, Determinants, and Consequences of Multiple Births. Seminars in Perinatology. 26(4):239-49, 2002.  Vahratian, Anjel. Utilization of Fertility-Related Services in the United States. Fertil Steril. 2008 October; 90(4):1317-1319. 38
  • 39. Revolution Confidential T hank you! Revolution Confidential  R-Core Team  R Package Developers  R Community  Revolution R Enterprise Customers and Beta Testers  Colleagues at Revolution Analytics Contact: sue@revolutionanalytics.com 39