Secondary Data Analysis

17 Years Superbowl Viewership 3

4 What is common among these time series data?

All wrong! All these time series are fabrications All these time series are “random walks” 5

6 Welcome to secondary data analysis.

Secondary Data Analysis B. Rey de Castro, Sc.D. Guest Researcher CDC National Center for Health Statistics University of Maryland College Park School of Public Health FMSC 720 Study Design in MCH Epidemiology November 30, 2010

Secondary Data Analysis Data that you did not collect yourself Both the data and study design are givens The statistical analysis is up to you 8

Uses for Secondary Data Hypothesis generation/testing Pilot data for grant proposals Expanding knowledge Publications

National Health and Nutrition Examination Survey (NHANES) http://www.cdc.gov/nchs/nhanes.htm Population Children, adults nationwide Method Face to face interview Physical exams Content Chronic and Infectious Disease Mental health and cognitive functioning Energy Balance Reproductive history and sexual behavior Respiratory disease Data N ~ 5,000 annually Initiated in 1960’s; Annual since 1999 On-line tutorial

National Health Interview Survey (NHIS) http://www.cdc.gov/nchs/nhis.htm Population Households, families, adults, children nationwide Method Face to face interview Content Health conditions and behaviors, access to and use of health services Cancer Control Module (1987, 1992, 2000, 2003, and 2005) Energy Balance Cancer Screening Sun Avoidance Tobacco Use and Control Genetic Testing Data N ~ 40,000 households (~87,000 individuals) annually Initiated in 1957

Other Federal Surveys National Longitudinal Mortality Study http://www.census.gov/nlms/ National Health Care Survey http://www.cdc.gov/nchs/nhcs.htm National Ambulatory Medical Care Survey http://www.cdc.gov/nchs/about/major/ahcd/ahcd1.htm Medical Expenditure Panel Survey http://www.meps.ahrq.gov/ Medicare Current Beneficiary Survey http://www.cms.hhs.gov/MCBS/ Medicare Health Outcomes Survey http://www.hosonline.org/ National Survey on Drug Use and Health http://www.oas.samhsa.gov/nhsda.htm National Survey of Family Growth http://www.cdc.gov/nchs/about/major/nsfg/nsfgbiblio.htm

Strengths Inexpensive data collection and design costs More statistical power: larger samples Broader geographic area Generalizable to national population Improves understanding of hypothesis Test trends over time Potential for linkage Person Geographically

Limitations 1 Substantial time spent on statistical analysis Cross-sectional Recall bias Mismatch: ideal and feasible hypothesis Mismatch: hypothesis and original purpose Generalizabilityto small areas impossible Specialized statistical techniques

Limitations 2 Quality Validity & reliability Changes to survey over time Poor documentation Restricted/conditional access Confidentiality 15

Recap Just a few examples of publicly available data Most are cross-sectional All employ a complex sampling design Many use multi-stage sampling Requires special software to analyze e.g., SUDAAN Use of weighting, clustering, and stratification Differences in variance estimation methods

Statistical Weight The statistical weight of a sampled person is the number of people in the population that the person represents. Weights derived from Selection probabilities Response rates Post-stratification adjustment e.g., gender, education, income, region

Stratification Population divided before sampling into disjoint, exhaustive groups (strata) Members termed primary sampling units (PSUs) Independent samples are taken in each strata Strata formed by similar geographic areas e.g., NHANES: partition US counties into 49 strata based on region and economic/racial characteristics Sample 2 counties (PSUs) from each strata

Clustering Persons residing in a small area may have similar characteristics Thus, responses of subjects in small area are potentially correlated Correlation must be accounted for in the analysis Survey analysis programs do this through strata/PSU information

Variance Estimation for Surveys Linearization: Uses a Taylor series expansion to estimate variance of non-linear estimators Default method for most programs Requires stratification and PSU information Replication: Calculates parameter estimates for each replicate and combines to estimate variance Jackknife with replicate weights available for SUDAAN, STATA, SAS and WesVAR

Replication vs. Linearization If survey doesn’t have replicate weights use the full sample weights and linearization If survey has replicate weights use them with the jackknife procedure Most software use linearization method Only SUDAAN, STATA, SAS, and WesVAR can incorporate replicate weights

Complex Survey Design Correct variance estimates Proper hypothesis testing Standard errors will tend to be larger Less likely to make Type I error

Statistical Software for Analyzing Health Surveys Specifically designed for analyzing data utilizing complex sampling designs: SUDAAN WesVar Others that can be used: SAS STATA SPSS Mplus

Data/Research Resources Univ. of Michigan Consortium for social research: http://www.icpsr.umich.edu/ UCLA Statistical Computing: http://www.ats.ucla.edu/stat/ BRFSS Maps http://apps.nccd.cdc.gov/gisbrfss/default.aspx State Cancer Profiles http://statecancerprofiles.cancer.gov/

References Korn, E.L. and Graubard, B.I. (1999). Analysis of Health Surveys. New York: John Wiley State Cancer Profiles: http://statecancerprofiles.cancer.gov/ SUDAAN: http://www.rti.org/SUDAAN/ SAS: http://www.sas.com/ SPSS: http://www.spss.com/ STATA: http://www.stata.com/ WesVar: http://www.westat.com/wesvar/ Mplus: http://www.statmodel.com/

Other Data Sources State registries Birth Death Cancer Emergency room admissions Acute outcomes 27

Secondary Data Analysis Data that you did not collect yourself Both the data and study design are givens The statistical analysis is up to you 29

Dirty Data Key-punch errors Invalid data Missing data Mislabeled variables Unknown variables 31

Processing Data Recode data Label variables Format data 33

Investigation Reality checks Out-of-range values Descriptive statistics Ranges: out-of-range or improbable values Frequencies: missing values or classes Simple graphical display 34

Imputing Missing Values Increases available data Statistically more complex Defensibility Useful 36

Lesson Two Spend time up-front being sure about your data Foundation of sand or stone? Crystal clear case definition & recodes More time preparing than analyzing Prevents problems Simplifies analysis 37

Diagnostics Independence Homoskedasticity Skewness Influential observations 44

Lesson Three Plan, then execute the plan Conform statistical technique to outcome and design Diagnostics 45

Case Study Ongoing spatial epidemiology project Complex survey Cross-sectional Data linkage Childhood asthma episodes Air pollution exposure 46

Case Study Air pollutant: acrolein EPA attributes >90% non-cancer respiratory health effects to acrolein No epidemiology to date 47

National Health Interview Survey Health outcome Asthma episode in last 12 months 2000 – 2004 Children 3 – 17 years-old Parents of ~66,000 kids surveyed Nationally representative sample Complex survey weighting 49

National Health Interview Survey Potential Confounders Smoking household Acrolein industry household Age, sex, race Education, income, single-parent family Access to care, insurance Urban/rural Census regional division 50

National Air Toxics Assessment Air pollutant Acrolein Strong respiratory irritant Cigarette smoke; industrial emissions 2002 Modeled exposure assessment Census tracts nationwide 51

52 How would you link these two databases?

54 But, requires access to confidential NHIS data.

NCHS Says Orient to data structure and contents Locate variables Download data Append & merge data Clean & recode data Format & label variables 55

NHIS Data Processing Extract and compile data by year Multiple files 2004 redesign Compile data 2000 – 2004 Formatting and variable names a pain Identify records with complete data Link to NATA Done confidentially by NCHS staff 56

Analysis Plan Hypothesis “Childhood asthma episodes are associated with census-tract-level estimates of acrolein exposure” Descriptive statistics Logistic regression Complex weighted variance estimation SAS-callable SUDAAN 57

Wisdom Network Cultivate relationships Front-line staff Principal investigators 58

Secondary Data Analysis

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (6)

Similar to Secondary Data Analysis

Similar to Secondary Data Analysis (20)

More from REY DECASTRO

More from REY DECASTRO (14)

Recently uploaded

Recently uploaded (20)

Secondary Data Analysis

Editor's Notes