Secondary Data Analysis


Published on

UMd lecture November 30, 2010 on secondary data analysis.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Stage 1: Primary sampling units (PSUs) are selected.  These are mostly single counties or, in a few cases, groups of contiguous counties with probability proportional to a measure of size (PPS).Stage 2: The PSUs are divided up into segments (generally city blocks or their equivalent). As with each PSU, sample segments are selected with PPS.Stage 3: Households within each segment are listed, and a sample is randomly drawn. In geographic areas where the proportion of age, ethnic, or income groups selected for oversampling is high, the probability of selection for those groups is greater than in other areas.Stage 4: Individuals are chosen to participate in NHANES from a list of all persons residing in selected households. Individuals are drawn at random within designated age-sex-race/ethnicity screening subdomains. On average, 1.6 persons are selected per household.
  • Secondary Data Analysis

    1. 1. 52-Week Biotech Stock Price<br />1<br />
    2. 2. 100 Years of “Emma”<br />2<br />
    3. 3. 17 Years Superbowl Viewership<br />3<br />
    4. 4. 4<br />What is common among these time series data?<br />
    5. 5. All wrong!<br />All these time series are fabrications<br />All these time series are “random walks”<br />5<br />
    6. 6. 6<br />Welcome to secondary data analysis.<br />
    7. 7. Secondary Data Analysis<br />B. Rey de Castro, Sc.D.<br />Guest Researcher<br />CDC National Center for Health Statistics<br />University of Maryland College Park<br />School of Public Health<br />FMSC 720 Study Design in MCH Epidemiology<br />November 30, 2010<br />
    8. 8. Secondary Data Analysis<br />Data that you did not collect yourself<br />Both the data and study design are givens<br />The statistical analysis is up to you<br />8<br />
    9. 9. Uses for Secondary Data<br />Hypothesis generation/testing<br />Pilot data for grant proposals<br />Expanding knowledge<br />Publications<br />
    10. 10. National Health and Nutrition Examination Survey (NHANES)<br /><br />Population<br />Children, adults nationwide<br />Method<br />Face to face interview<br />Physical exams<br />Content <br />Chronic and Infectious Disease<br />Mental health and cognitive functioning<br />Energy Balance<br />Reproductive history and sexual behavior<br />Respiratory disease<br />Data<br />N ~ 5,000 annually<br />Initiated in 1960’s; Annual since 1999<br />On-line tutorial<br />
    11. 11. National Health Interview Survey (NHIS)<br /><br />Population<br />Households, families, adults, children nationwide<br />Method<br />Face to face interview<br />Content<br />Health conditions and behaviors, access to and use of health services<br />Cancer Control Module (1987, 1992, 2000, 2003, and 2005)<br />Energy Balance<br />Cancer Screening <br />Sun Avoidance <br />Tobacco Use and Control <br />Genetic Testing<br />Data<br />N ~ 40,000 households (~87,000 individuals) annually<br />Initiated in 1957<br />
    12. 12. Other Federal Surveys<br />National Longitudinal Mortality Study<br /><br />National Health Care Survey<br /><br />National Ambulatory Medical Care Survey<br /><br />Medical Expenditure Panel Survey<br /><br />Medicare Current Beneficiary Survey<br /><br />Medicare Health Outcomes Survey<br /><br />National Survey on Drug Use and Health<br /><br />National Survey of Family Growth<br /><br />
    13. 13. Strengths<br />Inexpensive data collection and design costs<br />More statistical power: larger samples<br />Broader geographic area<br />Generalizable to national population<br />Improves understanding of hypothesis<br />Test trends over time<br />Potential for linkage<br />Person<br />Geographically<br />
    14. 14. Limitations 1<br />Substantial time spent on statistical analysis<br />Cross-sectional<br />Recall bias<br />Mismatch: ideal and feasible hypothesis<br />Mismatch: hypothesis and original purpose<br />Generalizabilityto small areas impossible<br />Specialized statistical techniques<br />
    15. 15. Limitations 2<br />Quality<br />Validity & reliability<br />Changes to survey over time<br />Poor documentation<br />Restricted/conditional access<br />Confidentiality<br />15<br />
    16. 16. Recap<br />Just a few examples of publicly available data<br />Most are cross-sectional<br />All employ a complex sampling design<br />Many use multi-stage sampling<br />Requires special software to analyze <br />e.g., SUDAAN<br />Use of weighting, clustering, and stratification<br />Differences in variance estimation methods<br />
    17. 17. Complex Surveys<br />17<br />
    18. 18. Statistical Weight<br />The statistical weight of a sampled person is the number of people in the population that the person represents. <br />Weights derived from<br />Selection probabilities<br />Response rates<br />Post-stratification adjustment <br />e.g., gender, education, income, region<br />
    19. 19. Stratification<br />Population divided before sampling into disjoint, exhaustive groups (strata)<br />Members termed primary sampling units (PSUs) <br />Independent samples are taken in each strata<br />Strata formed by similar geographic areas  <br />e.g., NHANES: partition US counties into 49 strata based on region and economic/racial characteristics<br />Sample 2 counties (PSUs) from each strata<br />
    20. 20. Clustering<br />Persons residing in a small area may have similar characteristics<br />Thus, responses of subjects in small area are potentially correlated <br />Correlation must be accounted for in the analysis <br />Survey analysis programs do this through strata/PSU information<br />
    21. 21. Variance Estimation for Surveys<br />Linearization: Uses a Taylor series expansion to estimate variance of non-linear estimators <br />Default method for most programs<br />Requires stratification and PSU information<br />Replication: Calculates parameter estimates for each replicate and combines to estimate variance<br />Jackknife with replicate weights available for SUDAAN, STATA, SAS and WesVAR<br />
    22. 22. Replication vs. Linearization<br />If survey doesn’t have replicate weights use the full sample weights and linearization<br />If survey has replicate weights use them with the jackknife procedure<br />Most software use linearization method<br />Only SUDAAN, STATA, SAS, and WesVAR can incorporate replicate weights<br />
    23. 23. Complex Survey Design<br />Correct variance estimates<br />Proper hypothesis testing<br />Standard errors will tend to be larger <br />Less likely to make Type I error<br />
    24. 24. Statistical Software for Analyzing Health Surveys <br />Specifically designed for analyzing data utilizing complex sampling designs:<br />SUDAAN<br />WesVar<br />Others that can be used:<br />SAS<br />STATA<br />SPSS<br />Mplus<br />
    25. 25. Data/Research Resources<br />Univ. of Michigan Consortium for social research:<br />UCLA Statistical Computing:<br />BRFSS Maps<br /><br />State Cancer Profiles<br /><br />
    26. 26. References<br />Korn, E.L. and Graubard, B.I. (1999). Analysis of <br /> Health Surveys. New York: John Wiley<br />State Cancer Profiles:<br />SUDAAN:<br />SAS:<br />SPSS:<br />STATA:<br />WesVar:<br />Mplus:<br />
    27. 27. Other Data Sources<br />State registries<br />Birth<br />Death<br />Cancer<br />Emergency room admissions<br />Acute outcomes<br />27<br />
    28. 28. Intermission<br />28<br />
    29. 29. Secondary Data Analysis<br />Data that you did not collect yourself<br />Both the data and study design are givens<br />The statistical analysis is up to you<br />29<br />
    30. 30. Lesson One<br />30<br />Integrity<br />
    31. 31. Dirty Data<br />Key-punch errors<br />Invalid data<br />Missing data<br />Mislabeled variables<br />Unknown variables<br />31<br />
    32. 32. Preparing Data<br />32<br />
    33. 33. Processing Data<br />Recode data<br />Label variables<br />Format data<br />33<br />
    34. 34. Investigation<br />Reality checks<br />Out-of-range values<br />Descriptive statistics<br />Ranges: out-of-range or improbable values<br />Frequencies: missing values or classes<br />Simple graphical display<br />34<br />
    35. 35. Normal Ranges<br />35<br />
    36. 36. Imputing Missing Values<br />Increases available data<br />Statistically more complex<br />Defensibility<br />Useful<br />36<br />
    37. 37. Lesson Two<br />Spend time up-front being sure about your data<br />Foundation of sand or stone?<br />Crystal clear case definition & recodes<br />More time preparing than analyzing<br />Prevents problems<br />Simplifies analysis<br />37<br />
    38. 38. Statistical Analysis Plan<br />38<br />
    39. 39. Outcome<br />39<br />
    40. 40. Design<br />40<br />
    41. 41. Clustered Data<br />41<br />
    42. 42. Longitudinal<br />42<br />
    43. 43. Hierarchical<br />43<br />
    44. 44. Diagnostics<br />Independence<br />Homoskedasticity<br />Skewness<br />Influential observations<br />44<br />
    45. 45. Lesson Three<br />Plan, then execute the plan<br />Conform statistical technique to outcome and design<br />Diagnostics<br />45<br />
    46. 46. Case Study<br />Ongoing spatial epidemiology project<br />Complex survey<br />Cross-sectional<br />Data linkage<br />Childhood asthma episodes<br />Air pollution exposure<br />46<br />
    47. 47. Case Study<br />Air pollutant: acrolein<br />EPA attributes >90% non-cancer respiratory health effects to acrolein<br />No epidemiology to date<br />47<br />
    48. 48. Data Linkage<br />48<br />
    49. 49. National Health Interview Survey<br />Health outcome<br />Asthma episode in last 12 months<br />2000 – 2004<br />Children 3 – 17 years-old<br />Parents of ~66,000 kids surveyed<br />Nationally representative sample<br />Complex survey weighting<br />49<br />
    50. 50. National Health Interview Survey<br />Potential Confounders<br />Smoking household<br />Acrolein industry household<br />Age, sex, race<br />Education, income, single-parent family<br />Access to care, insurance<br />Urban/rural<br />Census regional division<br />50<br />
    51. 51. National Air Toxics Assessment<br />Air pollutant<br />Acrolein<br />Strong respiratory irritant<br />Cigarette smoke; industrial emissions<br />2002<br />Modeled exposure assessment<br />Census tracts nationwide<br />51<br />
    52. 52. 52<br />How would you link these two databases?<br />
    53. 53. Geographic Linkage<br />53<br />
    54. 54. 54<br />But, requires access to confidential NHIS data.<br />
    55. 55. NCHS Says<br />Orient to data structure and contents<br />Locate variables<br />Download data<br />Append & merge data<br />Clean & recode data<br />Format & label variables<br />55<br />
    56. 56. NHIS Data Processing<br />Extract and compile data by year<br />Multiple files<br />2004 redesign<br />Compile data 2000 – 2004<br />Formatting and variable names a pain<br />Identify records with complete data<br />Link to NATA<br />Done confidentially by NCHS staff<br />56<br />
    57. 57. Analysis Plan<br />Hypothesis<br />“Childhood asthma episodes are associated with census-tract-level estimates of acrolein exposure”<br />Descriptive statistics<br />Logistic regression<br />Complex weighted variance estimation<br />SAS-callable SUDAAN<br />57<br />
    58. 58. Wisdom<br />Network<br />Cultivate relationships<br />Front-line staff<br />Principal investigators<br />58<br />
    59. 59. Wisdom<br />No one cares more about your problem than you<br />Or, you should<br />59<br />
    60. 60. Wisdom<br />Teach yourself<br />Learn to learn<br />60<br />
    61. 61. Contact<br />B. Rey de Castro, Sc.D.<br /><br /><br />61<br />