ICPSR Data Sharing


Published on

This is Part II of a workshop presented by ICPSR at IASSIST 2011. This section focuses on data sharing of publicly available data.

Published in: Education, Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • DOJ, BJS, SAMHSA, National Institute on AgingRobert Wood Johnson, Fenway InstituteNCAAInternational Archive of Education Data taken down Several years ago ICPSR discontinued support for the International Archive of Education Data (IAED) Web site. The site was sponsored through a contract with the National Center for Education Statistics which was not renewed. All ICPSR data collections included in this site are still available to the membership (via the ICPSR Web site) and we will continue to respond to user questions about these data.
  • Why might you organize your collections in this way?
  • Note – image does not capture all projects.Our data collections tend to have a unique purpose and sometimes unique ways of serving their particular audiences. They share a common hub that is ICPSR with the primary benefit being a common approach to processing, preserving, and very importantly, surrounding data with metadata including citations, the SSVD, and documentation.What is important to note is that as you consider how you might serve your data community and particular audiences within, you need to think about not only processing and tagging data in a common way, but also, how you will staff your collection and what your outreach will look like. Let’s take a look at what this means. . .
  • If you as an archive or collection desire to assist non-researchers in this way, remember you will need staffing strategies that will provide hands-on service/training to novice data users (staff who not only understand the data, but who are patient instructors on the use of data among analysts that may not be well-trained). Dedicated User Support!
  • DSDR PartnersEunice Kennedy Shriver National Institute of Child Health and Human Development NICHD Demographic & Behavioral Sciences (DBS) Branch Carolina Population Center (CPC) Hopkins Population Center (HPC) Michigan Population Studies Center (PSC) Minnesota Population Center (MPC) RAND Population Research Center Resource ExplanationDSDR provides resources to demographic data producers and users, including confidentiality and disclosure review, restricted data contract development and data dissemination, a searchable index of important demography and population study data, and a catalogue of publications using data indexed.
  • Research Connections is funded by the Office of Child Care and the Office of Planning, Research and Evaluation, Administration for Children and Families, in the U.S. Department of Health and Human Services.In addition to data, RC provides reports (tagged with metadata), research opportunities (grants announcements), recurring training opportunities, announcements relevant to this particular research and policy community. “Community Engagement” is RC’s goal and staff for this project must be active engagers to meet the project’s goal.
  • This collection has a specific goal in sharing its data – policy development. The collection represents another collection where the core audience includes individuals who are not research scientists (including administrators and the media) and will require more instruction on data use.
  • This is a foreshadowing (teaser!) to the last part of this workshop – Data Management – where we’ll provide insights on data sharing in secure environments and administering a growing number of restricted data contracts.Let’s take a look at some of these collections.
  • Data preparation and processing is largely handled offsite. ICPSR serves as the web host & Fenway uses our processes and infrastructure as well as our distribution capabilities – ability to share its data with over 700 member institutions.
  • An initiative of the Office of Applied Studies , Substance Abuse and Mental Health Services Administration (SAMHSA) of the United States Department of Health and Human ServicesThe audience here prompted the desire to upgrade our online analysis approach from a simple data exploration tool to one that could handle more advanced statistical analysis like that of SPSS or SAS. It also prompted the development of an online tool where when a sample size was too small and for example, might lead to some greater level of disclosure risk, that the tool prohibited display. This is known as Secure SDA.Currently, this collection is leading our efforts to develop the VDE – virtual data enclave (more on that soon!) whereby a research scientist is able to analyze sensitive data virtually versus for example, coming physically to our enclave.
  • National Institute on Drug AbuseAnother automated system - RCS – another topic to be discussed shortly, has become necessary because the data in this archive, and much coming in via DSDR, are restricted. That is, a researcher must have submit his/her research team composition, get IRB approvals, provide data security plans, etc. before we can release this data to the researcher – either within the VDE or via removable media. To date, our restricted contracts, which require periodic tracking of the research team up until the data has been destroyed, has been manual. With the increase in volume of contracts, an automated system needed to be developed.
  • IFSS offers data and tools for examining issues related to families and fertility in the United States spanning five decades. IFSS encompasses the Growth of American Families (GAF), National Fertility Surveys (NFS), and National Surveys of Family Growth (NSFG), as well as a single dataset of harmonized variables across all ten surveys. Analytic tools make it possible to quickly and easily explore the data and obtain information about changes in behaviors and attitudes across time.The Eunice Kennedy Shriver National Institute for Child Health and Human Development (NICHD)
  • ICPSR Data Sharing

    1. 1. ICPSR AT 50:Facilitating Research and Data Sharing<br />Part II: Data Sharing<br />IASSIST Vancouver, BC<br />May 31, 2011<br />
    2. 2. “Public” Data Sharingbegins at 10:45<br />
    3. 3. ICPSR’s Public Data<br />Sharing Public Data - Agenda<br />2010 US Census<br />ICPSR’s “Public” Archives<br />
    4. 4. DISSEMINATION OF DATA - MICRODATA<br />From the Office of Management and Budget (OMB) Policy Directive published in the Federal Register, Vol. 72, No. 46, Friday, March 7, 2008, Notices, pp. 12662-12626:<br /> “When appropriate to facilitate in-depth research, and feasible in the presence of resource constraints, statistical agencies should provide public access to microdata files with secure safeguards to protect the confidentiality of individually-identifiable responses and with readily accessible documentation, metadata, or other means to facilitate user access to and manipulation of the data. “<br />
    5. 5.
    6. 6. U.S. CENSUS DATA – 2010: KEY DATES<br />National Census Day: 1 April 2010<br />April - July 2010: Census takers visit households that did not return a form by mail<br />December 2010: By law, the Census Bureau delivers population information to the President for apportionment<br />March 2011: By law, the Census Bureau completes delivery of redistricting data to states<br />
    7. 7.
    8. 8. U.S. CENSUS DATA – 2010: DISSEMINATION OF RESULTS <br /> American FactFinder (AFF) is an online source for population, housing, economic and geographic data that presents the results from four key data programs:<br /><ul><li>Decennial Census of Housing and Population - 1990 and 2000
    9. 9. Economic Census 1997-2002-2007 
    10. 10. American Community Survey 1-Year Estimates and 3-Year Estimates 
    11. 11. Population Estimates Program - July 1, 2006 to July 1, 2009 </li></ul>Results from each of these data programs are provided in the form of data sets, tables, thematic maps, and reference maps. <br />
    12. 12.
    13. 13.
    14. 14. U.S. CENSUS DATA – 2010: DISSEMINATION OF DATA <br />Direct File Access through Download FTP Center at Census Bureau<br />Free Access to all PUBLIC-USE DATA FILES<br />First Release of Data (February – March 2011) <br />2010 Census Redistricting Data Summary File (P.L. 94-171):<br />State and sub-state population counts to the block level for the total population and the population 18 years and over for 63 race groups; and not Hispanic or Latino origin by 63 race groups<br />State and sub-state housing unit counts down to the block level by occupancy status (occupied units, vacant units)<br />Quickly followed by (April 2011):<br />National Summary File of Redistricting Data: Contains the same data tables as the state files, but the geographic levels include the U.S., regions, divisions, other areas that cross state boundaries, and a small subset of the geographic areas shown in the state files.<br />
    15. 15. U.S. CENSUS DATA – 2010: DISSEMINATION OF DATA <br />SUMMARY FILE 1 (SF 1):<br /> This file shows detailed tables on age, sex, households, families, relationship to householder, housing units, detailed race and Hispanic or Latino origin groups, and group quarters. Most tables are shown down to the block or census tract level. Some tables are repeated for nine race/Hispanic or Latino origin groups. The nine groups are (1) White alone, (2) Black or African American alone, (3) American Indian and Alaska Native alone, (4) Asian alone, (5) Native Hawaiian and Other Pacific Islander alone, (6) Some Other Race alone, (7) Two or More Races, (8) Hispanic or Latino; (9) White alone, Not Hispanic or Latino. (Release: June-August 2011)<br />
    16. 16. U.S. CENSUS DATA – 2010: DISSEMINATION OF DATA <br />SUMMARY FILE 1 (SF 1 CONTINUED):<br />The SF 1 National Update File contains the same data tables as the state files, but the geographic levels include the U.S., regions, divisions, and other areas that cross state boundaries. (Release: November 2011) <br />The SF 1 Urban/Rural Update File provides users with urban/rural population and housing unit counts (down to block) and characteristics for urbanized areas and urban clusters. (Release: October 2012)<br />The SF 1 Redefined Core Based Statistical Areas Update File contains the same data tables as the state files for redefined CBSAs as defined by OMB following the 2010 Census. (Release: August 2013)<br />
    17. 17. U.S. CENSUS DATA – 2010: DISSEMINATION OF DATA <br />SUMMARY FILE 2 (SF 2 CONTINUED):<br /> This file shows detailed tables on age, sex, households, families, relationship to householder, housing units, and group quarters. Most tables are shown down to the census tract level. Tables are repeated by 141 race groups, 98 American Indian and Alaska Native tribes/tribal groupings, and 39 Hispanic or Latino origin groups. In order for any of the tables for a specific group to be shown in SF 2, the data must meet a minimum population threshold. The tables in SF 2 will be repeated for each group if there are at least 100 or more people of that specific group in a particular geographic area. (Release: December 2011-April 2012)<br />
    18. 18. U.S. CENSUS DATA – 2010: DISSEMINATION OF DATA <br />SUMMARY FILE 2 (SF 2):<br />The SF 2 National Update File contains the same data tables as the state files, but the geographic levels include the U.S., regions, divisions, and other areas that cross state boundaries. (Release: May 2012)<br />The SF 2 Urban/Rural Update File provides users with urban/rural population and housing unit counts (down to census tract) and characteristics for urbanized areas and urban clusters. (Release: January 2013)<br />
    19. 19. U.S. CENSUS DATA – 2010: DISSEMINATION OF DATA <br />Congressional District Summary File – This file is a re-tabulation of Summary File 1 for newly redistricted Congressional Districts for the 113th Congress. State-based files will be released in January 2013 and every 2 years thereafter for states where congressional redistricting occurs.<br />State Legislative District Summary File – This file is a re-tabulation of Summary File 1 for State Legislative Districts drawn following the 2010 Census. State-based files will be released in June 2013 and every 2 years thereafter for states where legislative redistricting occurs.<br />
    20. 20. U.S. CENSUS DATA – 2010: DISSEMINATION OF DATA <br />American Indian and Alaska Native (AIAN) Summary File – This is a national-level file showing the same content as Summary File 2. Tables are repeated for the total population, the total AIAN population, the total American Indian population, the total Alaska Native population, and for numerous American Indian and Alaska Native tribes. In order for any of the tables for a specific group to be shown, the data must meet a minimum population threshold of at least 100 or more people of that specific group in a particular geographic area. (Release: April 2013)<br />
    21. 21. U.S. CENSUS DATA – 2010: DISSEMINATION OF DATA <br />Public Use Microdata Sample (PUMS) Files – The PUMS files contain state-level 2010 Census data containing individual records of characteristics for a 10 percent sample of people and housing units. Data will be included for age, sex, race, Hispanic or Latino origin, household type and relationship, and tenure data with identifying information removed, for PUMAs of 100,000 or more population. (Release: TBD) <br />Of lesser importance than 2000?<br />
    22. 22. Decennial Census <br />In Census 2000, the census used 2 forms<br />“short” form – asked for basic demographic and housing information, such as age, sex, race, how many people lived in the housing unit, and if the housing unit was owned or rented by the resident<br />“long” form – collected the same information as the short form but also collected more in-depth information such as income, education, and language spoken at home<br />Only a small portion of the population, called asample, received the long form.<br />
    23. 23. 2010 Census and American Community Survey<br /><ul><li>2010 Census will focus on counting the U.S. population
    24. 24. The sample data are now collected in the ACS
    25. 25. Puerto Rico is the only U.S. territory where the ACS is conducted
    26. 26. 2010 Census will have a long form for U.S. territories such as Guam and U.S. Virgin Islands
    27. 27. Same “short form” questions on the ACS</li></li></ul><li>American Community Survey2008 Content Changes<br /><ul><li>Three new questions</li></ul>Health Insurance Coverage<br />Veteran’s Service-connected Disability<br />Marital History<br /><ul><li>Deletion of one question</li></ul>Time and main reason for staying at the address<br /><ul><li>Changes in some wording and format</li></li></ul><li>American Community Survey Methodology<br /><ul><li>Sample includes about 3 million addresses each year
    28. 28. Three modes of data collection</li></ul>mail<br />phone<br />personal visit<br /><ul><li>Data are collected continuously throughout the year </li></li></ul><li>American Community SurveyTarget Population<br /><ul><li>Resident population of the United States and Puerto Rico</li></ul>Living in housing units and group quarters<br /><ul><li>Current residents at the selected address</li></ul>“Two month” rule<br />
    29. 29. American Community SurveyGroup Quarters <br /><ul><li>Place where people live or stay that is normally owned or managed by an entity or organization providing housing or services for the residents.
    30. 30. 2 categories of group quarters:</li></ul>Institutional<br />Non-institutional<br />
    31. 31. American Community Survey Period Estimates<br /><ul><li>ACS estimates are period estimates, describing the average characteristics over a specified period
    32. 32. Contrast with point-in-time estimates that describe the characteristics of an area on a specific date
    33. 33. 1-year, 3-year, and 5-year estimates will be released for geographic areas that meet specific population thresholds</li></li></ul><li>American Community Survey Data Products Release Schedule<br />* Five-year estimates will be available for areas as small as census tracts and block groups.<br />Source: US Census Bureau<br />
    34. 34. American Community SurveyData Products<br />Profiles<br />Data Profiles<br />Narrative Profiles<br />Comparison Profiles<br />Selected Population Profiles<br />Tables<br />Detailed Tables<br />Subject Tables<br />Ranking Tables<br />Geographic Comparison Tables<br />Thematic Maps<br />Public Use Microdata Sample (PUMS) Files<br />
    35. 35. American Community SurveySimilarities with Census 2000<br /><ul><li>Same questions and many of the same basic statistics
    36. 36. 5-year estimates will be produced for same broad set of geographic areas including census tracts and block groups</li></li></ul><li>American Community SurveyKey Differences from Census 2000<br /><ul><li>Beginning in 2010, data for small geographic areas will be produced every year versus once every 10 years
    37. 37. Data for larger areas are available now and data for mid sized area will be available in December 2008
    38. 38. Census 2000 data described the population and housing as of April 1, 2000 while ACS data describe a period of time and require data for 12 months, 36 months, or 60 months</li></li></ul><li>American Community SurveyKey Differences from Census 2000<br /><ul><li>The goal of ACS is to produce data comparable to the Census 2000 long form data
    39. 39. These estimates will cover the same small areas as Census 2000 but with smaller sample sizes
    40. 40. Smaller sample sizes for 5-year ACS estimates results in reductions in the reliability of estimates</li></li></ul><li>Cooperative Agreements<br />Close collaboration with the Bureau over the years in making data available to the academic research community.<br />Since the 1980’s ICPSR has sought outside funding to deal with Census data and entered into joint statistical agreements with the Bureau to facilitate its distribution and use.<br />Importance in 1990: High cost of raw data ($175 per reel of tape; entire Census comprised about 2000 tapes = C. $350,000).<br />
    41. 41. Cooperative Agreements<br />Data available to at no cost to member institutions without any rights to redistribute or resell.<br />Joint annual summer workshops to offer training on the new Census data products.<br />One week training sessions held in 1991-1994 and 2001-2004<br />Census Bureau staff participated extensively in these courses<br />Attracted both researchers and ICPSR Official Representatives who attended to learn how to provide assistance to faculty and students on their campuses<br />
    42. 42. The Decennial In(di)gestion<br />Census Data: Collected regularly since the 1960s.<br />Number of files and bytes have grown exponentially with every new Census.<br />Main reason for the rapid growth in the numbers of data files archived and disseminated by ICPSR.<br />How much and how rapid?<br />
    43. 43. The Decennial In(di)gestion<br />
    44. 44. U.S. CENSUS DATA: DISSEMINATION OF DATA AND ICPSR<br />Another access point, focused on the social science research community, to Census data and documentation<br />Original Census data available from the 1960s onward as well as special samples created for earlier years<br />TIGER Line Files<br />American Community Survey <br />Many of the newer files are available in a variety of formats:<br />SAS<br />SPSS<br />Stata<br />Ascii text files<br />Tab-delimited <br />
    45. 45. Special Census Subsets<br />These files report population and housing data for national and specific sub-national geographical entities, for example: <br />The entire nation <br />Each individual state<br />Counties<br />Metropolitan Areas<br />Places <br />Census Tracts<br />
    46. 46. Contextual File<br />Based largely on Census data<br />Provides information at the ‘county’ level in the U.S. (subunits of states numbering more than 3,100 in all)<br />Contains data from other government and private sources at the same geographic level<br />Under certain circumstances, can be merged with survey data <br />
    47. 47. Contextual File - 2<br />Population by age, sex, race, and Hispanic origin<br />Labor force size and unemployment<br />Personal income<br />Earnings and employment by industry<br />Land surface form typography<br />Climate<br />Government revenue and expenditures<br />Crimes reported to police<br />Presidential election results <br />Housing authorized by building permits<br />Medicare enrollment<br />Health profession shortage areas<br />
    48. 48. Preservation<br />ICPSR provides another location to preserve data and documentation files produced by the Census Bureau<br />ICPSR keeps multiple copies of these files both at its home location at the University of Michigan and at other sites in the United States<br />Copies are continually checked and updated when necessary<br />Considerable interest in historical Census data by demographers, historians, and economists.<br />
    49. 49. Current Happenings with ACS and Plans for Census 2010<br />Consulted with Collection Development Committee of ICPSR Council:<br />Advised to continue ICPSR precedent of acquiring Census 2010 since the membership and the research community in general have traditionally come to ICPSR for their Census data needs. <br />Suggestion that the data files need not be archived right away since all public-use data will be available directly from the Census Bureau.<br />Emphases should center on archiving the most important Census data products when it could be best determined that final versions were created. <br />The Committee also suggested that ICPSR consider holding training workshops on Census data once again as they did during the last decade and decide how best to finance them within the context of the Summer Program.<br />
    50. 50. Current Happenings with ACS and Plans for Census 2010<br />Suggestion to study possibility that SDA functionality might work to produce subsets for Census data instead of creating specific data products to do so. <br />Emphasis placed on partnerships and as an example working with the University of Minnesota Population Center and their National Historical Geographic Information System (NHGIS) which is expected to be able to produce subsets of 2010 Census data.<br />Determine in general from membership and user community what value-added features might make sense for academic researchers as greater amounts of Census 2010 data become available.<br />
    51. 51. Current Happenings with ACS and Plans for Census 2010<br />Select files archived at ICPSR beginning with 1996 ACS:<br />Emphasis on PUMS files at first<br />Greater interest in Summary Files as more data is released and, in particular, with the recent appearance of the first 5-year Estimates File covering calendar years 2005-2009<br />
    52. 52. Current Happenings with ACS and Plans for Census 2010<br />TIGER files (Topologically Integrated Geographic Encoding and Referencing System)<br />2010 extracts containing geographic and cartographic information from the Census Bureau's MAF/TIGER® (Master Address File/Topologically Integrated Geographic Encoding and Referencing) database. <br />These files support the 2010 Census Redistricting Data (P. L. 94-171) and the National Summary File of Redistricting Data/Summary File 1 releases. <br />The files provide the digital map base for a Geographic Information System or mapping software. The files do not contain any mapping software. <br />
    53. 53. Current Happenings with ACS and Plans for Census 2010<br />TIGER files (Topologically Integrated Geographic Encoding and Referencing System)<br />All legal boundaries and names are as of January 1, 2010. The boundaries shown are for Census Bureau statistical data collection and tabulation purposes only; their depiction and designation for statistical purposes does not constitute a determination of jurisdictional authority or rights of ownership or entitlement. <br />The geographic entity codes needed to link the Census Bureau's demographic data to the geography are included in the files. The TIGER/Line Shapefiles do not contain any demographic or economic data; data can be downloaded separately using American FactFinder. <br />
    54. 54. Current Happenings with ACS and Plans for Census 2010<br />TIGER files (Topologically Integrated Geographic Encoding and Referencing System)<br />Differences between shape files and line files<br />Data stored at ICPSR through designated Web site<br />Maintain archival copies as older versions of TIGER files cease to be distributed by Census Bureau<br />http://www.icpsr.umich.edu/TIGER/index.html<br />
    55. 55. ICPSR’s Public Archives<br />
    56. 56. ICPSR’s Public Archives<br />Three Differentiating Characteristics of a “Public Archive”<br />Funding Sources<br />Access<br />Search<br />
    57. 57. Funding Sources & Long Term Access<br />ICPSR’s public archives are funded by entities including:<br />Government agencies<br />Foundations<br />Other Organizations<br />And if the funding ceases:<br />ICPSR commitment to support access<br />Access generally reverts to membership-only after some time period<br />
    58. 58. Why are Funders using ICPSR? <br />An Archive’s Reasons for Being<br />Dissemination Infrastructure<br />Systems & Search = technology, security, & metadata <br />Data Community Base (700 immediate members to share with)<br />Community Outreach/engagement expertise<br />Preservation<br />Fulfillment of Data Management Plan (Grant) Requirements<br />Ability to Measure & Report Dissemination Statistics<br />
    59. 59. Data Search within our Public Archives<br />A search for data/documents from within a public archive defaults to searches of materials (data) within that archive<br />A strategy to help one narrow their scope<br />All materials are publicly available<br />
    60. 60. The Relationship Visual<br />A common hub, yet each unique <br />
    61. 61. NACJD: National Archive of Criminal Justice Data<br />Study topic: criminal justice<br />Funders: BJS, OJJDP, NIJ<br />Unique attribute: staff routinely assist non-researchers (police departments) in data use<br />
    62. 62. DSDR: Data Sharing for Demographic Research<br />Study topic: demography<br />Partnership of several institutions<br />Unique attribute: as much a resource for data producers as well as a mechanism for dissemination<br />
    63. 63. NACDA: National Archive of Computerized Data on Aging<br />Study topic: Aging – gerontological research<br />Funder: National Institute on Aging<br />Unique attribute: largest library of electronic data on aging in the US<br />
    64. 64. Research Connections: Child Care and Early Education <br />Study topic: early education<br />Funder: US Dept. of Health & Human Service<br />Unique attribute: goal is more than data – to be the destination for child care & early education research<br />
    65. 65. NCAA Student-Athlete Experiences Data Archive<br />Study topic: intercollegiate athletics and higher education<br />Funder: NCAA<br />Unique attribute: to assist in the development of national athletics policies<br />Unique attribute: to assist in development of national athletics policies<br />
    66. 66. Health and Mental Health Collections<br />Enhanced sensitivity in the area of disclosure risk<br />From ingest of data to storage of data to analysis of data<br />Has driven ICPSR, as the hub, to heighten its computing and data sharing environments<br />Increasing demand has lead to a need to automate – in a secured manner<br />
    67. 67. Center for Population Research in LGBT Health<br />Partner: Fenway Institute<br />Unique attribute: data is processed offsite – ICPSR acts as the host<br />
    68. 68. SAMHDA: Substance Abuse & Mental Health Data Archive<br />Funder: SAMHSA<br />Unique attribute: driving our online services and virtual analysis capabilities<br />
    69. 69. NAHDAP: National Addiction & HIV Data Archive Program<br />Funder: NIDA<br />Unique attribute: driving restricted contract system<br />
    70. 70. IFSS: Integrated Fertility Survey Series<br />Funder: NICHD<br />Unique attribute: data harmonization<br />
    71. 71. Let’s Take a BreakReturn at 11:45<br />