Welcome to the Life of a Dataset Webinar Series. This is the second in the series, Process. In this webinar, you will learn about how ICPSR works to protect data, and the confidentiality of study’s respondents. We will talk about the tools we use to secure data, the methods employed to insure confidentiality, and the steps taken with the data and associated files to bring a codebook, a data file, and data in multiple usable formats to our website. We will be joined by David Thomas, Archive Manager of the Research Center for Minority Data (RCMD).
The name of this webinar series is derived from a tour of ICPSR’s home during the OR Meeting in 2011. This meeting is held every other year on the campus of the University of Michigan in Ann Arbor. ICPSR invites its campus representatives from around the world to visit here for training. These representatives are known as Official Representatives - ORs, for short. The Life of a Dataset Tour was transformed into a poster sessions for the 50th Anniversary Open House held last year at ICPSR. The staff from each of the units in the tour created posters on their unit’s role in the data lifecycle. ORs requested a web site has been built with a short description of the Life of a Dataset and the posters which are downloadable there. The brochure from the tour is also available on the site. Also, This series of webinars is created in response to requests of the ORs who wish to be able to share this information with faculty and students on their campuses. When completed they will reside on the ICPSR Youtube Channel and will be linked into the Life of a Dataset page. Each of the webinars focus on the main theme of the three stages of a data life cycle, Deposit, Process, and Dissemination. The second page of the brochure that is downloadable from this page, has the pipeline through which data travel from deposit to dissemination. We’ll be looking at a processing piece of it shortly. If you need help with this page, please feel free to contact me. My contact information will be on a slide at the end of the webinar.
The goal of data processing and enhancement is to render data usable to researchers interested in accessing them after they are deposited in a repository. In the social sciences, archives add value to data by making them easier to use for secondary analysis. There is wide variation in archival practices, often depending on the condition of the data to be archived and the goals of a particular repository or discipline.At ICPSR, once data are submitted in a submission information package, the data pass through a "pipeline" for processing and enhancement. But before we get into the details of processing, let’s take a look at 2 overarching considerations that affects every step in the process, Data Security and Confidentiality.
ICPSR conducts a disclosure risk review of every dataset to determine whether any data items could be used to identify individual respondents. ICPSR ensures respondent confidentiality by either removing, masking, or collapsing variables within public-use versions of the datasets. The vast majority of ICPSR data holdings are public-use files with no restrictions on their access.Sometimes the protective measures taken to reduce disclosure risk would significantly degrade the research potential of the data. In these cases, ICPSR provides access to restricted use versions that retain confidential data by imposing stringent requirements for accessing them. Levels of AccessSometimes data cannot be modified to protect confidentiality without significantly compromising the research potential of the data. In these cases, access to the data is restricted and stringent confidentiality safeguards are imposed. ICPSR has established two categories of restricted access.Restricted-Use Data are made available for research purposes for use by investigators who agree to stringent conditions for the use of the data and its physical safekeeping.Enclave Data are those datasets which present especially acute disclosure risks. They can be accessed only on-site in ICPSR's physical data enclave in Ann Arbor. Investigators must be approved. Their notes and analytic output are reviewed by ICPSR staff.We will talk more about these types of data and accessing them in the next webinar on Dissemination
The success of social science research relies on the willingness of research participants to take part in surveys. Thus, it is critically important to ensure that the identities of research subjects are not revealed in archived data. Disclosure risk is a term that is often used for the possibility that a data record in a study can be linked to a specific person, thereby revealing information about that person that otherwise would not be known. Concerns about disclosure risk have grown as more datasets have become available online, and as it has become easier to link datasets with publicly available external databases.ICPSR employs stringent procedures to protect the confidentiality of individuals whose personal information may be part of archived data. These includeRigorous review of all datasets to assess disclosure riskModifying data if necessary to protect respondents' confidentialityLimiting access to datasets where risk of disclosure remains highTraining of staff and consultation with data producers in methods of disclosure risk managementRemoving Direct and Indirect IdentifiersTwo kinds of variables often found in social science datasets present problems that could endanger the privacy of research subjects. Some variables point explicitly to particular individuals or units. Examples of direct identifiers include:NamesAddresses, including ZIP codesTelephone numbers, including area codesSocial Security numbersOther linkable numbers such as driver license numbers, certification numbers, etc.Data depositors should carefully consider the analytic role that such variables play and should remove any identifiers not necessary for analysis.Variables that can also be problematic are the indirect identifiers that may be used in conjunction with other information to identify individual respondents. Examples of indirect identifiers include:Detailed geographic information (e.g., state, county, or census tract of residence)Organizations (to which the respondent belongs)Educational institutions (from which the respondent graduated and year of graduation)Exact occupationsPlace where respondent grew upExact dates of events (birth, death, marriage, divorce)Detailed incomeOffices or posts held by respondentICPSR may recode data to remove the threat of disclosure. Recoding can include converting dates to time intervals, exact dates of birth to age groups, state of residence to regional codes, and income to income ranges or categories.ICPSR staff work closely with data depositors to resolve confidentiality issues. ICPSR strongly recommends that data producers remove all respondent identifiers before they deposit their data in the archive.Confidentiality, Informed Consent, and Data SharingProtection of the confidentiality of respondents is a core tenet of responsible research practice that begins with obtaining their informed consent.Informed consent is a process of communication between a subject and researcher to enable the person to voluntarily decide whether or not to participate as a research subject. Human subjects involved in a project must participate willingly and be adequately informed about the research. The informed consent must include a statement describing how the confidentiality of subject records will be maintained. However, it is also important that informed consent is written in a way that does not unduly limit an investigator's discretion to share data with the research community.Confidentiality and IRBsIRBs take different approaches to secondary analysis of public-use datasets such as those distributed on ICPSR's Web site. Some require IRB review of proposals to analyze these data. Others exempt projects involving secondary analysis of existing datasets from preapproved sources, such as ICPSR. Others are establishing their own policies to address these issues. More
To continue forward, let’s take a step back to the Deposit form. At ICPSR, all data are deposited via an electronic Deposit Form. Files uploaded via this secure system are given unique deposit IDs and moved to the appropriate area for further processing. Physical submissions that arrive at ICPSR via removable media like CD-ROMs or DVDs are immediately copied to a secure location and the copy process is verified to ensure that all files transferred properly.
ICPSR processes with the assumption that all data deposited are restricted use until they are processed and it is determined that they can be disseminated publicly through our web site. The processor has virtual machine client software installed on their desktop computers. The virtually machines reside on a server housed at the University of Michigan Information Technology Services Virtual Desktop Infrastructure (VDI). The virtual desktop has all software that the processor needs to perform their jobs, such as SAS, SPSS, Stat Transfer, Adobe Acrobat, Windows 7, MS Office. It has no internet connection, no email, and no capacity to do anything in that environment that would let them remove files or documents from the virtual machine. For example, access to the usb drive ports and cd/dvd drives are blocked while in the virtual machine and the cut/copy and paste function is disabled between the virtual machine and the processor’s desktop. Each machine has a set allocation of resources, 4g of ram, 50 g of storage and two computer processors. To use the virtual machine, the processor logs into the client and connect to the virtual machine via a secure and encrypted connection. Key strokes, mouse clicks, and the refresh of the monitor are the only things that are not encrypted. Data files and working documents reside on the virtual machine and are never transferred to the processor’s desktop. When the processor accesses a data file, the virtual machine connects to a secure and dedicated server at ICPSR where the data resides. The particular server is accessible only through virtual machines. All communication between the virtual machine and this server are secure and encrypted. If a situation arises that a file needs to be removed from the virtual machine, such as the resolving an issue that requires the PIs input, there is an option through the secure webapp, “airlock.” Inside the SDE, the processor identifies the file that they wish to move out of the SDE and who their supervisor is. The webapp sends an email to the supervisor with the request. Inside the SDE, the supervisor checks the request and either approves or denies it. If permission is granted, outside the SDE, the processor receives an email with a URL for the file. The transfer of the file happens automatically when the supervisor approves the request. The VDE which behaves the same way as the SDE, will allow researchers to analyze restricted-use data on their computer via client. However, before files can be removed from that environment through a “dropbox” which is similar to the processor’s “airlock,” a designated member of the archive staff where the study is archived will need to review them in light of the terms of the data use agreement. Those removal requests that are deemed to comply with the agreement will be released. The client software can be installed on any desktop or laptop, Mac or PC. With the VDE, some data contracts have restrictions that don’t allow data to be accessed from a laptop. But from a technical standpoint, the software can work on any computer.
Overview of the process steps: Processing Supervisor reviews submitted materials, determines suitable level of processing, assigns processor. The Supervisor oversees progress, authorizes release of study and releases study in SDA, if appropriate. Once the study is assigned, the Processor reviews submitted materials, plan processing, builds dataset, and in cooperation with a Metadata Reviewer and Electronic Document Conversion staff, construct study description and documentation. The Processor may also prepare the study for release to the Online Analysis tool, SDA. Finally the Processor and the Process Supervisor assembles, reviews, and turns over the study for release to the Web site.
The Metadata Reviewers work with the Processors to build study description and copy edits metadata.Electronic Document Conversion Staff organize and check documents.ICPSR currently maintains six copies of its data (and requires that any off-site backup be encrypted):Two copies held locallyOne copy put on tape once a month onsiteOne copy elsewhere in MichiganOne copy in Amazon's storage cloudOne copy with the DuraSpace cloud storage system known as DuraCloud
About archiving at this step: ICPSR currently maintains multiple copies of data, the data formats, and documentation at various locations to ensure that the data will always be available.
These individuals are generally very experienced in processing data at ICPSR. Their duties include assessing a data deposit’s completeness, review all confidentiality issues and ensure the data’s quality and accurate processing. They also perform final reviews of all files and authorize “turnover” as well as authorize release of SDA datasets and receives notification of the successful release of study files or SDA datasets.Turnover is a ICPSR created batch process that gathers all of the files together for a study to be released to the website, checks them, and moves them to a location where they can be accessed by the website. QCs are reviews for accuracy and consistency of all data and documentation files. There are generally 2 and may be more checks.
The processor is the person who touches every bit of data and documentation to ensure that it protects the confidentiality of the respondents, is complete and consistent, and has documentation that supports the data. In the first step, produce a clear processed version of the data, the processor uses a Linux-based version of SPSS (remember that all of the processing is done in the SDE) and creates a syntax file to add variable names and labels, and values labels, standardize missing data and other missing values, removing or blanking sensitive variables or values, coding outlying values as needed. This is known as the “pre-Hermes” step because the ICPSR designed batch processor, “Hermes,” relies on the clean copy of the data to render a full suite formats for preservation and dissemination. The more complete and correct the pre-Hermes files is, the more complete and correct the files that are output by Hermes. The “full suite” contains data in ASCII format, set-up files for SAS, SPSS, and Stata and ready to go files for each of these statistical packages. Hermes also generates a codebook with a data completeness report, frequencies of all variables, and other detailed information on each variable, including the variable’s type and position. About SDA, Survey Documentation and Analysis (SDA) is a product of UC Berkeley’s Computer-assisted Survey Methods Program. It is a set of programs that is connected to data. Often SDA is added to the data after the Hermes step. Also, the SDA version of the data must be tested and checked prior to release.
The reviewer is the person who copy edits and reviews every bit of documentation to ensure that it complies with the DDI specification ICPSR’s naming conventions and is complete and consistent. We’ll talk more about the duties of the reviewer’s in our next webinar, but for the purpose of processing the dataset for release, these are their duties. ICPSR has produced study descriptions, or metadata records, describing its holdings since the organization began in 1962. In the 1980s the metadata records were converted to a standardized, fielded text format. In 1999, ICPSR received an NSF Infrastructure in the Social Sciences award (SES-9977984), which enabled us to enrich the content of the records and to convert them to XML. The first XML records were produced to be compliant with the Data Documentation Initiative (DDI) Version 1.0 tag set, the records now comply with DDI Version 2.1. In 2006, ICPSR, in partnership with the University of Michigan Library, produced MARC records for its collection.Please note that 'metadata records' does not refer to the larger PDF documentation files associated with each study, nor to the actual data files.ICPSR also encourages users to ensure good use of the metadata records: When incorporating the records into a local catalog or Web site, the records must contain links back to the ICPSR site to ensure proper branding and to familiarize users with ICPSR. Please note that all ICPSR metadata records, whether XML or MARC, contain the URL for the resource. The catalog or site must contain the date of download of the records from ICPSR so that users can determine how recent the information is. Every effort should be taken to keep the records up to date.
Processing Supervisors and many Archive Managers had their start at ICPSR as processors. Other former processors have gone onto careers in departments at ISR and other research institutions. At times, processors are hired on a temporary basis but for the most part, they are full time permanent employees of the University of Michigan and enjoy the benefits that come with that employment, including being able to pursue up to a Master’s degree with tuition supports from the university. If any of you attended the OR Meeting in 2011 and went on the Life of a Dataset tour, you probably met a few of our processors as they hosted the tour groups around the building.
LOAD Webinar on Process
Welcome! Life of a Dataset: Process Wednesday, January 23, 2013
Program Outline• Life of a Dataset, short introduction• Process – Goal of Data Processing – Security • Secure Data Environment (SDE) • Virtual Data Enclave (VDE) – Confidentiality – Steps in Processing Data • Processing Supervisor • Processors • Reviewers
Life of a Dataset - the Web site• www.icpsr.umich.edu/icpsrweb/content/data management/life-of-dataset.html
Goal of Data Processing• The goal of data processing and enhancement is to render data usable to researchers interested in accessing them after they are deposited.
Data Security• Public-Use - available through ICPSR Web site• Restricted-Use - available to vetted researchers through a data use agreement• Onsite ‘physical’ data enclave
Confidentiality• Remove Direct and Indirect Identifiers – Direct Identifiers include • Respondent’s Name • Respondent’s Address – Indirect Identifiers include • Detailed geographic info, i.e. Census Tract of residence • Exact occupation• Informed Consent• IRB
From Deposit to ProcessData deposited through the Data Deposit form go directly into asecure data server.
Processing Supervisor’s Duties• Oversee the work process of a number of processors with varying levels of experience – Review submitted data for confidentiality issues and completeness – Assign data to Processor – Work with Processor to develop processing plan and complete the processing of data – Perform Quality Checks (QC) on data to be released
Processor’s Duties• Prepares data to be released through the Web site, SDA, and/or restricted data use agreements. • Produce clean processed version of data • Using Hermes, render full suite of formats for dissemination and preservation • Produce complete and processed documentation set • Prepare data for SDA, if appropriate
Reviewer’s Duties• Prepares study description, metadata records, ensures that the documentation complies with DDI specifications. – Works cooperatively with the Process to complete tasks above – Performs Quality Checks on all documentation – digitizes any hard copy documentation and converts that to .pdf format – Assembles and conveys documentation to Processor
Who are the Processors?• Generally, Processors are early career individuals with a background in a social science and some experience with statistics. – They comprise the largest group of employees at ICPSR – They receive routine training on matters related to data processing – They have the opportunity to serve on committees or activities at ICPSR
Contacts• David Thomas, Archive Manager, RCMD, email@example.com , 734-936-5784• Sue Hodge, Instructional Resources, firstname.lastname@example.org, 734-615-7850
Thank you! Life of a Dataset: Process Wednesday, January 23, 2013