These slides cover evolving federal research requirements for sharing scientific data. Provided are updates on federal agency responses to the 2013 OSTP memo, guidance on data management plans, resources for data management and curation training for staff/researchers, and tips for evaluating public data-sharing services. ICPSR's public data-sharing service, openICPSR, is also presented. Recording of this presentation is here: https://www.youtube.com/watch?v=2_erMkASSv4&feature=youtu.be
Pragya Champions Chalice 2024 Prelims & Finals Q/A set, General Quiz
Meeting Federal Research Requirements for Data Management Plans, Public Access, and Preservation
1. Meeting Federal
Data Sharing Requirements for
Data Management Plans, Public
Access, and Preservation
June 9, 2015
ICPSR – University of Michigan
Hashtag: #icpsr
3. Possible New Forms of Data
and Big & Bigger Data
• Continuous location information from cell phones or Fastlane
transponders.
• Product radio-frequency identification (RFIDs), online product
searches and purchases, and device fingerprinting.
• Electronic medical records, and new devices for continuous
monitoring, passive heart beat measurement, movement indicators,
skin conductivity
• Satellite imagery.
• Social everything—networking, bookmarking, highlighting,
commenting, product reviewing, recommending, and annotating.
• Online games and virtual worlds.
(http://www.sciencemag.org/content/331/6018/719.full)
4. Direct identifiers
• Addresses, including ZIP and other postal codes
• Telephone numbers, including area codes
Indirect identifiers
• Exact dates of events (birth, death, marriage)
• Detailed income
• Detailed geographic information (e.g., county)
Data with Risk: Identifiers and/or
Sensitive Topics
6. What We will Cover
• Evolving status of federal research data-sharing
requirements
• Review of data management best practices
• Counsel on developing data management plans
• Data curation (management) training
opportunities
• Public-access data sharing services at ICPSR that
meet data-sharing requirements
– Tips to evaluate public-access data services
7. • 50+ years of data sharing
• Consortium approach
• Data stewardship
• Data management
• Data curation
• Data preservation
Data Stewardship
is ICPSR’s Mission
9. “Recent” Federal Data Sharing Initiatives
• NIH: 2003 – data sharing plans
• NSF: 2011 – data management plans
• OSTP: 2013 – Memo with subject “Increasing
Access to the Results of Federally Funded
Scientific Research”
12. Data Portion of Memo - 13 Elements
The elements are also summarized online within
ICPSR’s Web site:
http://icpsr.umich.edu/content/datamanagement/ostp.html
13. Elements Summarized in 5:
1.Maximize access
2.Protect confidentiality and privacy
3.Appropriate attribution
4.Long term preservation and sustainability
5.Data management planning & compliance
15. UK results on data sharing attitudes
• In 2011 survey, 85% of researchers said they
thought their data would be of interest to
others.
• Only 41% said they would be happy to make
their data available.
• Only a third had previously published data.
Source: DaMaRO Project, University of Oxford
http://www.slideshare.net/DigCurv/15-meriel-patrick
16. Data Sharing Status – Still Work to Do
Federal Agency Shared Formally,
Archived
(n=111)
Shared
Informally, Not
Archived
(n=415)
Not Shared
(n=409)
NSF (27.3%) 22.4% 43.7% 33.9%
NIH
(72.7%)
7.4% 45.0% 47.6%
Total 11.5% 44.6% 43.9%
Pienta, Gutmann, & Lyle (2009). “Research Data in The Social Sciences: How Much is Being Shared?”
http://ori.hhs.gov/content/research-research-integrity-rri-conference-2009
See also: Pienta, Gutmann, Hoelter, Lyle, & Donakowski (2008). “The LEADS Database at ICPSR:
Identifying Important ‘At Risk’ Social Science Data.”
http://www.data-pass.org/sites/default/files/Pienta_et_al_2008.pdf
Pienta, Alter, & Lyle (2010). “The Enduring Value of Social Science Research: The
Use and Reuse of Primary Research Data”. http://hdl.handle.net/2027.42/78307
19. Hundreds of Pages in Two Slides: 1 of 2
• Data management plans (DMP) required
– If not already, many by October 2015
• Awards January 2016 and beyond to require data
sharing (adherence to DMPs)
• Appropriate (direct) costs for data management and
access allowable in applications and proposals
• Agencies interested in use of existing public
(commercial) repositories; discipline-specific with
discipline-based data standards noted
20. Hundreds of Pages in Two Slides: 2 of 2
• Non-compliance achieved through withholding future funding
• An emphasis on attribution (credit) to principal investigators and
the agency
– Recognition of data citation and persistent identifiers
• An emphasis on metadata development
• Goals to develop best practices
– Selecting what data and supporting documentation to preserve
– Sharing restricted-use (sensitive) data
• Training of agency staff to understand best practices and for
evaluation and compliance
– Suggests training of research grant support teams needed
21. What is good data management? In
summary:
• Maximize data access
– Discoverable & interpretable
– In formats the public can use
• Achieve appropriate attribution
• Protect confidentiality and privacy
• Provide long term preservation and
sustainability
26. In Formats the Public Can Use
• Various common
statistical programs
• Current versions (2.0
versus 22.0!)
• Online analysis
capability
• Codebook
documentation
28. Appropriate Attribution
• Properly citing data encourages the replication of
scientific results, improves research standards, guarantees
persistent reference, and gives proper credit to data
producers.
• Citing data is straightforward. Each citation must include
the basic elements that allow a unique dataset to be
identified over time: title, author, date, version, and
persistent identifier.
• Resources: ICPSR's Data Citations page , IASSIST's Quick
Guide to Data Citation, DataCite.
29. Protect Confidentiality and Privacy
• It is critically important to protect the identities of
research subjects
• Disclosure risk is a term that is often used for the
possibility that a data record from a study could be
linked to a specific person
• Sensitive data are data with highly sensitive personal
information
30. Common Objection/Misperception:
“My data are too sensitive to share. . .”
• ICPSR has been sharing restricted-use data for
over a decade. Three methods are used:
– Secure Download
– Virtual Data Enclave
– Physical Enclave
• ICPSR, for example, stores & shares over 6,400
restricted-use datasets associated with over
2,000 ‘active’ restricted-use data agreements
31. Reality: Restricted-use data can be
effectively shared with the public
• Through the use of a virtual data enclave where
the data never leave the server
• Where there is a process (and understanding!)
to garner IRB approval from the requesting
scientist’s university
• Where there is a system, technology, data
professionals, and collaboration space in place
to disseminate
• Because agencies allow for an incremental
charge to the data requestor to offset marginal
costs
32. Long Term Preservation and Sustainability
“Digital information lasts forever or five years,
whichever comes first”.
-Jeff Rothenberg
34. Sustained Long Term Preservation –
Still a Key Funding Question
• Sustainable funding for free public access remains a
challenge and even unfunded mandate
• Good data management requires paid professionals
• Significant progress has been made as agencies
allow for direct costs for data management to
achieve public data sharing (in the short term)
• Funding for long term sharing is still nebulous
– Data sharing services thus forced to offer disclaimers
36. Purpose of Data Management Plans
• Data management plans describe how researchers
will provide for long-term preservation of, and
access to, scientific data in digital formats.
• Data management plans provide opportunities for
researchers to manage and curate their data more
actively from project inception to completion.
• See ICPSR's resource: Guidelines for Effective Data
Management Plans
37. ICPSR Acquisitions Can Help!
• Elements of a data management plan and the
role of ICPSR
– It should answer: what is the scope of data,
how will public access be achieved,
confidentiality addressed, and content
preserved
• Review data management plan prior to project
38. ICPSR Acquisitions Can Help!
• How to plan for the costs of data management
– Better earlier rather than later
– Doing it yourself vs. paying someone else to do
it
– Tools available to help (e.g. colectica)
– Our curation cost model is based on: number
of files, number of variables, file format,
confidentiality review
39. ICPSR Acquisitions Can Help!
• ICPSR willing to write a letter of support for
archiving with ICPSR.
– Requires some lead time
– Not a requirement, but helps clarify the roles
of the project vs. the archive
44. And still more guidelines after the
project is awarded:
• Guide emphasizes
preparation for data
sharing throughout
the project
• Available online and
via download (pdf)
46. The Concept of “Data Curation”
• Curation, from the Latin "to care," is the process used to add value to
data, maximize access, and ensure long-term preservation
• Data curation is akin to work performed by an art or museum curator.
– Data are organized, described, cleaned, enhanced, and preserved for
public use, much like the work done on paintings or rare books to make
the works accessible to the public now and in the future
• Curation provides meaningful and enduring access to data
• Data curation is the foundation for effective, long-term data sharing
47. Practical Data Curation Training
• Training for working knowledge or for functional
knowledge
• For greater understanding in building DMPs,
understanding agency requirements, assessing or
enhancing data repositories
• The Curriculum = The Process
– Data Appraisal
– Reviewing Data
– Confidential Data
– Cleaning Data
– Describing Data – the Metadata
– Deposit
– Dissemination (Sharing)
– Sustainability
49. ICPSR Data Curation Training Workshops
• 1-5 day workshops on data curation/data
repository management decisions
– Participants learn about best practices and
tools for data curation, from selecting and
preparing data for archiving to optimizing and
promoting data for reuse
• Available via ICPSR Summer Program (Ann
Arbor – July 27-31, 2015) or onsite at your
institution
51. First: Tips for Evaluating a Data Sharing Service
• How will the service sustain itself? Does it have a long term funding
stream?
• How will the service care for my data in the long term should the service
fail? Is there a plan? A safety net?
• Can the service quickly maximize discoverability of my data? Does it
explain how it will do so? Is it connected to a data catalog?
• Does the service have a network of interested researchers & students
seeking data? Will my data get used?
• Does the service have knowledge of international archiving standards?
• Does the service provide a DOI, data citation, and version control should I
need to update my files?
• I have sensitive data or data with some disclosure risk to deposit. Does
the service understand how to secure it upon intake and when sharing?
Does it have experience in this area?
Questions to consider when evaluating a data sharing service:
52. Understanding openICPSR
• What is openICPSR
• Who might use the hosted data-sharing
service
• What features/benefits does openICPSR offer
data depositors; what is unique about the
service
• What is openICPSR for Institutions and
Journals
53. What is openICPSR?
openICPSR is a research data-sharing service for the
social and behavioral sciences. It enables the public to
access research data without charge—or in the case of
restricted-use data, for nominal charge.
54. Who might use openICPSR?
openICPSR has been developed for use in the social
and behavioral sciences. This includes:
• Researchers required to share data freely with the
public to comply with grant/contract requirements
• Authors required to share data for replication
purposes to comply with journal requirements
• Researchers required to share sensitive data or data
with disclosure risk with the public from a secure
digital environment
• Researchers, including students, who want to share
data publicly as good practice or for the purposes of
replication
56. Why did ICPSR develop openICPSR?
1. The environment changed
• That government & foundation funded research data be
shared with the public without charge is moving to the norm
• Data management plans are required as part of the
proposal package
• Grant/contract proposals allow for budget line items related
to sharing and preserving data
2. Institutions, journals, and individual
researchers desired a public data sharing
service that is trustworthy and economical
without the need for additional technical
staffing or equipment (hosted)
57. How is openICPSR unique?
openICPSR is the only public data-sharing service:
• Where the deposit is reviewed by professional data
curators who are experts in developing metadata (tags)
for the social and behavioral sciences
• With an immediate distribution network of over 750
institutions looking for research data, that has powerful
search tools, and a data catalog indexed by major
search engines
• Sustained by ICPSR - over 50 years of experience in
reliably protecting research data
• Prepared to accept and disseminate sensitive and/or
restricted-use data in the public-access environment
59. Why is there a charge for openICPSR
deposits and how should the fee be paid?
• openICPSR charges a fee to sustain the service such that
data deposits will be available both now and into the future
• Effective data curation carries costs. Fees are charged to
cover costs including:
– Curation professionals who review metadata and catalog the
data
– Technology professionals who maintain functionality and
security of the website
– Costs for multiple copies of the deposit and the metadata
(preservation) to ensure the safety of the deposits (storage and
servers)
• Deposit fees can and should be written into the grant
proposal
60. I would like to curate my data. Can I
share it via openICPSR?
There are two openICPSR package types:
1. Self Deposit: Enables research scientists to deposit
data & documentation on demand and provide
immediate public access. Depositors receive a DOI and
data citation upon publishing and a metadata review
shortly after publishing. The cost is $600 per project.
2. Professional Curation: Enables a research scientist to
tap all aspects of ICPSR’s curation services. The fee
depends on the complexity of the data and the curation
services desired. Scientists must call for a quote,
preferably during the time the grant proposal (specifically
the data management plan) is being prepared.
61. I’m still trying to understand the
difference between a self-deposit and a
curated deposit. Can you explain?
Let’s show you the difference!
• A nicely described self-deposit in openICPSR
• A professionally-curated deposit in the
membership archive of ICPSR
62. Does openICPSR accept and
disseminate restricted-use data?
• The deposit of sensitive data or data with
disclosure risk (restricted-use) is similar to the
deposit of non-sensitive data except that the
depositor will indicate that the data should be for
restricted-use only
• Dissemination of restricted-use data is through
ICPSR’s virtual data enclave; in this environment,
data never leave the secure server and analysis
takes place in the virtual space
• Analysts desiring to access the data will need to
apply for the data and will pay an access fee
64. openICPSR for organizations was built to:
• Fulfill the organization’s governmental grant & journal
replication requirements
• Brand the data-sharing service with a logo, colors, and a
unique URL
• Provide DOIs & data citations upon deposit
• Increase exposure & reach of the organization’s research
via inclusion in ICPSR’s data catalog & integration with its
social media
• Administer the fully-hosted (cloud) service economically
without the need for costly technical staff or equipment
• Share and preserve restricted-use data
• Provide confidence that the data and service are safe &
available for the long term
65. Branded and Hosted
Your logo. Your colors.
A unique URL.
On-demand deposit.
On-demand reports.
66. What are the fees for openICPSR for
Institutions and Journals?
67. A Firehose of Information!
• Evolving status of federal research data-sharing
requirements (guidelines)
• Review of good data management practices
• Counsel on developing data management plans
• Data curation (management) training
opportunities
• Public-access data sharing services at ICPSR that
meet data-sharing requirements
– Tips to evaluate public-access data services
68. Copies of these Slides & Use
• Feel free to share it; present
it; cite it!
• Webinar recording and
slides available soon
• Find copies of these slides
on Slideshare.net
– Several notes and
additional links are found in
the notes view
69. Get More information
• Contact us:
– netmail@icpsr.umich.edu
– (734) 647-2200
• More on Assuring Access to Scientific Data: white paper
– “Sustaining Domain Repositories for Digital Data”
Editor's Notes
Federal agencies are requiring data management plans as part of research proposals to increase public access to results (including research data) of federally funded scientific research. Join us for a session on sustainable data sharing models, including models for sharing restricted-use data. Demos of these models and tips for accessing hosted public data access services will be provided as well as resources for creating data management plans for grant applications.
We’re going to jump right in here and talk about what we mean when we talk data! If only data were uniform and always well-documented.
Just a few questions, many unanswered, regarding data:
How do you store, preserve, and share all these types of data?
Will one process (above) fit all data?
Should all these different types of data live together?
What data should you keep? And what documentation related to the data should you keep? All of it? Should you? Could you?
A need to share data with varying levels of security to ensure the protection of respondents.
Why We Care!
ICPSR exists to preserve and share research data to support researchers who:
Write research articles, books, and papers
Teach or utilize quantitative methods
Write grant/contract proposals (require data management plans)
Current archives/collections/repositories already meeting public access requirements regarding data
NACDA – NACJD: examples of long term sustainability
NAHDAP – DSDR: examples of sharing of confidential data
NACJD – example of depository/researcher compliance (holding 10% of funding to PI)
LGBT – MET: unique infrastructure and dissemination
Research Connections: reports and data dissemination; audiences including policymakers
In January 2011, the National Science Foundation released a new requirement for proposal submissions regarding the management of data generated using NSF support. All proposals must now include a data management plan (DMP). (NIH has similar DMP requirements.)
The plan is to be short, no more than two pages, and is submitted as a supplementary document. The plan needs to address two main topics:
What data are generated by your research?
What is your plan for managing the data?
The OSTP Memo
This memo directed funding agencies with an annual R&D budget over $100 million to develop a public access plan for disseminating the results of their research
concern for investment: “Policies that mobilize these publications and data for re-use through preservation and broader public access also maximize the impact and accountability of the Federal research investment.”
Federal agencies with over $100 M annually in R&D expenditures to develop plans to support increased public access to the results of research funded by the Federal Government
The OSTP Memo – Overview
Released February 22, 2013
A concern for investment: “Policies that mobilize these publications and data for re-use through preservation and broader public access also maximize the impact and accountability of the Federal research investment.”
Federal agencies with over $100 M annually in R&D expenditures to develop plans to support increased public access to the results of research funded by the Federal Government
“Maximize access, by the general public and without charge, to digitally formatted scientific data created with Federal funds…”
ICPSR’s response: http://www.icpsr.umich.edu/files/ICPSR/ICPSRComments.pdf
Elements can be summarized by:
Maximize access
Protect confidentiality and privacy
Appropriate attribution
Long term preservation and sustainability
Data management planning
An old idea but still new in practice – globally not fully practiced.
http://www.slideshare.net/DigCurv/15-meriel-patrick
4,883 NIH & NSF PIs emailed a survey
1,217 responses (24.9% response rate)
1,003 valid (collected data, not disseratation)
We attempted to invite all 4,883 of these Pis.
The PI survey consisted of consisted of questions about research data collected, various methods for sharing research data, attitudes about data sharing and demographic information. PIs were also asked about publications tied to the research project including information about their own publications, research team publications, and publications outside the research team. We received 1,217 responses (24.9% response rate). For the analytic sample we select PIs and their research data if (1) they confirm they collected research data (86.6% of the responses), (2) they did not collect data for a dissertation award (n=33), or (3) they were missing data on the dependent variable.
The responses vary widely in detail, length, and policies. This slide represents the commonalities we found.
Great list of existing domain repositories: http://www.re3data.org/
The responses vary widely in detail, length, and policies. This slide represents the commonalities we found.
To understand the development of a good data management plan, we’ll talk about good data management.
Note that our summarized view supports what we are seeing so far in agency responses to the OSTP memo.
ICPSR views discovery of data much like discovery of other products and services. How do retailers assemble their online catalogs such that you can find what you are looking for?
Understanding how you search and how others like you AND unlike you search, they tag items in a multitude of ways and they offer a variety of search tools to enable you and others like/unlike you to narrow down to what you are interested in – quickly (because if we don’t do it quickly, we know you move on!).
A good data catalog, like online retailer catalog, understands how you search – the words you use and/or the questions you ask.
This is potentially part of the reason we see interest in domain-specific repositories since those tagging (the metadata) are experts in their field, as well as those web-developers who also understand the domain in ways that general catalog developers may not.
This is the ultimate litmus goal -- it “contains information intended to be complete and self-explanatory” for future users. [Quote is from the National Longitudinal Survey of Youth’s explanation of its documentation (see: http://www.nlsinfo.org/nlsy97/97guide/chap3.htm#threethree).]
Why does this matter? 1) Others will be able to independently use/understand data, 2) Data will be readable (i.e., in useable formats) in the future, 3) It makes your life less complicated once you’re finished with the data collection -- you don’t need to continually explain, reformat, revise, etc.
Sensitive personal information isn’t about names, addresses, credit card numbers, or other direct identifying information. Research scientists should never, never, ever submit this type of information to any hosted service – ever. What we’re talking about is highly personal information (topics) within research data that may include past/present drug use, illegal activities, or perhaps sexual habits.
We’re currently adding about 50 new agreements each month.
EBCDIC format
A collection of resources (links) to assist in data management plans for grant proposals
Tools to prepare plans (templates & sample plans)
Contact information for plan advice
Repository list: http://www.re3data.org/
https://dmp.cdlib.org/
Puts together the basic structure & form for your DMP. Note that it isn’t plug and go – the reasoning behind your management plans based on the discipline and/or data being collected should be added.
22 pages of guidelines and references even including a sample plan (boilerplate!) available for download.
Link to pdf document: http://www.icpsr.umich.edu/files/datamanagement/DataManagementPlans-All.pdf
Pdf link to the data prep guide: http://www.icpsr.umich.edu/files/deposit/dataprep.pdf
More information on data preparation for archiving: http://www.icpsr.umich.edu/icpsrweb/content/deposit/guide/
Links to images are imbedded in image
Summer Program fees: $1500 to $3000 per participant, depending on membership affiliation
Onsite workshops fees: Per diem and travel costs for ICPSR staff.
openICPSR is ICPSR’s public-access data collection.
openICPSR assists researchers in meeting requirements for public access to federally funded research data. It ensures that data depositors fulfill public-access requirements of grant and contract RFPs. It is available to any researchers (from member and non-member institutions) who wish to make their data available to the public via a sustainable archive.
Why is ICPSR venturing into this space? Because we were hearing from agencies that ICPSR might not be ‘open enough’ for some of our research scientists.
It is sometime easier to identify what openICPSR does not accept than what it does. openICPSR is not appropriate for the natural or hard sciences (bio-medical). It is also not appropriate for huge datasets – multiple GBs of data. Our meta-data experts and our catalog is focused on a very broadly defined area known as the social and behavioral sciences.
For repositories outside ICPSR’s domain, see Stanford’s list: http://library.stanford.edu/research/data-management-services/share-and-preserve-research-data/domain-specific-data-repositories
Another great list is the Registry of Research Data Repositories: http://www.re3data.org/
What about file-types? Currently openICPSR does not accept video or audio files or image files like jpeg, tif, etc. These types of files present size and confidentiality concerns.
Each project is limited to 1GB maximum storage and less than 1,000 files per project.
openICPSR reserves the right to remove files and projects, published and unpublished, where use is determined not to be for the purposes of research or research-oriented sharing/preservation.
Calendar Year 2013: Over 185,200 searches for data conducted on ICPSR’s data search tools
Note that bit-level deposits to openICPSR by individuals affiliated with ICPSR member institutions are free as openICPSR is a benefit of membership; curated deposits (highly recommended) require a fee.
It is important to note that fees for openICPSR are intended to be written into grant and contract proposals. Fees should be funded by the project, not outright by an institution or members of ICPSR. Funding agencies are encouraging such fees be included as part of the data management plan included with the RFQ. For more information on data management plans, please see our website http://www.icpsr.umich.edu/icpsrweb/content/datamanagement/dmp/index.html and/or contact our data archivists at deposit@icpsr.umich.edu
There is significant administrative burden required for the dissemination of restricted-use data. This includes the completion and review of restricted-use contracts that include IRB approval, data protection issues, placement of the data into the VDE and monitoring of progress and results with a disclosure review of results as well as server time. This is what the access fee to the data user will cover. An access fee to cover costs is allowable per federal grants/contracts guidelines.
openICPSR for Institutions and Journals is powered by all of the features included in openICPSR. It’s part of the same cloud though offers organizations the ability to be part of a stable, trustworthy cloud with branding, reporting, and administrative controls.
Financial sustainability of openICPSR for Institutions & Journals will require an annual fee paid for by the organization. The fee will be graduated based upon the projected number of new deposits per year and a maximum total GB of storage.