Introduction to Data Management and Sharing


Published on

Scholars and researchers are being asked by an increasing number of research sponsors and journals to outline how they will manage and share their research data. This is an introduction to data management and sharing practices with some specific information for Columbia University researchers.

Published in: Education, Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Introduction to Data Management and Sharing

  1. 1. Introduction to Data Management and Sharing University Libraries/Information Services Office of Research Compliance and Training
  2. 2. <ul><li>Why is there a new focus on data management and sharing? </li></ul>
  3. 3. Data sharing is not widely practiced… <ul><li>Lack of time </li></ul><ul><ul><li>for data clean up, user questions </li></ul></ul><ul><li>Lack of recognition </li></ul><ul><ul><li>not valued in promotion/tenure </li></ul></ul><ul><li>Lack of control </li></ul><ul><ul><li>worries about scooping, misinterpretation </li></ul></ul><ul><li>Legal concerns </li></ul><ul><ul><li>copyright, patents </li></ul></ul><ul><li>Inadequate infrastructure </li></ul>
  4. 4. … yet its value is recognized Data sharing was a key element of: Human Genome Project NIH-funded Alzheimer’s study published in April 2011 Sloan Digital Sky Survey
  5. 5. There are new possibilities… <ul><li>Networked digital technology creates new potential for: </li></ul><ul><li>data collection </li></ul><ul><li>data analysis </li></ul><ul><li>data “mash ups” </li></ul><ul><li>collaboration </li></ul><ul><li>citizen science </li></ul>National Science Foundation
  6. 6. “ The impact of science on people’s lives, and the implications of scientific assessments for society and the economy are now so great that  people won’t just believe scientists when they say “trust me, I’m an expert.” … Science has to adapt.” - Geoffrey Boulton, chair of working group for study: Science as a public enterprise: opening up scientific information , 5.13.11 … and science is in the spotlight
  7. 7. These factors have changed the conversation, resulting in…
  8. 8. Calls for data accessibility… “ It is obvious that making data widely available is an essential element of scientific research.” - Science editorial “Making Data Maximally Available,” 2.11.11
  9. 9. … and new data management policies <ul><li>NSF and other research sponsors are strengthening their data management and sharing policies to help: </li></ul><ul><li>increase the accessibility of data </li></ul><ul><li>create standards and protocols </li></ul><ul><li>develop interoperable data repositories </li></ul><ul><li>encourage transparency of research </li></ul>
  10. 10. Submitting a proposal to the NSF? <ul><li>You must: </li></ul><ul><li>Submit a two-page data management plan with your proposal. </li></ul><ul><li>Share your research data (or justify why you should not share it). </li></ul>
  11. 11. Publishing in a Nature journal? “… authors are required to make materials, data and associated protocols promptly available to readers.”
  12. 12. More than ever, researchers are expected to make their data accessible to—and usable by—others.
  13. 13. <ul><li>This means… </li></ul><ul><li>Having a data management plan is more important than ever. </li></ul>Library of Congress
  14. 14. Data management plan (DMP) <ul><li>A data management plan outlines how you will collect, organize, manage, store, secure, back up, preserve, and share your data. </li></ul>Academic Commons
  15. 15. Other DMP elements <ul><li>Designating who is responsible for data management </li></ul><ul><li>Tools or software needed to create/process/visualize the data </li></ul><ul><li>Compliance with policies and regulations </li></ul>NIST
  16. 16. Columbia DMP Template <ul><li>Columbia provides a DMP template. </li></ul><ul><li>Though created in response to NSF requirements, you can use it as a guide for creating any DMP. </li></ul><ul><li>You can find the template on the NSF Data Management Requirements page of this website. </li></ul>
  17. 17. <ul><li>Some points to consider when creating your DMP </li></ul>
  18. 18. Your data storage needs <ul><li>Data formats and size </li></ul><ul><li>Retention period </li></ul><ul><li>Privacy or security requirements </li></ul><ul><li>Backup plan </li></ul><ul><li>Access requirements </li></ul>Pittsburgh Supercomputing Center
  19. 19. Data storage planning <ul><li>Plan for the entire life-cycle. </li></ul><ul><li>Establish a baseline and project the rate of growth for the duration of the project. </li></ul>CDC/Dorothy Roland
  20. 20. Two types of storage <ul><li>Active </li></ul><ul><ul><ul><li>Frequent additions and updates </li></ul></ul></ul><ul><li>Archival </li></ul><ul><ul><ul><li>In fixed form; only need periodic access </li></ul></ul></ul>CDC
  21. 21. Active storage at Columbia <ul><li>School/department/division servers </li></ul><ul><ul><ul><li>Many researchers use servers managed by “local” IT groups. </li></ul></ul></ul><ul><li>CUIT </li></ul><ul><ul><ul><li>20-80 MB personal storage </li></ul></ul></ul><ul><ul><ul><li>Central LAN service </li></ul></ul></ul><ul><li>Center for Digital Research & Scholarship </li></ul><ul><ul><ul><li>Consultation available </li></ul></ul></ul>
  22. 22. Archival storage at Columbia <ul><li>Digital </li></ul><ul><ul><ul><li>Academic Commons is Columbia’s online research repository. </li></ul></ul></ul><ul><li>Physical </li></ul><ul><ul><ul><li>Consult the appropriate Columbia University Libraries archive. </li></ul></ul></ul>
  23. 23. Best archival file formats <ul><li>Nonproprietary file formats </li></ul><ul><li>Uncompressed and unencrypted files </li></ul><ul><li>Consider ease of migration going forward </li></ul><ul><li>May need to archive software as well as data </li></ul>INL
  24. 24. Data retention requirements
  25. 25. Other important retention policies <ul><li>NIH </li></ul><ul><ul><li>3 years </li></ul></ul><ul><li>NSF </li></ul><ul><ul><li>Check with individual NSF directorates </li></ul></ul><ul><ul><ul><li>Health Information Portability and Accountability Act (HIPPA) </li></ul></ul></ul><ul><ul><li>At least 6 years </li></ul></ul>USGS
  26. 26. Data security and integrity <ul><li>Security </li></ul><ul><ul><li>Protect data from unauthorized access or accidental disclosure. </li></ul></ul><ul><li>Integrity </li></ul><ul><ul><li>Ensure that data remains unaltered before, during, and after analysis and presentation. </li></ul></ul>NPS
  27. 27. Data security requirements <ul><li>Your data may be subject to laws and policies such as: </li></ul><ul><ul><li>HIPAA (Health Information Portability and Accountability Act) </li></ul></ul><ul><ul><ul><li>IRB (Institutional Review Board) </li></ul></ul></ul><ul><ul><ul><li>Columbia computing policies </li></ul></ul></ul><ul><ul><ul><ul><ul><li>See the Computing and Technology section of the Columbia Administrative Policy Library </li></ul></ul></ul></ul></ul>
  28. 28. Physical security best practices <ul><ul><li>Restricted access to research facilities, computers, data </li></ul></ul><ul><ul><li>Only trusted individuals troubleshoot computer problems </li></ul></ul><ul><ul><li>Lab notebooks, samples in locked cabinets </li></ul></ul>Lawrence Berkeley National Laboratory
  29. 29. Digital security best practices <ul><ul><li>Sensitive data on computers not connected to Internet </li></ul></ul><ul><ul><li>Virus protection up to date </li></ul></ul><ul><ul><li>No confidential data via e-mail or FTP </li></ul></ul><ul><ul><li>Passwords to access files and computers </li></ul></ul><ul><ul><li>Proper data disposal at end of retention period </li></ul></ul>Lawrence Livermore National Laboratory
  30. 30. Data backup best practices <ul><li>Make 3 copies </li></ul><ul><ul><li>Original </li></ul></ul><ul><ul><li>External/local </li></ul></ul><ul><li>Verify recovery is possible </li></ul><ul><ul><li>Checksum validation </li></ul></ul><ul><ul><li>Test file restore after initial setup </li></ul></ul><ul><ul><li>Per iodically thereafter </li></ul></ul><ul><ul><li>External/remote – different geographic area </li></ul></ul>
  31. 31. Data backup options <ul><li>Hard drive </li></ul><ul><li>Tape back-up </li></ul><ul><li>Server </li></ul><ul><li>Cloud storage </li></ul><ul><ul><ul><li>Amazon S3 </li></ul></ul></ul><ul><ul><ul><li>Subject Repository/ Data Centers </li></ul></ul></ul><ul><ul><ul><ul><ul><li>Examples: PubChem, Dryad, IRI/LDEO </li></ul></ul></ul></ul></ul>NIH
  32. 32. Sharing requirements <ul><li>How, when, and what you share depends on: </li></ul><ul><ul><ul><li>Data format </li></ul></ul></ul><ul><ul><ul><li>Restrictions on data </li></ul></ul></ul><ul><ul><ul><li>Funder and publisher guidelines </li></ul></ul></ul><ul><ul><ul><li>Customary embargo periods </li></ul></ul></ul><ul><ul><ul><li>Availability of appropriate repositories or other vehicles for sharing </li></ul></ul></ul>NIH
  33. 33. Sample data sharing guidelines
  34. 34. Sharing restrictions <ul><ul><li>Under HIPAA (Health Information Portability and Accountability Act), you cannot share information that compromises the confidentiality or privacy of human subjects. Any data resulting from studies using human subjects must be scrubbed of identifying information. </li></ul></ul>
  35. 35. <ul><ul><li>You may have other reasons that justify not sharing your data, and you can detail these in your data management plan. Funders may allow exceptions to data sharing policies. </li></ul></ul>Sharing restrictions
  36. 36. Don’t forget metadata <ul><li>Metadata is structured information that describes, explains, locates, and otherwise makes it easier to retrieve and use an information resource. </li></ul>BLM NTSC
  37. 37. <ul><ul><li>“ The metadata accompanying your data should be written for a user 20 years into the future -- what does that person need to know to use your data properly? Prepare the metadata for a user who is unfamiliar with your project, methods, or observations. “ </li></ul></ul><ul><ul><li>Oak Ridge National Laboratory </li></ul></ul><ul><ul><li>Distributed Active Archive Center </li></ul></ul>Metadata facilitates use of your data
  38. 38. Major metadata standards <ul><li>Darwin Core (Biology) </li></ul><ul><li>DDI (Data Documentation Initiative, for social and behavioral sciences data) </li></ul><ul><li>DIF (Directory Interchange Format for scientific data) </li></ul><ul><li>EML (Ecological Metadata Language) </li></ul><ul><li>FGDC/CSDGM (geographic data) </li></ul><ul><li>NBII (National Biological Information Infrastructure) </li></ul>
  39. 39. <ul><li>Online data repositories </li></ul><ul><ul><li>organized around institutions or subjects </li></ul></ul><ul><ul><li>often open access </li></ul></ul><ul><ul><li>archival, not active, </li></ul></ul><ul><ul><li>may offer: </li></ul></ul><ul><ul><ul><ul><li>long-term preservation and access </li></ul></ul></ul></ul><ul><ul><ul><ul><li>search engine optimization </li></ul></ul></ul></ul><ul><ul><ul><ul><li>permanent URL or DOI </li></ul></ul></ul></ul>Repositories for data sharing
  40. 40. Columbia’s repository Academic Commons accepts materials from faculty, students, and staff. <ul><ul><li>secure replicated storage </li></ul></ul><ul><ul><li>accurate metadata </li></ul></ul><ul><ul><li>globally accessible repository </li></ul></ul><ul><ul><li>contextual linking between data and publications </li></ul></ul><ul><ul><li>a permanent URL </li></ul></ul>
  41. 41. Some subject-based repositories Space science mission repository Cryospheric data repository Macromolecular structural data repository Marine data repository Biological activities of small molecules data repository
  42. 42. More subject-based repositories Deep-sea core samples repository housed at LDEO Data repository for archeology and related disciplines Basic and applied biosciences data repository Geodesy data repository Social science data repository
  43. 43. Licensing your data <ul><li>Copyright issues around data can be complex </li></ul><ul><li>These groups offer “ready-made” licenses for data that help clarify any restrictions on reuse </li></ul>
  44. 44. For more information <ul><li>Data Management section of Scholarly Communication Program website </li></ul><ul><li>Sponsored Projects Administration </li></ul><ul><li>Office of Research Compliance and Training </li></ul><ul><li>Center for Digital Research and Scholarship </li></ul><ul><li>CUIT </li></ul><ul><li>Computing and Technology section of Columbia Administrative Policy Library </li></ul>