Responding to data management questionsPresentation Transcript
Responding toSCIENTIFIC DATA MANAGEMENT Questions A Guide for UC Berkeley Library Personnel Jeffery Loo August 2012
Motivation for this class Researchers may have questions about data management Library personnel may help researchers find answers
Class goals1. Identify the questions our patrons may have about data management2. Examine ways of helping patrons find answers
3-part agenda 1. 3.Why is data Openmanagement discussion important? 2. Responding to core data management Planning to Saving a data questions Describing Sharing data Using data collect data file and files ethically documenting data for re- use
Why is datamanageme ntimportant?
0. What is data management?Activities for Generating high-quality data Safely storing the data Reducing limitations to data sharing
NSF data management plan Requirement as of January 18, 2011 Your plans to organize, store, and share data http://www.nsf.gov/bfa/dias/policy/dmp.jsp
“My Data Management Plan – a satire” Dr. C. Titus Brown Assistant Professor Michigan State University Source
Dear NSF,I am happy to respond to your request for a 2-pageData Management Plan.First of all, let me say how enthusiastic I am that youhave embraced this new field of "large scale dataanalysis". Ever since I started working with large Avidadata sets in 1993, […] I have seen the need for asystematic plan to manage the data. It is nice to seeNSF stepping up to the plate in such a timely manner,and I am happy to comply.Now, as to my actual data management plan, here ishow I plan to deal with research data in the future.I will store all data on at least one, and possibly upto 50, hard drives in my lab.The directory structure will be custom, not self-explanatory, and in no way documented or described.Students working with the data will be encouragedto make their own copies and modify them as theyplease, in order to ensure that no one can everfigure out what the actual real raw data is.
Backups will rarely, if ever, be done.When required to make the data available by myprogram manager, my collaborators, and ultimately bylaw, I will grudgingly do so by placing the raw dataon an FTP site, named with UUIDs like 4e283d36-61c4-11df-9a26-edddf420622d. I will under nocircumstances make any attempt to provideanalysis source code, documentation for formats,or any metadata with the raw data. When requested(and ONLY when requested), I will provide an Excelspreadsheet linking the names to data sets withpublished results. This spreadsheet will likely bewrong -- but since no one will be able to analyze thedata, that wont matter.[….]Note, we didnt use a version control system,either. […] And our repository is not publiclyavailable - you have to beg for permission. Note, Ionly answer e-mail on every other Tuesday.
Any design notes on the data analysis are in ourprivate e-mail, and we will fight to the death -- up toand including ignoring FOIA requests -- to prevent youfrom obtaining them.Meanwhile we will continue publishing excitingsounding (but irerproducible) analyses, and submittinggrants based on them, because thats the only thingthat the reviewers care about.sincerely yours,--titus(representing every computational scientist in theworld.)
Data challenges Informal data management practices Distributed, uncoordinated effort Concerns about data re-use“Can’t you ever relax?” Data management may be ad lib
Lots to do for data management! Ensure long-term access Facilitate sharing Prepare for future re-use
Data activities in the research workflowSource:http://www2.lib.virginia.edu/brown/data/lifecycle.html
Lots of different research products Models and computational simulations Images, photographs, audio, and video Instrument readings Maps Software Artifacts and samples Physical collections And more …
Summary The importance of data managementIt is a requirement It is beneficial It is evolving and complex
Responding to basic datamanageme ntquestions
Describing andPlanning to Saving a Sharing data Using data documentingcollect data data file files ethically data for re- use
Planning to collect data
1a. What are research data? Collected via observations, experiments, simulations, or be derived or compiled Models and computational simulations Images, photographs, audio, and video Instrument readings Maps Software Artifacts and samples Physical collections And many other research products and input …
1b. What is a data management plan?A plan for organizing, storing, and sharing data
1c. How do I write a data management plan?1. Know the requirements2. Find examples and templates3. Try the DMP Tool
Overview of data management requirements and guidelinesHave data management requirementsNational Science Foundation (NSF)National Oceanic and Atmospheric Administration (NOAA)National Institutes of Health (NIH)National Endowment for the Humanities (NEH): Office of Digital Humanities Listing of requirements and guidelinesHave recommendations and guidelines onlyICPSR for social sciences dataNational Aeronautics and Space Administration (NASA)Institute of Museum and Library Services (IMLS)Environmental Protection Agency (EPA)
requirementsData management plan≤ 2 pagesdescribes how data will be managed, disseminated, and sharedPlan undergoes peer review
Writing an NSF data management planSpecific requirements vary by NSF divisionsIn general, describe: Types of research data and materials produced Standards for data format, content, and metadata Policies for access and sharing Policies for re-use, re-distribution, and derivatives Plans for archiving and preserving You can explain why data will not be shared Example DMP
NIH requirementsTimely data sharing encouragedIf requesting ≥ $500k per year,a plan is requiredDescribe how data will be sharedor why sharing is not possibleIn the final progress report,describe data sharing actions taken
Writing an NIH data sharing planA brief paragraphSuggested topics Schedule for sharing Format of the data Documentation of the data Analytic tools provided Data-sharing agreements (criteria and conditions) Mode of data sharing
NIH plan example 1The proposed research will involve a small sample (lessthan 20 subjects) recruited from clinical facilities in the NewYork City area with Williams syndrome. This rarecraniofacial disorder is associated with distinguishing facialfeatures, as well as mental retardation. Even with the we believe that itremoval of all identifiers,would be difficult if notimpossible to protect theidentities of subjects given the physicalcharacteristics of subjects, the type of clinical data(including imaging) that we will be collecting, and therelatively restricted area from which we are recruiting Therefore, we are notsubjects.planning to share the data.
NIH plan example 2This application requests support to collect public-use data from a survey of Datamore than 22,000 Americans over the age of 50 every 2 years.products from this study will be madeavailable without cost to researchers andanalysts. https://ssl.isr.umich.edu/hrs/User registration is required in order to access ordownload files. As part of the registration process, users must agreeto the conditions of use governing access to the public releasedata, including restrictions against attempting to identify study participants,destruction of the data after analyses are completed, reporting responsibilities,restrictions on redistribution of the data to third parties, and properacknowledgement of the data resource. Registered users willreceive user support, as well as information related to errors inthe data, future releases, workshops, and publication lists. The information will not be used for commercialprovided to userspurposes, and will not be redistributed tothird parties.
Find examples and templates Guides, templates, examples http://www.lib.berkeley.edu/sciences/data/guide
Online service for building data plans Step-by-step instructions for meeting funding agency requirementshttps://dmp.cdlib.org/
ActivityCreate a data management plan with the DMPTool
Saving a data file
Hall of fame anecdote http://www.youtube.com/watch?v=J6HtRWyiL98
2a. Where can I store data safely?Traditional storage not always sufficient Personal computers Departmental/university serversTwo additional types of storage Archives and repositories Cloud storage (storing files in an online site)
Archives and repositoriesSpecial types of online storage sitesLong-term storage, management, andpreservationSearch, download, and analytic functionalities
Institutional archives and repositoriesMerritthttp://merritt.cdlib.org/ Data repository management services at UCB http://ist.berkeley.edu/ds
Public archive and repository Long-term access, open to the public GenBank http://www.ncbi.nlm.nih.gov/genbank/ ICPSR http://www.icpsr.umich.edu/icpsrweb/ICPSR/
3rd party cloud storage Amazon S3 Google Drive DropboxBeware of posting sensitive data/files
Deciding on storage Consider:Permanence Oversight Security
Summary of storage options Personal computers Traditional storage Departmental/ university serversStorage options Institutional Archives and repositories Public Cloud storage
2b. How do I ensure long term access to my data?Recommended file formats• Non-proprietary• Uncompressed and unencrypted (okay to encrypt sensitive data)• Common usage by your research community• Standard representation (e.g., ASCII text, Unicode)
2c. How do I back up my data? 1 2 3Original master Local external storage Remote external storage UC Berkeley IST backup services 3rd party services (Amazon S3, Elephant Drive, Jungle Disk, Mozy, Carbonite Free, Dropbox) Email a copy to yourself
ActivitySave a file in Merritt
Describing and documenting data (metadata)to prepare for re-use
What countries have a five-pointed star on their national flag?
DOI: 10.1126/science.1207745 “outsourcing” our memory “we don’t remember information as well, when we expect to find it on a computer later”
If we outsource our memoryto computers … We need good organization structures to Find data from the past quickly and completely Understand data from the past It helps to Document and describe data “Assign metdata”
3a. What are metadata? Documentation and descriptions about data Metadata help us find, understand, and know how to use the files in the future.
3c. What about the data do we document?Descriptive Administrative Structuralmetadata elements metadata elements metadata elementsTitle Dictionary or codebook File formatsCreator or contact to explain the dataDate variables File namesExperimental conditionsMethodology Tools and softwareVersion needed for processing or visualizing the data
3d. How do I record metadata?Option 1 write metadata save as readme.txt store in file folder with data
Metadata form/file in an archive/repositoryOption 2 http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html
AnnotateOption 3<title>Effect of salt on ice creamproduction efficiency</title> XML, a popular system for annotating data http://www.w3schools.com/xml/
Australia Brazil United States of America Cape Verde Ethiopia
4a. Why share data?
Historic data sharingGalileo Newton Huygens Hooke Anagrams to secure discoveriesVersus the “open science revolution” of journals today
Open scienceShare research data, products, and communications openly Potential benefits Protects unique data that cannot be readily replicated Reinforces open scientific inquiry Encourages diversity of analysis and opinion Promotes new lines of research Makes possible the testing of new or alternative hypotheses and methods of analysis Supports studies on data collection methods and measurement Facilitates the education of new researchers Enables the exploration of topics not envisioned by the initial investigators Permits the creation of new datasets when data from multiple sources are combined Provides content for scientific education
Data sharing examples Crystal structure of M-PMV retroviral protease
Increased citation rate
Funding agency policiesNIH Data Sharing Policy NSF Data Sharing Policy Data management plan for grant applications
Journal expectationsData sharing as a term of publication
Data sharing as a sacred act
Summary Why share data?Open scienceCulture of sharingIncreased citationsFunding agency requirementsCondition for publications
4b. How do I share data? Personal sharingShare-upon- Self-archiverequest Download from myEmail me for a personal website!copy!
Institutional archive or repository UC3 Merritt repository
Public archive or repositoryIdeal characteristics Find an archive/repositoryPopular with national/global coverage Ask colleaguesSpecific to your discipline Search http://databib.org/Offers long-term preservation The Ancient Agora of Athens
4c. Public versus institutional archives and repositories – which to use?Public archives/repositoriesCreate comprehensive dataset for a larger research problem spaceDomain-specific archives/repositories may provide better support Institutional archives/repositories May restrict to a smaller audience May offer greater control of your data
Summary How to share data Share upon request Personal Self-archive (personal website)Data sharing Publish in a journal Public Institutional Archives and repositories Public
4d. What do I share? Be selective Recognize restrictions (privacy and confidentiality) Online services for sharing among your team Research Hub 3rd party services
4e. How do I help colleagues find my data? even after I move them?
Help others find your dataBerkeleywww.berkeley.edu/mystuff/super-data.csv file moves toold URL is kaput Stanford www.stanford.edu/mystuff/super-data.cs
Try permanent identifiersDOIDigital object identifierResolve DOIby visiting http://dx.doi.org/ followed by DOIFile can move, but DOI remains the sameThe DOI record stores location details
Generate permanent identifiers http://n2t.net/ezid Subscription through the UCB Library request your free account, by emailing firstname.lastname@example.org
ActivityGenerate a DOI in EZID
Using data ethically
3247 respondents0.3% admitted to falsification or“cooking” research dataAbout 1 in 3 confessed tocommitting at least one of 10 Motivated by increasingserious misbehaviors pressureStudy by Martinson et al., 2005Source - doi:10.1038/435737a to publish papers and win
5a. How do I use data ethically in an evolving data management landscape?
Prevent distortions and manipulations Keep raw original data Log all changes made
Stay current with data requirementsReview for changes to policies byFunding agenciesUniversity regulationsFederal and state governments
5b. How do I set permissions andrestrictions on the use of my data? Data license A legal instrumentPermits second parties to do things with the data (or not)
Preparing a data licenseWrite your own OR Use a standard licenselicense Creative Commons http://wiki.creativecommons.org/Data Open Data Commons http://opendatacommons.org/ Public domain options • Creative Commons Zero (CC0) • Open Data Commons Public Domain Dedication and License (PDDL)
Mechanism for licensing data Attach the license to the data by including: • a statement that the data is released under a license • a mechanism for retrieving the license full text
Haiku summaryData are preciousSafely store and share widelyGood for all research
“What’s the takeaway on all this?”
Part 3: Discussion1. Which of the data management literacies/tools are the key ones for the library to support and facilitate?2. How may we include these literacies/tools in our outreach, instruction, reference, collection development, systems, or ________? Any new directions?