• Save
Responding to data management questions
Upcoming SlideShare
Loading in...5
×
 

Responding to data management questions

on

  • 2,019 views

 

Statistics

Views

Total Views
2,019
Views on SlideShare
1,570
Embed Views
449

Actions

Likes
2
Downloads
0
Comments
0

2 Embeds 449

http://jeffloo.com 441
https://twitter.com 8

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

CC Attribution-NonCommercial-ShareAlike LicenseCC Attribution-NonCommercial-ShareAlike LicenseCC Attribution-NonCommercial-ShareAlike License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Show cartoon.Write on the board:Merritt username and passwordSetup speakers. Distribute handouts and exercise worksheetsWrite on the board: About our class:IntroductoryCovers a lot of groundProvides hands-on practiceThis class will help you:Respond to patrons’ data management questionsFoster data management literacy among our patrons
  • Good morning. Thankyou for coming.My name is Jeffery Loo, and I’m the Chemical Informatics Librarian at the Chemistry & Chemical Engineering Library. What is the motivation for this class? Jean McKenzie, Brian Quigley, Mary Ann Mahoney, and I are members of the ARL/DLF E-science Institute. For this program, we interviewed Berkeley faculty, campus administrators, and researchers about their data management experiences and needs. We learned that researchers see the changing landscape of data management and have many questions about this.And in the Spring, I gave a presentation for Berkeley’s Responsible Conduct of Research Seminar Series (http://rac.berkeley.edu/rcr.html).I learned from the post-docs and graduate students in the audience that they had a lot of questions about the core requirements and practices of data management. So the motivation of this class is thatresearchers may have lots of data management questions, and library personnel may help them find answers.
  • Today we are going to discuss data managementquestions and then examine ways of helping patrons find answers with a focus on UC resources. While our focus is on scientific data management, many of the principles and issues discussedwill apply to the social sciences and humanities.Our discussion is in three parts.
  • First we’ll explore why data management is so important? Afterwards, we will look at responding to data managementquestions. These questions are organized along the “workflow” of a data file from collection through to ethical use. We are omitting data analysis because our researchers are the experts in that arena.There willbehands-onexercises, and I thank Elliott and Lisa who will be availableto help during the activities.Your handout lists the questions we’ll explore today. All the responses are outlined in a website .And finally, we will have an optional discussion about promoting data management literacy and services.
  • Part 1. Let’s look at why data management is so important.
  • But first, what is data management. It is the activities that help us generate high-quality data that is safely stored and helps us reducelimitations on sharing that data.
  • I want to share an interesting case that highlights some important issues and motivations behind data management.On January 18, 2011, the National Science Foundation instituted a new requirement.NSF grant applications need documents known as a data management plan. The plan will be no longer than 2 pages, and it will describe how researchers will organize, store, and share their research data.
  • This isDr. C Titus Brown an Assistant Professor at the home of the Spartans, Michigan State University, in Computer Science and Engineering and Microbiology and Molecular Genetics.In anticipation of the NSF requirement, he wrote a satirical data management plan..
  • While humorous, Professor Brown points out some real problems with research data.Data management may not be as formal or as standardized of a process as other parts of the research workflow – like grant applications and article publication. So data could be lost and may not be treated according to their value.
  • In addition to these challenges, data management includes a lot of different activities.Ensure long-term access to dataFacilitate data sharingPrepare data for future re-use by other researchers
  • These three goals of data management can get very overwhelming quickly.As you can see, there are data activities for every step of the research process.Source:http://www2.lib.virginia.edu/brown/data/lifecycle.html
  • Another complicating factor is the variety of research data. Itis more than just text and numbers on spreadsheets. You want to protect as much of your research products as possible including:
  • To summarize why data management is important:It is a requirement of funding agencies.It is beneficial because it protects research works.It is evolving and complex so there are a large number of activities to be supported and conducted.
  • In Part 2, let’s now explore answers to data management questions.
  • These questions will be organized along the “workflow” ofa data file from planningdata collection through to using data ethically.
  • Patrons may ask: What are research data?Data can be defined as a variety of research products and outputs. They can be collected via observations, experiments, simulations, or be derived or compiled. Again, it includes the research products listed here. In general, it is any form of research output or input.
  • What is a data management plan?It is a written plan on how you will organize, store, and share your data. It isimportant because large funding agencies like the NSF and NIH are requiring these plan for grant applications.
  • How do I write a data management plan?First, know the requirementsset by the funding agency.Then look at examples and templates for ideas on managing your data.You may find the DMP Tool helpful to use in writing a plan.Let me explain these three points.
  • The requirements for data management varyby funding agency and discipline. Some agencies have requirements, while others simply have guidelines and recommendations for data management.Tufts University has this helpful listing,http://researchguides.library.tufts.edu/content.php?pid=167647&sid=1412586ICSPR has guidelines for the social sciences.Probably, as time moves on, more agencies will set data management requirements.But I want to focus on two data management requirements for the sciencesand they are the National Science Foundation and the National Institutes of Health. The issues and procedures for these formal requirements can be helpful for other disciplines.
  • Let’s look at the NSF requirement first.The data management plan is a supplementary document no more than two pages long.And it should describe how the proposal will conform to the NSF policy on the dissemination and sharing of research resultsThisplan undergoes peer review.<How is it reviewed? http://www.nsf.gov/pubs/policydocs/pappguide/nsf11001/gpg_2.jsp#IIC2j
  • What do you writein an NSF data management plan?The specific content requirements vary among the NSF divisions and groups.In general, your plan discusses theTypes of data and research materials producedStandards used for the data format and content (and its metadata)Policies for access and sharing Policies for re-use, re-distribution, and derivativesPlans for archiving and preserving data and research products You can state that the data will not be publicly shared, but you need to explain the reason. This rationale then undergoes peer review.Here is an example NSF data management plans from UCSD Geosciences research group. http://rci.ucsd.edu/_files/DMP%20Example%20SIO%20OCE.pdfAs you see, it doesn’t have to be long. It lists the different types of data generated, the collaborator responsible for the data, and the database or websites where the data set will be shared
  • Let’s look at the NIH requirements for data management.A plan is required if you are requesting $500,000 or more in any single year from NIH.You describe plans for sharing the final research data, or state why data sharing is not possible.And in the final progress report, you report the data sharing actions taken.http://grants.nih.gov/grants/policy/data_sharing/data_sharing_guidance.htm
  • Your plan is a brief paragraph.It is suggested that you briefly describe the expected schedule for data sharing, the format of the final dataset, the documentation to be provided, whether or not any analytic tools will be provided, whether or not a data-sharing agreement will be required and what the criteria and conditions areand the mode of data sharing
  • Here’s example 1.It doesn’t need to be long.The researchers explain that they do not plan to share the data. They believe it would be too difficult to protect the identities of subjects.
  • Here in example 2, the researchers are sharing their data online. User registration is required with the conditions that the data are not used for commercial purposes and they will not be distributed to third parties.
  • If you’d like more details on data management plans, check out our Sciences Library webpage on the subjecthttp://www.lib.berkeley.edu/sciences/data/guideYou’ll find guides, templates, and examples of data management plans.
  • To help you writea plan, there is an online service that provides step-by-step instructions and guidance for meeting specific funding agency requirements. It is the DMPTool.It is developed by UC3 (the University of California Curation Center) at the California Digital Library along with partner universities and research centers. It’s free, and you can log in with your CalNet ID.
  • Exercise time. Let’s create a data management plan in the DMPTool.
  • Now, let’s look at some questions about saving a data file.
  • Here’s an article in the Chronicle of Higher education documenting grad students who have had their computers stolen and therefore lost large amounts of their intellectual blood, sweat, and tears. The stories are gutwrenching.One Berkeley grad studenthad her computer and her back up hard drive stolen. As a result, she was afraid she could not finish her dissertation in time for a post-doc position she was offered.Source: http://chronicle.com/article/When-a-Thief-Takes-Your/126679/=
  • AnjaliAdukia is an education graduate student in the hall of fame of anecdotes about data loss.Her experience is so colorful and so unlikely that Harvard University’s Graduate School of Education asked her to re-enact it for a video – which I have to show you now.What she left out was that she assaulted the monkey nearby a Gandhi ashram – a shrine honoring a man of peace.
  • So where can you store your data safely?We have seen that traditional storage is not always sufficient.  With personal computers and departmental or university servers, you need to secure your computer from theft and damage, and make sure there is a back up. Lets consider two additional types of storage. The first is archives and repositories. The second is cloud-based storage - that is storing digital files in an online site on the Internet. The advantages are that wherever you have online access you can access your file, and that the organization that runs the online storage site are responsible for backups and server maintenance.
  • Let’s look at archives and repositories first. These are a special type of online storage site.They provide long-term storage, management, and preservation of digital files and they usually have functionalities for searching, downloading, and using the stored data.
  • You can store data in an archive orrepository run by an institution.The University of California community can use the Merritt repository. Additionally, UC Berkeley's IST provides data repository management support services.
  • You can also store your data in a public archive or repository– it’s public so your data will be available for long-term access across a wide audience.The reason for this openness is to share the data.A popular example is the GenBank database of nucleotide sequences and their protein translations. A prominent example in the social sciences is ICPSR.An institutional repository is different from a public repository. An institutional repository is run by a research or academic institution and they can determine how the content is shared if at all. A public repository does not limit to a particular institution and stores data from a number of different groups – and the data is publicly accessibly for free.
  • You can also store your data in an online storage service provided by a private company.For example, there are:Amazon S3Google DriveDropboxTypically, you buy storage space (sometimes the first few MBs are free) and then upload the data. An important consideration: Beware of loadingsensitive and confidential data in 3rd party services.
  • As you’re deciding which option is right for you, consider three issues:How long do you want to keep the data for? – An issue of PermanenceWho manages the storage system? – An issue of OversightIs it safe from unauthorized access? - An issue of Security
  • To summarize: You have traditional storage options like personal computers and departmental and university servers.Next, there are institutional and public archives and repositories.And then, there is cloud storage.
  • How do I ensure long term access to my data?So that you can open and read your data files in the future,it is recommended that you usefile formats that are:Non-proprietaryUncompressed and unencryptedCommonlyused by your research communityIn standard representation (e.g., ASCII text, Unicode)This table outlines examples of ideal andnot so ideal file formats.
  • So that you don’t get profiled in the Chronicle of Higher Education for data loss or theft, backup is very important.How do you back up?Try to have 3 copies.An originalmaster copyAcopy on a local external storage device – like an external hard drive in your lab or officeA copy on remote external storage, likestoring files in an online storage site or a hard drive in a physically removed location. UC Berkeley IST offers storage and back up servicesThere are third party companies that do backups for you, and some do them automatically. But at the very least, email a copy of your files to yourself.
  • Exercise time. Let’s save a file in Merritt repository.
  • Let’s look at questions about preparing your data for re-use by describing and documenting files.
  • Before we discuss the questions, let me ask you: What countries have a five-pointed star on their national flag?Chances are you won’t be thinking through a list of countries and their flags – instead, you’re probably thinking where is the nearest Web connection and what search terms to use in Google.
  • This study published in Science shows us that as information is continually and instantaneously available online, our cognitive habits are changing.We are outsourcing our memory to machines.We don’t remember information as well when we expect to find it later on a computer.
  • So if we are not going to remember information as well, we need to make sure that:We can find data from the past quickly and completelyWe need to be able to understand this data when we open itSo outsourcing our memory means that we need good organizational structures.The simple act of labeling something helps you to find the file and understand how to use it.  The process of documenting, describing, and labeling your data has a fancy name – you assign metadata. The metadata are the labels you add to data.
  • To answer the question, what are metadata? Metadata are the documentation and descriptions you write about your data.And, why are they important? Metadata help us find, understand, and know how to use the files in the future.
  • So what about the data do you document?There are different types of information and characteristics about your dataknownas metadata elements.There are descriptive metadata elements like… For some disciplines, there are standards as to what metadata elements to use and how to apply them (check our guide). 
  • So how do you record this metadata? The simplest way is to create a text file (.txt) andwrite down your metadata elements. You store this file in the same folder as your data.
  • Here’s another option. If you upload your data to an archive or repository, you will likely complete a form ofmetadata details for the file. This metadata is then stored in a database record with your data file. 
  • Finally, you can annotate your data. Here is an example.These bracket parts are known as tags, and they let us know that the enclosed text is the title of the file.A popular system for annotating your data is XML. You can learn more about XML at http://www.w3schools.com/xml/
  • In case you were wondering, here are a few countrieswith 5-pointed stars on their national flags.
  • Let’s lookat sharing data files now.
  • Why share data? This is a fundamental and longstanding question since time immemorial … so it deserves a long answer! 
  • In the past, scientific communication wasn’t as open as it is today.Scientists like Galileo, Newton, Huygens, and Hooke were sending each other communications about their research with parts written in indecipherable anagrams.Thisallowed the scientists to discuss their discoveries withoutgivingaway their methods and results.If another scientist claimed the same discovery, you would just unscramble the anagram to secureyour claim on a finding Our journal system today represents a relative “open science revolution.”Source:http://www.dailykos.com/story/2012/03/12/1073780/-Book-Review-Nielsen-s-Reinventing-Discovery
  • Currently, there is a drive for a new movement of“open science” – sharingscientific data and publications freely to all. Here is a list of the potential benefitsof open science.If patrons are interested in the topic of open science, direct them to the book Reinventing Discovery: The New Era of Networked Science. The author, physicist Michael Nielsen, explores some interesting and unexpected outcomes of open science.
  • There are prominent cases of successful data sharingfor research collaborations such as …Foldit - an online puzzle video game about protein folding.In 2011, players helped to decipher the crystal structure of an AIDS-causing monkey virus. This research problem stumped scientists for 15 years, butit took just 10 days for the players to solve it.All the players of this game had their name on the final publication.Rumor has it, this project maywin a Nobel Prize. So if you want to be a Nobel Laureate to get your parking spot, signup and start playing now.
  • And according to this study, sharing detailed research data is associated with increasedcitation rate of your study.
  • Furthermore,fundingagencies – such as the NIH and NSF – are expecting and requiring data sharing practices.
  • Andprominentjournals like Science,Nature, and PLoShavepolicies that require data sharingas a term of publication.http://www.nature.com/authors/policies/availability.htmlhttp://www.sciencemag.org/site/feature/contribinfo/prep/gen_info.xhtml#dataavailhttp://www.plosone.org/static/policies.action#sharing
  • Data sharing may even be a sacred act.In 2010, the Missionary Church of Kopimismformed.  Adherentshold philosophical belief that all information should be freely distributed and unrestricted - and a central dogma is that file sharing is sacred.The name comes from a Swedish spelling of the words “copy me.” They even registered as a “bona fide” religion with the Swedish government.They have missionaries that we could invite for Open Access Week and they evenhave held weddings.
  • So as you can see, there are manyforces driving research data sharing.
  • So how do I share data?Oneoption is share-upon-request. You announce that your data is available, and you sendacopy to any requestors.Anotherapproach is self-archiving. You post your data to your website and then allow downloads.
  • Alternatively, you could publish your data in a journalin the supplementary online materials section
  • You could also deposit your data into an institutional archive or repository and then allow others to download the file.
  • Alternatively, deposit your data into a public archive or repository.In order for your data to be widely accessible and highly used by your research peers, you want a repository that is:Popular with your research community and has national or global coverageSpecific to your discipline or research domainOffersreliable long-term preservationIf patrons are looking to find a repository for their data, recommend they:Asktheircolleagues where they share their data – you want to be where your colleagues areAlternatively, search Databib.Databib is a searchable catalog of online repositories of research data. It is funded by IMLS with support from Purdue University Libraries.DemonstrationYou can browse alphabetically or by subjectInside a record, you can see the comprehensive cataloging details.Databib is currently being beta-tested and the searchfunctionalities are still in development.You may have to do some term mapping yourself.Another catalogue of online data sets is the Data Hub, http://thedatahub.org/. It is communityrun, and is an openly editable data catalogue in the styleofWikipedia.
  • There are public repositoriesand there are institutional repositories  So which to use?With a public repository,research colleagues across the world are sharing data tobuild a comprehensive dataset for a larger research problem space. Andif the public repository is domain-specific– like a devoted genetics or chemical database – researchersmay receive enhanceddatasupportamong a community of experienced, peer researchers.Bycomparison, posting data into an institutional repository may restrict the data to a smalleraudience – but it mayalsooffergreatercontrol of how the data is stored, organized, and represented.
  • To summarize, there are differentways to share data that can suit researchers’ needs from personal options to very public ways.
  • What data do I share?Beselective – maybe reconsider sharing preliminary analyses.Next, recognizetherestrictions on sharing. Beware private and confidential data.A fewmoretips. There are onlineservices and software for sharing data/files just among a research team – and not the world.At UCB, there is the Research Hub service, https://hub.berkeley.edu.Research Hub is a service developedby campus IST, https://hub.berkeley.edu/. It lets you loadfiles for storage and sharing, and fostersgroupcollaboration with blogs, wikis, group documents, and social media features. There are plansfor data management and curation services to be added.There are also 3rd party services like Google Drive, which I list in a link you’ll find in the class website.
  • Howcan I help colleagues find my data? Even after I move to a different institution?
  • Suppose a researcher posted a data file at www.berkeley.edu/mystuff/super-data.csvBut if the researcher moves to Stanford andtransfers the files over, the new URL could be www.stanford.edu/mystuff/super-data.csvBut then old friends at Cal who have the old URL will get an errormessage when they try to view that data.
  • To fix this problem of deadlinks, try assigningpermanentidentifiers to digital files. For example, DOIs.A DOI is a uniqueidentificationnumber for your file. Think of it like a Social Security Number. If you have a DOI, you can find the file by going to a web browser andenteringdx.doi.org/ followed by the DOI (e.g., 10.1126/science.331.6018.692 )Every DOI has a database record that stores the locationdetails of a file. You can update the DOI record with a new location if your file moves.So a filecanmove to a different location but the DOIremains the same.
  • At UC Berkeley,researchers can use the EZID service to generate DOIs. EZID also lets you generate an ARK, which is another type of permanent identifier.EZID is brought to you by the University of CaliforniaCuration Center at the California Digital Library. The UC Berkeley Library pays for a subscription to this service.If Berkeley researcherswant a free account, they can email data-consult@lists.berkeley.edu to request an account.
  • Activity time! Let’s generate a DOI in EZID.
  • Data sharing is on the rise, and this ischangingideas of how to responsibly use data.
  • Here are some interestingnumbers.In a 2005 study reported in Nature, 3247 researchersfundedby the NIH responded to a survey about scientific misbehaviors. 0.3% admitted to falsification or cooking research dataAnd about 1 in 3 confessed to committing at least one of 10 relatively serious acts of professional misbehavior.The researchers proposed that a maincauseof this behavior might be the increasing pressure to publish papers and win grants.Source: http://www.nature.com/nature/journal/v435/n7043/full/435737a.html
  • How does a researcher avoid becoming a statistic? How do you use data ethically in an evolving data management landscape?Here are sometips.
  • If you use another researcher’s dataset, cite the data or the study’s paper.Here are some sample citations for data. (Source:http://www.dcc.ac.uk/resources/how-guides/cite-datasets#x1-5000 )Theformat varies by the citation style you use, but they typically include DOIs.
  • To preventdatadistortions and manipulations, keep a copy of the raw original data and keep a log of all the changes made to the data.
  • Adataset may have a license with restrictions on its use. So check for a data license and its terms of use.
  • Periodicallyreview the data expectations and requirements of funding agencies, university regulations, and federal and state governments. They may change over time.
  • How do I set permissions and restrictions on the use of my data?You could prepare a data license.A license is a legalinstrument. It allows a rights holder to permit a second party to do things with that data (or not) and may impose restrictions on use.(paraphrased from DCC, 2012, “How to License Research Data,” http://www.dcc.ac.uk/resources/how-guides/license-research-data#x1-20002)
  • How do you prepare a data license?You couldwrite your own license. Alternatively,there are standardlicenses you could apply to data.Creative Commons, http://wiki.creativecommons.org/DataOpen Data Commons, http://opendatacommons.org/They also offerpublicdomain options.(paraphrased from DCC, 2012, “How to License Research Data,” http://www.dcc.ac.uk/resources/how-guides/license-research-data#x1-20002)
  • The mechanismforlicensing data is simple. You attach the license to the data.What you typically do is include in your data file:a statement that the data is releasedunder a particular license or public domain dedication, andProvide a mechanism for the reader to retrieve the full text of the license itself(paraphrased from DCC, 2012, “How to License Research Data,” http://www.dcc.ac.uk/resources/how-guides/license-research-data#x1-20002)
  • To summarize this presentation succinctly, here is a haiku poem.= = = = = =Creative Commons LicenseAttribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)http://creativecommons.org/licenses/by-nc-sa/3.0/Special thanks to all the creators of the following resources for making it possible to build upon the materials that they created!MIT Libraries' Data Management and Publishing subject guide and Managing Research Data 101 classUniversity of Minnesota Libraries' Managing Your Data guide and Data Life-Cycle Management workshopUK Data Archive's Managing and Sharing Data: Best Practice for Researchers
  • We just spent 90 minutes together. I know we are all busy, so if you need to get back, please feelfree.But if you are interested, you could stay for an open discussion. ===Think, pair, share exercise.

Responding to data management questions Responding to data management questions Presentation Transcript

  • Responding toSCIENTIFIC DATA MANAGEMENT Questions A Guide for UC Berkeley Library Personnel Jeffery Loo August 2012
  • Motivation for this class Researchers may have questions about data management Library personnel may help researchers find answers
  • Class goals1. Identify the questions our patrons may have about data management2. Examine ways of helping patrons find answers
  • 3-part agenda 1. 3.Why is data Openmanagement discussion important? 2. Responding to core data management Planning to Saving a data questions Describing Sharing data Using data collect data file and files ethically documenting data for re- use
  • Why is datamanageme ntimportant?
  • 0. What is data management?Activities for Generating high-quality data Safely storing the data Reducing limitations to data sharing
  • NSF data management plan Requirement as of January 18, 2011 Your plans to organize, store, and share data http://www.nsf.gov/bfa/dias/policy/dmp.jsp
  • “My Data Management Plan – a satire” Dr. C. Titus Brown Assistant Professor Michigan State University Source
  • Dear NSF,I am happy to respond to your request for a 2-pageData Management Plan.First of all, let me say how enthusiastic I am that youhave embraced this new field of "large scale dataanalysis". Ever since I started working with large Avidadata sets in 1993, […] I have seen the need for asystematic plan to manage the data. It is nice to seeNSF stepping up to the plate in such a timely manner,and I am happy to comply.Now, as to my actual data management plan, here ishow I plan to deal with research data in the future.I will store all data on at least one, and possibly upto 50, hard drives in my lab.The directory structure will be custom, not self-explanatory, and in no way documented or described.Students working with the data will be encouragedto make their own copies and modify them as theyplease, in order to ensure that no one can everfigure out what the actual real raw data is.
  • Backups will rarely, if ever, be done.When required to make the data available by myprogram manager, my collaborators, and ultimately bylaw, I will grudgingly do so by placing the raw dataon an FTP site, named with UUIDs like 4e283d36-61c4-11df-9a26-edddf420622d. I will under nocircumstances make any attempt to provideanalysis source code, documentation for formats,or any metadata with the raw data. When requested(and ONLY when requested), I will provide an Excelspreadsheet linking the names to data sets withpublished results. This spreadsheet will likely bewrong -- but since no one will be able to analyze thedata, that wont matter.[….]Note, we didnt use a version control system,either. […] And our repository is not publiclyavailable - you have to beg for permission. Note, Ionly answer e-mail on every other Tuesday.
  • Any design notes on the data analysis are in ourprivate e-mail, and we will fight to the death -- up toand including ignoring FOIA requests -- to prevent youfrom obtaining them.Meanwhile we will continue publishing excitingsounding (but irerproducible) analyses, and submittinggrants based on them, because thats the only thingthat the reviewers care about.sincerely yours,--titus(representing every computational scientist in theworld.)
  • Data challenges Informal data management practices Distributed, uncoordinated effort Concerns about data re-use“Can’t you ever relax?” Data management may be ad lib
  • Lots to do for data management! Ensure long-term access Facilitate sharing Prepare for future re-use
  • Data activities in the research workflowSource:http://www2.lib.virginia.edu/brown/data/lifecycle.html
  • Lots of different research products Models and computational simulations Images, photographs, audio, and video Instrument readings Maps Software Artifacts and samples Physical collections And more …
  • Summary The importance of data managementIt is a requirement It is beneficial It is evolving and complex
  • Responding to basic datamanageme ntquestions
  • Describing andPlanning to Saving a Sharing data Using data documentingcollect data data file files ethically data for re- use
  • Planning to collect data
  • 1a. What are research data? Collected via observations, experiments, simulations, or be derived or compiled Models and computational simulations Images, photographs, audio, and video Instrument readings Maps Software Artifacts and samples Physical collections And many other research products and input …
  • 1b. What is a data management plan?A plan for organizing, storing, and sharing data
  • 1c. How do I write a data management plan?1. Know the requirements2. Find examples and templates3. Try the DMP Tool
  • Overview of data management requirements and guidelinesHave data management requirementsNational Science Foundation (NSF)National Oceanic and Atmospheric Administration (NOAA)National Institutes of Health (NIH)National Endowment for the Humanities (NEH): Office of Digital Humanities Listing of requirements and guidelinesHave recommendations and guidelines onlyICPSR for social sciences dataNational Aeronautics and Space Administration (NASA)Institute of Museum and Library Services (IMLS)Environmental Protection Agency (EPA)
  • requirementsData management plan≤ 2 pagesdescribes how data will be managed, disseminated, and sharedPlan undergoes peer review
  • Writing an NSF data management planSpecific requirements vary by NSF divisionsIn general, describe: Types of research data and materials produced Standards for data format, content, and metadata Policies for access and sharing Policies for re-use, re-distribution, and derivatives Plans for archiving and preserving You can explain why data will not be shared Example DMP
  • NIH requirementsTimely data sharing encouragedIf requesting ≥ $500k per year,a plan is requiredDescribe how data will be sharedor why sharing is not possibleIn the final progress report,describe data sharing actions taken
  • Writing an NIH data sharing planA brief paragraphSuggested topics Schedule for sharing Format of the data Documentation of the data Analytic tools provided Data-sharing agreements (criteria and conditions) Mode of data sharing
  • NIH plan example 1The proposed research will involve a small sample (lessthan 20 subjects) recruited from clinical facilities in the NewYork City area with Williams syndrome. This rarecraniofacial disorder is associated with distinguishing facialfeatures, as well as mental retardation. Even with the we believe that itremoval of all identifiers,would be difficult if notimpossible to protect theidentities of subjects given the physicalcharacteristics of subjects, the type of clinical data(including imaging) that we will be collecting, and therelatively restricted area from which we are recruiting Therefore, we are notsubjects.planning to share the data.
  • NIH plan example 2This application requests support to collect public-use data from a survey of Datamore than 22,000 Americans over the age of 50 every 2 years.products from this study will be madeavailable without cost to researchers andanalysts. https://ssl.isr.umich.edu/hrs/User registration is required in order to access ordownload files. As part of the registration process, users must agreeto the conditions of use governing access to the public releasedata, including restrictions against attempting to identify study participants,destruction of the data after analyses are completed, reporting responsibilities,restrictions on redistribution of the data to third parties, and properacknowledgement of the data resource. Registered users willreceive user support, as well as information related to errors inthe data, future releases, workshops, and publication lists. The information will not be used for commercialprovided to userspurposes, and will not be redistributed tothird parties.
  • Find examples and templates Guides, templates, examples http://www.lib.berkeley.edu/sciences/data/guide
  • Online service for building data plans Step-by-step instructions for meeting funding agency requirementshttps://dmp.cdlib.org/
  • ActivityCreate a data management plan with the DMPTool
  • Saving a data file
  • Hall of fame anecdote http://www.youtube.com/watch?v=J6HtRWyiL98
  • 2a. Where can I store data safely?Traditional storage not always sufficient Personal computers Departmental/university serversTwo additional types of storage Archives and repositories Cloud storage (storing files in an online site)
  • Archives and repositoriesSpecial types of online storage sitesLong-term storage, management, andpreservationSearch, download, and analytic functionalities
  • Institutional archives and repositoriesMerritthttp://merritt.cdlib.org/ Data repository management services at UCB http://ist.berkeley.edu/ds
  • Public archive and repository Long-term access, open to the public GenBank http://www.ncbi.nlm.nih.gov/genbank/ ICPSR http://www.icpsr.umich.edu/icpsrweb/ICPSR/
  • 3rd party cloud storage Amazon S3 Google Drive DropboxBeware of posting sensitive data/files
  • Deciding on storage Consider:Permanence Oversight Security
  • Summary of storage options Personal computers Traditional storage Departmental/ university serversStorage options Institutional Archives and repositories Public Cloud storage
  • 2b. How do I ensure long term access to my data?Recommended file formats• Non-proprietary• Uncompressed and unencrypted (okay to encrypt sensitive data)• Common usage by your research community• Standard representation (e.g., ASCII text, Unicode)
  • 2c. How do I back up my data? 1 2 3Original master Local external storage Remote external storage UC Berkeley IST backup services 3rd party services (Amazon S3, Elephant Drive, Jungle Disk, Mozy, Carbonite Free, Dropbox) Email a copy to yourself
  • ActivitySave a file in Merritt
  • Describing and documenting data (metadata)to prepare for re-use
  • What countries have a five-pointed star on their national flag?
  • DOI: 10.1126/science.1207745 “outsourcing” our memory “we don’t remember information as well, when we expect to find it on a computer later”
  • If we outsource our memoryto computers … We need good organization structures to Find data from the past quickly and completely Understand data from the past It helps to Document and describe data “Assign metdata”
  • 3a. What are metadata? Documentation and descriptions about data Metadata help us find, understand, and know how to use the files in the future.
  • 3c. What about the data do we document?Descriptive Administrative Structuralmetadata elements metadata elements metadata elementsTitle Dictionary or codebook File formatsCreator or contact to explain the dataDate variables File namesExperimental conditionsMethodology Tools and softwareVersion needed for processing or visualizing the data
  • 3d. How do I record metadata?Option 1 write metadata save as readme.txt store in file folder with data
  • Metadata form/file in an archive/repositoryOption 2 http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html
  • AnnotateOption 3<title>Effect of salt on ice creamproduction efficiency</title> XML, a popular system for annotating data http://www.w3schools.com/xml/
  • Australia Brazil United States of America Cape Verde Ethiopia
  • Sharing data
  • 4a. Why share data?
  • Historic data sharingGalileo Newton Huygens Hooke Anagrams to secure discoveriesVersus the “open science revolution” of journals today
  • Open scienceShare research data, products, and communications openly Potential benefits Protects unique data that cannot be readily replicated Reinforces open scientific inquiry Encourages diversity of analysis and opinion Promotes new lines of research Makes possible the testing of new or alternative hypotheses and methods of analysis Supports studies on data collection methods and measurement Facilitates the education of new researchers Enables the exploration of topics not envisioned by the initial investigators Permits the creation of new datasets when data from multiple sources are combined Provides content for scientific education
  • Data sharing examples Crystal structure of M-PMV retroviral protease
  • Increased citation rate
  • Funding agency policiesNIH Data Sharing Policy NSF Data Sharing Policy Data management plan for grant applications
  • Journal expectationsData sharing as a term of publication
  • Data sharing as a sacred act 
  • Summary Why share data?Open scienceCulture of sharingIncreased citationsFunding agency requirementsCondition for publications
  • 4b. How do I share data? Personal sharingShare-upon- Self-archiverequest Download from myEmail me for a personal website!copy!
  • Journal publishing
  • Institutional archive or repository UC3 Merritt repository
  • Public archive or repositoryIdeal characteristics Find an archive/repositoryPopular with national/global coverage Ask colleaguesSpecific to your discipline Search http://databib.org/Offers long-term preservation The Ancient Agora of Athens
  • 4c. Public versus institutional archives and repositories – which to use?Public archives/repositoriesCreate comprehensive dataset for a larger research problem spaceDomain-specific archives/repositories may provide better support Institutional archives/repositories May restrict to a smaller audience May offer greater control of your data
  • Summary How to share data Share upon request Personal Self-archive (personal website)Data sharing Publish in a journal Public Institutional Archives and repositories Public
  • 4d. What do I share? Be selective Recognize restrictions (privacy and confidentiality) Online services for sharing among your team Research Hub 3rd party services
  • 4e. How do I help colleagues find my data? even after I move them?
  • Help others find your dataBerkeleywww.berkeley.edu/mystuff/super-data.csv file moves toold URL is kaput Stanford www.stanford.edu/mystuff/super-data.cs
  • Try permanent identifiersDOIDigital object identifierResolve DOIby visiting http://dx.doi.org/ followed by DOIFile can move, but DOI remains the sameThe DOI record stores location details
  • Generate permanent identifiers http://n2t.net/ezid Subscription through the UCB Library request your free account, by emailing data-consult@lists.berkeley.edu
  • ActivityGenerate a DOI in EZID
  • Using data ethically
  • 3247 respondents0.3% admitted to falsification or“cooking” research dataAbout 1 in 3 confessed tocommitting at least one of 10 Motivated by increasingserious misbehaviors pressureStudy by Martinson et al., 2005Source - doi:10.1038/435737a to publish papers and win
  • 5a. How do I use data ethically in an evolving data management landscape?
  • Citing data
  • Prevent distortions and manipulations Keep raw original data Log all changes made
  • Data licensingRestrictions on data use, forexample No for-profit use No re-sharing Give attributionCheck for license/terms of use
  • Stay current with data requirementsReview for changes to policies byFunding agenciesUniversity regulationsFederal and state governments
  • 5b. How do I set permissions andrestrictions on the use of my data? Data license A legal instrumentPermits second parties to do things with the data (or not)
  • Preparing a data licenseWrite your own OR Use a standard licenselicense Creative Commons http://wiki.creativecommons.org/Data Open Data Commons http://opendatacommons.org/ Public domain options • Creative Commons Zero (CC0) • Open Data Commons Public Domain Dedication and License (PDDL)
  • Mechanism for licensing data Attach the license to the data by including: • a statement that the data is released under a license • a mechanism for retrieving the license full text
  • Haiku summaryData are preciousSafely store and share widelyGood for all research
  • “What’s the takeaway on all this?”
  • Part 3: Discussion1. Which of the data management literacies/tools are the key ones for the library to support and facilitate?2. How may we include these literacies/tools in our outreach, instruction, reference, collection development, systems, or ________? Any new directions?