Scientific data management
Upcoming SlideShare
Loading in...5
×
 

Scientific data management

on

  • 954 views

 

Statistics

Views

Total Views
954
Views on SlideShare
566
Embed Views
388

Actions

Likes
0
Downloads
6
Comments
0

1 Embed 388

http://jeffloo.com 388

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

CC Attribution-NonCommercial-ShareAlike LicenseCC Attribution-NonCommercial-ShareAlike LicenseCC Attribution-NonCommercial-ShareAlike License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • The Responsible Conduct of Research Seminar Series will give graduate students, postdoctoral scholars, faculty, and staff the knowledge and tools to guide them through the increasingly complex ethical issues that they will face during their careers.Attendees will become aware of established professional norms and ethical principles for responsible scholarly conduct and learn how to apply them to their own work and conduct their research with integrity. Each session will be led by an expert on the seminar's topic and will include case studies and time for discussion.http://rac.berkeley.edu/rcr.html
  • My name is Jeffery Loo.I’m a librarian at the Chemistry and Chemical Engineering Library. It is an honor to speak with you about data management today.
  • Let’s start at the beginning!
  • On January 18, 2011, the National Science Foundation instituted a new requirement.NSF grant applications need documents known as a data management plan. No more than 2 pages long, this plan will describe how you will organize, store, and share your research data.
  • This isDr. C Titus Brown who graduated from CalTech with a PhD in developmental molecular biology, and is now an Assistant Professor at the home of the Spartans, Michigan State University.In anticipation of the NSF requirement for a data management plan, he wrote a satirical plan. As is the form of satire, it’s fictional but with a kernel of truth.(Story from presentation by Joel Herndon & Paolo Mangiafico, Research Data Management at Duke: Trends and New Developments, January 2012, http://lgdata.s3-website-us-east-1.amazonaws.com/docs/217/382205/ResearchDataMgtDuke-Jan2012.pdf )
  • While humorous, Professor Brown points out some real problems with research data.Data management may not be as formal or as standardized of a process as other parts of the research workflow – like grant applications and article publication. So data could be lost and may not be treated according to their value.Other challenges include:Distributed effort among team membersConcerns about data re-useResearch changes directions and evolves rapidly, so data management may be an afterthought and addressed ad lib
  • In addition to these challenges, data management includes a lot of different activities.Ensure long-term access to dataFacilitate data sharingPrepare data for future re-use by other researchers
  • These three goals of data management can get very overwhelming quickly. As you can see, there are data activities for every step of the research process.Source:http://www2.lib.virginia.edu/brown/data/lifecycle.html
  • Another complicating factor is the variety of research data. Itis more than just text and numbers on spreadsheets. You want to protect as much of your research products as possible including:
  • The goal of this presentation is to make data management easier by reviewing some of the first steps.As a librarian, I am going to focus on activities related to organization of your data and its sharing.I’m leaving out data analysis because you are the experts in that arena.
  • Sometimes common sense isn’t common practice (e.g., we know we should finish that major project as soon as possible, but we end up procrastinating; and we know we shouldn’t talk back to our mothers, but we do)As I review a lot of these data management tips, they will seem obvious like common sense, but I’m reviewing them as a reminder to put them into common practice.
  • Here’s an article in the Chronicle of Higher education documenting grad students who have had their computers stolen and therefore lost large amounts of their intellectual blood, sweat, and tears. The stories are gutwrenching.Our research data are very important, and I’m sure each of us has data that we could not afford to lose. Source: http://chronicle.com/article/When-a-Thief-Takes-Your/126679/=
  • Anjali Adukia is an education graduate student in the hall of fame of anecdotes about data loss.Her experience is so colorful and so unlikely that Harvard University’s Graduate School of Education asked her to re-enact it for a video – which I have to show you now.What she left out was that she assaulted the monkey nearby a Gandhi ashram – a shrine honoring a man of peace.
  • Saving your data and your research products safely is very important.So where do you store your data safely?We have seen that traditional storage is not always sufficient.  With personal computers and departmental or university servers, you need to secure your computer from theft and damage, and make sure there is a back up. Lets consider two additional types of storage. The first is archives and repositories. The second is cloud-based storage.Increasingly people are saving their data on the cloud – that is storing their digital files in an online site on the Internet. The advantages are that wherever you have online access you can access your file, and that the organization that runs the online storage site can maintain its servers and make back-ups for you as part of its service.
  • Let’s look at archives and repositories first. These are a special type of online storage site.They provide long-term storage, management, and preservation of digital files and they usually have functionalities for searching, downloading, and using the stored data.
  • You can store data in an archive orrepository run by an institution – like the University of California, a research organization, or your scientific society.The University of California community can use Merritt. Merritt lets you manage, archive, and share digital content with easy-to-use interface for deposits and updates, offers persistent URLs for access, and provides tools for long-term management and permanent storage.http://merritt.cdlib.org/You do have to buy storage space on Merritt, but the costs are competitive with private sector services.Additionally, the Research and Content Technologies team at UC Berkeley's IST (Information Services and Technology) provides data repository management support services (http://ist.berkeley.edu/ds )
  • You can also store your data in a public archive or repository– it’s public so your data will be available for long-term access across a wide audience.The reason for this openness is to share the data, and we’ll talk a lot more about this later.A popular example is the GenBank sequence database. This open access database offers anannotated collection of all publicly available nucleotide sequences and their protein translations.
  • You can also store your data in an online storage service provided by a private company.For example, there are:Amazon S3Google Docs DropboxYou buy storage space (sometimes the first few MBs are free) and then upload the data for storage and get online access to your files. Some of these services also allows collaborative editing so you and your team can make changes to the data and share. An important consideration: Beware of loadingsensitive and confidential data in 3rd party services.
  • As you’re deciding which storage option is right for you, consider three issues in your evaluation:How long do you want to keep the data for? – An issue of PermanenceWho manages the storage system and the data? – An issue of OversightIs it safe from unauthorized access, especially for private, personal, and confidential data? -- An issue of Security
  • When you save your data files, try saving them in a format that allows you to be able to open and read them in the future.It is recommended that you usefile formats that are:Non-proprietaryUncompressed and unencrypted (okay to encrypt sensitive data) Commonlyused by your research communityIn standard representation (e.g., ASCII text, Unicode)This table outlines examples of ideal and not so ideal file formats.
  • If you make lots of changes to your data files over and over again, try version control software/services.They record the changes to your files and save them separately as different file versions – this way you can revert to previous versions.
  • For sensitive and confidential data, you may encrypt or password protect your data. Try to record your passwords and encryption keys on paper and then lock them in a file cabinet (2 copies).You could also use software like Password Safe, http://passwordsafe.sourceforge.net/, for storing passwords.
  • So that you don’t get profiled in the Chronicle of Higher Education for data loss or theft, backup is very important.Try to have 3 copies.An original master copyAcopy on a local external storage device – like an external hard drive in your lab or officeA copy on remote external storage, like storing files in an online storage site or a hard drive in a physically removed location. For example, UC Berkeley IST offers storage and back up services, http://ist.berkeley.edu/services/catalog/storageThere are third party companies that do backups for you, and some do them automatically. Here’s a listing: http://www.cdlib.org/services/uc3/datamanagement/managing.html#secureI like Dropbox, which gives 2GB free, and wheneveryou’re online, it will automatically load a copy of your file for online storage and Dropbox can also then load these files and sync them across a number of different computers you’re using.Finally, at the very least, email a copy of your files to yourself.
  • If I asked you, “What countries have a five-pointed star on their national flag?, chances are you won’t be thinking through a list of countries and their flags – instead, you’re probably thinking where is the nearest Web connection and what search terms to use in Google.
  • This research study published in Science shows us that as information is continually and instantaneously available online, our cognitive habits are changing.We are outsourcing our memory to machines.For example, research subjects in the study were asked to type facts into a computer … for example, “The space shuttle Columbia disintegrated during re-entry over Texas in February 2003”Half of the subjects were told that their work would be saved, the other half were told the work would not be saved.Those who believed their work would be saved by the computer did not recall details of what they typed as well as the other subjectsSo … we don’t remember information as well, when we expect to be able to find it again later on a computer.
  • So if we are not going to remember information as well when we save it on a computer, we need to make sure that:We can find data from the past quickly and completelyWe need to be able to understand this data when we open itSo outsourcing our memory means that we need good organization structures.To meet these goals, it helps to document and describe (i.e., label) your data as you create itThe simple act of labeling something helps you to find it and understand how to use it.  The process of documenting, describing, and labeling your data has a fancy name – you assign metadata. The metadata are the labels you add to data.
  • So what in your data do you document?There are different types of information and characteristics about your dataknown as metadata elements.There are descriptive metadata elements like … For some disciplines, there are standards as to what metadata elements to use and how to apply them (check our guide). 
  • So how do you record this metadata? The simplest way is to create a text file (.txt) and record your metadata elements. You store this file in the same folder as your data. Therefore, when you are in that data folder, you can read the text file to understand the data.
  • If you upload your data to an archive or repository, you will likely complete a form ofmetadata details for the file. This metadata is then stored in a database record with your data file. For example, here is a GenBank record for a genetic sequence, here are the metadata elements on the left, and here are instructions on how you assign the metadata. (http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html)If you plan to deposit into a repository or archive, you might want to check the metadata requirements before you collect your data.
  • Finally, you can annotate your data. Here is an example.These bracket parts are known as tags, and they let us know that the enclosed text is the title of the file.And here 0 is a temperature data point.A popular system for annotating your data is XML. You can learn more about XML at http://www.w3schools.com/xml/
  • A final point, give descriptive names to your data files and folders so you can find them quickly in your storage system.Avoid ambiguity by considering these elements for a file/folder name:project titleexperimental conditions and grouptrial numbersfile version number indicating data modificationsdate or time stampsauthor initials
  • In case you were wondering, here are a few countrieswith 5-pointed stars on their national flags.
  • Some studies have shown that researchers may be reluctant to share their research data openly, citing reasons like:Source: http://chronicle.com/article/Too-Many-Researchers-Are/123749
  • In the past, scientific communication wasn’t as open as it is today.Scientists like Galileo, Newton, Huygens, and Hooke were sending each other communications about their research with parts written in indecipherable anagrams.This allowed the scientists to discuss their discoveries without giving away their methods and results.If another scientist claimed the same discovery, you would just unscramble the anagram to secureyour claim on a finding Our journal system today represents a relative “open science revolution” overcoming the initial resistance to its transparency.Source:http://www.dailykos.com/story/2012/03/12/1073780/-Book-Review-Nielsen-s-Reinventing-Discovery
  • Currently, there is a drive for a new movement of“open science” – sharing scientific data and publications freely to all. Sharing your research materials freely may foster a number of potential benefits. There are many, but my favorites are:Protects unique data that cannot be readily replicatedPromotes new lines of researchEnables the exploration of topics not envisioned by the initial investigatorsProvides content for scientific educationIf you’re interested in the topic of open science, check out the book Reinventing Discovery: The New Era of Networked Science. The author, physicist Michael Nielsen, explores some interesting and unexpected outcomes of open science.
  • There are prominent cases of successful data sharing for research collaborations such as …Galaxy Zoo –an online astronomy project that invites members of the public to assist in the morphological classification of large numbers of galaxies.Foldit - an online puzzle video game about protein folding.In 2011, players helped to decipher the crystal structure of the Mason-Pfizer monkey virus (M-PMV) retroviral protease, an AIDS-causing monkey virus. This research problem stumped scientists for 15 years, but it took just 10 days for the players to solve it.
  • The private sector is invested in data sharing as well.The drug company, Eli Lilly has its “open innovation models”, https://openinnovation.lilly.com/dd/They give up some knowledge about their internal processes and problems in order to recruit external researchers to help attack those problemsAnother case is from Alzheimer’s researchIn 2003, a group of scientists and executives from government, academic, and private groups joined to share data on the biological markers for Alzheimer’s disease.This data sharing hastened progress of Alzheimer’s research and the data is publically available at http://www.adni-info.org/
  • And according to this study, sharing detailed research data is associated with increased citation rate.http://www.plosone.org/article/info:doi%2F10.1371%2Fjournal.pone.0000308
  • Furthermore,funding agencies – such as the NIH and NSF – are increasingly expecting and requiring data sharing practices.
  • And prominent journals like Science,Nature, PLoS have policies that require data sharing as a term of publication.http://www.nature.com/authors/policies/availability.htmlhttp://www.sciencemag.org/site/feature/contribinfo/prep/gen_info.xhtml#dataavailhttp://www.plosone.org/static/policies.action#sharingAs you can see, there are many forces moving towards research data sharing.
  • If you do share your research data, you can share your data in a very personal manner.One option is share-upon-request. You announce that your data is available, and you send a copy to any requestors.Another approach is self-archiving. You post your data to your website and then allow downloads.
  • You could also publish your data in a journal in the supplementary online materials section
  • You could also deposit yourinto an archive or repository at your research institution like UC’s Merritt repository.
  • You could also deposit your data into an archive or repository that is open to the public, which may increase the impact of your work.In order for your data to be widely accessible and highly used by your research peers, you want an archive/repository that is:Popular with your research community and has national or global coverageSpecific to your discipline or research domainOffers reliable long-term preservationTo find an archive or repository:Ask your colleagues where they share their data – you want to be where your colleagues areAlternatively, search this directory of public archives and repositories http://databib.lib.purdue.edu/
  • Public and institutional archives have different purposes.  With a public archive or repository, you and your research colleagues across the world are sharing data to build a comprehensive dataset for a larger research problem space. And if the public archive or repository is domain-specific – like a devoted genetics or chemical database – you may receive enhanced data support among a community of experienced, peer researchers.By comparison, posting your data into an institutional repository may restrict your data to a smaller audience – but it may also offer greater control of how your data is stored, organized, and represented.
  • When you share your data, you want your fellow scientists to find it easily. Suppose you’ve posted a data file at www.berkeley.edu/mystuff/super-data.csvBut if you move to Stanford and transfer your files over, the new URL could be www.stanford.edu/mystuff/super-data.csvBut then your old friends at Cal who have the old URL will get an error message when they try to view your data.
  • To fix this problem of dead links, try assigning permanent identifiers to your digital files. For example, DOIs.A DOI is a unique identification number for your file. Think of it like a Social Security Number. If you have a DOI, you can find the file by going to a web browser and enteringdx.doi.org/ followed by the DOI (e.g., 10.1126/science.331.6018.692 )Voila the DOI takes you to the file location – and it’s a different URL.Every DOI has a database record that stores the location details of a file. You can update the DOI record with a new location if your file moves.So a file can move to a different location but the DOI remains the same.
  • At UC Berkeley,researchers can use the EZID service to generate DOIs.EZID is brought to you by the University of CaliforniaCuration Center at the California Digital Library. The UC Berkeley Library pays for a subscription to this service.Let medemonstrate.Visit the EZID website at http://n2t.net/ezid/help You enter themetadatadetailsincludingthe URL location. Save, and youhave a DOI.You can generate DOIs through this web interface, and you can also generate them via API.To get your free account, email data-consult@lists.berkeley.edu and request an account.A final point, EZID also lets you generate an ARK, which is another type of permanent identifier.
  • Some final tips on data sharing. Be selective in what you share– maybe reconsider preliminary analyses.Next, recognize the restrictions on sharing. Beware of sharing private and confidential data.Finally, there are online services and software for sharing data/files just among your team – and not the world.At UCB, there is the Research Hub service, https://hub.berkeley.eduThere are also 3rd party services like Google Docs, and many others listed here: http://www.readwriteweb.com/biz/2011/06/8-simple-ways-to-share-data-on.php
  • What is a data management plan?It is a written plan on how you will organize, store, and share your data.
  • The act of planning has been associated with greater self-control for engaging in positive activities like:exercise medical adherence self-health exams sunscreen use schoolwork refraining from a negative behaviorSo perhaps planning might help with data management as well. 
  • Writing a data management plan can be preparation for efficient and quality collection of data that is safe and shareable.And you’ll meet NSF and NIH requirements.
  • The NSF Data Management Plan is a requiredpart of your proposal application.It is a supplementary document of no more than two pages And it should describe how the proposal will conform to the NSF policy on the dissemination and sharing of research resultsThe data management plan undergoes peer review and is an integral part of the proposal.<How is it reviewed? http://www.nsf.gov/pubs/policydocs/pappguide/nsf11001/gpg_2.jsp#IIC2j
  • What do you writein a data management plan?The specific content requirements vary among the NSF divisions and groups – with details available at: http://www.nsf.gov/bfa/dias/policy/dmp.jspIn general, your plan discusses theTypes of data and research materials producedStandards used for the data format and content (and its metadata)Policies for access and sharing Policies for re-use, re-distribution, and derivativesPlans for archiving and preserving data and research products In your plan, you can state that the data will not be publicly shared, and explain the reason. This rationale will undergo peer review.Here are example NSF data management plans from UCSD Geosciences research groupsExample 1 - http://rci.ucsd.edu/_files/DMP%20Example%20Jennifer%20MacKinnon.pdfAs you see, it doesn’t have to be long. It lists the types of data generated. As for standards, the data will be shared in the matlab MAT format. They also specify who is responsible for the different data analyses and management. Example 2 - http://rci.ucsd.edu/_files/DMP%20Example%20SIO%20OCE.pdfLists the different types of data generated, the collaborator responsible for the data, and the database or websites where the data set will be shared
  • NIH expects timely data sharing - no later than the acceptance for publication of the main findings.A plan is required if you are requesting $500,000 or more in any single year from NIH.You describe your plans for sharing the final research data, or state why data sharing is not possible.And in the final progress report, you report the data sharing actions taken.http://grants.nih.gov/grants/policy/data_sharing/data_sharing_guidance.htm(Reviewers will not factor the proposed data-sharing plan into the determination of scientific merit or priority score.)
  • Your plan is a brief paragraph immediately after the Letters of Support in your application form(It does not count towards the application page limit.)It is suggested that you briefly describe the expected schedule for data sharing, the format of the final dataset, the documentation to be provided, whether or not any analytic tools will be provided, whether or not a data-sharing agreement will be required and what the criteria and conditions areand the mode of data sharing
  • Here’s example 1.It doesn’t need to be long.The researchers explain that they do not plan to share the data. They believe it would be too difficult to protect the identities of subjects.
  • Here in example 2, the researchers are sharing their data online. User registration is required with the conditions that the data are not used for commercial purposes and they will not be distributed to third parties.
  • If you’d like more details on data management plans, check out our Library webpage on the subjecthttp://www.lib.berkeley.edu/sciences/data/guideYou’ll find guides, templates, and examples of data management plans.
  • To help you writea plan, there is an online service that provides step-by-step instructions and guidance for meeting specific funding agency requirements. It is the DMPTool, and I like to think of it as the TurboTax for data management plans. https://dmp.cdlib.org/It’s free, and you can log in with your CalNet ID to begin creating a data management plan.The DMPTool is developed by UC3 (the University of California Curation Center) at the California Digital Library along with partner universities and research centers across the country and internationally.
  • Data sharing is on the rise, and this is changing ideas of how to responsibly use data and our shared research efforts.So I want to brieflydiscuss data ethics. This allows me to end on a good note by discussing goodness.
  • Here are some interesting numbers.In a 2005 study reported in Nature, 3247 researchers funded by the NIH responded to a survey about scientific misbehaviors. 0.3% admitted to falsification or cooking research dataAnd about 1 in 3 confessed to committing at least one of 10 relatively serious acts of professional misbehavior.The researchers proposed that a main causeof this behavior might be the increasing pressure to publish papers and win grants.So I’m going to quickly review data ethics to help you avoid becoming a data point in these statistics.
  • If you use another researcher’s dataset, cite the data or the study’s paper.Here are some sample citations for data: http://www.dcc.ac.uk/resources/how-guides/cite-datasets#x1-5000The format varies by the citation style you use, but they all typically include DOIs.
  • To prevent data distortions and manipulations, keep a copy of the raw original data and keep a log of all the changes made to the data.
  • Adata set may have a license with restrictions on its use. You may be forbidden to use that data for for-profit use or you might have to give attribution in a particular way. So check for a data license and its terms of use.
  • The last point.Periodically review the data expectations and requirements of funding agencies, university regulations, and federal and state governments. They may change over time.
  • To summarize this presentation succinctly, here is a haiku poem.= = = = = =Creative Commons LicenseAttribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)http://creativecommons.org/licenses/by-nc-sa/3.0/Special thanks to all the creators of the following resources for making it possible to build upon the materials that they created!MIT Libraries' Data Management and Publishing subject guide and Managing Research Data 101 classUniversity of Minnesota Libraries' Managing Your Data guide and Data Life-Cycle Management workshopUK Data Archive's Managing and Sharing Data: Best Practice for Researchers

Scientific data management Scientific data management Presentation Transcript

  • managemenResponsible Conduct of Research Seminar Series April 16, 2012
  • Who are you?Jeffery Loo, PhD “Flying books” Installation by J. Ignacio Diaz de Rabago UC Berkeley Library
  • NSF data management plan Requirement as of January 18, 2011 Your plans to organize, store, and share data http://www.nsf.gov/bfa/dias/policy/dmp.jsp
  • “My Data Management Plan – a satire” Dr. C. Titus Brown Assistant Professor Michigan State University Source
  • Dear NSF,I am happy to respond to your request for a 2-pageData Management Plan.First of all, let me say how enthusiastic I am that youhave embraced this new field of "large scale dataanalysis". Ever since I started working with large Avidadata sets in 1993, […] I have seen the need for asystematic plan to manage the data. It is nice to seeNSF stepping up to the plate in such a timely manner,and I am happy to comply.Now, as to my actual data management plan, here ishow I plan to deal with research data in the future.I will store all data on at least one, and possibly upto 50, hard drives in my lab.The directory structure will be custom, not self-explanatory, and in no way documented or described.Students working with the data will be encouragedto make their own copies and modify them as theyplease, in order to ensure that no one can everfigure out what the actual real raw data is.
  • Backups will rarely, if ever, be done.When required to make the data available by myprogram manager, my collaborators, and ultimately bylaw, I will grudgingly do so by placing the raw dataon an FTP site, named with UUIDs like 4e283d36-61c4-11df-9a26-edddf420622d. I will under nocircumstances make any attempt to provideanalysis source code, documentation for formats,or any metadata with the raw data. When requested(and ONLY when requested), I will provide an Excelspreadsheet linking the names to data sets withpublished results. This spreadsheet will likely bewrong -- but since no one will be able to analyze thedata, that wont matter.[….]Note, we didnt use a version control system,either. […] And our repository is not publiclyavailable - you have to beg for permission. Note, Ionly answer e-mail on every other Tuesday.
  • Any design notes on the data analysis are in ourprivate e-mail, and we will fight to the death -- up toand including ignoring FOIA requests -- to prevent youfrom obtaining them.Meanwhile we will continue publishing excitingsounding (but irerproducible) analyses, and submittinggrants based on them, because thats the only thingthat the reviewers care about.sincerely yours,--titus(representing every computational scientist in theworld.)
  • Data challenges Informal data management practices Distributed, uncoordinated effort Concerns about data re-use“Can’t you ever relax?” Data management may be ad lib
  • Lots to do! Ensure long-term access Facilitate sharing Prepare for future re-use
  • Data activities in the research workflowSource:http://www2.lib.virginia.edu/brown/data/lifecycle.html
  • Lots of different research products Models and computational simulations Images, photographs, audio, and video Instrument readings Maps Software Artifacts and samples Physical collections And more …
  • Goal for this lunch hourReview “first steps” in datamanagement Saving data Describing/documenting data Sharing data Data management planning Data ethics
  • Common sense versus common practice
  • Saving data
  • Hall of fame anecdote http://www.youtube.com/watch?v=J6HtRWyiL98
  • Where do you store data safely?Traditional storage not always sufficient Personal computers Departmental/university serversTwo additional types of storage Archives and repositories Cloud storage (storing files in an online site)
  • Archives and repositoriesSpecial types of online storage sitesLong-term storage, management, andpreservationSearch, download, and analytic functionalities
  • Institutional archives and repositoriesMerritthttp://merritt.cdlib.org/ Data repository management services at UCB http://ist.berkeley.edu/ds
  • Public archive and repository Long-term access, open to the public GenBank http://www.ncbi.nlm.nih.gov/genbank/
  • 3rd party cloud storage Amazon S3 Google Docs DropboxBeware of posting sensitive data/files
  • Deciding on storage Consider:Permanence Oversight Security
  • Save for long-term accessRecommended file formats• Non-proprietary• Uncompressed and unencrypted (okay to encrypt sensitive data)• Common usage by your research community• Standard representation (e.g., ASCII text, Unicode)
  • Backup 3 copies 1 2 3Original master Local external storage Remote external storage UC Berkeley IST backup services 3rd party services (Amazon S3, Elephant Drive, Jungle Disk, Mozy, Carbonite Free, Dropbox) Email a copy to yourself
  • Describing anddocumenting data (metadata)
  • What countries have a five-pointed star on their national flag?
  • DOI: 10.1126/science.1207745 “outsourcing” our memory “we don’t remember information as well, when we expect to find it on a computer later”
  • If we outsource our memoryto computers … We need good organization structures to Find data from the past quickly and completely Understand data from the past It helps to Document and describe data “Assign metdata”
  • What do you document?Descriptive Administrative Structuralmetadata elements metadata elements metadata elementsTitle Dictionary or codebook File formatsCreator or contact to explain the dataDate variables File namesExperimental conditionsMethodology Tools and softwareVersion needed for processing or visualizing the data
  • How to record metadataOption 1 write metadata save as readme.txt store in file folder with data
  • Metadata form/file in an archive/repositoryOption 2 http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html
  • AnnotateOption 3<title>Effect of salt on ice creamproduction efficiency</title><temperature>0</temperature> XML, a popular system for annotating data http://www.w3schools.com/xml/
  • Assign descriptive namesDescriptive file names data1.csv 75-celsius-trial_control_ver002.csvDescriptive folder namesData Project-title > 1 > Trial 1 > raw >> Experimental Consider these elements: >> part A >> Control • project title >> 110904 > Trial 2 • experimental conditions > readings > Trial 3 and group • trial numbers • file version number indicating data modifications • date or time stamps
  • Australia Brazil United States of America Cape Verde Ethiopia
  • Sharing data
  • Historic data sharingGalileo Newton Huygens Hooke Anagrams to secure discoveriesVersus the “open science revolution” of journals today
  • Open scienceShare research data, products, and communications openly Potential benefits Protects unique data that cannot be readily replicated Reinforces open scientific inquiry Encourages diversity of analysis and opinion Promotes new lines of research Makes possible the testing of new or alternative hypotheses and methods of analysis Supports studies on data collection methods and measurement Facilitates the education of new researchers Enables the exploration of topics not envisioned by the initial investigators Permits the creation of new datasets when data from multiple sources are combined Provides content for scientific education
  • Data sharing examples Crystal structure of M-PMV retroviral protease
  • Private sector too! Cross-sector data sharing for Alzheimer’s research http://www.adni-info.org (News story)
  • Increased citation rate
  • Funding agency policiesNIH Data Sharing Policy NSF Data Sharing Policy Data management plan for grant applications
  • Journal expectationsData sharing as a term of publication
  • How do I share? Personal sharingShare-upon- Self-archiverequest Download from myEmail me for a personal website!copy!
  • Journal publishing
  • Institutional archive or repository UC3 Merritt repository
  • Public archive or repositoryIdeal characteristics Find an archive/repositoryPopular with national/global coverage Ask colleaguesSpecific to your discipline Search http://databib.lib.purdue.edu/Offers long-term preservation The Ancient Agora of Athens
  • Public versus institutional archives and repositoriesPublic archives/repositoriesCreate comprehensive dataset for a larger research problem spaceDomain-specific archives/repositories may provide better support Institutional archives/repositories May restrict to a smaller audience May offer greater control of your data
  • Help others find your dataBerkeleywww.berkeley.edu/mystuff/super-data.csv file moves toold URL is kaput Stanford www.stanford.edu/mystuff/super-data.cs
  • Try permanent identifiersDOIDigital object identifierResolve DOIby visiting http://dx.doi.org/ followed by DOIFile can move, but DOI remains the sameThe DOI record stores location details
  • Generate permanent identifiers http://n2t.net/ezid Subscription through the UCB Library request your free account, by emailing data-consult@lists.berkeley.edu
  • Final tips for sharing Be selective Recognize restrictions (privacy and confidentiality) Online services for sharing among your team Research Hub 3rd party services
  • Data management planning
  • What is a data management plan?A plan for organizing, storing, and sharing data
  • Planning associated withgreater self-control forexercisemedical adherenceself-health examssunscreen useschoolworkrefraining from a negativebehavior and Liu, 2012Source: Townsend Perhaps planning helps for data management
  • Why have a plan? Prepare for efficient and quality data collection that is safe and shareable NSF and NIH requirements
  • requirementsData management plan≤ 2 pagesdescribes how data will be managed, disseminated, and sharedPlan undergoes peer review
  • Writing an NSF data management planSpecific requirements vary by NSF divisionsIn general, describe: Types of research data and materials produced Standards for data format, content, and metadata Policies for access and sharing Policies for re-use, re-distribution, and derivatives Plans for archiving and preserving You can explain why data will not be sharedExamples 1 and 2
  • NIH requirementsTimely data sharing encouragedIf requesting ≥ $500k per year,a plan is requiredDescribe how data will be sharedor why sharing is not possibleIn the final progress report,describe data sharing actions taken
  • Writing an NIH data sharing planA brief paragraphSuggested topics Schedule for sharing Format of the data Documentation of the data Analytic tools provided Data-sharing agreements (criteria and conditions) Mode of data sharing
  • NIH plan example 1The proposed research will involve a small sample (lessthan 20 subjects) recruited from clinical facilities in the NewYork City area with Williams syndrome. This rarecraniofacial disorder is associated with distinguishing facialfeatures, as well as mental retardation. Even with the we believe that itremoval of all identifiers,would be difficult if notimpossible to protect theidentities of subjects given the physicalcharacteristics of subjects, the type of clinical data(including imaging) that we will be collecting, and therelatively restricted area from which we are recruiting Therefore, we are notsubjects.planning to share the data.
  • NIH plan example 2This application requests support to collect public-use data from a survey of Datamore than 22,000 Americans over the age of 50 every 2 years.products from this study will be madeavailable without cost to researchers andanalysts. https://ssl.isr.umich.edu/hrs/User registration is required in order to access ordownload files. As part of the registration process, users must agreeto the conditions of use governing access to the public releasedata, including restrictions against attempting to identify studyparticipants, destruction of the data after analyses are completed, reportingresponsibilities, restrictions on redistribution of the data to third parties, andproper acknowledgement of the data resource. Registered userswill receive user support, as well as information related toerrors in the data, future releases, workshops, and publication lists. The will not be used forinformation provided to userscommercial purposes, and will not beredistributed to third parties.
  • Library guidance Guides, templates, exampleshttp://www.lib.berkeley.edu/sciences/data/guide
  • Online service for building data plans Step-by-step instructions for meeting funding agency requirementshttps://dmp.cdlib.org/
  • Data ethics
  • 3247 respondents0.3% admitted to falsification or“cooking” research dataAbout 1 in 3 confessed tocommitting at least one of 10 Motivated by increasingserious misbehaviors pressureStudy by Martinson et al., 2005Source - doi:10.1038/435737a to publish papers and win
  • Citing data
  • Prevent distortions and manipulations Keep raw original data Log all changes made
  • Data licensingRestrictions on data use, forexample No for-profit use No re-sharing Give attributionCheck for license/terms of use
  • Stay current with data requirementsReview for changes to policies byFunding agenciesUniversity regulationsFederal and state governments
  • Haiku summaryData is preciousSafely store and share widelyGood for your career