TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
Â
Writing a successful data management plan with the DMPTool
1. Writing a Data Management Plan
with the DMPTool
Kathleen Fear
January 15, 2015
2. Goals of this session
⢠Learn how the DMPTool can help you
generate a DMP
⢠Learn the basic components of a DMP
⢠Understand how good data management
practices translate to a good DMP
3. What is a DMP?
A formal plan outlining how you will handle your
data throughout and after your projectâŚ
âŚwhich is now required by many fundersâŚ
âŚand which is a good idea anyhow, even if itâs
not required.
13. You can only add someone if
theyâve logged into the
DMPTool before
14.
15.
16. Data Products
Describe the kind of data youâre collecting or
using, whether itâs digitalâŚ
âŚor physical.
(or all of the above)
17. Data Products: What to specify
⢠What are your data products, both primary
and derived?
⢠When will you collect / produce each data
product?
⢠How much data will you generate?
19. Documenting data
⢠Data are machine readable, but must also be
understandable to humans
⢠What information would someone else (or
you, long in the future) need to understand
the data?
20. Data and Metadata Standards: What to specify
⢠File formats:
â Open or proprietary?
⢠If you need special software to open a file, how will you
ensure its accessibility over time?
â Standard or non-standard?
21. Data and Metadata Standards: What to specify
⢠Naming standards:
â Can you tell what a file is and what it contains
without opening it? How do your files relate to
one another?
22. Data and Metadata Standards: What to specify
⢠Metadata: Contextualizing information about
an object, physical or digital
⢠Some fields have defined standards; some
repositories ask for a specific set of metadata
24. A DMP does NOT:
Require that you share all data with
anyone who wants it
âat no more than incremental cost and
within a reasonable timeâ (NSF)
âindicate the criteria for deciding who
can receive your dataâ (NIH)
25. Access and sharing: what to specify
⢠What data products will you share freely? When?
How?
â Data necessary for replication of public results
â Other data?
⢠What data products wonât you share freely? Why
not?
⢠How will you resolve ethical or privacy issues?
⢠Consider restrictions, embargo, etc. for data that
canât be immediately shared freely
26. Access and sharing: What to specify
⢠Backup:
â Where? (and what?)
⢠Local (hard drive, dept/local server, personal laptop, flash
drive) vs. distant (PDC, hard drive at home)
⢠Central (PDC, UR Research) vs. cloud (Amazon, Box,
CrashPlan, Google Drive)
â How often?
â Whoâs responsible?
⢠Security: Locked cabinets? Password-protected
computer? Non-networked storage?
27. Access and sharing: Placing data in a
repository
⢠Long-term commitment to data preservation
⢠Higher visibility for your data
⢠Permanent URL / DOI enables data citation
⢠Reuse tracking and usage statistics
28. Access and sharing: Placing data in a
repository
⢠UR Research:
https://urresearch.rochester.edu/home.action
â Example: STOP-ROP Clinical Trial
30. Access and sharing: Placing data in a
repository
⢠UR Research:
https://urresearch.rochester.edu/home.action
⢠Repository directories: re3data.org;
biosharing.org
31.
32. ⢠Integration with journal submission
processes
⢠Link to data held elsewhere
⢠Not free: $80/submissionâŚ
⢠âŚbut talk to us about a voucher
33. Reuse and distribution
⢠Who is the audience for your data?
⢠What possible uses might someone make of
your data?
⢠Are there any permissions restrictions
necessary?
34. Plans for archiving and preservation
⢠How long should data be retained for?
⢠Where will the data be placed for long-term
preservation? What policies are in place there
to guarantee its preservation?
⢠How will you ensure accessibility and usability
over the long term?
â Data transformations?
â Archiving associated information?
35.
36. Revisiting Metadata and
Documentation
⢠Information about data processing, collection
details: the âstoryâ of the data
(âŚbut itâs all in the paper!)
⢠Are your variable names meaningful? It is
clear how different parts of the dataset relate
to each other? Is it in a format others can use?
37.
38. One size does not fit allâŚ
⢠But weâll cover general guidelines
39. A little help: UR Data Management website
library.rochester.edu/data
40. A little help: consultation
⢠Call me! (Or email, or drop by.)
5-6882
Carlson 313E
kfear@library.rochester.edu
⢠DMP consultation & review; trainings; data
archiving support; etc.
41. A request
⢠When you get a grant funded, send me your
DMP.
⢠If youâre comfortable, if you get negative
feedback on your DMP, share it with me.
Starting from the very beginning: what is a data management plan?
Well, itâs a plan that spells out what you do with your data, during and after a project. Many funders require these now â the NSF has asked for a data management plan as part of all grant proposals since 2011 â but they can be a useful exercise even if your funder doesnât require it.
The NSF requirements are among the most comprehensive, and those are what weâll focus on today. The NSF in some ways led the way on the data management plan front, so many funders have requirements that really take their cue from the NSF â so if you know how to do an NSF plan, youâre in great shape in general.
Iâm going to switch to a live demo now, but please note that screenshots of all the pages I go through will be captured in the slides, so if you download the powerpoint, youâll have a walkthrough of the demo as well.
So the first part is data products. This is the easiest piece: what data are you dealing with? When I say research data, what I mean is the factual, recorded material necessarily to support and validate your findings. This can mean a range of things, depending on your research: could be measurements from instruments, observations recorded in a lab notebook, video or audio recordings, transcripts of the same, survey results, and so on.
In this section, youâll want to account for all those things, as well as any data products derived from your raw data. Thereâs often a time element here: at the start of the project youâll generate raw data, and then at a later point process it and generate something new. Maybe youâre doing multiple waves of data collection. Be specific about what youâre collecting and when, as well as how much. You donât need to be super specific, but a ballpark estimate (10GB, or 100, or TBs) is important.
(back to browser)
Next youâll specify data and metadata standards: how your data is described, documented and organized.
The main idea here is understandability: how will you ensure that your data is meaningful for the duration of your project and beyond?
In part, this is a technical question. An important data standard to specify is the file format youâll use, and especially whether it is open or proprietary. If you are using a proprietary software, you need to plan ahead for any problems that may arise when new versions of the software come out â especially if the software youâre using is not only proprietary but also not widely adopted. This goes for custom binaries as well. Think about circumstances under which you might lose access to your data, and how you might prevent that from happening: by using an open format, by ensuring that all team members continue to have access to the software, and so on. Iâll show an example of one strategy a little later on.
Thereâs also a policy piece here, in the actions you take to make sure the data are documented appropriately. An extremely easy but effective tool is a naming standard â if you donât already do this, itâs good practice to name your files in a way that you know whatâs in them without needing to open them. These is especially useful if you arenât the only one touching the files.
As an example here, you might have a sample or reagent or some sort of material youâre working with, the output from a instrument used in an experiment, the lab notebook where you recorded the work, and another document where you did some kind of analysis. If this is my stuff, I know that all these go together, but itâs not obvious to anyone else. And six months down the road, might not be obvious to me anymore, either.
But if you set a standard early on, in this case taking an identifier and carrying it through, you can head off these kinds of problems.
In your plan, you donât need to get that specific, but you can say something like, at the outset of the project, the team will establish a naming convention to ensure accessibility and consistency of results. And then you work out the details in your team when the project gets going.
As a side note, this is useful to establish in your lab if you oversee students, and you want to be able to peek in on their work and know whatâs going on.
Metadata standards are really all about how you record the contextual information about your data. Some fields or repositories do have defined standards, so noting that youâll adhere to those standards in your plan is sufficient.
The third piece focuses on access and sharing. This, by the way, is the piece that the NIH is particularly concerned with.
To clarify: you are not, generally, required to share everything with everyone. The guidelines â here the NSF and NIH are just examples â are fuzzy, which is troublling if youâre looking for guidance, but also gives you some latitude to determine the best strategy for your research.
The sharing piece is really about what you can share, when, and how youâll do it. If you return to that list you generated in the first part, what things might be of value to share with others? What things canât be shared?
A good rule of thumb is that, at minimum, you should consider sharing the data necessary to replicate your published results. Now thatâs still a little loose: for some folks, thatâll mean sharing x-y coordinates to reproduce a figure; for others, that might mean putting raw data and code out there. Thereâs really no hard-and-fast rule; the important thing is to act in good faith and in accordance with the norms of your discipline.
If there are things you canât share, because of ethical concerns, intellectual property interests, or other reasons, identify them and say why they canât be shared.
And also consider if there are things that you canât share right now, but you could after a period of time: after the paper is published, after 18 months, so on.
The flipside of sharing is that you want to be able to control the data up until the point youâre ready to share, and that means establishing secure storage and making sure the data are backed up.
Backup is critical. Have at least one additional copy of your data, ideally two to three, on different media, and at least one stored separately, either in the cloud or on a hard drive at home. Even more important than the technology, though, is the people around it. Who is responsible for making sure the backups happen, and for checking that the backups are functioning correctly? You can have a wonderful system set up, but if nobody pushes the button, youâre in trouble.
The kind of security you need depends heavily on the kind of work youâre doing: it can range from storage in locked cabinets to a password-protected computer to non-networked storage.
Now the last piece is how to share. Many peopleâs first instinct is to keep the data on a server locally, in the lab, and then send it out via email or whatever when someone requests it. Itâs a nice idea, but there have been several studies in the last few years on how well it actually works, And as it turns out, that approach is pretty terrible. Success rates are very, very low. And itâs not because people are refusing to send data, itâs because they canât. They donât have the computer anymore. The hard drive is in a box somewhere. They just flat out donât have time to track it down and send it.
So when it comes to sharing, one of the best things you can do is take a âset it and forget itâ approach, by putting data into a repository. Repositories offer a bunch of other good things as well: a long-term commitment to preservation, higher visibility (which can increase citation counts to your papers), the ability for data users to cite your data directly, and the ability for you to see how many eyes there are on your data, and how many people are interested in it.
We have a campus option for depositing data, and thatâs UR Research. Though UR Research primarily houses pre-prints, it can also take your data.
Hereâs an example.
There are also lots of discipline-specific repositories, some of which you might already be familiar with. Re3data is a directory of repositories, so you can check out whatâs available for your field. Biosharing is similar but focused on biology.
Another option is a general-purpose repository such as dryad. Hereâs an example of a study in dryad. You can see this is a dataset associated with a paper published in PLOS ONE.
The nice part about Dryad is that theyâve partnered with a number of publishers, so if you submit to one of them, you can submit your data with your manuscript, and dryad and the journal will coordinate the review and publication process of the paper and the data.
It is not free, but we are a partner of this repository, and we have a limited number of vouchers available to cover the cost of data submission. Talk to me or anyone else in the library if you are interested.
Next up is a related topic: reuse and distribution. Who might be interested in your data? Think broadly: other scientists? Policy-makers? And so on.
Last but not least, archiving and preservation. This piece deals with what happens when your project ends. What happens to your data? Note that the University of Rochesterâs policy is to maintain data for 3 years after the close of a project, so thatâs really the minimum amount of time you need to hold on to your data.
As I just mentioned, a repository is a really good option for dealing with your data after a project is done, and if you want your data preserved for a long time, you do need to think about future-proofing it.
I said earlier Iâd show a strategy for doing that, and here it is.
Now once youâve put together your draft, you do have the option of getting it looked over. If youâre doing it on your own, you can feel free to email it to me, but the dmptool makes it especially easy. The âsubmit for reviewâ button sends your plan right along to me or another subject librarian, and weâll take a look and give you feedback.
Please donât hesitate to ask for a review, even if itâs last minute. Iâm happy to turn these around pretty quickly.
Just a last note: there really is a lot of wiggle room in the NSF guidelines. There isnât always one right answer when it comes to this kind of planning. And if you think about it, this is an extension of the argument you make in the rest of the proposal: you argue that your question is an interesting and important one to explore, that your approach is the right way to do it, that you and your colleagues are the best ones to carry out the work, and in the data management plan, you set out your strategy for handling your data and make the case that itâs the right approach for your project.
We have quite a bit of information about data management planning online at our website, but
You can also contact me any time.
Thanks very much for your attention, and I hope this was helpful!