A presentation for the Food and Nutrition Science Responsible conduct of research class on data management best practices. Covers material in the context of writing a data management plan.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
Responsible conduct of research: Data Management
1. Responsible Conduct of Research:
Managing Data
Tobin Magle
Data Management
Specialist
Nicole Kaplan
Information
Manager
Daniel Draper
Digital Repositories
Unit Coordinator
2. Responsible Conduct of Research:
The data management firehose!
C. Tobin Magle, PhD
Please ask me for help with data management!
3. My Background: molecular microbiology
(1) Magle CT et al Infect Immun. 2014 Feb;82(2):618-25. doi: 10.1128/IAI.00444-13. Epub 2013 Nov 25.
(2) Sun W, Tanaka TQ, Magle CT, et al.. Sci Rep. 2014 Jan 17;4:3743. doi: 10.1038/srep03743.
5. Individual help for ANY data topic
How do I
write a
DMP?
How do I
organize my
data?
How do I
clean and
format my
data?
How do
I use R?
How do I get
my data ready
to share?
How do I comply
with funder
mandates?
What DM tools
are there for
collaboration?
How do
I use R?
7. What is data
management?
The policies, practices and procedures needed to
manage the storage, access and preservation of
data produced from a research project
12. Working
Email
Data are extant
(If status known)
Status of data
(if response)
Response
(if email
working)
doi:10.1016/j.cub.2013.11.014
13. We are losing vast amounts of data
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
11
1
1
1
1
1
1
1
0
0
0
0
0
0
0
00
0
00 0
1
1
1 1
1
0
Who is responsible?
14. CSU data policy
General points:
• The university owns, and is therefore ultimately responsible for
research data
• Researchers are the data managers
• The university promotes openness
http://policylibrary.colostate.edu/policy.aspx?id=737
15. You’re a Data Manager
http://www.phdcomics.com/comics/archive.php?comicid=382
16. CSU data policy
Research Data Associated with Theses and Dissertations
To preserve the complete scholarly record of the author, data
sets must be incorporated. Therefore, a student depositing their
thesis or dissertation is required to make discoverable,
accessible and available their associated data sets in
accordance with this policy and provisions of the University’s
Digital Repository. Access and rights management (embargo
period, access limited to specific IP addresses) shall be the same
for the associated data sets as it is for the thesis or the
dissertation.
http://policylibrary.colostate.edu/policy.aspx?id=737
30. What is research data?
• “The recorded factual material commonly accepted
in the scientific community as necessary to validate
research findings”
- White House Office of Management and Budget
• Reality: Applies to any research product
32. What is a data
management plan?
A description of how you plan to describe, preserve
and share your research data.
Often required by funding agencies
33. Successful DMPs include
• A data inventory, including type(s) and size
• A strategy for describing the data
• A plan for preserving the data
• A method for access to the data
Always make sure to follow funder requirements
34. Tool: DMPTool
• Review requirements from
different agencies
• https://dmptool.org/guidance
• Create new DMPs based on
funding agency templates
• Search public DMPs
35. Data inventory
• What type of data are you going to collect?
• What file type will be produced?
• What size will these files be? How many files?
• How will you organize the data?
• What other research outputs will be produced?
• Code/Software?
• Templates/protocols?
36. Data inventory
• What type of data are you going to collect?
• What file type will be produced?
• What size will these files be? How many files?
• What other research outputs will be produced?
• Code/Software?
• Templates/protocols?
miRNA sequences
FASTQ files
1 GB per file
x 64 strains
x 3 replicates
-------------------
~200 GB
R scripts for
analysis and
visualization
Data use tutorials
37. Data formats
• Avoid proprietary formats
• Know what software can read your data
Proprietary Format Open Format
Excel (.xls, .xlsx) Comma Separated Values (.csv)
Word (.doc, .docx) plain text (.txt)
PowerPoint (.ppt, .pptx) PDF/A (.pdf)
Photoshop (.psd) TIFF (.tif, .tiff)
Quicktime (.mov) MPEG-4 (.mp4)
MPEG 4 Protected audio (.m4p) MP3 (.mp3)
38. Q’s: Data Inventory
What kind of data are you going to collect?
What file type will be produced?
What size will these files be? How many files?
What other research outputs will be produced?
39. Folder systems
• Identify ways to divide your
data into categories
(Attributes)
• Top level organization is the
most important attribute
• Provide documentation
41. Q’s: Data Organization
• What kinds of files are there? (See data inventory)
• How could you group them?
• Project?
• Time?
• Location?
• File type?
• What are the most important attributes?
42. Tool: Open Science Framework
• Components
• Add-ons
• Contributors
• Wiki
http://help.osf.io/m/collaborating/l/524109-using-the-wiki http://www.slideshare.net/DuraSpace/121014-
slides-roadmap-to-the-future-of-share
43. Organization rules
• Be consistent
• One directory per project
• Separate subdirectories for
• Raw data
• Processed data
• Code (processing and analysis)
• Output
• Make raw data read-only
• Make README files
http://help.osf.io/m/60347/l/611391-organizing-files
45. A strategy for describing the data
• Metadata: Relevant information
for re-creation and re-use
• Contact info
• How data was collected
• Details about collection
• Date, location of collection
• Units
• Can be as simple as a text file
46. Metadata standards
• Dublin Core: http://dublincore.org/documents/dcmi-terms/
• Can be applied to anything
• Many discipline specific metadata standards
• EML: https://knb.ecoinformatics.org/#external//emlparser/docs/index.html
• MIAME: http://fged.org/projects/miame/
• Search for other standards:
• http://www.dcc.ac.uk/resources/metadata-standards
• https://biosharing.org/standards/
48. Q’s: Describe your data
What do people need to know to reuse your data?
Are there any discipline-specific metadata standards?
What format will you describe your data in (text, XML, tabular)?
What fields will you include (author, date, format, identifier?)
49. A plan for preserving the data
• Where will it be stored?
- Backups
• Necessary metadata and
other products
• Who is responsible?
• How long?
50. Ellin, A. Rutgers Student Offer $1,000 for Data on Stolen Laptop.abcNEWS via Good Morning America. April 26, 2013.
http://abcnews.go.com/blogs/business/2013/04/rutgers-student-offers-1000-for-data-on-stolen-laptop/
Backup
51. Back up recommendations
• Store in geographically
distinct locations
• How often?
• Automation: Will you remember
to do it manually?
• Security: Are you working with
PHI?
52. Q’s: Preservation plan
What will you store?
Who will be responsible for the data (person or position)?
How long will you store it?
Where will you store it?
How will you back it up?
*Differentiate between working vs. archived
53. A method to access the data
• Important to funding agencies
• Reproduce existing research
• Promote further research
• Must be easily available:
• No “by request only”
• Embargoes are “ok”
• Data security: consider privacy
and IP issues before sharing
54. Data access and sharing best practices
• Non-proprietary formats
• Include metadata
• As open as possible
• Follow CSU research data policy
55. Trusted Repositories: store and share
• Discipline specific
• Search: http://service.re3data.org/browse/by-subject/
• Generic
• Figshare - https://figshare.com/
• Dryad - http://datadryad.org/
• CSU Digital Repository
• http://lib.colostate.edu/digital-collections/ http://67.media.tumblr.com/6228cbe58a9652f1a85e8a
b1ed08d715/tumblr_inline_n6oukhNlZW1qf11bs.png
56. Tool: CSU digital repository
• Over 100 Datasets
• Satisfy requirements
for manuscripts and
grants
• At no cost <1 TB
• $150/TB for 5 years
• $300/TB for >5 years
57. Theses and Dissertation Data
1. Submit to ProQuest with thesis or dissertation
• Supplemental data file
• Only discoverable through thesis or dissertation
2. Submit to CSU Library separately
• Requires distinctive descriptive metadata
• Linked with thesis or dissertation
• Data discoverable globally
58. Q’s: Access methods
Where will people be able to access the data?
Does your discipline have a repository?
Are you complying with CSU’s data policy?
How will you format the data for CSU digital repository?
59. Need help?
• General: library_data@colostate.edu
• Direct: tobin.magle@colostate.edu
• DMPTool: http://dmptool.org/
• Data Management Services website:
http://lib.colostate.edu/services/data-management
Editor's Notes
As one of my colleagues so kindly put it, I should tell you all “that I’m a weirdo”
The number of PhDs is growing, hence….
Ecological Metadata Language, Geospatial Metadata — Federal Geographic Data Committee