Our regular Introduction to Data Management (DM) workshop (90-minutes). Covers very basic DM topics and concepts. Audience is graduate students from all disciplines. Most of the content is in the NOTES FIELD.
4. ââŚthe recorded factual material
commonly accepted in the
scientific community as
necessary to validate research
findings.â
Research data is:
U.S. Office of Management and Budget, Circular A-110
3
5. âUnlike other types of information, research
data are collected, observed, or created, for the
purposes of analysis to produce
and validate original research
results.â
University of Edinburgh
MANTRA Research Data Management Training,
âResearch Data Explainedâ
Research data is:
6. Actions that contribute to effective
storage, preservation and reuse of
data and documentation throughout
the research lifecycle.
Data management:
5
7. 6
What data management is not:
Data/computational science
Database administration
A research method:
⢠what data to collect
⢠how to collect them
⢠how to design an experiment
8. Increase visibility & impact
Protects investment
Increases research efficiency
Preservation
Saves time
Funding agency requirements
Why data management?
Further your field
image credit: http://www.flickr.com/photos/karenchiang/5041048588/
10. 85 cancer microarray clinical trial
publications
69% increase in citations for articles
w/publicly available data
Increase visibility & impact
11. Funder mandates
ââŚdirected Federal agencies with more than $100M in R&D
expenditures to develop plans to make the published results of
federally funded research freely available to the public within one year
of publication and requiring researchers to better account for and
manage the digital data resulting from federally funded scientific
research.â
15. Data types
14
Observational | Canât be recaptured, recreated or replaced; Examples: sensor
readings, sensory (human) observations, survey results
Experimental | Should be reproducible, but can be expensive; Examples:
gene sequences, chromatograms, spectroscopy, microscopy, cell counts
Derived or compiled | Reproducible, but can be very expensive;
Examples: text and data mining, compiled database, 3D models
Simulation | Models and metadata, where the input can be more important
than output data; Examples: climate models, economic models, biogeochemical
models
Reference/canonical | Static or organic collection [peer-reviewed] datasets,
most probably published and/or curated; Examples: gene sequence databanks,
chemical structures, census data, spatial data portals
17. Data types: another classification
Qualitative data âis a categorical measurement expressed not in
terms of numbers, but rather by means of a natural language
description. In statistics, it is often used interchangeably with
"categorical" data.â See also: nominal, ordinal
Source: Wikibooks: Statistics
Quantitative data âis a numerical measurement expressed not by
means of a natural language description, but rather in terms of numbers.
However, not all numbers are continuous and measurable.â
âMy favorite color is blue-green.â vs. âMy favorite color is 510 nm.â
18. More common data types
Geospatial data has a geographical or geospatial aspect. Spatial
location is critically tied to other variables.
Digital image, audio & video data
Documentation & scripts Sometimes, software code IS data;
likewise with documentation (laboratory notebooks, written observations,
etc.)
19. File Formats
âA file format is a standard way that information is encoded for
storage in a computer file. It specifies how bits are used to encode information
in a digital storage medium.â - Wikipedia
Qualitative, tabular
experimental data
Data type
Excel spreadsheet (.xlsx)
Comma-delimited text (.csv)
Access database (.mdb/,accdb)
Google Spreadsheet
SPSS portable file (.por)
XML file
Possible
formats
20. What types & formats of data will you
be generating and/or using?
Observational | Experimental | Derived | Compiled |
Simulation | Reference/Canonical
Qualitative | Quantitative | Geospatial
Image/audio/video | Scripts/codes
Reflective Writing: 60 seconds
31. Data backup
âKeeping backups is probably
your most important data
management task.â
-Everyone
1. Some data backup is better than none.
2. Automated backups are better than manual.
3. Your data are only as safe as your last backup.
35. Storage: external storage devices
Advantages
Convenient, cheap & portable
Disadvantages
Longevity not guaranteed (e.g. Zip disks)
Errors writing to CD/DVD are common
Easily damaged, misplaced or lost (=security risk)
May not be big enough to hold all data; multiple drives needed
Bottom Line
Do NOT use to store master copies of data
Not recommended for long-term storage
36. Storage: networked drives
Advantages
Data in single place, backed up regularly
Replicated storage not vulnerable to loss due to hardware
failure
Secure storage minimizes risk of loss, theft, unauthorized use
Available as needed (assuming network avail.)
Disadvantages
Cost may be prohibitive; export control
Bottom Line
Highly recommended for master copies of data
Recommended for long-term storage (~5 years)
37. Storage: cloud storage
Advantages
Data in single place, backed up regularly
Replicated storage not vulnerable to loss due to hardware
failure
Secure storage minimizes risk of loss, theft, unauthorized use
Disadvantages
Cost may be prohibitive
Upload/download bottleneck & fees
Longevity?
Export control
Bottom Line
Possibly recommended for master copies of data
Not recommended for in-process data, large files
38. Storage: Google Drive for OSU
Advantages
All same advantages of network & cloud storage
File sharing & collaboration w/variable access levels
Unlimited storage (GD), 30 GB non-GD
Automatic version control on GD
Disadvantages
30 GB may not be enough
Upload/download bottleneck
Bottom Line
Possibly recommended for master copies of data
Possibly not recommended for in-process data, large files
45. File naming strategies
44
Order by date:
1955-04-12_notes_MassObs.docx
1955-04-12_questionnaire_MassObs.pdf
1963-12-15_notes_Gorer.docx
1963-12-15_questionnaire_Gorer.pdf
Order by subject:
Gorer_notes_1963-12-15.docx
Gorer_questionnaire_1963-12-15.pdf
MassObs_notes_1955-04-12.docx
MassObs_questionnaire_1955-04-12.pdf
Order by type:
Notes_Gorer_1963-12-15.docx
Notes_MassObs_1955-04-12.docx
Questionnaire_Gorer_1963-12-15.pdf
Questionnaire_MassObs_1955-04-12.pdf
Forced order with numbering:
01_MassObs_questionnaire_1955-04-12.pdf
02_MassObs_notes_1955-04-12.docx
03_Gorer_questionnaire_1963-12-15.pdf
04_Gorer_notes_1963-12-15.docx
50. Disambiguate yourself
49
Open Researcher & Contributor ID
John L. Campbell
Forest Research Ecologist
Oregon State University, Corvallis, OR
John L. Campbell
Forest Research Ecologist
Center for Research on Ecosystem Change
US Forest Service, Durham, NC
54. What is metadata?
⢠Data about data
⢠Structured information that describes,
explains, locates, or otherwise makes it
easier to retrieve, use, or manage an
information resource.
NISO, Understanding Metadata
53
55. Metadata
54
âThe metadata accompanying your
data should be written for a user 20
years into the future -- what does
that person need to know to use your
data properly? Prepare the metadata
for a user who is unfamiliar with your
project, methods, or observations.â
Oak Ridge National Laboratory Distributed Active
Archive Center for Biogeochemical Dynamics
(ORNL DAAC)
57. Darwin Core | biological diversity, taxonomy
Dublin Core | general
DDI (Data Documentation Initiative) | social &
behavioral sciences
DIF (Directory Interchange Format) | environmental
sciences
EML (Ecological Metadata Language) | ecology, biology
ISO 19115 | geographic data
Major metadata standards
01101101 01100101 01110100 01100001
01100100 01100001 01110100 01100001
58. Metadata examples
Santa Barbara Coastal Long Term Ecological
Research (LTER)
web link
Bureau of Labor Statistics, Consumer Price
Index, 1913-1992
web link
57
60. Data sharing & reuse
â...digitally formatted scientific data resulting
from unclassified research supported wholly
or in part by Federal funding should be
stored and publicly accessible to search,
retrieve, and analyze.â
Office of Science and Technology Policy
The White House
59
65. What is a data management plan?
64
âŚPhysical specimens consist of carbonate biothems
and filtered water samples are stored in Spilde or
Northupâs labs until completion of analyses. At this
time they are returned to the federal agency for
museum curation or destroyed during analysis. Field
notes will be scanned into pdf files, with a copy sent
to Carlsbad Caverns National Park or other federal
cave manager. SEM images will be saved in the tif
format. âŚ
66. Sections of a data management plan
65
1. Types of data
2. Data & metadata standards
3. Archiving & preservation
4. Sharing (access provisions)
5. Transition from collection to reuse
See resources on your handout + use DMPTool
68. Contact information
67
Amanda Whitmire | Data Management
Specialist
amanda.whitmire@oregonstate.edu
Steve Van Tuyl| Data and Digital Repository
Librarian
steve.vantuyl@oregonstate.edu
http://bit.ly/OSUData
75. How can OSU Libraries help?
74
â
â
â
ââ
â
76. Copyright â who owns the data?
Copyright applies to text, images, recordings, videos etc. in
a digital form, in the same way that it would to analog
versions of those works. The code of computer programs
(both the human readable source code and the machine
readable object code) is protected by copyright as a
literary work. Data compilations such as datasets and
databases can be protected by copyright in the literary
works category, which includes âtablesâ or âcompilationsâ. A
table or compilation, consisting of words, figures or
symbols (or a combination of these) is protected if it is a
literary work and has the required degree of originality.
75
80. Data Types: Observational
⢠Captured in situ
⢠Canât be recaptured, recreated or replaced
Examples: Sensor readings, sensory (human)
observations, survey results
81. Data Types: Experimental
⢠Data collected under controlled
conditions, in situ or laboratory-based
⢠Should be reproducible, but can be
expensive
Examples: gene sequences,
chromatograms, spectroscopy, microscopy,
cell counts
82. Data Types: Derived or Compiled
⢠Compiled: integration of data from
several sources (could be many types)
⢠Derived: new variables created from
existing data
⢠Reproducible, but can be very expensive
Examples: text and data mining, compiled
database, 3D models
83. Data Types: Simulation
⢠Results from using a model to study the
behavior and performance of an actual or
theoretical system
⢠Models and metadata, where the input can
be more important than output data
Examples: climate models, economic models,
biogeochemical models
84. Data Types: Reference/Canonical
Static or organic collection [peer-reviewed]
datasets, most probably published and/or
curated.
Examples: gene sequence databanks, chemical
structures, census data, spatial data portals
The reason weâre here is to explain what data management is, why we think it should be important to you, provide you with some things to think about in managing your data and provide some resources for managing your data.
This is a first, introductory workshopâŚ. Invite questions and comments throughout.
[Note: Unless otherwise noted, all images in this presentation are from Stock Exchange (http://www.sxc.hu) and copyright is retained by their owners.]
As we talk with you today, be thinking about how what we are telling you works with your own research process.
Image credit: Science Magazine
Does not include, âany of the following: preliminary analyses, drafts of scientific papers, plans for future research, peer reviews, or communications with colleagues. This "recorded" material excludes physical objects (e.g., laboratory samples).â This narrow definition mostly takes a retrospective view of your dataset, in that it does not account for raw and intermediate that may be critical to the research process but that donât become part of the âfinalâ dataset.
Data could be:
Observational
Experimental
Simulated
Derived
âŚâData may be viewed as the lowest level of abstraction from which information and knowledge are derived.â
Data management is a verb â it involves intentional effort and activity.
The main goals of DM are preservation and reuse, for you and for others.
Covers all aspects of the data lifecycle from planning digital data capture methods, whittling down, ingestion to databases, providing for access and reuse, to transformation.
Requirements: funder, journal, etc.
Further your field: speed the pace of discovery by enhancing discovery, access and reuse
research & knowledge base develops more rapidly
Increase your visibility & impact: share your data, get credit with data citation
Increase dataset value: More likely to be re-visited and re-used in ways that may be different from their original intention
Integrity: data (documentation will preserve integrity of data over time) & you (achieve ethics compliance [HIPAA] and enhance your reputation via transparency)
Allows for validation of your research
Efficiency: avoid digital archeology, avoid duplication of effort
Allows easier exchange with other scientists
Preserve your lifelong body of work!
If we agree that data that are better managed are more easily, rapidly, and effectively shared, then we can also agree that better data management can enable and speed up discoveries.
85 cancer microarray clinical trial publications with respect to the availability of their data. The 48% of trials with publicly available microarray data received 85% of the aggregate citations. Publicly available data was significantly (p = 0.006) associated with a 69% increase in citations, independently of journal impact factor, date of publication, and author country of origin using linear regression.
Piwowar HA, Day RS, Fridsma DB (2007) Sharing Detailed Research Data Is Associated with Increased Citation Rate. PLoS ONE 2(3): e308. doi:10.1371/journal.pone.0000308
A [very] short bibliography on citation advantage:
Henneken, E.A. & Accomazzi, A. Linking to Data - Effect on Citation Rates in Astronomy. (2011).
Dorch, B. On the Citation Advantage of linking to data. (2012).
Piwowar, H.A. & Vision, T.J. Data reuse and the open data citation advantage. PeerJ 1, e175 (2013)
Sears, J.R. Data Sharing Effect on Article Citation Rate in Paleoceanography. AGU Fall Meet. Abstr. -1, 1628 (2011).
22 February 2013: The Office of Science and Technology Policy in the White House released a memorandum about expanding pubic access to the results of federally funded research. In addition to scholarly publications, federal agencies are making serious efforts to increase the sharing of research data.
All federal agencies with more than $100M in R&D expenditures are subject to this memo.
http://www.whitehouse.gov/sites/default/files/microsites/ostp/ostp_public_access_memo_2013.pdf
All of these federal agencies sponsor work being conducted at OSU. Red $$ are federal agencies funding the most work at OSU; amounts shown are funding to OSU in FY2011-2012. [source: OSU Research Office]
So, weâve agreed on what âdataâ means, and what data management is. Letâs talk about the large variety of data that you may obtain or create. The types of data that you use and/or create have a large impact on most of your data management activities.
Most research uses multiple data types & formats. My own dissertation included the generation or use of all of the data types that weâve discussed (which might help to explain why it took me 7 years to earn a Ph.D.). http://hdl.handle.net/1957/9088
Image credit: Document by Piotrek Chuchla from The Noun Project
http://en.wikibooks.org/wiki/Statistics/Different_Types_of_Data/Quantitative_and_Qualitative_Data
More on qualitative data: âAlthough we may have categories, the categories may have a structure to them. When there is not a natural ordering of the categories, we call these nominal categories. Examples might be gender, race, religion, or sport.
When the categories may be ordered, these are called ordinal variables. Categorical variables that judge size (small, medium, large, etc.) are ordinal variables. Attitudes (strongly disagree, disagree, neutral, agree, strongly agree) are also ordinal variables, however we may not know which value is the best or worst of these issues. Note that the distance between these categories is not something we can measure.â
More on quantitative data: âHowever, not all numbers are continuous and measurable. For example, the social security number is a number, but not something that one can add or subtract.â
See more on geospatial analysis here: http://en.wikipedia.org/wiki/Geospatial_analysis
If you donât know yet â make an educated guess.
Why is this important? The data types that you are creating and/or using have implications for data documentation, storage and backup, data citation & ethics or reuse, and preservation and sharing (basically, all aspects of data management.
Before I move on â questions about anything so far?
Common data management areas are:
Planning
Storage & backup
Organization & naming
Documentation & metadata
Legal & ethical considerations
Sharing & reuse
Archiving & preservation
[image source: http://upload.wikimedia.org/wikipedia/commons/2/22/Earth_Western_Hemisphere_transparent_background.png]
Things to consider:
How much data?
Resources needed
supplies and services
costs associated with them
Roles & responsibilities
Metadata
project level (documentation)
data level (description)
Data formats
Data storage & backup
working data vs. archive vs. shared/discoverable
Ethics & consent
Copyright (open data)
Sharing
How? With who?
Quote from: UE RDM Kit
So, letâs review a few real-world scenarios.
Hard drive failures happen ALL THE TIME, without notice. Thereâs a one-liner that goes something like, âThere are two kinds of people in this world: those who have had a hard drive fail, and those who havenât had a had hard drive failâŚyet.â (ha ha)
Personal example: tell the story about Mariaâs hard drive failure, and expensive (several $K) partial recovery of her dissertation data. She lost a monthâs worth of work.
image credit: http://www.johnytech.com/wp-content/uploads/2012/09/booterrors.jpg
Something else that happens ALL THE TIME: your computer get stolen (Wiley, post-research cruise theft example). Or, you crash your bike and your computer goes flying (Angel).
image credit: http://www.flickr.com/photos/jeffreywarren/299422173/
Lesson: offsite, redundant storage is strongly encouraged.
From my post-doctoral work: Following the Chile earthquake in 2010 (magnitude 8.8), a tsunami devastated the coastal town of Dichato, and moved the Universidad de Concepcionâs research vessel 1 km inland. The marine research station was left standing, but everything in it was destroyed (including our samples and data servers). Luckily, we had the data server replicated at OSU, so we only lost the physical samples that were in Chile. Think it canât happen to you? Yeah, I thought the same thing.
[photo credit: http://www.ocean-partners.org/attachments/473_Being%20towed.JPG]
Not from my personal experience: from the DataONE Data Stories project: https://notebooks.dataone.org/data-stories/the-best-laid-schemes-of-backups-and-redundancy/
Basic synopsis of this story:
A researcher had made regular backups of her data, 5 GB worth, to an external drive. Her computer hard drive failed, so she sent it to IT, had it repaired, and went to restore her data using what was on her external backup drive. Unfortunately, only 30% of the data were accessible from the drive. âThe only thing they could determine was that the software used to back up the data was not compatible with the recovery process.â
Lesson: âAs it turned out, creating backups was not enough to completely guard against tragedy; the only way to make sure those backups could serve their purpose in the event of an emergency was to check them regularly and ensure the data they contained were recoverable.â
From NECDMC Module 4: Backup allows you to restore your data in the event that it is lost.
A backup strategy is a plan for ensuring the accessibility of research data during the life of a project. What your strategy will be depends on the amount of data you are working with, the frequency that your data changes, and the system requirements for storing and rendering it.
From NECDMC Module 4: Would you need just the data files themselves, the software that created them, or customized scripts written for data analysis? Depending on your research project, you may want to perform full, differential, or incremental backups.
Before we can talk about backing up your data, we need to address where you can store it. We will talk about the advantages, disadvantages and longevity of each option. Then, weâll move on to backup.
Some content from: UE RDM Kit
USB flash drives, Compact Discs (CDs) and Digital Video Discs (DVDs)
If you choose to use CDs, DVDs and USB flash drives (for example, for working data or extra backup copies), you should:
Choose high quality products from reputable manufacturers.
Follow the instructions provided by the manufacturer for care and handling, including environmental conditions and labeling.
Regularly check the media to make sure that they are not failing
periodically 'refresh' the data (that is, copy to a new disk or new USB flash drive); every 2-5 years
Ensure that any private or confidential data is password-protected and/or encrypted.
Some content from: UE RDM Kit
For example:
Fileservers managed by your research group, department or college
Fileservers managed by Information Services (IS)
image credit: http://www.flickr.com/photos/ketmonkey/3329372324/
Some content from: UE RDM Kit
For example: DropBox, SkyDrive, Box, Mozy,
Image credit: Cloud by Benni from The Noun Project
Some content from: UE RDM Kit
See more info here: http://oregonstate.edu/helpdocs/software/google-apps-osu
Fully encrypted
Local storage options:
Personal computer
External storage
USB drive
Drobo box
Network server
PI or department server located near you
Remote storage options:
Network server
Run by Information Services, likely cloud-based
Cloud storage
Google Apps Suite (Drive, Docs, Forms, etc.)
up to 30GB of data
Variable access levels
Login portal: http://oregonstate.edu/main/online-services/google-apps-for-osu
Using a thoughtful, consistent file-naming strategy is the easiest way to make a big difference in organizing your files, and making them easy to find (not just by you, but by others in your group and your adviser).
[image credit: PhDComics.com]
Tweet me @AWhitTwit with your #overlyhonestmethods!
Be consistent
Have conventions for naming: (1) Directory structure
(2) Folder names
(3) File names
Always include the same information (e.g. date and time)
Retain the order of information (e.g. YYYYMMDD, not MMDDYYY )
Be descriptive
Try to keep file and folder names under 32 characters
Within reason, Include relevant information such as:
Unique identifier (ie. Project Name or Grant # in folder name)
Project or research data name
Conditions (Lab instrument, Solvent, Temperature, etc.)
Run of experiment (sequential)
Date (in file properties too)
Use application-specific codes in 3-letter file extension and lowercase: mov, tif, wrl
When using sequential numbering, make sure to use leading zeros to allow for multi-digit versions. For example, a sequence of 1-10 should be numbered 01-10; a sequence of 1-100 should be numbered 001-010-100.
No special characters: & , * % # ; * ( ) ! @$ ^ ~ ' { } [ ] ? < > -
Use only one period and before the file extension (e.g. name_paper.doc
NOT name.paper.doc OR name_paper..doc)
Show examples of versions
Can go back when you make mistakes
when changes are made
Share work with other people
Both work on things at the same time and merge back together
Akin to game of telephone- version control can let you see exactly when a change was made
See Git, TortoiseSVN, Mercurial, âŚ
Before I move on â questions about anything so far?
http://orcid.org â get an identifier for free!
Hereâs Amandaâs ORCID page: http://orcid.org/0000-0003-2429-8879
Thanks to Jackie Wirz at OHSU for finding this perfect example!
Thanks to Jackie Wirz at OHSU for finding this perfect example!
PROJECT LEVEL:
Context of data collection
Data collection methods
Structure, organization of data files
Data sources used (see citing data)
Data validation, quality assurance
Transformations of data from the raw data through analysis
Information on confidentiality, access & use conditions
DATA LEVEL:
Variable names, and descriptions
Explanation of codes and classification schemes used
Algorithms used to transform data
File format and software (including version) used
[add content about licensing]
Intellectual Property
Office for Commercialization & Corporate Development (OCCD)
Copyright
Licensing
Charging for data?
Data attribution & citation
HUMAN SUBJECTS?
Informed consent & anonymization required prior to publishing
Resources @ OSU:
Office of Research Integrity, Institutional Review Board (IRB)
Responsible Conduct of Research (RCR) Program
âLearn it. Know it. Live it.â, Brad Hamilton, from Fast Times at Ridgemont High.
How to share data:
Repository: discipline-specific, journal, etc.
Disciplinary Repositories
DataBib: http://databib.lib.purdue.edu/
Open Access Directory: http://oad.simmons.edu/oadwiki/Data_repositories
OSU's digital repository, ScholarsArchive@OSU
Personal web site
By request
Best practices:
Well documented, w/standard metadata schema
Non-proprietary file formats
How to share data:
Repository: discipline-specific, journal, etc.
OSU's digital repository, ScholarsArchive@OSU
http://ir.library.oregonstate.edu
Personal web site
By request
Best practices:
Well documented, w/standard metadata schema
Non-proprietary file formats
There are storage options for various disciplines including discipline related repositories. Some scientific journals also have their own data repositories. OSU has an institutional repository, Scholars Archive that serves as a repository for University Scholarship. Currently we store data sets on a case-by-case basis. If you donât know where to store your data, you may consult with the library or talk to people in your department and/or your field to see if there is a logical repository for your work.
What is a data paper? http://guides.library.oregonstate.edu/data-management-data-papers-journals
âData papers facilitate the sharing of data in a standardized framework that provides value, impact, and recognition for authors. Data papers also provide much more thorough context and description than datasets that are simply deposited to a repository (which may have very minimal metadata requirements).â
âData papers thoroughly describe datasets, and do not usually include any interpretation or discussion (an exception may be discussion of different methods to collect the data, e.g.).â
Your data is being managed whether you realize it or not, but in order to preserve data and make the most of it, researchers should actively plan their data management process. A data management plan is a document that provides a clear roadmap of the process of collecting, storing, sharing and preserving your data. The earlier in the process a researcher starts in the data management process, not only makes it easier for her, but many repositories have requirements about how the data has been formatted etc. . . And it may not be possible to meet those requirements retroactively.
[sample DMP text from: http://libguides.unm.edu/loader.php?type=d&id=194916]
These are the five sections for the generic NSF data management plan, but they cover what is generally discussed in any good data management plan.
Things to consider:
How much data?
Resources needed
supplies and services
costs associated with them
Roles & responsibilities
Metadata
project level (documentation)
data level (description)
Data formats
Data storage & backup
working data vs. archive vs. shared/discoverable
Ethics & consent
Sharing
When?
How?
With who?
Create ready to use plans
Meet agency requirements
Step-by-step guidance
OSU-specific guidance
Please feel free to contact Amanda Whitmire or Maura Valentino with any data management questions you might have or questions about depositing your research to ScholarsArchive@OSU digital repository, especially research data associated with published research articles and/or your thesis or dissertation.
As humans, we arenât designed to interpret columns of numbers, or to be able to extract meaning from a table with a passing glance. As researchers and practitioners, we need to transform our data into something more meaningful.
We need to be the ambassadors of our data, to represent it in a way that speaks to why you made the effort to collect it and what you expect to learn from it. Sadly, most researchers, the large majority, never receive any training in how to share and communicate about their data, even to their peers.
Get the whole lifecycle graphic here: âData management throughout the research lifecycle. Amanda Whitmire. figshare. http://dx.doi.org/10.6084/m9.figshare.774628â
OSU Research Data Services can help with data management during most of the research lifecycle phases. Examples include:
Data management plan consultation
Finding & accessing secondary data
Data description & metadata
Ways to share your data within your project team
How to prepare data for archiving and where to put it
How to license your datasets
How to cite your datasets (& help others cite it properly)
Could be:
Sensor output
Audio visual recording
Lab/field notes
GPS coordinates
Survey results
Interview notes
Characteristics of raw data:
Raw units (volts, e.g.)
No [extra] metadata
No context
How to manage raw data:
Multiple copies in multiple locations
DON'T modify them
Characteristics of intermediate data
Data that are in-process
Anything between ârawâ and âfinalâ
Multiple stages, multiple formats, using various software
May be shareable w/colleagues, but not likely w/wider audience
Managing intermediate data
Multiple copies (of most recent versions) in multiple locations
Use sensible folder hierarchies & file naming conventions
Add metadata immediately & then additionally as you make changes
Characteristics of final data:
May be a subset of raw data, or a much larger dataset
Shareable with anyone
Non-proprietary formats are ideal
Metadata are thorough & complete
Managing final data:
Multiple copies in multiple locations
Refresh data every 2-5 years
If you use a proprietary format, migrate data to new version regularly
Share via an open repository
[image copyright: http://www.flickr.com/photos/usnavy/]
Image credit: Stock X.chnge
Image credit: OSU media services
For example, in climate science, HUGE amounts of data are compiled: http://www.realclimate.org/index.php/data-sources/
In the example image, hydrographic data from all over the world are compiled into a database, and new variables can be derived from them. For example, depth profiles of seawater density from measurements of pressure, temperature, and salinity (via conductivity measurements).
Image credit; http://cdiac.ornl.gov/oceans/RepeatSections/CDIAC_CLIVAR_global_map.jpg
Image credit: http://media.oregonlive.com/pacific-northwest-news/photo/gs61tsun103jpg-730687c0e8a35305.jpg
This model of tsunami driftage is based on ocean topography and surface winds. See: http://iprc.soest.hawaii.edu/research/research.php, research by N. Maximenko.
Anticipate:
Volume/File type(s)
Raw data vs. processed/analyzed data
Versioning
File Naming Conventions
Privacy Concerns
Storage practice
Backup plans; LOCKSS; checksums
Storage needs will depend on the volume and type of data in question. A plan should anticipate the particular storage needs of your research. How much data do you expect to collect and how will storage infrastructure be funded?
How will files be arranged on disk. This may seem like a trivial issue, but itâs important. If the person who has been responsible for maintaining files leaves, you need to be able to find things. Youâll also want to group files meaningfully with their related metadata.
Data are, of course, subject to loss and corruption. (Bonus joke: there are two types of computer users⌠those who have suffered hard drive failure and those who havenât suffered hard drive failureâŚyet). Will there be multiple copiesâon-site and off-site? Are there checks against file corruption?
Will you use specialized software or services to manage these aspects of the process?
You should know which data you intend to retain locally and for how long. Who will be responsible for this after the main project period is over?
Will you deposit in a preservation repository? Different repositories may have different preservation policies which you should be aware of.
Itâs also necessary to consider storage throughout the lifecycle. How will your data make it from the collection point, into your local storage, into backup, then into long term preservation? The systems involved may not be capable of interoperating without some planning at each stage.
Lesson: offsite, redundant storage is strongly encouraged.
1 on your local workstation
1 local/removable, such as external hard drive
1 on central server and/or
1 remote, such as on a cloud server*
*Depending on the type of data, as cloud servers are not always secure
Following the Chile earthquake in 2010 (magnitude 8.8), a tsunami devastated the coastal town of Dichato, and moved the Universidad de Concepcionâs research vessel 1 km inland. The marine research station was left standing, but everything in it was destroyed (including our samples). Think it canât happen to you? Yeah, I thought the same thing.
[photo credit: http://www.ocean-partners.org/attachments/473_Being%20towed.JPG]
~back to Amanda~
NOTE: 15 October 2013: the software upgrade in this classroom to Office 2013 introduced a bug and subsequent lack of functionality for Colectica. Likewise, DataUP is currently throwing a bug for unknown reasons. I am in contact with both entities, but the bugs could not be fixed by today. Our apologies!
DataUp is âan open source tool helping researchers document, manage, and archive their tabular data, DataUp operates within the scientist's workflow and integrates with MicrosoftÂŽ Excel. http://dataup.cdlib.org/ (EML metdata)
Colectica for Excel allows documenting of Variables, Code Lists, and Data Sets directly from within Microsoft Excel. Export your data documentation to an XML file in the DDI metadata format, the standard for data documentation. Open and edit it from Colectica Designer, Colectica Express, or other DDI applications. http://www.colectica.com/software/colecticaforexcel (DDI metadata)