Data Management Lab: Session 2 slides

Research Data Management
Spring 2014: Session 2
Practical strategies for better results
University Library
Center for Digital Scholarship

Review
Key points from last week?
What is still unclear?

DMP
Data map: complete the partially mapped
research question
OR
Start your own data map
Don’t forget to upload your DMP to Box.
Suggested file name: DMP_20140401

PROJECT & DATA DOCUMENTATION
MODULE 2

LEARNING
OUTCOMES
• Outline planned
project and data
documentation in a
data management
plan.
• Identify
documentation and
metadata required
to describe data

What is Metadata
DATADETAILS
Time of data development
Specific details about problems with individual items or
specific dates are lost relatively rapidly
General details about datasets are
lost through time
Accident or
technology
change may
make data
unusable
Retirement or career change
makes access to “mental
storage” difficult or unlikely
Loss of data
developer leads to
loss of remaining
information
TIME (From Michener et al 1997)

Why do we document?
“Scientific publications have at least two goals:
(i) to announce a result and (ii) to convince
readers that the result is correct… papers in
experimental science should describe the results
and provide a clear enough protocol to allow
successful repetition and extension”
-Mesirov, 2010

Analysis and Workflows
• Reproducibility at core of scientific method
• Complex process = more difficult to reproduce
• Good documentation required for reproducibility
o Metadata: data about data
o Process metadata: data about process used to create, manipulate,
and analyze data
CCimagebyRichardCarteronFlickr
Provenance: where your data
came from and what has been
done to it
Crucial for replication/
reproducibility

Why do we document?
• Provide an accurate, reliable record of your work
– Including all the details you will not remember when
it’s time to write up the project
• Facilitate writing of high quality publications
• Necessary for reproducibility, a core principle of
scientific process
• Establish provenance
– Relevant to commercial application and patents
(legal), defending your publications (scientific),
responsible conduct of research (scientific)

Best Practices
Best Practices for Preparing Ecological Data Sets, ESA, August 2010
The 20-Year Rule
• The metadata accompanying a data set should be
written for a user 20 years into the future--what
does that investigator need to know to use the
data?
• Prepare the data and documentation for a user
who is unfamiliar with your project, methods, and
observations
11

Data Management Planning
Plan
Collect
Assure
Describe
Preserve
Discover
Integrate
Analyze

What?
• Everything that is crucial for others to
understand, interpret, evaluate, and build on
your work
• What do YOU think?
• Metadata should capture the who, what,
when, where, how, why of your data

Think-Pair-Share
What do you think you need to document
about your project?
Share with your partner/group
Share with the class

What? How much?
Project-level
• Project history, aims, objectives and hypotheses
• Data collection methods: data collection protocol, sampling design,
instruments, hardware and software used, data scale and resolution,
temporal coverage and geographic coverage
• Dataset structure of data files, cases, relationships between files
• Data sources used (enough detail to find it again)
• Data validation, checking, proofing, cleaning and other quality assurance
procedures carried out
• Modifications made to data over time since their original creation and
identification of different versions of datasets
• Information on data confidentiality, access and use conditions

What? How much?
Data-level
• Names, labels and descriptions for variables, records and their values
• Units of measurement
• Explanation of codes and classification schemes used
• Codes of, and reasons for, missing values
• Derived data created after collection, with code, algorithm or command
file used to create them
• Weighting and grossing variables created
• Data listing with descriptions for cases, individuals or items studied
• Equipment, instruments, or other data collection tools used
• Field, lab, or interview conditions

What? How much?
• What went right
– So you can repeat/replicate it
• What went wrong
– So you can determine the cause (e.g., human error,
machine error, etc.) and prevent it from happening again

Some Effective Strategies
• Data
– Data models
– Data dictionaries
– Metadata
• Project (see Documentation Instructions for
examples)
– Procedures Manual
– Protocols
– Lab Notebooks
– Codebook
– Reference Libraries

Data Dictionary
A description of all study variables; for each variable:
• Variable name
• Role of the variable (analytical)
• Variable label
• Unit of measurement (if applicable)
• Type of variable
• Permissible values or range of values
• Definitions of redefined or derived variables
• Additional edits to be performed (logic & consistency)

What is Metadata?
• A structured set of terms describing a defined world
– Standardized
– Structured
• Metadata can be created automatically or manually
• Ex: ClinicalTrials.gov

Why Use/Create Metadata?
• Metadata is critical for communicating context for data
• How is metadata used?
– To find things
– To describe things
– To merge things
• Metadata standards define a common set of terms and
structure to communicate information
– Enables consistency, shared definitions, shared language, and
shared structure for interoperability
• Different standards have been developed for different
purposes (social science data, clinical trials, ecology)

Think-Pair-Share
Transform narrative description to structured
metadata using the provided template.
Write the information corresponding to the field
on your index card.
Abstract at http://doi.org/10.1542/peds.2013-
1488

• http://datadryad.org/resource/doi:10.5061/dryad.ph
8s5
– Readme files
– File-level metadata (e.g., Keywords, Scientific Names,
Spatial Coverage, Temporal Coverage)
• Selected ICPSR dataset 34792
– Codebook
– Scope of Study
– Citation & Metadata Exports
Good Documentation Examples

DMP
Sections to draft and/or refine:
• Metadata
–Documentation Checklist

References
1. HEB site. (2014). Reading Nutrition Labels. From
http://www.heb.com/page/recipes-cooking/cooking-
tips/reading-nutrition-labels
2. Mesirov JP: Computer science. Accessible reproducible
research. Science 2010, 327(5964): 415-416. From
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3878063/?t
ool=pubmed
3. Target Clear Rx bottles:
http://www.brandchannel.com/features_profile.asp?pr_id=
248

ORGANIZING DATA & FILES
MODULE 2

LEARNING
OUTCOMES
• Develop a consistent and
coherent file organization
and naming convention
scheme for all project
files.
• Select appropriate non-
proprietary hardware and
software formats for
storing data.
• Create protected copies
of files at crucial points in
your study
• Use versioning software
or documentation for
tracking changes to files
over time.

File Organization & Naming
• Be Clear, Concise, Consistent, Correct, and
Conformant
• Consider what is necessary to find and access
files in next year and when the project is
complete.
• Develop a scheme and use it.
• Track changes.

Organization: Filing v. Piling
• Filing (hierarchical)
– When organizing files, directory top-level folder
should include the project title, unique identifier, and
date (year).
– The substructure should have a clear, documented
naming convention; for example, each run of an
experiment, each version of a dataset, and/or each
person in the group.
• Piling (tags)
– All files in one directory, rely on sorting and searching.

Naming Files
• Be Clear, Concise, Consistent, Correct, and
Conformant
• Make it meaningful
• Remember the purpose is to provide context

Elements of a File Name
• Project/grant name and/or number
• Date of creation/modification
• Name of creator/investigator: last name first
followed by (initials of) first name
• Research team/department associated with the data
• Content or subject descriptor
• Data collection method (instrument, site, etc.)
• Version number
• Project phase

Naming Strategies
• Date first
– 20110103_diss_surveyB_raw
– 20110118_diss_surveyB_raw
– 20110119_diss_inter_trans
– 20110204_diss_surveyB_quest-B
• Subject first
– diss_surveyB_raw_20110103
– diss_surveyB_raw_20110118
– diss_inter_trans_20110119
– diss_surveyB_quest_20110204
• Type first
– surveyB_raw_diss_20110103
– surveyB_raw_diss_20110118
– inter_trans_diss_20110119
– surveyB_quest_diss_20110204
• Numbered (Forced ordering)
– 01_diss_survey_raw_20110103
– 01_diss_survey_raw_20110118
– 02_diss_inter_trans_20110119
– 04_diss_survey_quest-
B_20110204
Whitmire, 2014

Technical Tips
• For sequential numbering, use leading zeros.
– For example, a sequence of 1-10 should be numbered 01-10; a
sequence of 1-100 should be numbered 001-010-100.
• No special characters in file names
& , * % # ; * ( ) ! @$ ^ ~ ' { } [ ] ? < > -
• Use only one period ONLY before the file extension (e.g.
name_paper.doc NOT name.paper.doc OR
name_paper..doc)
?Will your files still be unique and comprehensible
if moved to another location

Think-Pair-Share
• Develop a file naming scheme for your
project (enter it in your DMP).
• Share it with your partner.
• Share with class.

File Formats
• Choose formats that are more likely to be accessible
in the future (10-20 years)
– Non-proprietary
– Open, documented standard
– Commonly used
– Standardized (ASCII, Unicode)
• Also, if possible
– Unencrypted
– Uncompressed
• Ex: PDF/A (not .doc/x), ASCII (not .xls/x), MPEG-4,
TIFF or JPEG2000, XML or RDF (not RDBMS)

Master Files
• Provides snapshots of key phases in the data
life cycle
– Raw
– Cleaned
– Phases of processing
• In combination with detailed documentation,
these files make write-up easier and supports
reproducibility and reuse
• Demonstrate provenance (i.e., an audit trail)

Version Control
• Manual – file names
– Sequential numbered system
– Dated
• Automatic – version control software
– Mercurial
– TortoiseSVN
– GitHub
• Keep log files, supplement with documentation
(e.g., readme.txt, comments, etc.)

DMP
Sections to work on:
• Format (revise)
–Are you choosing the best formats?
• Data organization (write)
–File & Folder structure
–File naming convention
–Master files/Data locks

References
1. DataONE Education Module: Data Management Planning. DataONE. From
http://www.dataone.org/sites/all/
documents/L03_DataManagementPlanning.pptx
2. DataONE Education Module: Data Citation. DataONE. From
http://www.dataone.org/sites/all/documents/L09_DataCitation.pptx
3. McNeill, K. (2013). Research Data Management: File Organization. From:
http://libraries.mit.edu/guides/subjects/data-
management/File%20Organization_JulyAP2013.pdf
4. MIT. (2014). Organizing your files. From:
http://libraries.mit.edu/guides/subjects/data-management/organizing.html
5. Savage, A. (nd). Mythbusters. From http://weknowmemes.com/2012/10/
the-only-difference-between-screwing-around-and-science/
6. Whitmire, A. (2014). Research Data Management – Organizing Your Data.
From http://guides.library.oregonstate.edu/grad521lectures

Wrapping up
What’s next?
Mid-point evaluation

Data Management Lab: Session 2 slides

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to Data Management Lab: Session 2 slides

Similar to Data Management Lab: Session 2 slides (20)

More from IUPUI

More from IUPUI (20)

Recently uploaded

Recently uploaded (20)

Data Management Lab: Session 2 slides