SlideShare a Scribd company logo
1 of 49
Download to read offline
Research Data Management
Spring 2014: Session 2
Practical strategies for better results
University Library
Center for Digital Scholarship
Review
Key points from last week?
What is still unclear?
DMP
Data map: complete the partially mapped
research question
OR
Start your own data map
Don’t forget to upload your DMP to Box.
Suggested file name: DMP_20140401
PROJECT & DATA DOCUMENTATION
MODULE 2
LEARNING
OUTCOMES
• Outline planned
project and data
documentation in a
data management
plan.
• Identify
documentation and
metadata required
to describe data
Why do we document?
What is Metadata
DATADETAILS
Time of data development
Specific details about problems with individual items or
specific dates are lost relatively rapidly
General details about datasets are
lost through time
Accident or
technology
change may
make data
unusable
Retirement or career change
makes access to “mental
storage” difficult or unlikely
Loss of data
developer leads to
loss of remaining
information
TIME (From Michener et al 1997)
Why do we document?
“Scientific publications have at least two goals:
(i) to announce a result and (ii) to convince
readers that the result is correct… papers in
experimental science should describe the results
and provide a clear enough protocol to allow
successful repetition and extension”
-Mesirov, 2010
Analysis and Workflows
• Reproducibility at core of scientific method
• Complex process = more difficult to reproduce
• Good documentation required for reproducibility
o Metadata: data about data
o Process metadata: data about process used to create, manipulate,
and analyze data
CCimagebyRichardCarteronFlickr
Provenance: where your data
came from and what has been
done to it
Crucial for replication/
reproducibility
Why do we document?
• Provide an accurate, reliable record of your work
– Including all the details you will not remember when
it’s time to write up the project
• Facilitate writing of high quality publications
• Necessary for reproducibility, a core principle of
scientific process
• Establish provenance
– Relevant to commercial application and patents
(legal), defending your publications (scientific),
responsible conduct of research (scientific)
Best Practices
Best Practices for Preparing Ecological Data Sets, ESA, August 2010
The 20-Year Rule
• The metadata accompanying a data set should be
written for a user 20 years into the future--what
does that investigator need to know to use the
data?
• Prepare the data and documentation for a user
who is unfamiliar with your project, methods, and
observations
11
Data Management Planning
Plan
Collect
Assure
Describe
Preserve
Discover
Integrate
Analyze
What?
• Everything that is crucial for others to
understand, interpret, evaluate, and build on
your work
• What do YOU think?
• Metadata should capture the who, what,
when, where, how, why of your data
Think-Pair-Share
What do you think you need to document
about your project?
Share with your partner/group
Share with the class
What? How much?
Project-level
• Project history, aims, objectives and hypotheses
• Data collection methods: data collection protocol, sampling design,
instruments, hardware and software used, data scale and resolution,
temporal coverage and geographic coverage
• Dataset structure of data files, cases, relationships between files
• Data sources used (enough detail to find it again)
• Data validation, checking, proofing, cleaning and other quality assurance
procedures carried out
• Modifications made to data over time since their original creation and
identification of different versions of datasets
• Information on data confidentiality, access and use conditions
What? How much?
Data-level
• Names, labels and descriptions for variables, records and their values
• Units of measurement
• Explanation of codes and classification schemes used
• Codes of, and reasons for, missing values
• Derived data created after collection, with code, algorithm or command
file used to create them
• Weighting and grossing variables created
• Data listing with descriptions for cases, individuals or items studied
• Equipment, instruments, or other data collection tools used
• Field, lab, or interview conditions
What? How much?
• What went right
– So you can repeat/replicate it
• What went wrong
– So you can determine the cause (e.g., human error,
machine error, etc.) and prevent it from happening again
Some Effective Strategies
• Data
– Data models
– Data dictionaries
– Metadata
• Project (see Documentation Instructions for
examples)
– Procedures Manual
– Protocols
– Lab Notebooks
– Codebook
– Reference Libraries
Data Models
Data Dictionary
A description of all study variables; for each variable:
• Variable name
• Role of the variable (analytical)
• Variable label
• Unit of measurement (if applicable)
• Type of variable
• Permissible values or range of values
• Definitions of redefined or derived variables
• Additional edits to be performed (logic & consistency)
What is Metadata?
• A structured set of terms describing a defined world
– Standardized
– Structured
• Metadata can be created automatically or manually
• Ex: ClinicalTrials.gov
Why Use/Create Metadata?
• Metadata is critical for communicating context for data
• How is metadata used?
– To find things
– To describe things
– To merge things
• Metadata standards define a common set of terms and
structure to communicate information
– Enables consistency, shared definitions, shared language, and
shared structure for interoperability
• Different standards have been developed for different
purposes (social science data, clinical trials, ecology)
Everyday Examples
HEB site
Everyday Examples
Creating Metadata
Think-Pair-Share
Transform narrative description to structured
metadata using the provided template.
Write the information corresponding to the field
on your index card.
Abstract at http://doi.org/10.1542/peds.2013-
1488
• http://datadryad.org/resource/doi:10.5061/dryad.ph
8s5
– Readme files
– File-level metadata (e.g., Keywords, Scientific Names,
Spatial Coverage, Temporal Coverage)
• Selected ICPSR dataset 34792
– Codebook
– Scope of Study
– Citation & Metadata Exports
Good Documentation Examples
DMP
Sections to draft and/or refine:
• Metadata
–Documentation Checklist
References
1. HEB site. (2014). Reading Nutrition Labels. From
http://www.heb.com/page/recipes-cooking/cooking-
tips/reading-nutrition-labels
2. Mesirov JP: Computer science. Accessible reproducible
research. Science 2010, 327(5964): 415-416. From
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3878063/?t
ool=pubmed
3. Target Clear Rx bottles:
http://www.brandchannel.com/features_profile.asp?pr_id=
248
ORGANIZING DATA & FILES
MODULE 2
LEARNING
OUTCOMES
• Develop a consistent and
coherent file organization
and naming convention
scheme for all project
files.
• Select appropriate non-
proprietary hardware and
software formats for
storing data.
• Create protected copies
of files at crucial points in
your study
• Use versioning software
or documentation for
tracking changes to files
over time.
File Organization & Naming
• Be Clear, Concise, Consistent, Correct, and
Conformant
• Consider what is necessary to find and access
files in next year and when the project is
complete.
• Develop a scheme and use it.
• Track changes.
Organization: Filing v. Piling
• Filing (hierarchical)
– When organizing files, directory top-level folder
should include the project title, unique identifier, and
date (year).
– The substructure should have a clear, documented
naming convention; for example, each run of an
experiment, each version of a dataset, and/or each
person in the group.
• Piling (tags)
– All files in one directory, rely on sorting and searching.
File Names
Courtesy of PhD Comics
Naming Files
• Be Clear, Concise, Consistent, Correct, and
Conformant
• Make it meaningful
• Remember the purpose is to provide context
Elements of a File Name
• Project/grant name and/or number
• Date of creation/modification
• Name of creator/investigator: last name first
followed by (initials of) first name
• Research team/department associated with the data
• Content or subject descriptor
• Data collection method (instrument, site, etc.)
• Version number
• Project phase
Naming Strategies
• Date first
– 20110103_diss_surveyB_raw
– 20110118_diss_surveyB_raw
– 20110119_diss_inter_trans
– 20110204_diss_surveyB_quest-B
• Subject first
– diss_surveyB_raw_20110103
– diss_surveyB_raw_20110118
– diss_inter_trans_20110119
– diss_surveyB_quest_20110204
• Type first
– surveyB_raw_diss_20110103
– surveyB_raw_diss_20110118
– inter_trans_diss_20110119
– surveyB_quest_diss_20110204
• Numbered (Forced ordering)
– 01_diss_survey_raw_20110103
– 01_diss_survey_raw_20110118
– 02_diss_inter_trans_20110119
– 04_diss_survey_quest-
B_20110204
Whitmire, 2014
Technical Tips
• For sequential numbering, use leading zeros.
– For example, a sequence of 1-10 should be numbered 01-10; a
sequence of 1-100 should be numbered 001-010-100.
• No special characters in file names
& , * % # ; * ( ) ! @$ ^ ~ ' { } [ ] ? < > -
• Use only one period ONLY before the file extension (e.g.
name_paper.doc NOT name.paper.doc OR
name_paper..doc)
?Will your files still be unique and comprehensible
if moved to another location
Think-Pair-Share
• Develop a file naming scheme for your
project (enter it in your DMP).
• Share it with your partner.
• Share with class.
File Formats
• Choose formats that are more likely to be accessible
in the future (10-20 years)
– Non-proprietary
– Open, documented standard
– Commonly used
– Standardized (ASCII, Unicode)
• Also, if possible
– Unencrypted
– Uncompressed
• Ex: PDF/A (not .doc/x), ASCII (not .xls/x), MPEG-4,
TIFF or JPEG2000, XML or RDF (not RDBMS)
Master Files
• Provides snapshots of key phases in the data
life cycle
– Raw
– Cleaned
– Phases of processing
• In combination with detailed documentation,
these files make write-up easier and supports
reproducibility and reuse
• Demonstrate provenance (i.e., an audit trail)
Version Control
• Manual – file names
– Sequential numbered system
– Dated
• Automatic – version control software
– Mercurial
– TortoiseSVN
– GitHub
• Keep log files, supplement with documentation
(e.g., readme.txt, comments, etc.)
DMP
Sections to work on:
• Format (revise)
–Are you choosing the best formats?
• Data organization (write)
–File & Folder structure
–File naming convention
–Master files/Data locks
References
1. DataONE Education Module: Data Management Planning. DataONE. From
http://www.dataone.org/sites/all/
documents/L03_DataManagementPlanning.pptx
2. DataONE Education Module: Data Citation. DataONE. From
http://www.dataone.org/sites/all/documents/L09_DataCitation.pptx
3. McNeill, K. (2013). Research Data Management: File Organization. From:
http://libraries.mit.edu/guides/subjects/data-
management/File%20Organization_JulyAP2013.pdf
4. MIT. (2014). Organizing your files. From:
http://libraries.mit.edu/guides/subjects/data-management/organizing.html
5. Savage, A. (nd). Mythbusters. From http://weknowmemes.com/2012/10/
the-only-difference-between-screwing-around-and-science/
6. Whitmire, A. (2014). Research Data Management – Organizing Your Data.
From http://guides.library.oregonstate.edu/grad521lectures
Why do we document?
Wrapping up
What’s next?
Mid-point evaluation

More Related Content

What's hot

Data management plans
Data management plansData management plans
Data management plans
Brad Houston
 
Information Skills For Researchers V3
Information Skills For Researchers V3Information Skills For Researchers V3
Information Skills For Researchers V3
Jacqueline Thomas
 

What's hot (19)

Introduction to research data management
Introduction to research data managementIntroduction to research data management
Introduction to research data management
 
Introduction to research data management; Lecture 01 for GRAD521
Introduction to research data management; Lecture 01 for GRAD521Introduction to research data management; Lecture 01 for GRAD521
Introduction to research data management; Lecture 01 for GRAD521
 
An introduction to Statistical Analysis Plans
An introduction to Statistical Analysis PlansAn introduction to Statistical Analysis Plans
An introduction to Statistical Analysis Plans
 
Data management plans
Data management plansData management plans
Data management plans
 
RDAP 16 Poster: Interpreting Local Data Policies in Practice
RDAP 16 Poster: Interpreting Local Data Policies in PracticeRDAP 16 Poster: Interpreting Local Data Policies in Practice
RDAP 16 Poster: Interpreting Local Data Policies in Practice
 
Introduction to data management
Introduction to data managementIntroduction to data management
Introduction to data management
 
Basics of Research Data Management
Basics of Research Data ManagementBasics of Research Data Management
Basics of Research Data Management
 
Information Skills For Researchers V3
Information Skills For Researchers V3Information Skills For Researchers V3
Information Skills For Researchers V3
 
Research Data Management and Librarians
Research Data Management and LibrariansResearch Data Management and Librarians
Research Data Management and Librarians
 
Research Data Management: How will Northwestern address new sharing requireme...
Research Data Management: How will Northwestern address new sharing requireme...Research Data Management: How will Northwestern address new sharing requireme...
Research Data Management: How will Northwestern address new sharing requireme...
 
Computational Research day 2015
Computational Research day 2015Computational Research day 2015
Computational Research day 2015
 
Using Open Science to advance science - advancing open data
Using Open Science to advance science - advancing open data Using Open Science to advance science - advancing open data
Using Open Science to advance science - advancing open data
 
Practical Research Data Management: tools and approaches, pre- and post-award
Practical Research Data Management:  tools and approaches, pre- and post-awardPractical Research Data Management:  tools and approaches, pre- and post-award
Practical Research Data Management: tools and approaches, pre- and post-award
 
The Donders Repository
The Donders RepositoryThe Donders Repository
The Donders Repository
 
A basic course on Research data management, part 1: what and why
A basic course on Research data management, part 1: what and whyA basic course on Research data management, part 1: what and why
A basic course on Research data management, part 1: what and why
 
Introduction to Research Data Management - 2015-02-09 - MPLS Division, Univer...
Introduction to Research Data Management - 2015-02-09 - MPLS Division, Univer...Introduction to Research Data Management - 2015-02-09 - MPLS Division, Univer...
Introduction to Research Data Management - 2015-02-09 - MPLS Division, Univer...
 
Literature Reviews For The Health Sciences March 2010
Literature Reviews For The Health Sciences March 2010Literature Reviews For The Health Sciences March 2010
Literature Reviews For The Health Sciences March 2010
 
Pulverer-embo-source data-nfdp13
Pulverer-embo-source data-nfdp13Pulverer-embo-source data-nfdp13
Pulverer-embo-source data-nfdp13
 
Managing data throughout the research lifecycle
Managing data throughout the research lifecycleManaging data throughout the research lifecycle
Managing data throughout the research lifecycle
 

Similar to Data Management Lab: Session 2 slides

Data management plans (dmp) for nsf
Data management plans (dmp) for nsfData management plans (dmp) for nsf
Data management plans (dmp) for nsf
Brad Houston
 
Data Sets, Ensemble Cloud Computing, and the University Library: Getting the ...
Data Sets, Ensemble Cloud Computing, and the University Library:Getting the ...Data Sets, Ensemble Cloud Computing, and the University Library:Getting the ...
Data Sets, Ensemble Cloud Computing, and the University Library: Getting the ...
SEAD
 

Similar to Data Management Lab: Session 2 slides (20)

Documentation and Metdata - VA DM Bootcamp
Documentation and Metdata - VA DM BootcampDocumentation and Metdata - VA DM Bootcamp
Documentation and Metdata - VA DM Bootcamp
 
Effective research data management
Effective research data managementEffective research data management
Effective research data management
 
Elag workshop sessie 1 en 2 v10
Elag workshop sessie 1 en 2 v10Elag workshop sessie 1 en 2 v10
Elag workshop sessie 1 en 2 v10
 
Support Your Data, Kyoto University
Support Your Data, Kyoto UniversitySupport Your Data, Kyoto University
Support Your Data, Kyoto University
 
Research data life cycle
Research data life cycleResearch data life cycle
Research data life cycle
 
Qualitative and Quantitative Research Plans By Malik Muhammad Mehran
Qualitative and Quantitative Research Plans By Malik Muhammad MehranQualitative and Quantitative Research Plans By Malik Muhammad Mehran
Qualitative and Quantitative Research Plans By Malik Muhammad Mehran
 
Data management plans (dmp) for nsf
Data management plans (dmp) for nsfData management plans (dmp) for nsf
Data management plans (dmp) for nsf
 
Data Management Planning for researchers
Data Management Planning for researchersData Management Planning for researchers
Data Management Planning for researchers
 
Research-Data-Management-and-your-PhD
Research-Data-Management-and-your-PhDResearch-Data-Management-and-your-PhD
Research-Data-Management-and-your-PhD
 
LIS 653, Session 11: Data Management & Curation
LIS 653, Session 11: Data Management & CurationLIS 653, Session 11: Data Management & Curation
LIS 653, Session 11: Data Management & Curation
 
Responsible conduct of research: Data Management
Responsible conduct of research: Data ManagementResponsible conduct of research: Data Management
Responsible conduct of research: Data Management
 
Data management
Data management Data management
Data management
 
INCLUSION OF DATA ARCHIVES IN DATA MANAGEMENT PLAN
INCLUSION OF DATA ARCHIVES IN DATA MANAGEMENT PLANINCLUSION OF DATA ARCHIVES IN DATA MANAGEMENT PLAN
INCLUSION OF DATA ARCHIVES IN DATA MANAGEMENT PLAN
 
Research data management during and after your research ; an introduction / L...
Research data management during and after your research ; an introduction / L...Research data management during and after your research ; an introduction / L...
Research data management during and after your research ; an introduction / L...
 
ROER4D Open Data Initiative
ROER4D Open Data InitiativeROER4D Open Data Initiative
ROER4D Open Data Initiative
 
Critical infrastructure to promote data synthesis
Critical infrastructure to promote data synthesis Critical infrastructure to promote data synthesis
Critical infrastructure to promote data synthesis
 
Preparing your data for sharing and publishing
Preparing your data for sharing and publishingPreparing your data for sharing and publishing
Preparing your data for sharing and publishing
 
Data Sets, Ensemble Cloud Computing, and the University Library: Getting the ...
Data Sets, Ensemble Cloud Computing, and the University Library:Getting the ...Data Sets, Ensemble Cloud Computing, and the University Library:Getting the ...
Data Sets, Ensemble Cloud Computing, and the University Library: Getting the ...
 
Data Management for librarians
Data Management for librariansData Management for librarians
Data Management for librarians
 
Research data management workshop april12 2016
Research data management workshop april12 2016 Research data management workshop april12 2016
Research data management workshop april12 2016
 

More from IUPUI

Building the Future of Research Together
Building the Future of Research TogetherBuilding the Future of Research Together
Building the Future of Research Together
IUPUI
 

More from IUPUI (20)

Altmetrics 101 - Altmetrics in Libraries
Altmetrics 101 - Altmetrics in LibrariesAltmetrics 101 - Altmetrics in Libraries
Altmetrics 101 - Altmetrics in Libraries
 
Gather evidence to demonstrate the impact of your research
Gather evidence to demonstrate the impact of your researchGather evidence to demonstrate the impact of your research
Gather evidence to demonstrate the impact of your research
 
Managing data responsibly to enable research interity
Managing data responsibly to enable research interityManaging data responsibly to enable research interity
Managing data responsibly to enable research interity
 
Case studies for open science
Case studies for open scienceCase studies for open science
Case studies for open science
 
Midwest Medical Library Association 2015 Big Data Panel
Midwest Medical Library Association 2015 Big Data PanelMidwest Medical Library Association 2015 Big Data Panel
Midwest Medical Library Association 2015 Big Data Panel
 
Gathering Evidence to Demonstrate Impact
Gathering Evidence to Demonstrate ImpactGathering Evidence to Demonstrate Impact
Gathering Evidence to Demonstrate Impact
 
Citation & altmetrics - a comparison
Citation & altmetrics - a comparisonCitation & altmetrics - a comparison
Citation & altmetrics - a comparison
 
Altmetrics for Team Science
Altmetrics for Team ScienceAltmetrics for Team Science
Altmetrics for Team Science
 
Ensuring data quality
Ensuring data qualityEnsuring data quality
Ensuring data quality
 
Preventing data loss
Preventing data lossPreventing data loss
Preventing data loss
 
Practical Data Management Plans
Practical Data Management PlansPractical Data Management Plans
Practical Data Management Plans
 
Teaching data management in a lab environment (IASSIST 2014)
Teaching data management in a lab environment (IASSIST 2014)Teaching data management in a lab environment (IASSIST 2014)
Teaching data management in a lab environment (IASSIST 2014)
 
Building the Future of Research Together
Building the Future of Research TogetherBuilding the Future of Research Together
Building the Future of Research Together
 
NIH Data Sharing Plan Workshop - Handout
NIH Data Sharing Plan Workshop - HandoutNIH Data Sharing Plan Workshop - Handout
NIH Data Sharing Plan Workshop - Handout
 
NIH Data Sharing Plan Workshop - Slides
NIH Data Sharing Plan Workshop - SlidesNIH Data Sharing Plan Workshop - Slides
NIH Data Sharing Plan Workshop - Slides
 
Data Management Lab: Session 4 Slides
Data Management Lab: Session 4 SlidesData Management Lab: Session 4 Slides
Data Management Lab: Session 4 Slides
 
Data Management Lab: Session 4 Review Outline
Data Management Lab: Session 4 Review OutlineData Management Lab: Session 4 Review Outline
Data Management Lab: Session 4 Review Outline
 
Data Management Lab: Session 3 Slides
Data Management Lab: Session 3 SlidesData Management Lab: Session 3 Slides
Data Management Lab: Session 3 Slides
 
Data Management Lab: Session 3 Data Review Checklist
Data Management Lab: Session 3 Data Review ChecklistData Management Lab: Session 3 Data Review Checklist
Data Management Lab: Session 3 Data Review Checklist
 
Data Management Lab: Session 3 Data Entry Best Practices
Data Management Lab: Session 3 Data Entry Best PracticesData Management Lab: Session 3 Data Entry Best Practices
Data Management Lab: Session 3 Data Entry Best Practices
 

Recently uploaded

1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
QucHHunhnh
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
heathfieldcps1
 

Recently uploaded (20)

SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptxSKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptx
 
Dyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptxDyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptx
 
Spatium Project Simulation student brief
Spatium Project Simulation student briefSpatium Project Simulation student brief
Spatium Project Simulation student brief
 
PROCESS RECORDING FORMAT.docx
PROCESS      RECORDING        FORMAT.docxPROCESS      RECORDING        FORMAT.docx
PROCESS RECORDING FORMAT.docx
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibit
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docx
 
Third Battle of Panipat detailed notes.pptx
Third Battle of Panipat detailed notes.pptxThird Battle of Panipat detailed notes.pptx
Third Battle of Panipat detailed notes.pptx
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentation
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptx
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdf
 

Data Management Lab: Session 2 slides

  • 1. Research Data Management Spring 2014: Session 2 Practical strategies for better results University Library Center for Digital Scholarship
  • 2. Review Key points from last week? What is still unclear?
  • 3. DMP Data map: complete the partially mapped research question OR Start your own data map Don’t forget to upload your DMP to Box. Suggested file name: DMP_20140401
  • 4. PROJECT & DATA DOCUMENTATION MODULE 2
  • 5. LEARNING OUTCOMES • Outline planned project and data documentation in a data management plan. • Identify documentation and metadata required to describe data
  • 6. Why do we document?
  • 7. What is Metadata DATADETAILS Time of data development Specific details about problems with individual items or specific dates are lost relatively rapidly General details about datasets are lost through time Accident or technology change may make data unusable Retirement or career change makes access to “mental storage” difficult or unlikely Loss of data developer leads to loss of remaining information TIME (From Michener et al 1997)
  • 8. Why do we document? “Scientific publications have at least two goals: (i) to announce a result and (ii) to convince readers that the result is correct… papers in experimental science should describe the results and provide a clear enough protocol to allow successful repetition and extension” -Mesirov, 2010
  • 9. Analysis and Workflows • Reproducibility at core of scientific method • Complex process = more difficult to reproduce • Good documentation required for reproducibility o Metadata: data about data o Process metadata: data about process used to create, manipulate, and analyze data CCimagebyRichardCarteronFlickr Provenance: where your data came from and what has been done to it Crucial for replication/ reproducibility
  • 10. Why do we document? • Provide an accurate, reliable record of your work – Including all the details you will not remember when it’s time to write up the project • Facilitate writing of high quality publications • Necessary for reproducibility, a core principle of scientific process • Establish provenance – Relevant to commercial application and patents (legal), defending your publications (scientific), responsible conduct of research (scientific)
  • 11. Best Practices Best Practices for Preparing Ecological Data Sets, ESA, August 2010 The 20-Year Rule • The metadata accompanying a data set should be written for a user 20 years into the future--what does that investigator need to know to use the data? • Prepare the data and documentation for a user who is unfamiliar with your project, methods, and observations 11
  • 13. What? • Everything that is crucial for others to understand, interpret, evaluate, and build on your work • What do YOU think? • Metadata should capture the who, what, when, where, how, why of your data
  • 14. Think-Pair-Share What do you think you need to document about your project? Share with your partner/group Share with the class
  • 15. What? How much? Project-level • Project history, aims, objectives and hypotheses • Data collection methods: data collection protocol, sampling design, instruments, hardware and software used, data scale and resolution, temporal coverage and geographic coverage • Dataset structure of data files, cases, relationships between files • Data sources used (enough detail to find it again) • Data validation, checking, proofing, cleaning and other quality assurance procedures carried out • Modifications made to data over time since their original creation and identification of different versions of datasets • Information on data confidentiality, access and use conditions
  • 16. What? How much? Data-level • Names, labels and descriptions for variables, records and their values • Units of measurement • Explanation of codes and classification schemes used • Codes of, and reasons for, missing values • Derived data created after collection, with code, algorithm or command file used to create them • Weighting and grossing variables created • Data listing with descriptions for cases, individuals or items studied • Equipment, instruments, or other data collection tools used • Field, lab, or interview conditions
  • 17. What? How much? • What went right – So you can repeat/replicate it • What went wrong – So you can determine the cause (e.g., human error, machine error, etc.) and prevent it from happening again
  • 18. Some Effective Strategies • Data – Data models – Data dictionaries – Metadata • Project (see Documentation Instructions for examples) – Procedures Manual – Protocols – Lab Notebooks – Codebook – Reference Libraries
  • 20. Data Dictionary A description of all study variables; for each variable: • Variable name • Role of the variable (analytical) • Variable label • Unit of measurement (if applicable) • Type of variable • Permissible values or range of values • Definitions of redefined or derived variables • Additional edits to be performed (logic & consistency)
  • 21.
  • 22. What is Metadata? • A structured set of terms describing a defined world – Standardized – Structured • Metadata can be created automatically or manually • Ex: ClinicalTrials.gov
  • 23. Why Use/Create Metadata? • Metadata is critical for communicating context for data • How is metadata used? – To find things – To describe things – To merge things • Metadata standards define a common set of terms and structure to communicate information – Enables consistency, shared definitions, shared language, and shared structure for interoperability • Different standards have been developed for different purposes (social science data, clinical trials, ecology)
  • 27. Think-Pair-Share Transform narrative description to structured metadata using the provided template. Write the information corresponding to the field on your index card. Abstract at http://doi.org/10.1542/peds.2013- 1488
  • 28. • http://datadryad.org/resource/doi:10.5061/dryad.ph 8s5 – Readme files – File-level metadata (e.g., Keywords, Scientific Names, Spatial Coverage, Temporal Coverage) • Selected ICPSR dataset 34792 – Codebook – Scope of Study – Citation & Metadata Exports Good Documentation Examples
  • 29. DMP Sections to draft and/or refine: • Metadata –Documentation Checklist
  • 30. References 1. HEB site. (2014). Reading Nutrition Labels. From http://www.heb.com/page/recipes-cooking/cooking- tips/reading-nutrition-labels 2. Mesirov JP: Computer science. Accessible reproducible research. Science 2010, 327(5964): 415-416. From http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3878063/?t ool=pubmed 3. Target Clear Rx bottles: http://www.brandchannel.com/features_profile.asp?pr_id= 248
  • 31. ORGANIZING DATA & FILES MODULE 2
  • 32. LEARNING OUTCOMES • Develop a consistent and coherent file organization and naming convention scheme for all project files. • Select appropriate non- proprietary hardware and software formats for storing data. • Create protected copies of files at crucial points in your study • Use versioning software or documentation for tracking changes to files over time.
  • 33. File Organization & Naming • Be Clear, Concise, Consistent, Correct, and Conformant • Consider what is necessary to find and access files in next year and when the project is complete. • Develop a scheme and use it. • Track changes.
  • 34. Organization: Filing v. Piling • Filing (hierarchical) – When organizing files, directory top-level folder should include the project title, unique identifier, and date (year). – The substructure should have a clear, documented naming convention; for example, each run of an experiment, each version of a dataset, and/or each person in the group. • Piling (tags) – All files in one directory, rely on sorting and searching.
  • 36. Courtesy of PhD Comics
  • 37.
  • 38. Naming Files • Be Clear, Concise, Consistent, Correct, and Conformant • Make it meaningful • Remember the purpose is to provide context
  • 39. Elements of a File Name • Project/grant name and/or number • Date of creation/modification • Name of creator/investigator: last name first followed by (initials of) first name • Research team/department associated with the data • Content or subject descriptor • Data collection method (instrument, site, etc.) • Version number • Project phase
  • 40. Naming Strategies • Date first – 20110103_diss_surveyB_raw – 20110118_diss_surveyB_raw – 20110119_diss_inter_trans – 20110204_diss_surveyB_quest-B • Subject first – diss_surveyB_raw_20110103 – diss_surveyB_raw_20110118 – diss_inter_trans_20110119 – diss_surveyB_quest_20110204 • Type first – surveyB_raw_diss_20110103 – surveyB_raw_diss_20110118 – inter_trans_diss_20110119 – surveyB_quest_diss_20110204 • Numbered (Forced ordering) – 01_diss_survey_raw_20110103 – 01_diss_survey_raw_20110118 – 02_diss_inter_trans_20110119 – 04_diss_survey_quest- B_20110204 Whitmire, 2014
  • 41. Technical Tips • For sequential numbering, use leading zeros. – For example, a sequence of 1-10 should be numbered 01-10; a sequence of 1-100 should be numbered 001-010-100. • No special characters in file names & , * % # ; * ( ) ! @$ ^ ~ ' { } [ ] ? < > - • Use only one period ONLY before the file extension (e.g. name_paper.doc NOT name.paper.doc OR name_paper..doc) ?Will your files still be unique and comprehensible if moved to another location
  • 42. Think-Pair-Share • Develop a file naming scheme for your project (enter it in your DMP). • Share it with your partner. • Share with class.
  • 43. File Formats • Choose formats that are more likely to be accessible in the future (10-20 years) – Non-proprietary – Open, documented standard – Commonly used – Standardized (ASCII, Unicode) • Also, if possible – Unencrypted – Uncompressed • Ex: PDF/A (not .doc/x), ASCII (not .xls/x), MPEG-4, TIFF or JPEG2000, XML or RDF (not RDBMS)
  • 44. Master Files • Provides snapshots of key phases in the data life cycle – Raw – Cleaned – Phases of processing • In combination with detailed documentation, these files make write-up easier and supports reproducibility and reuse • Demonstrate provenance (i.e., an audit trail)
  • 45. Version Control • Manual – file names – Sequential numbered system – Dated • Automatic – version control software – Mercurial – TortoiseSVN – GitHub • Keep log files, supplement with documentation (e.g., readme.txt, comments, etc.)
  • 46. DMP Sections to work on: • Format (revise) –Are you choosing the best formats? • Data organization (write) –File & Folder structure –File naming convention –Master files/Data locks
  • 47. References 1. DataONE Education Module: Data Management Planning. DataONE. From http://www.dataone.org/sites/all/ documents/L03_DataManagementPlanning.pptx 2. DataONE Education Module: Data Citation. DataONE. From http://www.dataone.org/sites/all/documents/L09_DataCitation.pptx 3. McNeill, K. (2013). Research Data Management: File Organization. From: http://libraries.mit.edu/guides/subjects/data- management/File%20Organization_JulyAP2013.pdf 4. MIT. (2014). Organizing your files. From: http://libraries.mit.edu/guides/subjects/data-management/organizing.html 5. Savage, A. (nd). Mythbusters. From http://weknowmemes.com/2012/10/ the-only-difference-between-screwing-around-and-science/ 6. Whitmire, A. (2014). Research Data Management – Organizing Your Data. From http://guides.library.oregonstate.edu/grad521lectures
  • 48. Why do we document?