SlideShare a Scribd company logo
1 of 88
Amanda L. Whitmire
Steven Van Tuyl
OSU Libraries
GOAL:
Achievable habits for implementing
data management best practices into
your workflow
1
What are
data?
“…the recorded factual material
commonly accepted in the
scientific community as
necessary to validate research
findings.”
Research data is:
U.S. Office of Management and Budget, Circular A-110
3
“Unlike other types of information, research
data are collected, observed, or created, for the
purposes of analysis to produce
and validate original research
results.”
University of Edinburgh
MANTRA Research Data Management Training,
‘Research Data Explained’
Research data is:
Actions that contribute to effective
storage, preservation and reuse of
data and documentation throughout
the research lifecycle.
Data management:
5
6
What data management is not:
Data/computational science
Database administration
A research method:
• what data to collect
• how to collect them
• how to design an experiment
Increase visibility & impact
Protects investment
Increases research efficiency
Preservation
Saves time
Funding agency requirements
Why data management?
Further your field
image credit: http://www.flickr.com/photos/karenchiang/5041048588/
Further your field
85 cancer microarray clinical trial
publications
69% increase in citations for articles
w/publicly available data
Increase visibility & impact
Funder mandates
“…directed Federal agencies with more than $100M in R&D
expenditures to develop plans to make the published results of
federally funded research freely available to the public within one year
of publication and requiring researchers to better account for and
manage the digital data resulting from federally funded scientific
research.”
Which agencies are affected?
$54M
$35M
$23M
Aspects of data management
DMPs/Planning
Storage & backup
File organization & naming
Documentation & metadata
Legal/ethical
considerations
Sharing & reuse
Archiving &
preservation
Data types & formats
Data types
14
Observational | Can’t be recaptured, recreated or replaced; Examples: sensor
readings, sensory (human) observations, survey results
Experimental | Should be reproducible, but can be expensive; Examples:
gene sequences, chromatograms, spectroscopy, microscopy, cell counts
Derived or compiled | Reproducible, but can be very expensive;
Examples: text and data mining, compiled database, 3D models
Simulation | Models and metadata, where the input can be more important
than output data; Examples: climate models, economic models, biogeochemical
models
Reference/canonical | Static or organic collection [peer-reviewed] datasets,
most probably published and/or curated; Examples: gene sequence databanks,
chemical structures, census data, spatial data portals
Amanda’s dissertation
Thespectralbackscatteringpropertiesofmarineparticles
Observations
ship-based sampling &
moored instruments
Simulation
results
scattering &
absorption of light
Experimental
optical properties of
phytoplankton cultures
Derived
variables
endless things
Compiled
observations
global oceanic bio-
optical observations
[self + from peers]
Reference
global oceanic bio-
optical observations
[NASA]
Data types: another classification
Qualitative data “is a categorical measurement expressed not in
terms of numbers, but rather by means of a natural language
description. In statistics, it is often used interchangeably with
"categorical" data.” See also: nominal, ordinal
Source: Wikibooks: Statistics
Quantitative data “is a numerical measurement expressed not by
means of a natural language description, but rather in terms of numbers.
However, not all numbers are continuous and measurable.”
“My favorite color is blue-green.” vs. “My favorite color is 510 nm.”
More common data types
Geospatial data has a geographical or geospatial aspect. Spatial
location is critically tied to other variables.
Digital image, audio & video data
Documentation & scripts Sometimes, software code IS data;
likewise with documentation (laboratory notebooks, written observations,
etc.)
File Formats
“A file format is a standard way that information is encoded for
storage in a computer file. It specifies how bits are used to encode information
in a digital storage medium.” - Wikipedia
Qualitative, tabular
experimental data
Data type
Excel spreadsheet (.xlsx)
Comma-delimited text (.csv)
Access database (.mdb/,accdb)
Google Spreadsheet
SPSS portable file (.por)
XML file
Possible
formats
What types & formats of data will you
be generating and/or using?
Observational | Experimental | Derived | Compiled |
Simulation | Reference/Canonical
Qualitative | Quantitative | Geospatial
Image/audio/video | Scripts/codes
Reflective Writing: 60 seconds
20
? ?
??
The “big picture”
21
Make a plan
Where do you start?
22
Data storage & backup
“Your data are the life
blood of your research.
If you lose your data
recovery could be slow,
costly or even worse…
it could be impossible.”
Most common loss scenario:
drive failure
This happens a lot:
physical theft & unintentional damage
Cute, but not a valid security plan.
University of Southampton, School of Electronics and
Computer Science, Southampton, UK, 2005
28
It CAN happen to you
Real-world lesson:
Audit your backups…
Data backup
“Keeping backups is probably
your most important data
management task.”
-Everyone
1. Some data backup is better than none.
2. Automated backups are better than manual.
3. Your data are only as safe as your last backup.
Data backup
Original
(working)
External
Local
External
Remote
Best Practice: 3 Copies of datasets
Data storage options
1. Personal computers (PCs) & laptops
2. External storage devices
3. Networked Drives
4. Cloud servers
Storage: PC/laptop
Advantages
Convenient
Disadvantages
Drive failure common
Laptops: susceptible to theft & unintentional damage
Not replicated
Bottom Line
Do NOT use to store master copies of data
Not a long term storage solution
Back up important data & files regularly
Storage: external storage devices
Advantages
Convenient, cheap & portable
Disadvantages
Longevity not guaranteed (e.g. Zip disks)
Errors writing to CD/DVD are common
Easily damaged, misplaced or lost (=security risk)
May not be big enough to hold all data; multiple drives needed
Bottom Line
Do NOT use to store master copies of data
Not recommended for long-term storage
Storage: networked drives
Advantages
Data in single place, backed up regularly
Replicated storage not vulnerable to loss due to hardware
failure
Secure storage minimizes risk of loss, theft, unauthorized use
Available as needed (assuming network avail.)
Disadvantages
Cost may be prohibitive; export control
Bottom Line
Highly recommended for master copies of data
Recommended for long-term storage (~5 years)
Storage: cloud storage
Advantages
Data in single place, backed up regularly
Replicated storage not vulnerable to loss due to hardware
failure
Secure storage minimizes risk of loss, theft, unauthorized use
Disadvantages
Cost may be prohibitive
Upload/download bottleneck & fees
Longevity?
Export control
Bottom Line
Possibly recommended for master copies of data
Not recommended for in-process data, large files
Storage: Google Drive for OSU
Advantages
All same advantages of network & cloud storage
File sharing & collaboration w/variable access levels
Unlimited storage (GD), 30 GB non-GD
Automatic version control on GD
Disadvantages
30 GB may not be enough
Upload/download bottleneck
Bottom Line
Possibly recommended for master copies of data
Possibly not recommended for in-process data, large files
Data organization
38
AGU presentation
Class presentation
OS presentation
Presentations
Ocean
Sciences Class AGU
Data storage options
39
Local Remote
Computer
External storage
Network server
Network server
Cloud storage
Google Apps
Box, Dropbox,
etc.
File-naming conventions
File naming strategy?
41
#OverlyHonestMethods
42
File naming conventions
43
Project_instrument_location_YYYYMMDDhhmmss_e
xtra.ext
Index/grant conditions Leading zero!
s/n, variable Retain
order
File naming strategies
44
Order by date:
1955-04-12_notes_MassObs.docx
1955-04-12_questionnaire_MassObs.pdf
1963-12-15_notes_Gorer.docx
1963-12-15_questionnaire_Gorer.pdf
Order by subject:
Gorer_notes_1963-12-15.docx
Gorer_questionnaire_1963-12-15.pdf
MassObs_notes_1955-04-12.docx
MassObs_questionnaire_1955-04-12.pdf
Order by type:
Notes_Gorer_1963-12-15.docx
Notes_MassObs_1955-04-12.docx
Questionnaire_Gorer_1963-12-15.pdf
Questionnaire_MassObs_1955-04-12.pdf
Forced order with numbering:
01_MassObs_questionnaire_1955-04-12.pdf
02_MassObs_notes_1955-04-12.docx
03_Gorer_questionnaire_1963-12-15.pdf
04_Gorer_notes_1963-12-15.docx
Version control
45
Version control
46
47
? ?
??
Random suggestion
48
Disambiguate yourself
49
Open Researcher & Contributor ID
John L. Campbell
Forest Research Ecologist
Oregon State University, Corvallis, OR
John L. Campbell
Forest Research Ecologist
Center for Research on Ecosystem Change
US Forest Service, Durham, NC
50
Another Doppelgänger
51
Chris Langdon
Studies ocean acidification
OSU HMSC
Chris Langdon
Studies ocean acidification
University of Miami
Metadata
What is metadata?
• Data about data
• Structured information that describes,
explains, locates, or otherwise makes it
easier to retrieve, use, or manage an
information resource.
NISO, Understanding Metadata
53
Metadata
54
“The metadata accompanying your
data should be written for a user 20
years into the future -- what does
that person need to know to use your
data properly? Prepare the metadata
for a user who is unfamiliar with your
project, methods, or observations.”
Oak Ridge National Laboratory Distributed Active
Archive Center for Biogeochemical Dynamics
(ORNL DAAC)
Metadata in real life
You use it all the time…
Darwin Core | biological diversity, taxonomy
Dublin Core | general
DDI (Data Documentation Initiative) | social &
behavioral sciences
DIF (Directory Interchange Format) | environmental
sciences
EML (Ecological Metadata Language) | ecology, biology
ISO 19115 | geographic data
Major metadata standards
01101101 01100101 01110100 01100001
01100100 01100001 01110100 01100001
Metadata examples
Santa Barbara Coastal Long Term Ecological
Research (LTER)
web link
Bureau of Labor Statistics, Consumer Price
Index, 1913-1992
web link
57
Legal & ethical considerations
58
Data sharing & reuse
“...digitally formatted scientific data resulting
from unclassified research supported wholly
or in part by Federal funding should be
stored and publicly accessible to search,
retrieve, and analyze.”
Office of Science and Technology Policy
The White House
59
How to preserve & share data
60
How to preserve & share data
61
Data journals
62
Data management
plans
63
What is a data management plan?
64
…Physical specimens consist of carbonate biothems
and filtered water samples are stored in Spilde or
Northup’s labs until completion of analyses. At this
time they are returned to the federal agency for
museum curation or destroyed during analysis. Field
notes will be scanned into pdf files, with a copy sent
to Carlsbad Caverns National Park or other federal
cave manager. SEM images will be saved in the tif
format. …
Sections of a data management plan
65
1. Types of data
2. Data & metadata standards
3. Archiving & preservation
4. Sharing (access provisions)
5. Transition from collection to reuse
See resources on your handout + use DMPTool
| Visit DMPTool
Your go-to
resource for data
management plans
66
Contact information
67
Amanda Whitmire | Data Management
Specialist
amanda.whitmire@oregonstate.edu
Steve Van Tuyl| Data and Digital Repository
Librarian
steve.vantuyl@oregonstate.edu
http://bit.ly/OSUData
end
68
69
Extra / Outdated Slides
Data does not speak for itself…
YOU speak for YOUR data
But first, you need to manage it
72
73
Data management in the lifecycle
How can OSU Libraries help?
74
✔
✔
✔
✔✔
✔
Copyright – who owns the data?
Copyright applies to text, images, recordings, videos etc. in
a digital form, in the same way that it would to analog
versions of those works. The code of computer programs
(both the human readable source code and the machine
readable object code) is protected by copyright as a
literary work. Data compilations such as datasets and
databases can be protected by copyright in the literary
works category, which includes ‘tables’ or ‘compilations’. A
table or compilation, consisting of words, figures or
symbols (or a combination of these) is protected if it is a
literary work and has the required degree of originality.
75
Types, formats & stages of data
76
Raw data
77
Types, formats & stages of data
Intermediate data
Types, formats & stages of data
78
Final data
Data Types: Observational
• Captured in situ
• Can’t be recaptured, recreated or replaced
Examples: Sensor readings, sensory (human)
observations, survey results
Data Types: Experimental
• Data collected under controlled
conditions, in situ or laboratory-based
• Should be reproducible, but can be
expensive
Examples: gene sequences,
chromatograms, spectroscopy, microscopy,
cell counts
Data Types: Derived or Compiled
• Compiled: integration of data from
several sources (could be many types)
• Derived: new variables created from
existing data
• Reproducible, but can be very expensive
Examples: text and data mining, compiled
database, 3D models
Data Types: Simulation
• Results from using a model to study the
behavior and performance of an actual or
theoretical system
• Models and metadata, where the input can
be more important than output data
Examples: climate models, economic models,
biogeochemical models
Data Types: Reference/Canonical
Static or organic collection [peer-reviewed]
datasets, most probably published and/or
curated.
Examples: gene sequence databanks, chemical
structures, census data, spatial data portals
Data storage & preservation
84
About backups…
85
University of Southampton, School of Electronics and Computer Science
Southampton, UK, 2005
Do them.
Plan for unexpected events
86
Metadata demonstration
87
Colectica

More Related Content

What's hot

Data Quality & Data Governance
Data Quality & Data GovernanceData Quality & Data Governance
Data Quality & Data GovernanceTuba Yaman Him
 
What exactly is Business Intelligence?
What exactly is Business Intelligence?What exactly is Business Intelligence?
What exactly is Business Intelligence?James Serra
 
Understanding the difference between Data, information and knowledge
Understanding the difference between Data, information and knowledgeUnderstanding the difference between Data, information and knowledge
Understanding the difference between Data, information and knowledgeNeeti Naag
 
An introduction to Business intelligence
An introduction to Business intelligenceAn introduction to Business intelligence
An introduction to Business intelligenceHadi Fadlallah
 
The importance of data
The importance of dataThe importance of data
The importance of dataAPNIC
 
Business intelligence- Components, Tools, Need and Applications
Business intelligence- Components, Tools, Need and ApplicationsBusiness intelligence- Components, Tools, Need and Applications
Business intelligence- Components, Tools, Need and Applicationsraj
 
Data visualization
Data visualizationData visualization
Data visualizationSushil kasar
 
The Importance of Data Visualization
The Importance of Data VisualizationThe Importance of Data Visualization
The Importance of Data VisualizationCenterline Digital
 
Data Quality
Data QualityData Quality
Data QualityVijaya K
 
DAS Slides: Data Quality Best Practices
DAS Slides: Data Quality Best PracticesDAS Slides: Data Quality Best Practices
DAS Slides: Data Quality Best PracticesDATAVERSITY
 

What's hot (20)

Data Management
Data Management Data Management
Data Management
 
Data Quality & Data Governance
Data Quality & Data GovernanceData Quality & Data Governance
Data Quality & Data Governance
 
What exactly is Business Intelligence?
What exactly is Business Intelligence?What exactly is Business Intelligence?
What exactly is Business Intelligence?
 
Data Management
Data ManagementData Management
Data Management
 
Understanding the difference between Data, information and knowledge
Understanding the difference between Data, information and knowledgeUnderstanding the difference between Data, information and knowledge
Understanding the difference between Data, information and knowledge
 
An introduction to Business intelligence
An introduction to Business intelligenceAn introduction to Business intelligence
An introduction to Business intelligence
 
The importance of data
The importance of dataThe importance of data
The importance of data
 
Business Intelligence
Business IntelligenceBusiness Intelligence
Business Intelligence
 
Business intelligence- Components, Tools, Need and Applications
Business intelligence- Components, Tools, Need and ApplicationsBusiness intelligence- Components, Tools, Need and Applications
Business intelligence- Components, Tools, Need and Applications
 
Data visualization
Data visualizationData visualization
Data visualization
 
What is "data"?
What is "data"?What is "data"?
What is "data"?
 
Data analytics
Data analyticsData analytics
Data analytics
 
Data management
Data managementData management
Data management
 
Business intelligence
Business intelligenceBusiness intelligence
Business intelligence
 
Business intelligence
Business intelligenceBusiness intelligence
Business intelligence
 
The Importance of Data Visualization
The Importance of Data VisualizationThe Importance of Data Visualization
The Importance of Data Visualization
 
Data Quality
Data QualityData Quality
Data Quality
 
DAS Slides: Data Quality Best Practices
DAS Slides: Data Quality Best PracticesDAS Slides: Data Quality Best Practices
DAS Slides: Data Quality Best Practices
 
Data analytics
Data analyticsData analytics
Data analytics
 
Information management
Information managementInformation management
Information management
 

Similar to Introduction to Data Management

Data management for TA's
Data management for TA'sData management for TA's
Data management for TA'saaroncollie
 
Case Study Life Sciences Data: Central for Integrative Systems Biology and Bi...
Case Study Life Sciences Data: Central for Integrative Systems Biology and Bi...Case Study Life Sciences Data: Central for Integrative Systems Biology and Bi...
Case Study Life Sciences Data: Central for Integrative Systems Biology and Bi...sesrdm
 
Curation-Friendly Tools for the Scientific Researcher
Curation-Friendly Tools for the Scientific ResearcherCuration-Friendly Tools for the Scientific Researcher
Curation-Friendly Tools for the Scientific Researcherbwestra
 
Data Management for Undergraduate Researchers
Data Management for Undergraduate ResearchersData Management for Undergraduate Researchers
Data Management for Undergraduate ResearchersRebekah Cummings
 
Introduction to data management
Introduction to data managementIntroduction to data management
Introduction to data managementcunera
 
Reproducible research: theory
Reproducible research: theoryReproducible research: theory
Reproducible research: theoryC. Tobin Magle
 
UK Digital Curation Centre: enabling research data management at the coalface
UK Digital Curation Centre: enabling research data management at the coalfaceUK Digital Curation Centre: enabling research data management at the coalface
UK Digital Curation Centre: enabling research data management at the coalfaceLizLyon
 
Research Data Management Fundamentals for MSU Engineering Students
Research Data Management Fundamentals for MSU Engineering StudentsResearch Data Management Fundamentals for MSU Engineering Students
Research Data Management Fundamentals for MSU Engineering StudentsAaron Collie
 
Project CAiRO Overview
Project CAiRO OverviewProject CAiRO Overview
Project CAiRO OverviewStephen Gray
 
Research Data Curation _ Grad Humanities Class
Research Data Curation _ Grad Humanities ClassResearch Data Curation _ Grad Humanities Class
Research Data Curation _ Grad Humanities ClassAaron Collie
 
Intro to RDM
Intro to RDMIntro to RDM
Intro to RDMSarah Jones
 
RDM for Librarians
RDM for LibrariansRDM for Librarians
RDM for LibrariansMarieke Guy
 
Data curation issues for repositories
Data curation issues for repositoriesData curation issues for repositories
Data curation issues for repositoriesChris Rusbridge
 
Data Management for Research (New Faculty Orientation)
Data Management for Research (New Faculty Orientation)Data Management for Research (New Faculty Orientation)
Data Management for Research (New Faculty Orientation)aaroncollie
 
Publishing your research: Research Data Management (Introduction)
Publishing your research: Research Data Management (Introduction) Publishing your research: Research Data Management (Introduction)
Publishing your research: Research Data Management (Introduction) Jamie Bisset
 
Responsible Conduct of Research: Data Management
Responsible Conduct of Research: Data ManagementResponsible Conduct of Research: Data Management
Responsible Conduct of Research: Data ManagementKristin Briney
 
CSU-ACADIS_dataManagement101-20120217
CSU-ACADIS_dataManagement101-20120217CSU-ACADIS_dataManagement101-20120217
CSU-ACADIS_dataManagement101-20120217lyarmey
 

Similar to Introduction to Data Management (20)

Data management for TA's
Data management for TA'sData management for TA's
Data management for TA's
 
Introduction to Data Management and Sharing
Introduction to Data Management and SharingIntroduction to Data Management and Sharing
Introduction to Data Management and Sharing
 
Case Study Life Sciences Data: Central for Integrative Systems Biology and Bi...
Case Study Life Sciences Data: Central for Integrative Systems Biology and Bi...Case Study Life Sciences Data: Central for Integrative Systems Biology and Bi...
Case Study Life Sciences Data: Central for Integrative Systems Biology and Bi...
 
Research data life cycle
Research data life cycleResearch data life cycle
Research data life cycle
 
Curation-Friendly Tools for the Scientific Researcher
Curation-Friendly Tools for the Scientific ResearcherCuration-Friendly Tools for the Scientific Researcher
Curation-Friendly Tools for the Scientific Researcher
 
Data Management for Undergraduate Researchers
Data Management for Undergraduate ResearchersData Management for Undergraduate Researchers
Data Management for Undergraduate Researchers
 
Introduction to data management
Introduction to data managementIntroduction to data management
Introduction to data management
 
Reproducible research: theory
Reproducible research: theoryReproducible research: theory
Reproducible research: theory
 
UK Digital Curation Centre: enabling research data management at the coalface
UK Digital Curation Centre: enabling research data management at the coalfaceUK Digital Curation Centre: enabling research data management at the coalface
UK Digital Curation Centre: enabling research data management at the coalface
 
Research Data Management Fundamentals for MSU Engineering Students
Research Data Management Fundamentals for MSU Engineering StudentsResearch Data Management Fundamentals for MSU Engineering Students
Research Data Management Fundamentals for MSU Engineering Students
 
Project CAiRO Overview
Project CAiRO OverviewProject CAiRO Overview
Project CAiRO Overview
 
Cairo
CairoCairo
Cairo
 
Research Data Curation _ Grad Humanities Class
Research Data Curation _ Grad Humanities ClassResearch Data Curation _ Grad Humanities Class
Research Data Curation _ Grad Humanities Class
 
Intro to RDM
Intro to RDMIntro to RDM
Intro to RDM
 
RDM for Librarians
RDM for LibrariansRDM for Librarians
RDM for Librarians
 
Data curation issues for repositories
Data curation issues for repositoriesData curation issues for repositories
Data curation issues for repositories
 
Data Management for Research (New Faculty Orientation)
Data Management for Research (New Faculty Orientation)Data Management for Research (New Faculty Orientation)
Data Management for Research (New Faculty Orientation)
 
Publishing your research: Research Data Management (Introduction)
Publishing your research: Research Data Management (Introduction) Publishing your research: Research Data Management (Introduction)
Publishing your research: Research Data Management (Introduction)
 
Responsible Conduct of Research: Data Management
Responsible Conduct of Research: Data ManagementResponsible Conduct of Research: Data Management
Responsible Conduct of Research: Data Management
 
CSU-ACADIS_dataManagement101-20120217
CSU-ACADIS_dataManagement101-20120217CSU-ACADIS_dataManagement101-20120217
CSU-ACADIS_dataManagement101-20120217
 

Recently uploaded

call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
AmericanHighSchoolsprezentacijaoskolama.
AmericanHighSchoolsprezentacijaoskolama.AmericanHighSchoolsprezentacijaoskolama.
AmericanHighSchoolsprezentacijaoskolama.arsicmarija21
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxmanuelaromero2013
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxSayali Powar
 
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxEPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxRaymartEstabillo3
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTiammrhaywood
 
Gas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptxGas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptxDr.Ibrahim Hassaan
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...Marc Dusseiller Dusjagr
 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...JhezDiaz1
 
Capitol Tech U Doctoral Presentation - April 2024.pptx
Capitol Tech U Doctoral Presentation - April 2024.pptxCapitol Tech U Doctoral Presentation - April 2024.pptx
Capitol Tech U Doctoral Presentation - April 2024.pptxCapitolTechU
 
18-04-UA_REPORT_MEDIALITERAĐĄY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAĐĄY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAĐĄY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAĐĄY_INDEX-DM_23-1-final-eng.pdfssuser54595a
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon AUnboundStockton
 
MICROBIOLOGY biochemical test detailed.pptx
MICROBIOLOGY biochemical test detailed.pptxMICROBIOLOGY biochemical test detailed.pptx
MICROBIOLOGY biochemical test detailed.pptxabhijeetpadhi001
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxOH TEIK BIN
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17Celine George
 
Meghan Sutherland In Media Res Media Component
Meghan Sutherland In Media Res Media ComponentMeghan Sutherland In Media Res Media Component
Meghan Sutherland In Media Res Media ComponentInMediaRes1
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsanshu789521
 
Historical philosophical, theoretical, and legal foundations of special and i...
Historical philosophical, theoretical, and legal foundations of special and i...Historical philosophical, theoretical, and legal foundations of special and i...
Historical philosophical, theoretical, and legal foundations of special and i...jaredbarbolino94
 
DATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginnersDATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginnersSabitha Banu
 

Recently uploaded (20)

call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
 
AmericanHighSchoolsprezentacijaoskolama.
AmericanHighSchoolsprezentacijaoskolama.AmericanHighSchoolsprezentacijaoskolama.
AmericanHighSchoolsprezentacijaoskolama.
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptx
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
 
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxEPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
 
Gas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptxGas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptx
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
 
Capitol Tech U Doctoral Presentation - April 2024.pptx
Capitol Tech U Doctoral Presentation - April 2024.pptxCapitol Tech U Doctoral Presentation - April 2024.pptx
Capitol Tech U Doctoral Presentation - April 2024.pptx
 
18-04-UA_REPORT_MEDIALITERAĐĄY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAĐĄY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAĐĄY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAĐĄY_INDEX-DM_23-1-final-eng.pdf
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon A
 
MICROBIOLOGY biochemical test detailed.pptx
MICROBIOLOGY biochemical test detailed.pptxMICROBIOLOGY biochemical test detailed.pptx
MICROBIOLOGY biochemical test detailed.pptx
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptx
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17
 
Meghan Sutherland In Media Res Media Component
Meghan Sutherland In Media Res Media ComponentMeghan Sutherland In Media Res Media Component
Meghan Sutherland In Media Res Media Component
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha elections
 
Historical philosophical, theoretical, and legal foundations of special and i...
Historical philosophical, theoretical, and legal foundations of special and i...Historical philosophical, theoretical, and legal foundations of special and i...
Historical philosophical, theoretical, and legal foundations of special and i...
 
OS-operating systems- ch04 (Threads) ...
OS-operating systems- ch04 (Threads) ...OS-operating systems- ch04 (Threads) ...
OS-operating systems- ch04 (Threads) ...
 
DATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginnersDATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginners
 

Introduction to Data Management

  • 1. Amanda L. Whitmire Steven Van Tuyl OSU Libraries
  • 2. GOAL: Achievable habits for implementing data management best practices into your workflow 1
  • 4. “…the recorded factual material commonly accepted in the scientific community as necessary to validate research findings.” Research data is: U.S. Office of Management and Budget, Circular A-110 3
  • 5. “Unlike other types of information, research data are collected, observed, or created, for the purposes of analysis to produce and validate original research results.” University of Edinburgh MANTRA Research Data Management Training, ‘Research Data Explained’ Research data is:
  • 6. Actions that contribute to effective storage, preservation and reuse of data and documentation throughout the research lifecycle. Data management: 5
  • 7. 6 What data management is not: Data/computational science Database administration A research method: • what data to collect • how to collect them • how to design an experiment
  • 8. Increase visibility & impact Protects investment Increases research efficiency Preservation Saves time Funding agency requirements Why data management? Further your field image credit: http://www.flickr.com/photos/karenchiang/5041048588/
  • 10. 85 cancer microarray clinical trial publications 69% increase in citations for articles w/publicly available data Increase visibility & impact
  • 11. Funder mandates “…directed Federal agencies with more than $100M in R&D expenditures to develop plans to make the published results of federally funded research freely available to the public within one year of publication and requiring researchers to better account for and manage the digital data resulting from federally funded scientific research.”
  • 12. Which agencies are affected? $54M $35M $23M
  • 13. Aspects of data management DMPs/Planning Storage & backup File organization & naming Documentation & metadata Legal/ethical considerations Sharing & reuse Archiving & preservation
  • 14. Data types & formats
  • 15. Data types 14 Observational | Can’t be recaptured, recreated or replaced; Examples: sensor readings, sensory (human) observations, survey results Experimental | Should be reproducible, but can be expensive; Examples: gene sequences, chromatograms, spectroscopy, microscopy, cell counts Derived or compiled | Reproducible, but can be very expensive; Examples: text and data mining, compiled database, 3D models Simulation | Models and metadata, where the input can be more important than output data; Examples: climate models, economic models, biogeochemical models Reference/canonical | Static or organic collection [peer-reviewed] datasets, most probably published and/or curated; Examples: gene sequence databanks, chemical structures, census data, spatial data portals
  • 16. Amanda’s dissertation Thespectralbackscatteringpropertiesofmarineparticles Observations ship-based sampling & moored instruments Simulation results scattering & absorption of light Experimental optical properties of phytoplankton cultures Derived variables endless things Compiled observations global oceanic bio- optical observations [self + from peers] Reference global oceanic bio- optical observations [NASA]
  • 17. Data types: another classification Qualitative data “is a categorical measurement expressed not in terms of numbers, but rather by means of a natural language description. In statistics, it is often used interchangeably with "categorical" data.” See also: nominal, ordinal Source: Wikibooks: Statistics Quantitative data “is a numerical measurement expressed not by means of a natural language description, but rather in terms of numbers. However, not all numbers are continuous and measurable.” “My favorite color is blue-green.” vs. “My favorite color is 510 nm.”
  • 18. More common data types Geospatial data has a geographical or geospatial aspect. Spatial location is critically tied to other variables. Digital image, audio & video data Documentation & scripts Sometimes, software code IS data; likewise with documentation (laboratory notebooks, written observations, etc.)
  • 19. File Formats “A file format is a standard way that information is encoded for storage in a computer file. It specifies how bits are used to encode information in a digital storage medium.” - Wikipedia Qualitative, tabular experimental data Data type Excel spreadsheet (.xlsx) Comma-delimited text (.csv) Access database (.mdb/,accdb) Google Spreadsheet SPSS portable file (.por) XML file Possible formats
  • 20. What types & formats of data will you be generating and/or using? Observational | Experimental | Derived | Compiled | Simulation | Reference/Canonical Qualitative | Quantitative | Geospatial Image/audio/video | Scripts/codes Reflective Writing: 60 seconds
  • 23. Make a plan Where do you start? 22
  • 24. Data storage & backup
  • 25. “Your data are the life blood of your research. If you lose your data recovery could be slow, costly or even worse… it could be impossible.”
  • 26. Most common loss scenario: drive failure
  • 27. This happens a lot: physical theft & unintentional damage Cute, but not a valid security plan.
  • 28. University of Southampton, School of Electronics and Computer Science, Southampton, UK, 2005
  • 31. Data backup “Keeping backups is probably your most important data management task.” -Everyone 1. Some data backup is better than none. 2. Automated backups are better than manual. 3. Your data are only as safe as your last backup.
  • 33. Data storage options 1. Personal computers (PCs) & laptops 2. External storage devices 3. Networked Drives 4. Cloud servers
  • 34. Storage: PC/laptop Advantages Convenient Disadvantages Drive failure common Laptops: susceptible to theft & unintentional damage Not replicated Bottom Line Do NOT use to store master copies of data Not a long term storage solution Back up important data & files regularly
  • 35. Storage: external storage devices Advantages Convenient, cheap & portable Disadvantages Longevity not guaranteed (e.g. Zip disks) Errors writing to CD/DVD are common Easily damaged, misplaced or lost (=security risk) May not be big enough to hold all data; multiple drives needed Bottom Line Do NOT use to store master copies of data Not recommended for long-term storage
  • 36. Storage: networked drives Advantages Data in single place, backed up regularly Replicated storage not vulnerable to loss due to hardware failure Secure storage minimizes risk of loss, theft, unauthorized use Available as needed (assuming network avail.) Disadvantages Cost may be prohibitive; export control Bottom Line Highly recommended for master copies of data Recommended for long-term storage (~5 years)
  • 37. Storage: cloud storage Advantages Data in single place, backed up regularly Replicated storage not vulnerable to loss due to hardware failure Secure storage minimizes risk of loss, theft, unauthorized use Disadvantages Cost may be prohibitive Upload/download bottleneck & fees Longevity? Export control Bottom Line Possibly recommended for master copies of data Not recommended for in-process data, large files
  • 38. Storage: Google Drive for OSU Advantages All same advantages of network & cloud storage File sharing & collaboration w/variable access levels Unlimited storage (GD), 30 GB non-GD Automatic version control on GD Disadvantages 30 GB may not be enough Upload/download bottleneck Bottom Line Possibly recommended for master copies of data Possibly not recommended for in-process data, large files
  • 39. Data organization 38 AGU presentation Class presentation OS presentation Presentations Ocean Sciences Class AGU
  • 40. Data storage options 39 Local Remote Computer External storage Network server Network server Cloud storage Google Apps Box, Dropbox, etc.
  • 45. File naming strategies 44 Order by date: 1955-04-12_notes_MassObs.docx 1955-04-12_questionnaire_MassObs.pdf 1963-12-15_notes_Gorer.docx 1963-12-15_questionnaire_Gorer.pdf Order by subject: Gorer_notes_1963-12-15.docx Gorer_questionnaire_1963-12-15.pdf MassObs_notes_1955-04-12.docx MassObs_questionnaire_1955-04-12.pdf Order by type: Notes_Gorer_1963-12-15.docx Notes_MassObs_1955-04-12.docx Questionnaire_Gorer_1963-12-15.pdf Questionnaire_MassObs_1955-04-12.pdf Forced order with numbering: 01_MassObs_questionnaire_1955-04-12.pdf 02_MassObs_notes_1955-04-12.docx 03_Gorer_questionnaire_1963-12-15.pdf 04_Gorer_notes_1963-12-15.docx
  • 50. Disambiguate yourself 49 Open Researcher & Contributor ID John L. Campbell Forest Research Ecologist Oregon State University, Corvallis, OR John L. Campbell Forest Research Ecologist Center for Research on Ecosystem Change US Forest Service, Durham, NC
  • 51. 50
  • 52. Another Doppelgänger 51 Chris Langdon Studies ocean acidification OSU HMSC Chris Langdon Studies ocean acidification University of Miami
  • 54. What is metadata? • Data about data • Structured information that describes, explains, locates, or otherwise makes it easier to retrieve, use, or manage an information resource. NISO, Understanding Metadata 53
  • 55. Metadata 54 “The metadata accompanying your data should be written for a user 20 years into the future -- what does that person need to know to use your data properly? Prepare the metadata for a user who is unfamiliar with your project, methods, or observations.” Oak Ridge National Laboratory Distributed Active Archive Center for Biogeochemical Dynamics (ORNL DAAC)
  • 56. Metadata in real life You use it all the time…
  • 57. Darwin Core | biological diversity, taxonomy Dublin Core | general DDI (Data Documentation Initiative) | social & behavioral sciences DIF (Directory Interchange Format) | environmental sciences EML (Ecological Metadata Language) | ecology, biology ISO 19115 | geographic data Major metadata standards 01101101 01100101 01110100 01100001 01100100 01100001 01110100 01100001
  • 58. Metadata examples Santa Barbara Coastal Long Term Ecological Research (LTER) web link Bureau of Labor Statistics, Consumer Price Index, 1913-1992 web link 57
  • 59. Legal & ethical considerations 58
  • 60. Data sharing & reuse “...digitally formatted scientific data resulting from unclassified research supported wholly or in part by Federal funding should be stored and publicly accessible to search, retrieve, and analyze.” Office of Science and Technology Policy The White House 59
  • 61. How to preserve & share data 60
  • 62. How to preserve & share data 61
  • 65. What is a data management plan? 64 …Physical specimens consist of carbonate biothems and filtered water samples are stored in Spilde or Northup’s labs until completion of analyses. At this time they are returned to the federal agency for museum curation or destroyed during analysis. Field notes will be scanned into pdf files, with a copy sent to Carlsbad Caverns National Park or other federal cave manager. SEM images will be saved in the tif format. …
  • 66. Sections of a data management plan 65 1. Types of data 2. Data & metadata standards 3. Archiving & preservation 4. Sharing (access provisions) 5. Transition from collection to reuse See resources on your handout + use DMPTool
  • 67. | Visit DMPTool Your go-to resource for data management plans 66
  • 68. Contact information 67 Amanda Whitmire | Data Management Specialist amanda.whitmire@oregonstate.edu Steve Van Tuyl| Data and Digital Repository Librarian steve.vantuyl@oregonstate.edu http://bit.ly/OSUData
  • 71. Data does not speak for itself…
  • 72. YOU speak for YOUR data
  • 73. But first, you need to manage it 72
  • 74. 73 Data management in the lifecycle
  • 75. How can OSU Libraries help? 74 ✔ ✔ ✔ ✔✔ ✔
  • 76. Copyright – who owns the data? Copyright applies to text, images, recordings, videos etc. in a digital form, in the same way that it would to analog versions of those works. The code of computer programs (both the human readable source code and the machine readable object code) is protected by copyright as a literary work. Data compilations such as datasets and databases can be protected by copyright in the literary works category, which includes ‘tables’ or ‘compilations’. A table or compilation, consisting of words, figures or symbols (or a combination of these) is protected if it is a literary work and has the required degree of originality. 75
  • 77. Types, formats & stages of data 76 Raw data
  • 78. 77 Types, formats & stages of data Intermediate data
  • 79. Types, formats & stages of data 78 Final data
  • 80. Data Types: Observational • Captured in situ • Can’t be recaptured, recreated or replaced Examples: Sensor readings, sensory (human) observations, survey results
  • 81. Data Types: Experimental • Data collected under controlled conditions, in situ or laboratory-based • Should be reproducible, but can be expensive Examples: gene sequences, chromatograms, spectroscopy, microscopy, cell counts
  • 82. Data Types: Derived or Compiled • Compiled: integration of data from several sources (could be many types) • Derived: new variables created from existing data • Reproducible, but can be very expensive Examples: text and data mining, compiled database, 3D models
  • 83. Data Types: Simulation • Results from using a model to study the behavior and performance of an actual or theoretical system • Models and metadata, where the input can be more important than output data Examples: climate models, economic models, biogeochemical models
  • 84. Data Types: Reference/Canonical Static or organic collection [peer-reviewed] datasets, most probably published and/or curated. Examples: gene sequence databanks, chemical structures, census data, spatial data portals
  • 85. Data storage & preservation 84
  • 86. About backups… 85 University of Southampton, School of Electronics and Computer Science Southampton, UK, 2005 Do them.
  • 87. Plan for unexpected events 86

Editor's Notes

  1. The reason we’re here is to explain what data management is, why we think it should be important to you, provide you with some things to think about in managing your data and provide some resources for managing your data. This is a first, introductory workshop…. Invite questions and comments throughout. [Note: Unless otherwise noted, all images in this presentation are from Stock Exchange (http://www.sxc.hu) and copyright is retained by their owners.]
  2. As we talk with you today, be thinking about how what we are telling you works with your own research process.
  3. Image credit: Science Magazine
  4. Does not include, “any of the following: preliminary analyses, drafts of scientific papers, plans for future research, peer reviews, or communications with colleagues. This "recorded" material excludes physical objects (e.g., laboratory samples).” This narrow definition mostly takes a retrospective view of your dataset, in that it does not account for raw and intermediate that may be critical to the research process but that don’t become part of the ’final’ dataset. Data could be: Observational Experimental Simulated Derived
  5. …”Data may be viewed as the lowest level of abstraction from which information and knowledge are derived.”
  6. Data management is a verb – it involves intentional effort and activity. The main goals of DM are preservation and reuse, for you and for others. Covers all aspects of the data lifecycle from planning digital data capture methods, whittling down, ingestion to databases, providing for access and reuse, to transformation.
  7. Requirements: funder, journal, etc. Further your field: speed the pace of discovery by enhancing discovery, access and reuse research & knowledge base develops more rapidly Increase your visibility & impact: share your data, get credit with data citation Increase dataset value: More likely to be re-visited and re-used in ways that may be different from their original intention Integrity: data (documentation will preserve integrity of data over time) & you (achieve ethics compliance [HIPAA] and enhance your reputation via transparency) Allows for validation of your research Efficiency: avoid digital archeology, avoid duplication of effort Allows easier exchange with other scientists Preserve your lifelong body of work!
  8. If we agree that data that are better managed are more easily, rapidly, and effectively shared, then we can also agree that better data management can enable and speed up discoveries.
  9. 85 cancer microarray clinical trial publications with respect to the availability of their data. The 48% of trials with publicly available microarray data received 85% of the aggregate citations. Publicly available data was significantly (p = 0.006) associated with a 69% increase in citations, independently of journal impact factor, date of publication, and author country of origin using linear regression. Piwowar HA, Day RS, Fridsma DB (2007) Sharing Detailed Research Data Is Associated with Increased Citation Rate. PLoS ONE 2(3): e308. doi:10.1371/journal.pone.0000308 A [very] short bibliography on citation advantage: Henneken, E.A. & Accomazzi, A. Linking to Data - Effect on Citation Rates in Astronomy. (2011). Dorch, B. On the Citation Advantage of linking to data. (2012). Piwowar, H.A. & Vision, T.J. Data reuse and the open data citation advantage. PeerJ 1, e175 (2013) Sears, J.R. Data Sharing Effect on Article Citation Rate in Paleoceanography. AGU Fall Meet. Abstr. -1, 1628 (2011).
  10. 22 February 2013: The Office of Science and Technology Policy in the White House released a memorandum about expanding pubic access to the results of federally funded research. In addition to scholarly publications, federal agencies are making serious efforts to increase the sharing of research data. All federal agencies with more than $100M in R&D expenditures are subject to this memo. http://www.whitehouse.gov/sites/default/files/microsites/ostp/ostp_public_access_memo_2013.pdf
  11. All of these federal agencies sponsor work being conducted at OSU. Red $$ are federal agencies funding the most work at OSU; amounts shown are funding to OSU in FY2011-2012. [source: OSU Research Office]
  12. So, we’ve agreed on what “data” means, and what data management is. Let’s talk about the large variety of data that you may obtain or create. The types of data that you use and/or create have a large impact on most of your data management activities.
  13. Most research uses multiple data types & formats. My own dissertation included the generation or use of all of the data types that we’ve discussed (which might help to explain why it took me 7 years to earn a Ph.D.). http://hdl.handle.net/1957/9088 Image credit: Document by Piotrek Chuchla from The Noun Project
  14. http://en.wikibooks.org/wiki/Statistics/Different_Types_of_Data/Quantitative_and_Qualitative_Data More on qualitative data: ”Although we may have categories, the categories may have a structure to them. When there is not a natural ordering of the categories, we call these nominal categories. Examples might be gender, race, religion, or sport. When the categories may be ordered, these are called ordinal variables. Categorical variables that judge size (small, medium, large, etc.) are ordinal variables. Attitudes (strongly disagree, disagree, neutral, agree, strongly agree) are also ordinal variables, however we may not know which value is the best or worst of these issues. Note that the distance between these categories is not something we can measure.” More on quantitative data: “However, not all numbers are continuous and measurable. For example, the social security number is a number, but not something that one can add or subtract.”
  15. See more on geospatial analysis here: http://en.wikipedia.org/wiki/Geospatial_analysis
  16. If you don’t know yet – make an educated guess. Why is this important? The data types that you are creating and/or using have implications for data documentation, storage and backup, data citation & ethics or reuse, and preservation and sharing (basically, all aspects of data management.
  17. Before I move on – questions about anything so far?
  18. Common data management areas are: Planning Storage & backup Organization & naming Documentation & metadata Legal & ethical considerations Sharing & reuse Archiving & preservation [image source: http://upload.wikimedia.org/wikipedia/commons/2/22/Earth_Western_Hemisphere_transparent_background.png]
  19. Things to consider: How much data? Resources needed supplies and services costs associated with them Roles & responsibilities Metadata project level (documentation) data level (description) Data formats Data storage & backup working data vs. archive vs. shared/discoverable Ethics & consent Copyright (open data) Sharing How? With who?
  20. Quote from: UE RDM Kit So, let’s review a few real-world scenarios.
  21. Hard drive failures happen ALL THE TIME, without notice. There’s a one-liner that goes something like, “There are two kinds of people in this world: those who have had a hard drive fail, and those who haven’t had a had hard drive fail…yet.” (ha ha) Personal example: tell the story about Maria’s hard drive failure, and expensive (several $K) partial recovery of her dissertation data. She lost a month’s worth of work. image credit: http://www.johnytech.com/wp-content/uploads/2012/09/booterrors.jpg
  22. Something else that happens ALL THE TIME: your computer get stolen (Wiley, post-research cruise theft example). Or, you crash your bike and your computer goes flying (Angel). image credit: http://www.flickr.com/photos/jeffreywarren/299422173/
  23. Lesson: offsite, redundant storage is strongly encouraged.
  24. From my post-doctoral work: Following the Chile earthquake in 2010 (magnitude 8.8), a tsunami devastated the coastal town of Dichato, and moved the Universidad de Concepcion’s research vessel 1 km inland. The marine research station was left standing, but everything in it was destroyed (including our samples and data servers). Luckily, we had the data server replicated at OSU, so we only lost the physical samples that were in Chile. Think it can’t happen to you? Yeah, I thought the same thing. [photo credit: http://www.ocean-partners.org/attachments/473_Being%20towed.JPG]
  25. Not from my personal experience: from the DataONE Data Stories project: https://notebooks.dataone.org/data-stories/the-best-laid-schemes-of-backups-and-redundancy/ Basic synopsis of this story: A researcher had made regular backups of her data, 5 GB worth, to an external drive. Her computer hard drive failed, so she sent it to IT, had it repaired, and went to restore her data using what was on her external backup drive. Unfortunately, only 30% of the data were accessible from the drive. “The only thing they could determine was that the software used to back up the data was not compatible with the recovery process.” Lesson: “As it turned out, creating backups was not enough to completely guard against tragedy; the only way to make sure those backups could serve their purpose in the event of an emergency was to check them regularly and ensure the data they contained were recoverable.”
  26. From NECDMC Module 4: Backup allows you to restore your data in the event that it is lost. A backup strategy is a plan for ensuring the accessibility of research data during the life of a project. What your strategy will be depends on the amount of data you are working with, the frequency that your data changes, and the system requirements for storing and rendering it.
  27. From NECDMC Module 4: Would you need just the data files themselves, the software that created them, or customized scripts written for data analysis? Depending on your research project, you may want to perform full, differential, or incremental backups.
  28. Before we can talk about backing up your data, we need to address where you can store it. We will talk about the advantages, disadvantages and longevity of each option. Then, we’ll move on to backup.
  29. Some content from: UE RDM Kit
  30. USB flash drives, Compact Discs (CDs) and Digital Video Discs (DVDs) If you choose to use CDs, DVDs and USB flash drives (for example, for working data or extra backup copies), you should: Choose high quality products from reputable manufacturers. Follow the instructions provided by the manufacturer for care and handling, including environmental conditions and labeling. Regularly check the media to make sure that they are not failing periodically 'refresh' the data (that is, copy to a new disk or new USB flash drive); every 2-5 years Ensure that any private or confidential data is password-protected and/or encrypted. Some content from: UE RDM Kit
  31. For example: Fileservers managed by your research group, department or college Fileservers managed by Information Services (IS) image credit: http://www.flickr.com/photos/ketmonkey/3329372324/ Some content from: UE RDM Kit
  32. For example: DropBox, SkyDrive, Box, Mozy, Image credit: Cloud by Benni from The Noun Project Some content from: UE RDM Kit
  33. See more info here: http://oregonstate.edu/helpdocs/software/google-apps-osu Fully encrypted
  34. Local storage options: Personal computer External storage USB drive Drobo box Network server PI or department server located near you Remote storage options: Network server Run by Information Services, likely cloud-based Cloud storage Google Apps Suite (Drive, Docs, Forms, etc.) up to 30GB of data Variable access levels Login portal: http://oregonstate.edu/main/online-services/google-apps-for-osu
  35. Using a thoughtful, consistent file-naming strategy is the easiest way to make a big difference in organizing your files, and making them easy to find (not just by you, but by others in your group and your adviser). [image credit: PhDComics.com]
  36. Tweet me @AWhitTwit with your #overlyhonestmethods!
  37. Be consistent Have conventions for naming: (1) Directory structure (2) Folder names (3) File names Always include the same information (e.g. date and time) Retain the order of information (e.g. YYYYMMDD, not MMDDYYY ) Be descriptive Try to keep file and folder names under 32 characters Within reason, Include relevant information such as: Unique identifier (ie. Project Name or Grant # in folder name) Project or research data name Conditions (Lab instrument, Solvent, Temperature, etc.) Run of experiment (sequential) Date (in file properties too) Use application-specific codes in 3-letter file extension and lowercase: mov, tif, wrl When using sequential numbering, make sure to use leading zeros to allow for multi-digit versions. For example, a sequence of 1-10 should be numbered 01-10; a sequence of 1-100 should be numbered 001-010-100. No special characters: & , * % # ; * ( ) ! @$ ^ ~ ' { } [ ] ? < > - Use only one period and before the file extension (e.g. name_paper.doc NOT name.paper.doc OR name_paper..doc)
  38. Show examples of versions Can go back when you make mistakes when changes are made Share work with other people Both work on things at the same time and merge back together Akin to game of telephone- version control can let you see exactly when a change was made
  39. See Git, TortoiseSVN, Mercurial, …
  40. Before I move on – questions about anything so far?
  41. http://orcid.org – get an identifier for free! Here’s Amanda’s ORCID page: http://orcid.org/0000-0003-2429-8879 Thanks to Jackie Wirz at OHSU for finding this perfect example!
  42. Thanks to Jackie Wirz at OHSU for finding this perfect example!
  43. PROJECT LEVEL: Context of data collection Data collection methods Structure, organization of data files Data sources used (see citing data) Data validation, quality assurance Transformations of data from the raw data through analysis Information on confidentiality, access & use conditions DATA LEVEL: Variable names, and descriptions Explanation of codes and classification schemes used Algorithms used to transform data File format and software (including version) used
  44. [add content about licensing] Intellectual Property Office for Commercialization & Corporate Development (OCCD) Copyright Licensing Charging for data? Data attribution & citation HUMAN SUBJECTS? Informed consent & anonymization required prior to publishing Resources @ OSU: Office of Research Integrity, Institutional Review Board (IRB) Responsible Conduct of Research (RCR) Program
  45. “Learn it. Know it. Live it.”, Brad Hamilton, from Fast Times at Ridgemont High.
  46. How to share data: Repository: discipline-specific, journal, etc. Disciplinary Repositories DataBib: http://databib.lib.purdue.edu/ Open Access Directory: http://oad.simmons.edu/oadwiki/Data_repositories OSU's digital repository, ScholarsArchive@OSU Personal web site By request Best practices: Well documented, w/standard metadata schema Non-proprietary file formats
  47. How to share data: Repository: discipline-specific, journal, etc. OSU's digital repository, ScholarsArchive@OSU http://ir.library.oregonstate.edu Personal web site By request Best practices: Well documented, w/standard metadata schema Non-proprietary file formats There are storage options for various disciplines including discipline related repositories. Some scientific journals also have their own data repositories. OSU has an institutional repository, Scholars Archive that serves as a repository for University Scholarship. Currently we store data sets on a case-by-case basis. If you don’t know where to store your data, you may consult with the library or talk to people in your department and/or your field to see if there is a logical repository for your work.
  48. What is a data paper? http://guides.library.oregonstate.edu/data-management-data-papers-journals “Data papers facilitate the sharing of data in a standardized framework that provides value, impact, and recognition for authors. Data papers also provide much more thorough context and description than datasets that are simply deposited to a repository (which may have very minimal metadata requirements).” “Data papers thoroughly describe datasets, and do not usually include any interpretation or discussion (an exception may be discussion of different methods to collect the data, e.g.).”
  49. Your data is being managed whether you realize it or not, but in order to preserve data and make the most of it, researchers should actively plan their data management process. A data management plan is a document that provides a clear roadmap of the process of collecting, storing, sharing and preserving your data. The earlier in the process a researcher starts in the data management process, not only makes it easier for her, but many repositories have requirements about how the data has been formatted etc. . . And it may not be possible to meet those requirements retroactively. [sample DMP text from: http://libguides.unm.edu/loader.php?type=d&id=194916]
  50. These are the five sections for the generic NSF data management plan, but they cover what is generally discussed in any good data management plan. Things to consider: How much data? Resources needed supplies and services costs associated with them Roles & responsibilities Metadata project level (documentation) data level (description) Data formats Data storage & backup working data vs. archive vs. shared/discoverable Ethics & consent Sharing When? How? With who?
  51. Create ready to use plans Meet agency requirements Step-by-step guidance OSU-specific guidance
  52. Please feel free to contact Amanda Whitmire or Maura Valentino with any data management questions you might have or questions about depositing your research to ScholarsArchive@OSU digital repository, especially research data associated with published research articles and/or your thesis or dissertation.
  53. As humans, we aren’t designed to interpret columns of numbers, or to be able to extract meaning from a table with a passing glance. As researchers and practitioners, we need to transform our data into something more meaningful.
  54. We need to be the ambassadors of our data, to represent it in a way that speaks to why you made the effort to collect it and what you expect to learn from it. Sadly, most researchers, the large majority, never receive any training in how to share and communicate about their data, even to their peers.
  55. Get the whole lifecycle graphic here: “Data management throughout the research lifecycle. Amanda Whitmire. figshare. http://dx.doi.org/10.6084/m9.figshare.774628”
  56. OSU Research Data Services can help with data management during most of the research lifecycle phases. Examples include: Data management plan consultation Finding & accessing secondary data Data description & metadata Ways to share your data within your project team How to prepare data for archiving and where to put it How to license your datasets How to cite your datasets (& help others cite it properly)
  57. Could be: Sensor output Audio visual recording Lab/field notes GPS coordinates Survey results Interview notes Characteristics of raw data: Raw units (volts, e.g.) No [extra] metadata No context How to manage raw data: Multiple copies in multiple locations DON'T modify them
  58. Characteristics of intermediate data Data that are in-process Anything between ‘raw’ and ‘final’ Multiple stages, multiple formats, using various software May be shareable w/colleagues, but not likely w/wider audience Managing intermediate data Multiple copies (of most recent versions) in multiple locations Use sensible folder hierarchies & file naming conventions Add metadata immediately & then additionally as you make changes
  59. Characteristics of final data: May be a subset of raw data, or a much larger dataset Shareable with anyone Non-proprietary formats are ideal Metadata are thorough & complete Managing final data: Multiple copies in multiple locations Refresh data every 2-5 years If you use a proprietary format, migrate data to new version regularly Share via an open repository [image copyright: http://www.flickr.com/photos/usnavy/]
  60. Image credit: Stock X.chnge
  61. Image credit: OSU media services
  62. For example, in climate science, HUGE amounts of data are compiled: http://www.realclimate.org/index.php/data-sources/ In the example image, hydrographic data from all over the world are compiled into a database, and new variables can be derived from them. For example, depth profiles of seawater density from measurements of pressure, temperature, and salinity (via conductivity measurements). Image credit; http://cdiac.ornl.gov/oceans/RepeatSections/CDIAC_CLIVAR_global_map.jpg
  63. Image credit: http://media.oregonlive.com/pacific-northwest-news/photo/gs61tsun103jpg-730687c0e8a35305.jpg This model of tsunami driftage is based on ocean topography and surface winds. See: http://iprc.soest.hawaii.edu/research/research.php, research by N. Maximenko.
  64. Image credit: http://upload.wikimedia.org/wikipedia/commons/6/66/1930_census_Olsen_Schmidt.gif
  65. Anticipate: Volume/File type(s) Raw data vs. processed/analyzed data Versioning File Naming Conventions Privacy Concerns Storage practice Backup plans; LOCKSS; checksums Storage needs will depend on the volume and type of data in question. A plan should anticipate the particular storage needs of your research. How much data do you expect to collect and how will storage infrastructure be funded? How will files be arranged on disk. This may seem like a trivial issue, but it’s important. If the person who has been responsible for maintaining files leaves, you need to be able to find things. You’ll also want to group files meaningfully with their related metadata. Data are, of course, subject to loss and corruption. (Bonus joke: there are two types of computer users… those who have suffered hard drive failure and those who haven’t suffered hard drive failure…yet). Will there be multiple copies—on-site and off-site? Are there checks against file corruption? Will you use specialized software or services to manage these aspects of the process? You should know which data you intend to retain locally and for how long. Who will be responsible for this after the main project period is over? Will you deposit in a preservation repository? Different repositories may have different preservation policies which you should be aware of. It’s also necessary to consider storage throughout the lifecycle. How will your data make it from the collection point, into your local storage, into backup, then into long term preservation? The systems involved may not be capable of interoperating without some planning at each stage.
  66. Lesson: offsite, redundant storage is strongly encouraged. 1 on your local workstation 1 local/removable, such as external hard drive 1 on central server and/or 1 remote, such as on a cloud server* *Depending on the type of data, as cloud servers are not always secure
  67. Following the Chile earthquake in 2010 (magnitude 8.8), a tsunami devastated the coastal town of Dichato, and moved the Universidad de Concepcion’s research vessel 1 km inland. The marine research station was left standing, but everything in it was destroyed (including our samples). Think it can’t happen to you? Yeah, I thought the same thing. [photo credit: http://www.ocean-partners.org/attachments/473_Being%20towed.JPG]
  68. ~back to Amanda~ NOTE: 15 October 2013: the software upgrade in this classroom to Office 2013 introduced a bug and subsequent lack of functionality for Colectica. Likewise, DataUP is currently throwing a bug for unknown reasons. I am in contact with both entities, but the bugs could not be fixed by today. Our apologies! DataUp is “an open source tool helping researchers document, manage, and archive their tabular data, DataUp operates within the scientist's workflow and integrates with Microsoft® Excel. http://dataup.cdlib.org/ (EML metdata) Colectica for Excel allows documenting of Variables, Code Lists, and Data Sets directly from within Microsoft Excel. Export your data documentation to an XML file in the DDI metadata format, the standard for data documentation. Open and edit it from Colectica Designer, Colectica Express, or other DDI applications. http://www.colectica.com/software/colecticaforexcel (DDI metadata)