Lesson 1: Introduction to research data management. From a series of lectures from a 10-week, 2-credit graduate-level course in research data management (GRAD521, offered at Oregon State University).
The course description is: "Careful examination of all aspects of research data management best practices. Designed to prepare students to exceed funder mandates for performance in data planning, documentation, preservation and sharing in an increasingly complex digital research environment. Open to students of all disciplines."
Major course content includes: Overview of research data management, definitions and best practices; Types, formats and stages of research data; Metadata (data documentation); Data storage, backup and security; Legal and ethical considerations of research data; Data sharing and reuse; Archiving and preservation.
See also, "Whitmire, Amanda (2014): GRAD 521 Research Data Management Lectures. figshare. http://dx.doi.org/10.6084/m9.figshare.1003835. Retrieved 23:25, Jan 07, 2015 (GMT)"
3. B.S. in Aquatic Biology, 2000
Worked in a bioluminescence laboratory
Ph.D. in Oceanography, emphasis in biological
oceanography, 2008
Dissertation study area: bio-optics; using optical tools
to study ocean ecology (N. California Current)
Post-doc in Oceanography, emphasis in biological
oceanography, 2008-2012
Study area: bio-optics; using optical tools to study
ocean ecology in low oxygen zones (N. Chile)
Assistant Professor, Data Management
Specialist, Sept. 2012 - present
4.
5. Course Overview
Overview of research data management,
definitions & best practices
Types, formats & stages of research data
Data storage, backup & security
Metadata (data documentation)
Legal & ethical considerations of research data
Data sharing & reuse
Archiving & preservation
8. “…the recorded factual material
commonly accepted in the scientific
community as necessary to validate
research findings.”
Research data is:
U.S. Office of Management and Budget, Circular A-110
8
9. “Unlike other types of information, research
data are collected, observed, or created, for the
purposes of analysis to produce
and validate original research
results.”
University of Edinburgh
MANTRA Research Data Management Training,
‘Research Data Explained’
What is research data?
10. Actions that contribute to effective
storage, preservation and reuse of
data and documentation throughout
the research lifecycle.
What is data management?
11. Data management is not:
Data science
Computational science
Database administration
A research method:
• what data to collect
• how to collect them
• how to design an experiment
14. Photocourtesyofwww.carboafrica.net
Data is collected from sensors, sensor
networks, remote sensing, observations,
and more - this calls for increased attention
to data management and stewardship
Data deluge
Photocourtesyof
http://modis.gsfc.nasa.gov/
Photocourtesyof
http://www.futurlec.com
CCimagebytajaionFlickr
CCimagebyCIMMYTonFlickr
ImagecollectedbyVivHutchinson
15. Source: John Gantz, IDC Corporation: The Expanding Digital Universe
0
100,000
200,000
300,000
400,000
500,000
600,000
700,000
800,000
900,000
1,000,000
2005 2006 2007 2008 2009 2010
Transient
information
or unfilled
demand for
storage
Information
Available Storage
PetabytesWorldwide
The World of Data Around Us
16. Natural disaster
Facilities infrastructure failure
Storage failure
Server hardware/software failure
Application software failure
External dependencies (e.g. PKI
failure)
Format obsolescence
Legal encumbrance
Human error
Malicious attack by human or
automated agents
Loss of staffing competencies
Loss of institutional commitment
Loss of financial stability
Changes in user expectations and
requirements
The World of Data Around Us: Data Loss
CCimagebySharynMorrowonFlickr
CCimagebymomboleumonFlickr
17. Poor Data Management Affects Everyone
“MEDICARE PAYMENT ERRORS NEAR $20B” | (CNN) December 2004
Miscoding and billing errors from doctors and hospitals totaled $20,000,000,000 in FY2003 (9.3% error
rate). The error rate measured claims that were paid despite being medically unnecessary, inadequately
documented or improperly coded. In some instances, Medicare asked health care providers for medical
records to back up their claims and got no response. The survey did not document instances of alleged
fraud. This error rate actually was an improvement over the previous fiscal year (9.8% error rate).
“AUDIT: JUSTICE STATS ON ANTI-TERROR CASES FLAWED” | (AP) February 2007
The Justice Department Inspector General found only two sets of data out of 26 concerning terrorism
attacks were accurate. The Justice Department uses these statistics to argue for their budget. The Inspector
General said the data “appear to be the result of decentralized and haphazard methods of collections … and
do not appear to be intentional.”
“OOPS! TECH ERROR WIPES OUT ALASKA INFO” | (AP) March 2007
A technician managed to delete the data and backup for the $38 billion Alaska oil revenue fund – money
received by residents of the State. Correcting the errors cost the State an additional $220,700 (which of
course was taken off the receipts to Alaska residents.)
Slide courtesy of BLM
18. A wildlife biologist for a small field office was the in-
house GIS expert and provided support for all the staff’s
GIS needs. However, the data was stored on her own
workstation. When the biologist relocated to another
office, no one understood how the data was stored or
managed.
Solution: A state office GIS specialist retrieved the
workstation and sifted through files trying to salvage
relevant data.
Cost: 1 work month ($4,000) plus the
value of data that was not recovered
Poor Science Data Management Example
CCimagebyDTRaveon
OpenClipArtLibrary
19. Importance of Data Management
The climate scientists at the centre of a media storm
over leaked emails were yesterday cleared of
accusations that they fudged their results and silenced
critics, but a review found they had failed to be open
enough about their work.
20.
21. Manage your data for yourself:
o Keep yourself organized
o Track your research processes for
reproducibility
o Better control versions of data
o Quality control your data more efficiently
Why Data Management:
Researcher Perspective
22. Make backups to avoid data loss
Format your data for re-use (by yourself or others)
Be prepared: Document your data for your own
recollection, accountability, and re-use (by yourself or
others)
Prepare it to share it – gain credibility
and recognition for your science efforts!
CCimagebyUWWResNetonFlickr
Why Data Management:
Researcher Perspective
23. Data is a
valuable asset
It is expensive & time
consuming to collect
Why data management:
Foundation to advance science
24. Well-managed data can result in
re-use, integration & new science
Spatio-Temporal Exploratory
Models predict the probability
of occurrence of bird species
across the United States at a 35
km x 35 km grid.
Land Cover
Potential Uses-
• Examine patterns of migration
• Infer impacts of climate change
• Measure patterns of habitat usage
• Measure population trends
Model results
eBird
Meteorology
MODIS –
Remote
sensing data
Occurrence of Indigo Bunting (2008)
Jan Sep DecJunApr
Slide courtesy of DataONE
28. New discoveries
A new image processing technique reveals something not before seen in this Hubble Space Telescope
image taken 11 years ago: A faint planet (arrows), the outermost of three discovered with ground-
based telescopes last year around the young star HR 8799.D. Lafrenière et al., Astrophysical Journal
Letters
“The first thing it tells you is how valuable maintaining long-term archives can be. Here is a major
discovery that’s been lurking in the data for about 10 years!” comments Matt Mountain, director
of the Space Telescope Science Institute in Baltimore, which operates Hubble.
“The second thing its tells you is having a well calibrated archive is necessary but not sufficient to
make breakthroughs — it also takes a very innovative group of people to develop very smart
extraction routines that can get rid of all the artifacts to reveal the planet hidden under all that
telescope and detector structure.”
“Planet hidden in
Hubble archives”
Science News
Feb. 27, 2009
D.Lafrenièreetal.,ApJLetters
29. The data deluge has created a surge of information that
needs to be well-managed and made accessible.
The cost of not doing data management can be very high.
Be cognizant of best practices and tools associated with
the data lifecycle to manage your data well.
Many benefits are associated with the act of managing
data, including the ability to find, access, understand, integrate
and re-use data.
Summary
30. Summary, continued
If data are:
Well-organized
Documented
Preserved
Accessible
Verified as to accuracy
and validity
The result is:
High quality data
Easy to share and re-use
Citation & credibility to
the researcher
Cost-savings to science
This presentation has a CC-BY license (Creative Commons attribution license). Please cite this work as “Whitmire, Amanda L. (2014). Research Data Management Curriculum, Lecture 1: Introduction to Research Data Management. Oregon State University Libraries. Retrieved [date] from: http://guides.library.oregonstate.edu/grad521Lectures.”
Slides credited to DataONE (see slide notes) have the following citation: “DataONE Education Module: Why Data Management. DataONE. Retrieved Jan. 5, 2014. From http://www.dataone.org/sites/all/documents/L01_DataManagement.pptx.
Image credit: Surveying by Luis Prado from The Noun Project
About me…
Let’s spend some time reviewing the syllabus and getting acquainted with what you can expect for the next 11 weeks.
Image credit: Files by Pieter J. Smits from The Noun Project (knowledge transfer icon is public domain)
Welcome to your fist “active learning” exercise! [rubs hands together in a plotting manner]
Give students 3-4 minutes to discuss with a partner. Then ask for responses from a few students.
Image credit: ‘Interview’ by Sarah Abraham from The Noun Project.
Image credit: Science Magazine
Does not include, “any of the following: preliminary analyses, drafts of scientific papers, plans for future research, peer reviews, or communications with colleagues. This "recorded" material excludes physical objects (e.g., laboratory samples).” This narrow definition mostly takes a retrospective view of your dataset, in that it does not account for raw and intermediate data that may be critical to the research process but that don’t become part of the ’final’ dataset.
Data types could be:
Observational
Experimental
Simulated
Derived
Reference or canonical
…”Data may be viewed as the lowest level of abstraction from which information and knowledge are derived.”
Data management is a verb – it involves intentional effort and activity.
The main goals of DM are preservation and reuse, for you and for others.
Covers all aspects of the data lifecycle from planning digital data capture methods, whittling down, ingestion to databases, providing for access and reuse, to transformation.
If this is why you are here, you signed up for the wrong class.
Let’s look at one important area of scientific inquiry: climate change. What scale of data integration is necessary to study global trends over geologic timescales?
Data are being generated in massive quantities daily. Improvements in technology enable higher precision and coverage in data acquisition and makes higher capacity systems store and migrate more data –increasing the importance of managing, integrating, and re-using data. In order to integrate these diverse datasets to answer questions of global significance, the data have to be well organized, well documented and described, preserved and accessible. It all depends of effective management of the data.
Slide credit: DataONE Education Module 1.
The amount of available storage is not keeping up with the amount of data flooding in daily. How do we decide what data we keep?
Slide credit: DataONE Education Module 1.
Slide credit: DataONE Education Module 1.
Data Costs
Consider some of the data management issues that made headlines, affecting agencies and organizations. Data quality is not limited to any one organization. These examples show costs (in terms of money lost) due to a lack of data quality control.
Slide credit: DataONE Education Module 1.
Consider this situation in an academic context. How common do you think it is that data are lost when graduate students leave, because their adviser either can’t find the data, doesn’t understand the file names or how the data are organized, or because the data aren’t documented well enough (e.g. experimental or observational conditions, human subject codes, samples vs. controls, etc.)?
Slide credit: DataONE Education Module 1.
After investigations by the House of Commons Science and Technology Committee (UK), Inspector General of the U.S. Department of Commerce, the National Science Foundation, the National Research Council of the National Academy of Sciences, and Pennsylvania State University, no evidence of scientific misconduct or wrongdoing was discovered. The scientists were able to show all of their data, software codes and other research materials to put the hacked emails into the context of rigorous discourse among colleagues. If they hadn’t been able to provide all of the evidence to support their conclusions on climate change, “Climategate” would likely still be going on.
Slide credit (images only): DataONE Education Module 1.
“How long should I retain data?” is not a clear and cut data management question. Last year, for example, the JCI retracted a published article because one of its data tables was duplicated.
The publisher contacted the researchers to have them update the data, but they could not locate the original data files after six years, so the journal was forced to issue a retraction.
Slide credit: NECDMC, Module 1: Overview of Research Data Management
Manage your data for yourself:
Keep yourself organized – be able to find your files (data inputs, analytic scripts, outputs at various stages of the analytic process, etc.)
Track your science processes for reproducibility – be able to match up your outputs with exact inputs and transformations that produced them
Better control versions of data – identify easily versions that can be periodically purged
Quality control your data more efficiently
Slide credit: DataONE Education Module 1.
Slide credit: DataONE Education Module 1.
Data should be managed to:
maximize the effective use and value of data and information assets
continually improve the quality including: data accuracy, integrity, integration, timeliness of data capture and presentation, relevance and usefulness
ensure appropriate use of data and information
facilitate data sharing
ensure sustainability and accessibility in long term for re-use in science
Slide credit: DataONE Education Module 1.
By re-using data collected from a variety of sources – eBird database, land cover data, meteorology, and remotely sensed -- this project was able to compile and process the data using supercomputering to determine bird migration routes for particular species.
Slide credit: DataONE Education Module 1.
Slide credit: DataONE Education Module 1.
There is an abundance of data and metadata (if it is done) end up in filing cabinets, on discarded hard drives, in hard-copy journals on the library shelves -- or on the web, but many are subscription only journals.
Slide credit: DataONE Education Module 1.
Data should be properly managed and eventually be placed where they are accessible, understandable, and re-usable.
Slide credit: DataONE Education Module 1.
Slide credit: DataONE Education Module 1.
Slide credit: DataONE Education Module 1.
For each stage of the data lifecycle…there are best practices…..and….tools to help!
Your well-managed and accessible data can contribute to science in ways you may not even imagine today!
Slide credit: DataONE Education Module 1.