GRAD 521, Research Data Management
Winter 2014 - Lecture 1
Amanda L. Whitmire, Asst. Professor
Lesson One Outline
Introductions
The importance of
data management
What is/are ‘data’?
B.S. in Aquatic Biology, 2000
Worked in a bioluminescence laboratory
Ph.D. in Oceanography, emphasis in biological
oceanography, 2008
Dissertation study area: bio-optics; using optical tools
to study ocean ecology (N. California Current)
Post-doc in Oceanography, emphasis in biological
oceanography, 2008-2012
Study area: bio-optics; using optical tools to study
ocean ecology in low oxygen zones (N. Chile)
Assistant Professor, Data Management
Specialist, Sept. 2012 - present
Course Overview
Overview of research data management,
definitions & best practices
Types, formats & stages of research data
Data storage, backup & security
Metadata (data documentation)
Legal & ethical considerations of research data
Data sharing & reuse
Archiving & preservation
Pair & Share
Name
College/Department/Unit/etc.
1st year, 2nd year, etc.
What is/are data?
Why actively
manage it?
What is data?
“…the recorded factual material
commonly accepted in the scientific
community as necessary to validate
research findings.”
Research data is:
U.S. Office of Management and Budget, Circular A-110
8
“Unlike other types of information, research
data are collected, observed, or created, for the
purposes of analysis to produce
and validate original research
results.”
University of Edinburgh
MANTRA Research Data Management Training,
‘Research Data Explained’
What is research data?
Actions that contribute to effective
storage, preservation and reuse of
data and documentation throughout
the research lifecycle.
What is data management?
Data management is not:
Data science
Computational science
Database administration
A research method:
• what data to collect
• how to collect them
• how to design an experiment
Why Data Management?
Images collected by DataOne.org
Photocourtesyofwww.carboafrica.net
Data is collected from sensors, sensor
networks, remote sensing, observations,
and more - this calls for increased attention
to data management and stewardship
Data deluge
Photocourtesyof
http://modis.gsfc.nasa.gov/
Photocourtesyof
http://www.futurlec.com
CCimagebytajaionFlickr
CCimagebyCIMMYTonFlickr
ImagecollectedbyVivHutchinson
Source: John Gantz, IDC Corporation: The Expanding Digital Universe
0
100,000
200,000
300,000
400,000
500,000
600,000
700,000
800,000
900,000
1,000,000
2005 2006 2007 2008 2009 2010
Transient
information
or unfilled
demand for
storage
Information
Available Storage
PetabytesWorldwide
The World of Data Around Us
Natural disaster
Facilities infrastructure failure
Storage failure
Server hardware/software failure
Application software failure
External dependencies (e.g. PKI
failure)
Format obsolescence
Legal encumbrance
Human error
Malicious attack by human or
automated agents
Loss of staffing competencies
Loss of institutional commitment
Loss of financial stability
Changes in user expectations and
requirements
The World of Data Around Us: Data Loss
CCimagebySharynMorrowonFlickr
CCimagebymomboleumonFlickr
Poor Data Management Affects Everyone
“MEDICARE PAYMENT ERRORS NEAR $20B” | (CNN) December 2004
Miscoding and billing errors from doctors and hospitals totaled $20,000,000,000 in FY2003 (9.3% error
rate). The error rate measured claims that were paid despite being medically unnecessary, inadequately
documented or improperly coded. In some instances, Medicare asked health care providers for medical
records to back up their claims and got no response. The survey did not document instances of alleged
fraud. This error rate actually was an improvement over the previous fiscal year (9.8% error rate).
“AUDIT: JUSTICE STATS ON ANTI-TERROR CASES FLAWED” | (AP) February 2007
The Justice Department Inspector General found only two sets of data out of 26 concerning terrorism
attacks were accurate. The Justice Department uses these statistics to argue for their budget. The Inspector
General said the data “appear to be the result of decentralized and haphazard methods of collections … and
do not appear to be intentional.”
“OOPS! TECH ERROR WIPES OUT ALASKA INFO” | (AP) March 2007
A technician managed to delete the data and backup for the $38 billion Alaska oil revenue fund – money
received by residents of the State. Correcting the errors cost the State an additional $220,700 (which of
course was taken off the receipts to Alaska residents.)
Slide courtesy of BLM
A wildlife biologist for a small field office was the in-
house GIS expert and provided support for all the staff’s
GIS needs. However, the data was stored on her own
workstation. When the biologist relocated to another
office, no one understood how the data was stored or
managed.
Solution: A state office GIS specialist retrieved the
workstation and sifted through files trying to salvage
relevant data.
Cost: 1 work month ($4,000) plus the
value of data that was not recovered
Poor Science Data Management Example
CCimagebyDTRaveon
OpenClipArtLibrary
Importance of Data Management
The climate scientists at the centre of a media storm
over leaked emails were yesterday cleared of
accusations that they fudged their results and silenced
critics, but a review found they had failed to be open
enough about their work.
Manage your data for yourself:
o Keep yourself organized
o Track your research processes for
reproducibility
o Better control versions of data
o Quality control your data more efficiently
Why Data Management:
Researcher Perspective
Make backups to avoid data loss
Format your data for re-use (by yourself or others)
Be prepared: Document your data for your own
recollection, accountability, and re-use (by yourself or
others)
Prepare it to share it – gain credibility
and recognition for your science efforts!
CCimagebyUWWResNetonFlickr
Why Data Management:
Researcher Perspective
Data is a
valuable asset
It is expensive & time
consuming to collect
Why data management:
Foundation to advance science
Well-managed data can result in
re-use, integration & new science
Spatio-Temporal Exploratory
Models predict the probability
of occurrence of bird species
across the United States at a 35
km x 35 km grid.
Land Cover
Potential Uses-
• Examine patterns of migration
• Infer impacts of climate change
• Measure patterns of habitat usage
• Measure population trends
Model results
eBird
Meteorology
MODIS –
Remote
sensing data
Occurrence of Indigo Bunting (2008)
Jan Sep DecJunApr
Slide courtesy of DataONE
Data Integration Results
ImagescourtesyofCornellOrnithologyLab
http://www.youtube.com/watch?v=Cik6fIuoPDk
Where a majority of data end up now…
Imagine if data were more accessible
New discoveries
A new image processing technique reveals something not before seen in this Hubble Space Telescope
image taken 11 years ago: A faint planet (arrows), the outermost of three discovered with ground-
based telescopes last year around the young star HR 8799.D. Lafrenière et al., Astrophysical Journal
Letters
“The first thing it tells you is how valuable maintaining long-term archives can be. Here is a major
discovery that’s been lurking in the data for about 10 years!” comments Matt Mountain, director
of the Space Telescope Science Institute in Baltimore, which operates Hubble.
“The second thing its tells you is having a well calibrated archive is necessary but not sufficient to
make breakthroughs — it also takes a very innovative group of people to develop very smart
extraction routines that can get rid of all the artifacts to reveal the planet hidden under all that
telescope and detector structure.”
“Planet hidden in
Hubble archives”
Science News
Feb. 27, 2009
D.Lafrenièreetal.,ApJLetters
The data deluge has created a surge of information that
needs to be well-managed and made accessible.
The cost of not doing data management can be very high.
Be cognizant of best practices and tools associated with
the data lifecycle to manage your data well.
Many benefits are associated with the act of managing
data, including the ability to find, access, understand, integrate
and re-use data.
Summary
Summary, continued
If data are:
Well-organized
Documented
Preserved
Accessible
Verified as to accuracy
and validity
The result is:
High quality data
Easy to share and re-use
Citation & credibility to
the researcher
Cost-savings to science
Thursday
Data management plans & the research lifecycle
Homework:
Take the pre-assessment survey
(link in Canvas)
Archived slides
About You

Introduction to research data management; Lecture 01 for GRAD521

  • 1.
    GRAD 521, ResearchData Management Winter 2014 - Lecture 1 Amanda L. Whitmire, Asst. Professor
  • 2.
    Lesson One Outline Introductions Theimportance of data management What is/are ‘data’?
  • 3.
    B.S. in AquaticBiology, 2000 Worked in a bioluminescence laboratory Ph.D. in Oceanography, emphasis in biological oceanography, 2008 Dissertation study area: bio-optics; using optical tools to study ocean ecology (N. California Current) Post-doc in Oceanography, emphasis in biological oceanography, 2008-2012 Study area: bio-optics; using optical tools to study ocean ecology in low oxygen zones (N. Chile) Assistant Professor, Data Management Specialist, Sept. 2012 - present
  • 5.
    Course Overview Overview ofresearch data management, definitions & best practices Types, formats & stages of research data Data storage, backup & security Metadata (data documentation) Legal & ethical considerations of research data Data sharing & reuse Archiving & preservation
  • 6.
    Pair & Share Name College/Department/Unit/etc. 1styear, 2nd year, etc. What is/are data?
  • 7.
  • 8.
    “…the recorded factualmaterial commonly accepted in the scientific community as necessary to validate research findings.” Research data is: U.S. Office of Management and Budget, Circular A-110 8
  • 9.
    “Unlike other typesof information, research data are collected, observed, or created, for the purposes of analysis to produce and validate original research results.” University of Edinburgh MANTRA Research Data Management Training, ‘Research Data Explained’ What is research data?
  • 10.
    Actions that contributeto effective storage, preservation and reuse of data and documentation throughout the research lifecycle. What is data management?
  • 11.
    Data management isnot: Data science Computational science Database administration A research method: • what data to collect • how to collect them • how to design an experiment
  • 12.
  • 13.
  • 14.
    Photocourtesyofwww.carboafrica.net Data is collectedfrom sensors, sensor networks, remote sensing, observations, and more - this calls for increased attention to data management and stewardship Data deluge Photocourtesyof http://modis.gsfc.nasa.gov/ Photocourtesyof http://www.futurlec.com CCimagebytajaionFlickr CCimagebyCIMMYTonFlickr ImagecollectedbyVivHutchinson
  • 15.
    Source: John Gantz,IDC Corporation: The Expanding Digital Universe 0 100,000 200,000 300,000 400,000 500,000 600,000 700,000 800,000 900,000 1,000,000 2005 2006 2007 2008 2009 2010 Transient information or unfilled demand for storage Information Available Storage PetabytesWorldwide The World of Data Around Us
  • 16.
    Natural disaster Facilities infrastructurefailure Storage failure Server hardware/software failure Application software failure External dependencies (e.g. PKI failure) Format obsolescence Legal encumbrance Human error Malicious attack by human or automated agents Loss of staffing competencies Loss of institutional commitment Loss of financial stability Changes in user expectations and requirements The World of Data Around Us: Data Loss CCimagebySharynMorrowonFlickr CCimagebymomboleumonFlickr
  • 17.
    Poor Data ManagementAffects Everyone “MEDICARE PAYMENT ERRORS NEAR $20B” | (CNN) December 2004 Miscoding and billing errors from doctors and hospitals totaled $20,000,000,000 in FY2003 (9.3% error rate). The error rate measured claims that were paid despite being medically unnecessary, inadequately documented or improperly coded. In some instances, Medicare asked health care providers for medical records to back up their claims and got no response. The survey did not document instances of alleged fraud. This error rate actually was an improvement over the previous fiscal year (9.8% error rate). “AUDIT: JUSTICE STATS ON ANTI-TERROR CASES FLAWED” | (AP) February 2007 The Justice Department Inspector General found only two sets of data out of 26 concerning terrorism attacks were accurate. The Justice Department uses these statistics to argue for their budget. The Inspector General said the data “appear to be the result of decentralized and haphazard methods of collections … and do not appear to be intentional.” “OOPS! TECH ERROR WIPES OUT ALASKA INFO” | (AP) March 2007 A technician managed to delete the data and backup for the $38 billion Alaska oil revenue fund – money received by residents of the State. Correcting the errors cost the State an additional $220,700 (which of course was taken off the receipts to Alaska residents.) Slide courtesy of BLM
  • 18.
    A wildlife biologistfor a small field office was the in- house GIS expert and provided support for all the staff’s GIS needs. However, the data was stored on her own workstation. When the biologist relocated to another office, no one understood how the data was stored or managed. Solution: A state office GIS specialist retrieved the workstation and sifted through files trying to salvage relevant data. Cost: 1 work month ($4,000) plus the value of data that was not recovered Poor Science Data Management Example CCimagebyDTRaveon OpenClipArtLibrary
  • 19.
    Importance of DataManagement The climate scientists at the centre of a media storm over leaked emails were yesterday cleared of accusations that they fudged their results and silenced critics, but a review found they had failed to be open enough about their work.
  • 21.
    Manage your datafor yourself: o Keep yourself organized o Track your research processes for reproducibility o Better control versions of data o Quality control your data more efficiently Why Data Management: Researcher Perspective
  • 22.
    Make backups toavoid data loss Format your data for re-use (by yourself or others) Be prepared: Document your data for your own recollection, accountability, and re-use (by yourself or others) Prepare it to share it – gain credibility and recognition for your science efforts! CCimagebyUWWResNetonFlickr Why Data Management: Researcher Perspective
  • 23.
    Data is a valuableasset It is expensive & time consuming to collect Why data management: Foundation to advance science
  • 24.
    Well-managed data canresult in re-use, integration & new science Spatio-Temporal Exploratory Models predict the probability of occurrence of bird species across the United States at a 35 km x 35 km grid. Land Cover Potential Uses- • Examine patterns of migration • Infer impacts of climate change • Measure patterns of habitat usage • Measure population trends Model results eBird Meteorology MODIS – Remote sensing data Occurrence of Indigo Bunting (2008) Jan Sep DecJunApr Slide courtesy of DataONE
  • 25.
  • 26.
    Where a majorityof data end up now…
  • 27.
    Imagine if datawere more accessible
  • 28.
    New discoveries A newimage processing technique reveals something not before seen in this Hubble Space Telescope image taken 11 years ago: A faint planet (arrows), the outermost of three discovered with ground- based telescopes last year around the young star HR 8799.D. Lafrenière et al., Astrophysical Journal Letters “The first thing it tells you is how valuable maintaining long-term archives can be. Here is a major discovery that’s been lurking in the data for about 10 years!” comments Matt Mountain, director of the Space Telescope Science Institute in Baltimore, which operates Hubble. “The second thing its tells you is having a well calibrated archive is necessary but not sufficient to make breakthroughs — it also takes a very innovative group of people to develop very smart extraction routines that can get rid of all the artifacts to reveal the planet hidden under all that telescope and detector structure.” “Planet hidden in Hubble archives” Science News Feb. 27, 2009 D.Lafrenièreetal.,ApJLetters
  • 29.
    The data delugehas created a surge of information that needs to be well-managed and made accessible. The cost of not doing data management can be very high. Be cognizant of best practices and tools associated with the data lifecycle to manage your data well. Many benefits are associated with the act of managing data, including the ability to find, access, understand, integrate and re-use data. Summary
  • 30.
    Summary, continued If dataare: Well-organized Documented Preserved Accessible Verified as to accuracy and validity The result is: High quality data Easy to share and re-use Citation & credibility to the researcher Cost-savings to science
  • 31.
    Thursday Data management plans& the research lifecycle Homework: Take the pre-assessment survey (link in Canvas)
  • 32.
  • 33.

Editor's Notes

  • #2 This presentation has a CC-BY license (Creative Commons attribution license). Please cite this work as “Whitmire, Amanda L. (2014). Research Data Management Curriculum, Lecture 1: Introduction to Research Data Management. Oregon State University Libraries. Retrieved [date] from: http://guides.library.oregonstate.edu/grad521Lectures.” Slides credited to DataONE (see slide notes) have the following citation: “DataONE Education Module: Why Data Management. DataONE. Retrieved Jan. 5, 2014. From http://www.dataone.org/sites/all/documents/L01_DataManagement.pptx.
  • #3 Image credit: Surveying by Luis Prado from The Noun Project
  • #4 About me…
  • #5 Let’s spend some time reviewing the syllabus and getting acquainted with what you can expect for the next 11 weeks. Image credit: Files by Pieter J. Smits from The Noun Project (knowledge transfer icon is public domain)
  • #7 Welcome to your fist “active learning” exercise! [rubs hands together in a plotting manner] Give students 3-4 minutes to discuss with a partner. Then ask for responses from a few students. Image credit: ‘Interview’ by Sarah Abraham from The Noun Project.
  • #8 Image credit: Science Magazine
  • #9 Does not include, “any of the following: preliminary analyses, drafts of scientific papers, plans for future research, peer reviews, or communications with colleagues. This "recorded" material excludes physical objects (e.g., laboratory samples).” This narrow definition mostly takes a retrospective view of your dataset, in that it does not account for raw and intermediate data that may be critical to the research process but that don’t become part of the ’final’ dataset. Data types could be: Observational Experimental Simulated Derived Reference or canonical
  • #10 …”Data may be viewed as the lowest level of abstraction from which information and knowledge are derived.”
  • #11 Data management is a verb – it involves intentional effort and activity. The main goals of DM are preservation and reuse, for you and for others. Covers all aspects of the data lifecycle from planning digital data capture methods, whittling down, ingestion to databases, providing for access and reuse, to transformation.
  • #12 If this is why you are here, you signed up for the wrong class.
  • #14 Let’s look at one important area of scientific inquiry: climate change. What scale of data integration is necessary to study global trends over geologic timescales?
  • #15 Data are being generated in massive quantities daily. Improvements in technology enable higher precision and coverage in data acquisition and makes higher capacity systems store and migrate more data –increasing the importance of managing, integrating, and re-using data. In order to integrate these diverse datasets to answer questions of global significance, the data have to be well organized, well documented and described, preserved and accessible. It all depends of effective management of the data. Slide credit: DataONE Education Module 1.
  • #16 The amount of available storage is not keeping up with the amount of data flooding in daily. How do we decide what data we keep? Slide credit: DataONE Education Module 1.
  • #17 Slide credit: DataONE Education Module 1.
  • #18 Data Costs Consider some of the data management issues that made headlines, affecting agencies and organizations. Data quality is not limited to any one organization. These examples show costs (in terms of money lost) due to a lack of data quality control. Slide credit: DataONE Education Module 1.
  • #19 Consider this situation in an academic context. How common do you think it is that data are lost when graduate students leave, because their adviser either can’t find the data, doesn’t understand the file names or how the data are organized, or because the data aren’t documented well enough (e.g. experimental or observational conditions, human subject codes, samples vs. controls, etc.)? Slide credit: DataONE Education Module 1.
  • #20 After investigations by the House of Commons Science and Technology Committee (UK), Inspector General of the U.S. Department of Commerce, the National Science Foundation, the National Research Council of the National Academy of Sciences, and Pennsylvania State University, no evidence of scientific misconduct or wrongdoing was discovered. The scientists were able to show all of their data, software codes and other research materials to put the hacked emails into the context of rigorous discourse among colleagues. If they hadn’t been able to provide all of the evidence to support their conclusions on climate change, “Climategate” would likely still be going on. Slide credit (images only): DataONE Education Module 1.
  • #21 “How long should I retain data?” is not a clear and cut data management question. Last year, for example, the JCI retracted a published article because one of its data tables was duplicated. The publisher contacted the researchers to have them update the data, but they could not locate the original data files after six years, so the journal was forced to issue a retraction. Slide credit: NECDMC, Module 1: Overview of Research Data Management
  • #22 Manage your data for yourself: Keep yourself organized – be able to find your files (data inputs, analytic scripts, outputs at various stages of the analytic process, etc.) Track your science processes for reproducibility – be able to match up your outputs with exact inputs and transformations that produced them Better control versions of data – identify easily versions that can be periodically purged Quality control your data more efficiently Slide credit: DataONE Education Module 1.
  • #23 Slide credit: DataONE Education Module 1.
  • #24 Data should be managed to: maximize the effective use and value of data and information assets continually improve the quality including: data accuracy, integrity, integration, timeliness of data capture and presentation, relevance and usefulness ensure appropriate use of data and information facilitate data sharing ensure sustainability and accessibility in long term for re-use in science Slide credit: DataONE Education Module 1.
  • #25 By re-using data collected from a variety of sources – eBird database, land cover data, meteorology, and remotely sensed -- this project was able to compile and process the data using supercomputering to determine bird migration routes for particular species. Slide credit: DataONE Education Module 1.
  • #26 Slide credit: DataONE Education Module 1.
  • #27 There is an abundance of data and metadata (if it is done) end up in filing cabinets, on discarded hard drives, in hard-copy journals on the library shelves -- or on the web, but many are subscription only journals. Slide credit: DataONE Education Module 1.
  • #28 Data should be properly managed and eventually be placed where they are accessible, understandable, and re-usable. Slide credit: DataONE Education Module 1.
  • #29 Slide credit: DataONE Education Module 1.
  • #30 Slide credit: DataONE Education Module 1.
  • #31  For each stage of the data lifecycle…there are best practices…..and….tools to help! Your well-managed and accessible data can contribute to science in ways you may not even imagine today! Slide credit: DataONE Education Module 1.