Data Management Best Practices

Introduction to Data Management
CCimagebyUniversityofMarylandPressReleasesonFlickr
Adapted from
curriculum
developed by

• Why Manage Data?
• Data Sharing
• Data Entry & Manipulation
• Quality Control & Assurance
• Backup
• Metadata
• Data Citation

• Organization
• Reproducibility
• Version control
• Quality control
• Valuable asset
• Accuracy
• Integrity
• Data sharing
• Sustainability & accessibility

• If data are:
o Well-organized
o Documented
o Preserved
o Accessible
o Verified as to Accuracy and validity
• Result is:
o High quality data
o Easy to share and re-use in science
o Citation and credibility to the researcher
o Cost-savings to science

• Data Sharing

Data sharing requires effort, resources, and faith in others.
Why do it?
For the benefit of:
o the public
o the research sponsor
o the research community
o the researcher
CCimagebyJessicaLuciaonFlickr

A better informed public yields better decision making with
regard to:
o Environmental and economic planning
o Federal, state, and local policies
o social choices such as use of tax dollars and education options
o personal lifestyle and health such as nutrition and recreation
CCimagebyfalonyatesonFlickr

• Organizations that sponsor research must maximize the
value of research dollars
• Data sharing enhances the value of research investments by
enabling:
o verification of performance metrics and outcomes
o new research and increased return on investment
o advancement of the science
o reduced data duplication expenditures

Access to related research enables community members to:
o build upon the work of others
o perform meta analyses
o share resources and perspectives
CCimagebyLawrenceBerkeleyNational
LaboratoryonFlickr

Access to related research enables community members to
(cont’d):
o increase transparency, reproducibility and comparability of results
o expand methodology assessment, recommendations and
improvement
o educate new researchers as to the most current and significant
findings

Scientists that share data gain the benefit of:
o Recognition
o improved data quality
o greater opportunity for data exchange
o improved connections
CCimagebySLUMadridCampus
onFlickr

Step One:
Create robust metadata that is discoverable
o Geographic and temporal coverage
o Discipline specific metadata schema
o Discipline specific vocabulary
o Describe attributes

Step Two:
Include archival and reference information
o Include a data citation
o Include Persistent Identifier (e.g. DOI)
Data Citation Example: Sidlauskas, B. 2007. Data from: Testing for unequal rates of
morphological diversification in the absence of a detailed phylogeny: a case study from
characiform fishes. Dryad Digital Repository. doi:10.5061/dryad.20

Step Three:
Have data contributors review your metadata to ensure validity
and organizational ‘correctness’
o are the processes described accurately?
o are all contributions adequately identified?
o has management reviewed the product and documentation?
o is the funding organization properly recognized?

Step Four:
Publish your data and metadata via:
Data Repositories/Clearinghouses
• Discipline-specific
◦ Sciences
• Knowledge Network for Biodiversity (KNB) Data Portal
• Long Term Ecological Research (LTER) Network Data Portal
◦ Social Sciences
• ICPSR
• Institutional
◦ Trace

• Data Sharing

• Create data sets that are:
o Valid
o Organized to support ease of use
CCimagebyTravisSonFlickr

• Inconsistency between data collection events
– Location of Date information
– Inconsistent Date format
– Column names
– Order of columns

• Inconsistency between data collection events
– Different site spellings, capitalization, spaces
in site names—hard to filter
– Codes used for site names for some data, but
spelled out for others
– Mean1 value is in Weight column
– Text and numbers in same column – what is
the mean of 12, “escaped < 15”, and 91?

• Columns of data are consistent:
only numbers, dates, or text
• Consistent Names, Codes, Formats (date) used in each column
• Data are all in one table, which is much easier for a statistical program to work
with than multiple small tables which each require human intervention

• Descriptive column names
◦ Soil T30  Soil_Temp_30cm
◦ Species-Code  Species_Code (avoid using -,+,*,^ in column names.
Some software may interpret these symbols as an operator)
• Descriptive file names
◦ Mammal data-.csv  FieldVisit1_SmallMammalData_2010-04-11.csv

• Enter complete lines of data
Sorting an
Excel file with
empty cells is
not a good
idea!

• Missing data
o Preferably leave field empty (NULL = no value)
o In numeric fields, use a distinct value such as 9999 to indicate a missing
value
o In text fields, use NA (“Not Applicable” or “Not Available”)
o Use Data flags in a separate column to qualify missing value
Date Time NO3_N_Conc NO3_N_Conc_Flag
20081011 1300 0.013
20081011 1330 0.016
20081011 1400 M1
20081011 1430 0.018
20081011 1500 0.001 E1
M1 = missing; no sample
collected
E1 = estimated from
grab sample

20

• Great for charts, graphs,
calculations
• Flexible about cell content
type—cells in same column
can contain numbers or text
• Lack record integrity--can
sort a column independently
of all others)
• Easy to use – but harder to
maintain as complexity and
size of data grows
• Easy to query to select
portions of data
• Data fields are typed – For
example, only integers are
allowed in integer fields
• Columns cannot be sorted
independently of each other
• Steeper learning curve than
a spreadsheet

• A set of tables
• Relationships
• A command
language
*siteID
site_name
latitude
longitude
description
Sample sites
*speciesID
species_name
common_name
family
order
Species
*sampleID
siteID
sample_date
speciesID
height
flowering
flag
comments
samples
*sampleID
siteID
sample_date
speciesID
height
flowering
flag
comments
Samples

Date Site Height Flowering
<dates only> <text only> <real numbers only> <‘y’ and ‘n’ only>
Advantages
• quality control
• performance

Date Site Species Flowering?
2/13/2010 A BOGR2 y
2/13/2010 B HODR y
4/15/2010 B BOER4 y
4/15/2010 C PLJA n
Site Latitude Longitude
A 34.1 -109.3
B 35.2 -108.6
C 32.6 -107.5
Date Site Species Flowering? Latitude Longitude
2/13/2010 A BOGR2 y 34.1 -109.3
2/13/2010 B HODR y 35.2 -108.6
4/15/2010 B BOER4 y 35.2 -108.6
4/15/2010 C PLJA n 32.6 -107.5
Mix and
Match
data on
the fly

Date Plot Treatment SensorDepth Soil_Temperature
2010-02-01 C R 30 12.8
2010-02-01 B C 10 13.2
2010-02-02 C R 0 6.3
2010-02-02 A N 0 15.1
SQL examples: Select Date, Plot, Treatment, SensorDepth, Soil_Temperature from
SoilTemp where Date = ‘2010-02-01’
2010-02-01 C R 30 12.8
2010-02-01 B C 10 13.2
2010-02-02 A N 0 15.1
This table is called SoilTemp
Select * from SoilTemp where Treatment=‘N’ and SensorDepth=‘0’

• Be aware of Best Practices when designing data file structures
• Choose a data entry method that allows some validation of
data as it is entered
• Consider investing time in learning how to use a database if
datasets are large or complex
CCimagebyfo.olonFlickr

• Data Sharing

• Errors of Commission
o Incorrect or inaccurate data entered
o Examples: malfunctioning instrument, mistyped data
• Errors of Omission
o Data or metadata not recorded
o Examples: inadequate documentation, human error, anomalies in the
field
CCimagebyNickJWebbonFlickr

• Define & enforce standards
◦ Formats
◦ Codes
◦ Measurement units
◦ Metadata
• Assign responsibility for data quality
◦ Be sure assigned person is educated in QA/QC

• Double entry
◦ Data keyed in by two independent people
◦ Check for agreement with computer verification
• Record a reading of the data and transcribe from the
recording
• Use text-to-speech program to read data back
CCimagebyweskrieselonFlickr

• Design data storage well
◦ Minimize number of times items that must be entered repeatedly
◦ Use consistent terminology
◦ Atomize data: one cell per piece of information
• Document changes to data
◦ Avoids duplicate error checking
◦ Allows undo if necessary

• Make sure data line up in proper columns
• No missing, impossible, or anomalous values
• Perform statistical summaries
CCimagebychesapeakeclimateonFlickr

• Look for outliers
◦ Outliers are extreme values for a variable given the statistical model
being used
◦ The goal is not to eliminate outliers but to identify potential data
contamination
0
10
20
30
40
50
60
0 5 10 15 20 25 30 35

• Methods to look for outliers
◦ Graphical
• Normal probability plots
• Regression
• Scatter plots
◦ Plotting on maps
◦ Deviation

• Beware of errors of commission and omission
• Execute quality assurance and quality control strategies
◦ Data entry
◦ Data visualization

• Data Sharing
• Backup

• Backups vs. Archives
◦ Backups: a copy (or copies) of the original file is made before the
original is overwritten
o Archives: preservation of the file
• Data Preservation
o Includes archiving in addition to processes such as data rescue, data
reformatting, data conversion, metadata

• Backups
o periodic snapshots
o usually copies of files
o performed on regular schedule
• Archiving
o preserve data
o usually the final version
o performed at the end of a project or milestones
It is a good idea to have multiple copies of your backups and
archives, in case one copy fails.

• Limit or negate loss of data
• Save time, money, productivity
• Help prepare for disasters
o Accidental deletions
o Fires, natural disasters
o Software bugs, hardware failures
• Reproduce results
• Respond to data requests
• Limit liability
CCImagecourtesyofBrianJMatisonFlickr

• Includes backups and archiving
• Also includes
◦ data conversion
◦ data reformatting
◦ data rescue

• Data Conversions and Formats
o Use non-proprietary, standard formats
o Textual documents  .txt
o Spreadsheets  .csv
o Digital images  .tiff
• Versioning
• File Naming

• Create a backup plan that clearly identifies:
o roles,
o responsibilities,
o where the data is backed up,
o how often the files are backed up,
o how to access the files,
o recommended file formats to be used, and
o policies for migrating data to assure data are not lost due to media
degradation or changing formats or programs
• Review your backup plan regularly
• Update as needed

• Data Sharing
• Backup
• Metadata

• When you provide data to someone else, what types of
information would you want to include with the data?
• When you receive a dataset from an external source, what
types of details do you want to know about the data?

• Why were the data created?
• What limitations, if any, do the data have?
• What does the data mean?
• How should the data be cited if it is re-used in a
new study?

• What are the data gaps?
• What processes were used for creating the
data?
• Are there any fees associated with the data?
• In what scale were the data created?
• What do the values in the tables mean?
• What software do I need in order to read the
data?
• What projection are the data in?
• Can I give these data to someone else?

Metadata is: Data ‘reporting’
• WHO created the data?
• WHAT is the content of the data?
• WHEN were the data created?
• WHERE is it geographically?
• HOW were the data developed?
• WHY were the data developed?
PhotobyMichelleChang.AllRightsReserved

Author(s) Boullosa, Carmen.
Title(s) They're cows, we're pigs /
by Carmen Boullosa
Place New York : Grove Press, 1997.
Physical Descr viii, 180 p ; 22 cm.
Subject(s) Pirates Caribbean Area Fiction.
Format Fiction
CCimagebyUSDAgovonFlickr
CCimagebyMskaduonFlickr

DATADETAILS
Time of data development
Specific details about problems with individual items or
specific dates are lost relatively rapidly
General details about datasets are
lost through time
Accident or
technology
change may
make data
unusable
Retirement or career change
makes access to “mental
storage” difficult or unlikely
Loss of data
developer leads to
loss of remaining
information
TIME (From Michener et al 1997)

• A Standard provides a structure to describe data with:
◦ Common terms to allow consistency between records
◦ Common definitions for easier interpretation
◦ Common language for ease of communication
◦ Common structure to quickly locate information
• In search and retrieval, standards provide:
◦ Documentation structure in a reliable and predictable format for
computer interpretation
◦ A uniform summary description of the dataset
CCimagebyccarlstead
onFlickr

CCimagebyIlikeonFlickr

• Similar to citing a published
article or book
o Provide information necessary to
identify and locate the work cited
• No standards yet
• Use format recommended by
journal, repository, or
professional organization
CCimagebyPaxsimiusonFlickr

Breitburg DL, Hondorp D, Audemard C, Carnegie RB, Burrell RB,
Trice M, Clark V (2015) Data from: Landscape-level variation in
disease susceptibility related to shallow-water hypoxia. PLOS
ONE. http://dx.doi.org/10.5061/dryad.9k231

A persistent identifier should be included in the citation:
• DOI (Digital Object Identifier)
o Globally unique, alphanumeric string assigned by a registration
agency to identify content and provide a persistent link to its
location.
o May be assigned to any item of intellectual property that is defined
by structured metadata
o Examples:10.1234/NP5678, 10.5678/ISBN-0-7645-4889-4;
10.2224/2004-10-ISO-DOI
• The UT Libraries can assign a DOI to your data set.

Chris Eaker
Data Curation Librarian
Hodges Library, Room 236
865-974-4404
chris@utk.edu
Available to help with…
• Data management plan support
• Data repositories
• Metadata
• Data management consulting

Data Management Best Practices

Recommended

Recommended

More Related Content

What's hot

What's hot (17)

Similar to Data Management Best Practices

Similar to Data Management Best Practices (20)

Recently uploaded

Recently uploaded (20)

Data Management Best Practices

Editor's Notes