SlideShare a Scribd company logo
Introduction to Data Management
CCimagebyUniversityofMarylandPressReleasesonFlickr
Adapted from
curriculum
developed by
Introduction to Data Management
• Why Manage Data?
• Data Sharing
• Data Entry & Manipulation
• Quality Control & Assurance
• Backup
• Metadata
• Data Citation
Introduction to Data Management
• Organization
• Reproducibility
• Version control
• Quality control
• Valuable asset
• Accuracy
• Integrity
• Data sharing
• Sustainability & accessibility
Introduction to Data Management
• If data are:
o Well-organized
o Documented
o Preserved
o Accessible
o Verified as to Accuracy and validity
• Result is:
o High quality data
o Easy to share and re-use in science
o Citation and credibility to the researcher
o Cost-savings to science
Introduction to Data Management
• Why Manage Data?
• Data Sharing
Introduction to Data Management
Data sharing requires effort, resources, and faith in others.
Why do it?
For the benefit of:
o the public
o the research sponsor
o the research community
o the researcher
CCimagebyJessicaLuciaonFlickr
Introduction to Data Management
A better informed public yields better decision making with
regard to:
o Environmental and economic planning
o Federal, state, and local policies
o social choices such as use of tax dollars and education options
o personal lifestyle and health such as nutrition and recreation
CCimagebyfalonyatesonFlickr
Introduction to Data Management
• Organizations that sponsor research must maximize the
value of research dollars
• Data sharing enhances the value of research investments by
enabling:
o verification of performance metrics and outcomes
o new research and increased return on investment
o advancement of the science
o reduced data duplication expenditures
Introduction to Data Management
Access to related research enables community members to:
o build upon the work of others
o perform meta analyses
o share resources and perspectives
CCimagebyLawrenceBerkeleyNational
LaboratoryonFlickr
Introduction to Data Management
Access to related research enables community members to
(cont’d):
o increase transparency, reproducibility and comparability of results
o expand methodology assessment, recommendations and
improvement
o educate new researchers as to the most current and significant
findings
Introduction to Data Management
Scientists that share data gain the benefit of:
o Recognition
o improved data quality
o greater opportunity for data exchange
o improved connections
CCimagebySLUMadridCampus
onFlickr
Introduction to Data Management
Step One:
Create robust metadata that is discoverable
o Geographic and temporal coverage
o Discipline specific metadata schema
o Discipline specific vocabulary
o Describe attributes
Introduction to Data Management
Step Two:
Include archival and reference information
o Include a data citation
o Include Persistent Identifier (e.g. DOI)
Data Citation Example: Sidlauskas, B. 2007. Data from: Testing for unequal rates of
morphological diversification in the absence of a detailed phylogeny: a case study from
characiform fishes. Dryad Digital Repository. doi:10.5061/dryad.20
Introduction to Data Management
Step Three:
Have data contributors review your metadata to ensure validity
and organizational ‘correctness’
o are the processes described accurately?
o are all contributions adequately identified?
o has management reviewed the product and documentation?
o is the funding organization properly recognized?
Introduction to Data Management
Step Four:
Publish your data and metadata via:
Data Repositories/Clearinghouses
• Discipline-specific
◦ Sciences
• Knowledge Network for Biodiversity (KNB) Data Portal
• Long Term Ecological Research (LTER) Network Data Portal
◦ Social Sciences
• ICPSR
• Institutional
◦ Trace
Introduction to Data Management
• Why Manage Data?
• Data Sharing
• Data Entry & Manipulation
Introduction to Data Management
• Create data sets that are:
o Valid
o Organized to support ease of use
CCimagebyTravisSonFlickr
Introduction to Data Management
• Inconsistency between data collection events
– Location of Date information
– Inconsistent Date format
– Column names
– Order of columns
Introduction to Data Management
• Inconsistency between data collection events
– Different site spellings, capitalization, spaces
in site names—hard to filter
– Codes used for site names for some data, but
spelled out for others
– Mean1 value is in Weight column
– Text and numbers in same column – what is
the mean of 12, “escaped < 15”, and 91?
Introduction to Data Management
• Columns of data are consistent:
only numbers, dates, or text
• Consistent Names, Codes, Formats (date) used in each column
• Data are all in one table, which is much easier for a statistical program to work
with than multiple small tables which each require human intervention
Introduction to Data Management
• Descriptive column names
◦ Soil T30  Soil_Temp_30cm
◦ Species-Code  Species_Code (avoid using -,+,*,^ in column names.
Some software may interpret these symbols as an operator)
• Descriptive file names
◦ Mammal data-.csv  FieldVisit1_SmallMammalData_2010-04-11.csv
Introduction to Data Management
• Enter complete lines of data
Sorting an
Excel file with
empty cells is
not a good
idea!
Introduction to Data Management
• Missing data
o Preferably leave field empty (NULL = no value)
o In numeric fields, use a distinct value such as 9999 to indicate a missing
value
o In text fields, use NA (“Not Applicable” or “Not Available”)
o Use Data flags in a separate column to qualify missing value
Date Time NO3_N_Conc NO3_N_Conc_Flag
20081011 1300 0.013
20081011 1330 0.016
20081011 1400 M1
20081011 1430 0.018
20081011 1500 0.001 E1
M1 = missing; no sample
collected
E1 = estimated from
grab sample
Introduction to Data Management
Introduction to Data Management
20
Introduction to Data Management
• Great for charts, graphs,
calculations
• Flexible about cell content
type—cells in same column
can contain numbers or text
• Lack record integrity--can
sort a column independently
of all others)
• Easy to use – but harder to
maintain as complexity and
size of data grows
• Easy to query to select
portions of data
• Data fields are typed – For
example, only integers are
allowed in integer fields
• Columns cannot be sorted
independently of each other
• Steeper learning curve than
a spreadsheet
Introduction to Data Management
• A set of tables
• Relationships
• A command
language
*siteID
site_name
latitude
longitude
description
Sample sites
*speciesID
species_name
common_name
family
order
Species
*sampleID
siteID
sample_date
speciesID
height
flowering
flag
comments
samples
*sampleID
siteID
sample_date
speciesID
height
flowering
flag
comments
Samples
Introduction to Data Management
Date Site Height Flowering
<dates only> <text only> <real numbers only> <‘y’ and ‘n’ only>
Advantages
• quality control
• performance
Introduction to Data Management
Date Site Species Flowering?
2/13/2010 A BOGR2 y
2/13/2010 B HODR y
4/15/2010 B BOER4 y
4/15/2010 C PLJA n
Site Latitude Longitude
A 34.1 -109.3
B 35.2 -108.6
C 32.6 -107.5
Date Site Species Flowering? Latitude Longitude
2/13/2010 A BOGR2 y 34.1 -109.3
2/13/2010 B HODR y 35.2 -108.6
4/15/2010 B BOER4 y 35.2 -108.6
4/15/2010 C PLJA n 32.6 -107.5
Mix and
Match
data on
the fly
Introduction to Data Management
Date Plot Treatment SensorDepth Soil_Temperature
2010-02-01 C R 30 12.8
2010-02-01 B C 10 13.2
2010-02-02 C R 0 6.3
2010-02-02 A N 0 15.1
SQL examples: Select Date, Plot, Treatment, SensorDepth, Soil_Temperature from
SoilTemp where Date = ‘2010-02-01’
Date Plot Treatment SensorDepth Soil_Temperature
2010-02-01 C R 30 12.8
2010-02-01 B C 10 13.2
Date Plot Treatment SensorDepth Soil_Temperature
2010-02-02 A N 0 15.1
This table is called SoilTemp
Select * from SoilTemp where Treatment=‘N’ and SensorDepth=‘0’
Introduction to Data Management
Introduction to Data Management
• Be aware of Best Practices when designing data file structures
• Choose a data entry method that allows some validation of
data as it is entered
• Consider investing time in learning how to use a database if
datasets are large or complex
CCimagebyfo.olonFlickr
Introduction to Data Management
• Why Manage Data?
• Data Sharing
• Data Entry & Manipulation
• Quality Control & Assurance
Introduction to Data Management
• Errors of Commission
o Incorrect or inaccurate data entered
o Examples: malfunctioning instrument, mistyped data
• Errors of Omission
o Data or metadata not recorded
o Examples: inadequate documentation, human error, anomalies in the
field
CCimagebyNickJWebbonFlickr
Introduction to Data Management
• Define & enforce standards
◦ Formats
◦ Codes
◦ Measurement units
◦ Metadata
• Assign responsibility for data quality
◦ Be sure assigned person is educated in QA/QC
Introduction to Data Management
• Double entry
◦ Data keyed in by two independent people
◦ Check for agreement with computer verification
• Record a reading of the data and transcribe from the
recording
• Use text-to-speech program to read data back
CCimagebyweskrieselonFlickr
Introduction to Data Management
• Design data storage well
◦ Minimize number of times items that must be entered repeatedly
◦ Use consistent terminology
◦ Atomize data: one cell per piece of information
• Document changes to data
◦ Avoids duplicate error checking
◦ Allows undo if necessary
Introduction to Data Management
• Make sure data line up in proper columns
• No missing, impossible, or anomalous values
• Perform statistical summaries
CCimagebychesapeakeclimateonFlickr
Introduction to Data Management
• Look for outliers
◦ Outliers are extreme values for a variable given the statistical model
being used
◦ The goal is not to eliminate outliers but to identify potential data
contamination
0
10
20
30
40
50
60
0 5 10 15 20 25 30 35
Introduction to Data Management
• Methods to look for outliers
◦ Graphical
• Normal probability plots
• Regression
• Scatter plots
◦ Plotting on maps
◦ Deviation
Introduction to Data Management
• Beware of errors of commission and omission
• Execute quality assurance and quality control strategies
◦ Data entry
◦ Data visualization
Introduction to Data Management
• Why Manage Data?
• Data Sharing
• Data Entry & Manipulation
• Quality Control & Assurance
• Backup
Introduction to Data Management
• Backups vs. Archives
◦ Backups: a copy (or copies) of the original file is made before the
original is overwritten
o Archives: preservation of the file
• Data Preservation
o Includes archiving in addition to processes such as data rescue, data
reformatting, data conversion, metadata
Introduction to Data Management
• Backups
o periodic snapshots
o usually copies of files
o performed on regular schedule
• Archiving
o preserve data
o usually the final version
o performed at the end of a project or milestones
It is a good idea to have multiple copies of your backups and
archives, in case one copy fails.
Introduction to Data Management
• Limit or negate loss of data
• Save time, money, productivity
• Help prepare for disasters
o Accidental deletions
o Fires, natural disasters
o Software bugs, hardware failures
• Reproduce results
• Respond to data requests
• Limit liability
CCImagecourtesyofBrianJMatisonFlickr
Introduction to Data Management
• Includes backups and archiving
• Also includes
◦ data conversion
◦ data reformatting
◦ data rescue
Introduction to Data Management
• Data Conversions and Formats
o Use non-proprietary, standard formats
o Textual documents  .txt
o Spreadsheets  .csv
o Digital images  .tiff
• Versioning
• File Naming
Introduction to Data Management
• Create a backup plan that clearly identifies:
o roles,
o responsibilities,
o where the data is backed up,
o how often the files are backed up,
o how to access the files,
o recommended file formats to be used, and
o policies for migrating data to assure data are not lost due to media
degradation or changing formats or programs
• Review your backup plan regularly
• Update as needed
Introduction to Data Management
• Why Manage Data?
• Data Sharing
• Data Entry & Manipulation
• Quality Control & Assurance
• Backup
• Metadata
Introduction to Data Management
• When you provide data to someone else, what types of
information would you want to include with the data?
• When you receive a dataset from an external source, what
types of details do you want to know about the data?
Introduction to Data Management
• Why were the data created?
• What limitations, if any, do the data have?
• What does the data mean?
• How should the data be cited if it is re-used in a
new study?
Introduction to Data Management
• What are the data gaps?
• What processes were used for creating the
data?
• Are there any fees associated with the data?
• In what scale were the data created?
• What do the values in the tables mean?
• What software do I need in order to read the
data?
• What projection are the data in?
• Can I give these data to someone else?
Introduction to Data Management
Metadata is: Data ‘reporting’
• WHO created the data?
• WHAT is the content of the data?
• WHEN were the data created?
• WHERE is it geographically?
• HOW were the data developed?
• WHY were the data developed?
PhotobyMichelleChang.AllRightsReserved
Introduction to Data Management
Author(s) Boullosa, Carmen.
Title(s) They're cows, we're pigs /
by Carmen Boullosa
Place New York : Grove Press, 1997.
Physical Descr viii, 180 p ; 22 cm.
Subject(s) Pirates Caribbean Area Fiction.
Format Fiction
CCimagebyUSDAgovonFlickr
CCimagebyMskaduonFlickr
Introduction to Data Management
DATADETAILS
Time of data development
Specific details about problems with individual items or
specific dates are lost relatively rapidly
General details about datasets are
lost through time
Accident or
technology
change may
make data
unusable
Retirement or career change
makes access to “mental
storage” difficult or unlikely
Loss of data
developer leads to
loss of remaining
information
TIME (From Michener et al 1997)
Introduction to Data Management
• A Standard provides a structure to describe data with:
◦ Common terms to allow consistency between records
◦ Common definitions for easier interpretation
◦ Common language for ease of communication
◦ Common structure to quickly locate information
• In search and retrieval, standards provide:
◦ Documentation structure in a reliable and predictable format for
computer interpretation
◦ A uniform summary description of the dataset
CCimagebyccarlstead
onFlickr
Introduction to Data Management
CCimagebyIlikeonFlickr
Introduction to Data Management
• Why Manage Data?
• Data Sharing
• Data Entry & Manipulation
• Quality Control & Assurance
• Backup
• Metadata
• Data Citation
Introduction to Data Management
• Similar to citing a published
article or book
o Provide information necessary to
identify and locate the work cited
• No standards yet
• Use format recommended by
journal, repository, or
professional organization
CCimagebyPaxsimiusonFlickr
Introduction to Data Management
Breitburg DL, Hondorp D, Audemard C, Carnegie RB, Burrell RB,
Trice M, Clark V (2015) Data from: Landscape-level variation in
disease susceptibility related to shallow-water hypoxia. PLOS
ONE. http://dx.doi.org/10.5061/dryad.9k231
Introduction to Data Management
Breitburg DL, Hondorp D, Audemard C, Carnegie RB, Burrell RB,
Trice M, Clark V (2015) Data from: Landscape-level variation in
disease susceptibility related to shallow-water hypoxia. PLOS
ONE. http://dx.doi.org/10.5061/dryad.9k231
Introduction to Data Management
Breitburg DL, Hondorp D, Audemard C, Carnegie RB, Burrell RB,
Trice M, Clark V (2015) Data from: Landscape-level variation in
disease susceptibility related to shallow-water hypoxia. PLOS
ONE. http://dx.doi.org/10.5061/dryad.9k231
Introduction to Data Management
Breitburg DL, Hondorp D, Audemard C, Carnegie RB, Burrell RB,
Trice M, Clark V (2015) Data from: Landscape-level variation in
disease susceptibility related to shallow-water hypoxia. PLOS
ONE. http://dx.doi.org/10.5061/dryad.9k231
Introduction to Data Management
Breitburg DL, Hondorp D, Audemard C, Carnegie RB, Burrell RB,
Trice M, Clark V (2015) Data from: Landscape-level variation in
disease susceptibility related to shallow-water hypoxia. PLOS
ONE. http://dx.doi.org/10.5061/dryad.9k231
Introduction to Data Management
Breitburg DL, Hondorp D, Audemard C, Carnegie RB, Burrell RB,
Trice M, Clark V (2015) Data from: Landscape-level variation in
disease susceptibility related to shallow-water hypoxia. PLOS
ONE. http://dx.doi.org/10.5061/dryad.9k231
Introduction to Data Management
A persistent identifier should be included in the citation:
• DOI (Digital Object Identifier)
o Globally unique, alphanumeric string assigned by a registration
agency to identify content and provide a persistent link to its
location.
o May be assigned to any item of intellectual property that is defined
by structured metadata
o Examples:10.1234/NP5678, 10.5678/ISBN-0-7645-4889-4;
10.2224/2004-10-ISO-DOI
• The UT Libraries can assign a DOI to your data set.
Introduction to Data Management
Chris Eaker
Data Curation Librarian
Hodges Library, Room 236
865-974-4404
chris@utk.edu
Available to help with…
• Data management plan support
• Data repositories
• Metadata
• Data management consulting

More Related Content

What's hot

Big Data Taxonomy 8/26/2013
Big Data Taxonomy 8/26/2013Big Data Taxonomy 8/26/2013
Big Data Taxonomy 8/26/2013DataTactics
 
Peng Privette SMM_AMS2014_P695
Peng Privette SMM_AMS2014_P695Peng Privette SMM_AMS2014_P695
Peng Privette SMM_AMS2014_P695
Ge Peng
 
Data Cleaning Techniques
Data Cleaning TechniquesData Cleaning Techniques
Data Cleaning Techniques
Amir Masoud Sefidian
 
Unit iii
Unit iiiUnit iii
Unit iii
Kgr Sushmitha
 
Major issues in data mining
Major issues in data miningMajor issues in data mining
Major issues in data miningSlideshare
 
Leveraging Oracle's Life Sciences Data Hub to Enable Dynamic Cross-Study Anal...
Leveraging Oracle's Life Sciences Data Hub to Enable Dynamic Cross-Study Anal...Leveraging Oracle's Life Sciences Data Hub to Enable Dynamic Cross-Study Anal...
Leveraging Oracle's Life Sciences Data Hub to Enable Dynamic Cross-Study Anal...Perficient
 
An Introduction to Clinical Study Migrations
An Introduction to Clinical Study MigrationsAn Introduction to Clinical Study Migrations
An Introduction to Clinical Study Migrations
Perficient, Inc.
 
G045033841
G045033841G045033841
G045033841
IJERA Editor
 
Using Oracle Health Sciences Data Management Workbench to Optimize the Manage...
Using Oracle Health Sciences Data Management Workbench to Optimize the Manage...Using Oracle Health Sciences Data Management Workbench to Optimize the Manage...
Using Oracle Health Sciences Data Management Workbench to Optimize the Manage...Perficient
 
Data Quality Asia Pacific Award_v1.1_20100520
Data Quality Asia Pacific Award_v1.1_20100520Data Quality Asia Pacific Award_v1.1_20100520
Data Quality Asia Pacific Award_v1.1_20100520Tatiana Stebakova
 
Issues, challenges, and solutions
Issues, challenges, and solutionsIssues, challenges, and solutions
Issues, challenges, and solutions
csandit
 
Metadata lecture(9 17-14)
Metadata lecture(9 17-14)Metadata lecture(9 17-14)
Metadata lecture(9 17-14)
mhb120
 
1 introductory slides (1)
1 introductory slides (1)1 introductory slides (1)
1 introductory slides (1)
tafosepsdfasg
 
Knowledge discovery process
Knowledge discovery process Knowledge discovery process
Knowledge discovery process
Shuvra Ghosh
 
A step towards a data quality theory
 A step towards a data quality theory A step towards a data quality theory
A step towards a data quality theory
Anastasija Nikiforova
 
Good Practice in Research Data Management
Good Practice in Research Data ManagementGood Practice in Research Data Management
Good Practice in Research Data Management
Historic Environment Scotland
 

What's hot (17)

Big Data Taxonomy 8/26/2013
Big Data Taxonomy 8/26/2013Big Data Taxonomy 8/26/2013
Big Data Taxonomy 8/26/2013
 
rEDCap At A Glance
rEDCap At A GlancerEDCap At A Glance
rEDCap At A Glance
 
Peng Privette SMM_AMS2014_P695
Peng Privette SMM_AMS2014_P695Peng Privette SMM_AMS2014_P695
Peng Privette SMM_AMS2014_P695
 
Data Cleaning Techniques
Data Cleaning TechniquesData Cleaning Techniques
Data Cleaning Techniques
 
Unit iii
Unit iiiUnit iii
Unit iii
 
Major issues in data mining
Major issues in data miningMajor issues in data mining
Major issues in data mining
 
Leveraging Oracle's Life Sciences Data Hub to Enable Dynamic Cross-Study Anal...
Leveraging Oracle's Life Sciences Data Hub to Enable Dynamic Cross-Study Anal...Leveraging Oracle's Life Sciences Data Hub to Enable Dynamic Cross-Study Anal...
Leveraging Oracle's Life Sciences Data Hub to Enable Dynamic Cross-Study Anal...
 
An Introduction to Clinical Study Migrations
An Introduction to Clinical Study MigrationsAn Introduction to Clinical Study Migrations
An Introduction to Clinical Study Migrations
 
G045033841
G045033841G045033841
G045033841
 
Using Oracle Health Sciences Data Management Workbench to Optimize the Manage...
Using Oracle Health Sciences Data Management Workbench to Optimize the Manage...Using Oracle Health Sciences Data Management Workbench to Optimize the Manage...
Using Oracle Health Sciences Data Management Workbench to Optimize the Manage...
 
Data Quality Asia Pacific Award_v1.1_20100520
Data Quality Asia Pacific Award_v1.1_20100520Data Quality Asia Pacific Award_v1.1_20100520
Data Quality Asia Pacific Award_v1.1_20100520
 
Issues, challenges, and solutions
Issues, challenges, and solutionsIssues, challenges, and solutions
Issues, challenges, and solutions
 
Metadata lecture(9 17-14)
Metadata lecture(9 17-14)Metadata lecture(9 17-14)
Metadata lecture(9 17-14)
 
1 introductory slides (1)
1 introductory slides (1)1 introductory slides (1)
1 introductory slides (1)
 
Knowledge discovery process
Knowledge discovery process Knowledge discovery process
Knowledge discovery process
 
A step towards a data quality theory
 A step towards a data quality theory A step towards a data quality theory
A step towards a data quality theory
 
Good Practice in Research Data Management
Good Practice in Research Data ManagementGood Practice in Research Data Management
Good Practice in Research Data Management
 

Similar to Data Management Best Practices

DataONE Education Module 07: Metadata
DataONE Education Module 07: MetadataDataONE Education Module 07: Metadata
DataONE Education Module 07: Metadata
DataONE
 
L07 metadata
L07 metadataL07 metadata
L07 metadata
thplayer127
 
Building the enterprise data architecture
Building the enterprise data architectureBuilding the enterprise data architecture
Building the enterprise data architectureCosta Pissaris
 
Enhancing educational data quality in heterogeneous learning contexts using p...
Enhancing educational data quality in heterogeneous learning contexts using p...Enhancing educational data quality in heterogeneous learning contexts using p...
Enhancing educational data quality in heterogeneous learning contexts using p...
Alex Rayón Jerez
 
Saksham Sarode - Building Effective test Data Management in Distributed Envir...
Saksham Sarode - Building Effective test Data Management in Distributed Envir...Saksham Sarode - Building Effective test Data Management in Distributed Envir...
Saksham Sarode - Building Effective test Data Management in Distributed Envir...
TEST Huddle
 
Ch~2.pdf
Ch~2.pdfCh~2.pdf
You Need a Data Catalog. Do You Know Why?
You Need a Data Catalog. Do You Know Why?You Need a Data Catalog. Do You Know Why?
You Need a Data Catalog. Do You Know Why?
Precisely
 
You Need a Data Catalog. Do You Know Why?
You Need a Data Catalog. Do You Know Why?You Need a Data Catalog. Do You Know Why?
You Need a Data Catalog. Do You Know Why?
Precisely
 
lecture 1.pdf
lecture 1.pdflecture 1.pdf
lecture 1.pdf
AhmadHussainShafiSE3
 
Data-Ed: Unlock Business Value through Data Quality Engineering
Data-Ed: Unlock Business Value through Data Quality EngineeringData-Ed: Unlock Business Value through Data Quality Engineering
Data-Ed: Unlock Business Value through Data Quality EngineeringDATAVERSITY
 
Data-Ed: Unlock Business Value through Data Quality Engineering
Data-Ed: Unlock Business Value through Data Quality Engineering Data-Ed: Unlock Business Value through Data Quality Engineering
Data-Ed: Unlock Business Value through Data Quality Engineering
Data Blueprint
 
An Agile & Adaptive Approach to Addressing Financial Services Regulations and...
An Agile & Adaptive Approach to Addressing Financial Services Regulations and...An Agile & Adaptive Approach to Addressing Financial Services Regulations and...
An Agile & Adaptive Approach to Addressing Financial Services Regulations and...
Neo4j
 
Combining Human+Machine Intelligence to Successfully Integrate Biomedical Data
Combining Human+Machine Intelligence to Successfully Integrate Biomedical DataCombining Human+Machine Intelligence to Successfully Integrate Biomedical Data
Combining Human+Machine Intelligence to Successfully Integrate Biomedical Data
russelltamr
 
BigData Testing by Shreya Pal
BigData Testing by Shreya PalBigData Testing by Shreya Pal
BigData Testing by Shreya Pal
Agile Testing Alliance
 
Automating Data Science over a Human Genomics Knowledge Base
Automating Data Science over a Human Genomics Knowledge BaseAutomating Data Science over a Human Genomics Knowledge Base
Automating Data Science over a Human Genomics Knowledge Base
Vaticle
 
Five Things to Consider About Data Mesh and Data Governance
Five Things to Consider About Data Mesh and Data GovernanceFive Things to Consider About Data Mesh and Data Governance
Five Things to Consider About Data Mesh and Data Governance
DATAVERSITY
 
Chapter 4 Organizational Aspects of Data Management.ppt
Chapter 4 Organizational Aspects of Data Management.pptChapter 4 Organizational Aspects of Data Management.ppt
Chapter 4 Organizational Aspects of Data Management.ppt
AnasSamara3
 
Ch_2.pdf
Ch_2.pdfCh_2.pdf
Ch_2.pdf
DawitBirhanu13
 
Data Quality: A Raising Data Warehousing Concern
Data Quality: A Raising Data Warehousing ConcernData Quality: A Raising Data Warehousing Concern
Data Quality: A Raising Data Warehousing Concern
Amin Chowdhury
 
Dw 07032018-dr pl pradhan
Dw 07032018-dr pl pradhanDw 07032018-dr pl pradhan
Dw 07032018-dr pl pradhan
Dr Pradhan PL Pradhan
 

Similar to Data Management Best Practices (20)

DataONE Education Module 07: Metadata
DataONE Education Module 07: MetadataDataONE Education Module 07: Metadata
DataONE Education Module 07: Metadata
 
L07 metadata
L07 metadataL07 metadata
L07 metadata
 
Building the enterprise data architecture
Building the enterprise data architectureBuilding the enterprise data architecture
Building the enterprise data architecture
 
Enhancing educational data quality in heterogeneous learning contexts using p...
Enhancing educational data quality in heterogeneous learning contexts using p...Enhancing educational data quality in heterogeneous learning contexts using p...
Enhancing educational data quality in heterogeneous learning contexts using p...
 
Saksham Sarode - Building Effective test Data Management in Distributed Envir...
Saksham Sarode - Building Effective test Data Management in Distributed Envir...Saksham Sarode - Building Effective test Data Management in Distributed Envir...
Saksham Sarode - Building Effective test Data Management in Distributed Envir...
 
Ch~2.pdf
Ch~2.pdfCh~2.pdf
Ch~2.pdf
 
You Need a Data Catalog. Do You Know Why?
You Need a Data Catalog. Do You Know Why?You Need a Data Catalog. Do You Know Why?
You Need a Data Catalog. Do You Know Why?
 
You Need a Data Catalog. Do You Know Why?
You Need a Data Catalog. Do You Know Why?You Need a Data Catalog. Do You Know Why?
You Need a Data Catalog. Do You Know Why?
 
lecture 1.pdf
lecture 1.pdflecture 1.pdf
lecture 1.pdf
 
Data-Ed: Unlock Business Value through Data Quality Engineering
Data-Ed: Unlock Business Value through Data Quality EngineeringData-Ed: Unlock Business Value through Data Quality Engineering
Data-Ed: Unlock Business Value through Data Quality Engineering
 
Data-Ed: Unlock Business Value through Data Quality Engineering
Data-Ed: Unlock Business Value through Data Quality Engineering Data-Ed: Unlock Business Value through Data Quality Engineering
Data-Ed: Unlock Business Value through Data Quality Engineering
 
An Agile & Adaptive Approach to Addressing Financial Services Regulations and...
An Agile & Adaptive Approach to Addressing Financial Services Regulations and...An Agile & Adaptive Approach to Addressing Financial Services Regulations and...
An Agile & Adaptive Approach to Addressing Financial Services Regulations and...
 
Combining Human+Machine Intelligence to Successfully Integrate Biomedical Data
Combining Human+Machine Intelligence to Successfully Integrate Biomedical DataCombining Human+Machine Intelligence to Successfully Integrate Biomedical Data
Combining Human+Machine Intelligence to Successfully Integrate Biomedical Data
 
BigData Testing by Shreya Pal
BigData Testing by Shreya PalBigData Testing by Shreya Pal
BigData Testing by Shreya Pal
 
Automating Data Science over a Human Genomics Knowledge Base
Automating Data Science over a Human Genomics Knowledge BaseAutomating Data Science over a Human Genomics Knowledge Base
Automating Data Science over a Human Genomics Knowledge Base
 
Five Things to Consider About Data Mesh and Data Governance
Five Things to Consider About Data Mesh and Data GovernanceFive Things to Consider About Data Mesh and Data Governance
Five Things to Consider About Data Mesh and Data Governance
 
Chapter 4 Organizational Aspects of Data Management.ppt
Chapter 4 Organizational Aspects of Data Management.pptChapter 4 Organizational Aspects of Data Management.ppt
Chapter 4 Organizational Aspects of Data Management.ppt
 
Ch_2.pdf
Ch_2.pdfCh_2.pdf
Ch_2.pdf
 
Data Quality: A Raising Data Warehousing Concern
Data Quality: A Raising Data Warehousing ConcernData Quality: A Raising Data Warehousing Concern
Data Quality: A Raising Data Warehousing Concern
 
Dw 07032018-dr pl pradhan
Dw 07032018-dr pl pradhanDw 07032018-dr pl pradhan
Dw 07032018-dr pl pradhan
 

Recently uploaded

The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptxThe use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
MAGOTI ERNEST
 
DMARDs Pharmacolgy Pharm D 5th Semester.pdf
DMARDs Pharmacolgy Pharm D 5th Semester.pdfDMARDs Pharmacolgy Pharm D 5th Semester.pdf
DMARDs Pharmacolgy Pharm D 5th Semester.pdf
fafyfskhan251kmf
 
THEMATIC APPERCEPTION TEST(TAT) cognitive abilities, creativity, and critic...
THEMATIC  APPERCEPTION  TEST(TAT) cognitive abilities, creativity, and critic...THEMATIC  APPERCEPTION  TEST(TAT) cognitive abilities, creativity, and critic...
THEMATIC APPERCEPTION TEST(TAT) cognitive abilities, creativity, and critic...
Abdul Wali Khan University Mardan,kP,Pakistan
 
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Ana Luísa Pinho
 
Shallowest Oil Discovery of Turkiye.pptx
Shallowest Oil Discovery of Turkiye.pptxShallowest Oil Discovery of Turkiye.pptx
Shallowest Oil Discovery of Turkiye.pptx
Gokturk Mehmet Dilci
 
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
yqqaatn0
 
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...
Wasswaderrick3
 
SAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdfSAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdf
KrushnaDarade1
 
Nucleophilic Addition of carbonyl compounds.pptx
Nucleophilic Addition of carbonyl  compounds.pptxNucleophilic Addition of carbonyl  compounds.pptx
Nucleophilic Addition of carbonyl compounds.pptx
SSR02
 
BREEDING METHODS FOR DISEASE RESISTANCE.pptx
BREEDING METHODS FOR DISEASE RESISTANCE.pptxBREEDING METHODS FOR DISEASE RESISTANCE.pptx
BREEDING METHODS FOR DISEASE RESISTANCE.pptx
RASHMI M G
 
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
University of Maribor
 
Oedema_types_causes_pathophysiology.pptx
Oedema_types_causes_pathophysiology.pptxOedema_types_causes_pathophysiology.pptx
Oedema_types_causes_pathophysiology.pptx
muralinath2
 
Leaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdfLeaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdf
RenuJangid3
 
Deep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless ReproducibilityDeep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless Reproducibility
University of Rennes, INSA Rennes, Inria/IRISA, CNRS
 
Nucleic Acid-its structural and functional complexity.
Nucleic Acid-its structural and functional complexity.Nucleic Acid-its structural and functional complexity.
Nucleic Acid-its structural and functional complexity.
Nistarini College, Purulia (W.B) India
 
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
Travis Hills MN
 
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
University of Maribor
 
Topic: SICKLE CELL DISEASE IN CHILDREN-3.pdf
Topic: SICKLE CELL DISEASE IN CHILDREN-3.pdfTopic: SICKLE CELL DISEASE IN CHILDREN-3.pdf
Topic: SICKLE CELL DISEASE IN CHILDREN-3.pdf
TinyAnderson
 
Chapter 12 - climate change and the energy crisis
Chapter 12 - climate change and the energy crisisChapter 12 - climate change and the energy crisis
Chapter 12 - climate change and the energy crisis
tonzsalvador2222
 
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Sérgio Sacani
 

Recently uploaded (20)

The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptxThe use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
 
DMARDs Pharmacolgy Pharm D 5th Semester.pdf
DMARDs Pharmacolgy Pharm D 5th Semester.pdfDMARDs Pharmacolgy Pharm D 5th Semester.pdf
DMARDs Pharmacolgy Pharm D 5th Semester.pdf
 
THEMATIC APPERCEPTION TEST(TAT) cognitive abilities, creativity, and critic...
THEMATIC  APPERCEPTION  TEST(TAT) cognitive abilities, creativity, and critic...THEMATIC  APPERCEPTION  TEST(TAT) cognitive abilities, creativity, and critic...
THEMATIC APPERCEPTION TEST(TAT) cognitive abilities, creativity, and critic...
 
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
 
Shallowest Oil Discovery of Turkiye.pptx
Shallowest Oil Discovery of Turkiye.pptxShallowest Oil Discovery of Turkiye.pptx
Shallowest Oil Discovery of Turkiye.pptx
 
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
 
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...
 
SAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdfSAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdf
 
Nucleophilic Addition of carbonyl compounds.pptx
Nucleophilic Addition of carbonyl  compounds.pptxNucleophilic Addition of carbonyl  compounds.pptx
Nucleophilic Addition of carbonyl compounds.pptx
 
BREEDING METHODS FOR DISEASE RESISTANCE.pptx
BREEDING METHODS FOR DISEASE RESISTANCE.pptxBREEDING METHODS FOR DISEASE RESISTANCE.pptx
BREEDING METHODS FOR DISEASE RESISTANCE.pptx
 
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
 
Oedema_types_causes_pathophysiology.pptx
Oedema_types_causes_pathophysiology.pptxOedema_types_causes_pathophysiology.pptx
Oedema_types_causes_pathophysiology.pptx
 
Leaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdfLeaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdf
 
Deep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless ReproducibilityDeep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless Reproducibility
 
Nucleic Acid-its structural and functional complexity.
Nucleic Acid-its structural and functional complexity.Nucleic Acid-its structural and functional complexity.
Nucleic Acid-its structural and functional complexity.
 
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
 
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
 
Topic: SICKLE CELL DISEASE IN CHILDREN-3.pdf
Topic: SICKLE CELL DISEASE IN CHILDREN-3.pdfTopic: SICKLE CELL DISEASE IN CHILDREN-3.pdf
Topic: SICKLE CELL DISEASE IN CHILDREN-3.pdf
 
Chapter 12 - climate change and the energy crisis
Chapter 12 - climate change and the energy crisisChapter 12 - climate change and the energy crisis
Chapter 12 - climate change and the energy crisis
 
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
 

Data Management Best Practices

  • 1. Introduction to Data Management CCimagebyUniversityofMarylandPressReleasesonFlickr Adapted from curriculum developed by
  • 2. Introduction to Data Management • Why Manage Data? • Data Sharing • Data Entry & Manipulation • Quality Control & Assurance • Backup • Metadata • Data Citation
  • 3. Introduction to Data Management • Organization • Reproducibility • Version control • Quality control • Valuable asset • Accuracy • Integrity • Data sharing • Sustainability & accessibility
  • 4. Introduction to Data Management • If data are: o Well-organized o Documented o Preserved o Accessible o Verified as to Accuracy and validity • Result is: o High quality data o Easy to share and re-use in science o Citation and credibility to the researcher o Cost-savings to science
  • 5. Introduction to Data Management • Why Manage Data? • Data Sharing
  • 6. Introduction to Data Management Data sharing requires effort, resources, and faith in others. Why do it? For the benefit of: o the public o the research sponsor o the research community o the researcher CCimagebyJessicaLuciaonFlickr
  • 7. Introduction to Data Management A better informed public yields better decision making with regard to: o Environmental and economic planning o Federal, state, and local policies o social choices such as use of tax dollars and education options o personal lifestyle and health such as nutrition and recreation CCimagebyfalonyatesonFlickr
  • 8. Introduction to Data Management • Organizations that sponsor research must maximize the value of research dollars • Data sharing enhances the value of research investments by enabling: o verification of performance metrics and outcomes o new research and increased return on investment o advancement of the science o reduced data duplication expenditures
  • 9. Introduction to Data Management Access to related research enables community members to: o build upon the work of others o perform meta analyses o share resources and perspectives CCimagebyLawrenceBerkeleyNational LaboratoryonFlickr
  • 10. Introduction to Data Management Access to related research enables community members to (cont’d): o increase transparency, reproducibility and comparability of results o expand methodology assessment, recommendations and improvement o educate new researchers as to the most current and significant findings
  • 11. Introduction to Data Management Scientists that share data gain the benefit of: o Recognition o improved data quality o greater opportunity for data exchange o improved connections CCimagebySLUMadridCampus onFlickr
  • 12. Introduction to Data Management Step One: Create robust metadata that is discoverable o Geographic and temporal coverage o Discipline specific metadata schema o Discipline specific vocabulary o Describe attributes
  • 13. Introduction to Data Management Step Two: Include archival and reference information o Include a data citation o Include Persistent Identifier (e.g. DOI) Data Citation Example: Sidlauskas, B. 2007. Data from: Testing for unequal rates of morphological diversification in the absence of a detailed phylogeny: a case study from characiform fishes. Dryad Digital Repository. doi:10.5061/dryad.20
  • 14. Introduction to Data Management Step Three: Have data contributors review your metadata to ensure validity and organizational ‘correctness’ o are the processes described accurately? o are all contributions adequately identified? o has management reviewed the product and documentation? o is the funding organization properly recognized?
  • 15. Introduction to Data Management Step Four: Publish your data and metadata via: Data Repositories/Clearinghouses • Discipline-specific ◦ Sciences • Knowledge Network for Biodiversity (KNB) Data Portal • Long Term Ecological Research (LTER) Network Data Portal ◦ Social Sciences • ICPSR • Institutional ◦ Trace
  • 16. Introduction to Data Management • Why Manage Data? • Data Sharing • Data Entry & Manipulation
  • 17. Introduction to Data Management • Create data sets that are: o Valid o Organized to support ease of use CCimagebyTravisSonFlickr
  • 18. Introduction to Data Management • Inconsistency between data collection events – Location of Date information – Inconsistent Date format – Column names – Order of columns
  • 19. Introduction to Data Management • Inconsistency between data collection events – Different site spellings, capitalization, spaces in site names—hard to filter – Codes used for site names for some data, but spelled out for others – Mean1 value is in Weight column – Text and numbers in same column – what is the mean of 12, “escaped < 15”, and 91?
  • 20. Introduction to Data Management • Columns of data are consistent: only numbers, dates, or text • Consistent Names, Codes, Formats (date) used in each column • Data are all in one table, which is much easier for a statistical program to work with than multiple small tables which each require human intervention
  • 21. Introduction to Data Management • Descriptive column names ◦ Soil T30  Soil_Temp_30cm ◦ Species-Code  Species_Code (avoid using -,+,*,^ in column names. Some software may interpret these symbols as an operator) • Descriptive file names ◦ Mammal data-.csv  FieldVisit1_SmallMammalData_2010-04-11.csv
  • 22. Introduction to Data Management • Enter complete lines of data Sorting an Excel file with empty cells is not a good idea!
  • 23. Introduction to Data Management • Missing data o Preferably leave field empty (NULL = no value) o In numeric fields, use a distinct value such as 9999 to indicate a missing value o In text fields, use NA (“Not Applicable” or “Not Available”) o Use Data flags in a separate column to qualify missing value Date Time NO3_N_Conc NO3_N_Conc_Flag 20081011 1300 0.013 20081011 1330 0.016 20081011 1400 M1 20081011 1430 0.018 20081011 1500 0.001 E1 M1 = missing; no sample collected E1 = estimated from grab sample
  • 24. Introduction to Data Management
  • 25. Introduction to Data Management 20
  • 26. Introduction to Data Management • Great for charts, graphs, calculations • Flexible about cell content type—cells in same column can contain numbers or text • Lack record integrity--can sort a column independently of all others) • Easy to use – but harder to maintain as complexity and size of data grows • Easy to query to select portions of data • Data fields are typed – For example, only integers are allowed in integer fields • Columns cannot be sorted independently of each other • Steeper learning curve than a spreadsheet
  • 27. Introduction to Data Management • A set of tables • Relationships • A command language *siteID site_name latitude longitude description Sample sites *speciesID species_name common_name family order Species *sampleID siteID sample_date speciesID height flowering flag comments samples *sampleID siteID sample_date speciesID height flowering flag comments Samples
  • 28. Introduction to Data Management Date Site Height Flowering <dates only> <text only> <real numbers only> <‘y’ and ‘n’ only> Advantages • quality control • performance
  • 29. Introduction to Data Management Date Site Species Flowering? 2/13/2010 A BOGR2 y 2/13/2010 B HODR y 4/15/2010 B BOER4 y 4/15/2010 C PLJA n Site Latitude Longitude A 34.1 -109.3 B 35.2 -108.6 C 32.6 -107.5 Date Site Species Flowering? Latitude Longitude 2/13/2010 A BOGR2 y 34.1 -109.3 2/13/2010 B HODR y 35.2 -108.6 4/15/2010 B BOER4 y 35.2 -108.6 4/15/2010 C PLJA n 32.6 -107.5 Mix and Match data on the fly
  • 30. Introduction to Data Management Date Plot Treatment SensorDepth Soil_Temperature 2010-02-01 C R 30 12.8 2010-02-01 B C 10 13.2 2010-02-02 C R 0 6.3 2010-02-02 A N 0 15.1 SQL examples: Select Date, Plot, Treatment, SensorDepth, Soil_Temperature from SoilTemp where Date = ‘2010-02-01’ Date Plot Treatment SensorDepth Soil_Temperature 2010-02-01 C R 30 12.8 2010-02-01 B C 10 13.2 Date Plot Treatment SensorDepth Soil_Temperature 2010-02-02 A N 0 15.1 This table is called SoilTemp Select * from SoilTemp where Treatment=‘N’ and SensorDepth=‘0’
  • 31. Introduction to Data Management
  • 32. Introduction to Data Management • Be aware of Best Practices when designing data file structures • Choose a data entry method that allows some validation of data as it is entered • Consider investing time in learning how to use a database if datasets are large or complex CCimagebyfo.olonFlickr
  • 33. Introduction to Data Management • Why Manage Data? • Data Sharing • Data Entry & Manipulation • Quality Control & Assurance
  • 34. Introduction to Data Management • Errors of Commission o Incorrect or inaccurate data entered o Examples: malfunctioning instrument, mistyped data • Errors of Omission o Data or metadata not recorded o Examples: inadequate documentation, human error, anomalies in the field CCimagebyNickJWebbonFlickr
  • 35. Introduction to Data Management • Define & enforce standards ◦ Formats ◦ Codes ◦ Measurement units ◦ Metadata • Assign responsibility for data quality ◦ Be sure assigned person is educated in QA/QC
  • 36. Introduction to Data Management • Double entry ◦ Data keyed in by two independent people ◦ Check for agreement with computer verification • Record a reading of the data and transcribe from the recording • Use text-to-speech program to read data back CCimagebyweskrieselonFlickr
  • 37. Introduction to Data Management • Design data storage well ◦ Minimize number of times items that must be entered repeatedly ◦ Use consistent terminology ◦ Atomize data: one cell per piece of information • Document changes to data ◦ Avoids duplicate error checking ◦ Allows undo if necessary
  • 38. Introduction to Data Management • Make sure data line up in proper columns • No missing, impossible, or anomalous values • Perform statistical summaries CCimagebychesapeakeclimateonFlickr
  • 39. Introduction to Data Management • Look for outliers ◦ Outliers are extreme values for a variable given the statistical model being used ◦ The goal is not to eliminate outliers but to identify potential data contamination 0 10 20 30 40 50 60 0 5 10 15 20 25 30 35
  • 40. Introduction to Data Management • Methods to look for outliers ◦ Graphical • Normal probability plots • Regression • Scatter plots ◦ Plotting on maps ◦ Deviation
  • 41. Introduction to Data Management • Beware of errors of commission and omission • Execute quality assurance and quality control strategies ◦ Data entry ◦ Data visualization
  • 42. Introduction to Data Management • Why Manage Data? • Data Sharing • Data Entry & Manipulation • Quality Control & Assurance • Backup
  • 43. Introduction to Data Management • Backups vs. Archives ◦ Backups: a copy (or copies) of the original file is made before the original is overwritten o Archives: preservation of the file • Data Preservation o Includes archiving in addition to processes such as data rescue, data reformatting, data conversion, metadata
  • 44. Introduction to Data Management • Backups o periodic snapshots o usually copies of files o performed on regular schedule • Archiving o preserve data o usually the final version o performed at the end of a project or milestones It is a good idea to have multiple copies of your backups and archives, in case one copy fails.
  • 45. Introduction to Data Management • Limit or negate loss of data • Save time, money, productivity • Help prepare for disasters o Accidental deletions o Fires, natural disasters o Software bugs, hardware failures • Reproduce results • Respond to data requests • Limit liability CCImagecourtesyofBrianJMatisonFlickr
  • 46. Introduction to Data Management • Includes backups and archiving • Also includes ◦ data conversion ◦ data reformatting ◦ data rescue
  • 47. Introduction to Data Management • Data Conversions and Formats o Use non-proprietary, standard formats o Textual documents  .txt o Spreadsheets  .csv o Digital images  .tiff • Versioning • File Naming
  • 48. Introduction to Data Management • Create a backup plan that clearly identifies: o roles, o responsibilities, o where the data is backed up, o how often the files are backed up, o how to access the files, o recommended file formats to be used, and o policies for migrating data to assure data are not lost due to media degradation or changing formats or programs • Review your backup plan regularly • Update as needed
  • 49. Introduction to Data Management • Why Manage Data? • Data Sharing • Data Entry & Manipulation • Quality Control & Assurance • Backup • Metadata
  • 50. Introduction to Data Management • When you provide data to someone else, what types of information would you want to include with the data? • When you receive a dataset from an external source, what types of details do you want to know about the data?
  • 51. Introduction to Data Management • Why were the data created? • What limitations, if any, do the data have? • What does the data mean? • How should the data be cited if it is re-used in a new study?
  • 52. Introduction to Data Management • What are the data gaps? • What processes were used for creating the data? • Are there any fees associated with the data? • In what scale were the data created? • What do the values in the tables mean? • What software do I need in order to read the data? • What projection are the data in? • Can I give these data to someone else?
  • 53. Introduction to Data Management Metadata is: Data ‘reporting’ • WHO created the data? • WHAT is the content of the data? • WHEN were the data created? • WHERE is it geographically? • HOW were the data developed? • WHY were the data developed? PhotobyMichelleChang.AllRightsReserved
  • 54. Introduction to Data Management Author(s) Boullosa, Carmen. Title(s) They're cows, we're pigs / by Carmen Boullosa Place New York : Grove Press, 1997. Physical Descr viii, 180 p ; 22 cm. Subject(s) Pirates Caribbean Area Fiction. Format Fiction CCimagebyUSDAgovonFlickr CCimagebyMskaduonFlickr
  • 55. Introduction to Data Management DATADETAILS Time of data development Specific details about problems with individual items or specific dates are lost relatively rapidly General details about datasets are lost through time Accident or technology change may make data unusable Retirement or career change makes access to “mental storage” difficult or unlikely Loss of data developer leads to loss of remaining information TIME (From Michener et al 1997)
  • 56. Introduction to Data Management • A Standard provides a structure to describe data with: ◦ Common terms to allow consistency between records ◦ Common definitions for easier interpretation ◦ Common language for ease of communication ◦ Common structure to quickly locate information • In search and retrieval, standards provide: ◦ Documentation structure in a reliable and predictable format for computer interpretation ◦ A uniform summary description of the dataset CCimagebyccarlstead onFlickr
  • 57. Introduction to Data Management CCimagebyIlikeonFlickr
  • 58. Introduction to Data Management • Why Manage Data? • Data Sharing • Data Entry & Manipulation • Quality Control & Assurance • Backup • Metadata • Data Citation
  • 59. Introduction to Data Management • Similar to citing a published article or book o Provide information necessary to identify and locate the work cited • No standards yet • Use format recommended by journal, repository, or professional organization CCimagebyPaxsimiusonFlickr
  • 60. Introduction to Data Management Breitburg DL, Hondorp D, Audemard C, Carnegie RB, Burrell RB, Trice M, Clark V (2015) Data from: Landscape-level variation in disease susceptibility related to shallow-water hypoxia. PLOS ONE. http://dx.doi.org/10.5061/dryad.9k231
  • 61. Introduction to Data Management Breitburg DL, Hondorp D, Audemard C, Carnegie RB, Burrell RB, Trice M, Clark V (2015) Data from: Landscape-level variation in disease susceptibility related to shallow-water hypoxia. PLOS ONE. http://dx.doi.org/10.5061/dryad.9k231
  • 62. Introduction to Data Management Breitburg DL, Hondorp D, Audemard C, Carnegie RB, Burrell RB, Trice M, Clark V (2015) Data from: Landscape-level variation in disease susceptibility related to shallow-water hypoxia. PLOS ONE. http://dx.doi.org/10.5061/dryad.9k231
  • 63. Introduction to Data Management Breitburg DL, Hondorp D, Audemard C, Carnegie RB, Burrell RB, Trice M, Clark V (2015) Data from: Landscape-level variation in disease susceptibility related to shallow-water hypoxia. PLOS ONE. http://dx.doi.org/10.5061/dryad.9k231
  • 64. Introduction to Data Management Breitburg DL, Hondorp D, Audemard C, Carnegie RB, Burrell RB, Trice M, Clark V (2015) Data from: Landscape-level variation in disease susceptibility related to shallow-water hypoxia. PLOS ONE. http://dx.doi.org/10.5061/dryad.9k231
  • 65. Introduction to Data Management Breitburg DL, Hondorp D, Audemard C, Carnegie RB, Burrell RB, Trice M, Clark V (2015) Data from: Landscape-level variation in disease susceptibility related to shallow-water hypoxia. PLOS ONE. http://dx.doi.org/10.5061/dryad.9k231
  • 66. Introduction to Data Management A persistent identifier should be included in the citation: • DOI (Digital Object Identifier) o Globally unique, alphanumeric string assigned by a registration agency to identify content and provide a persistent link to its location. o May be assigned to any item of intellectual property that is defined by structured metadata o Examples:10.1234/NP5678, 10.5678/ISBN-0-7645-4889-4; 10.2224/2004-10-ISO-DOI • The UT Libraries can assign a DOI to your data set.
  • 67. Introduction to Data Management Chris Eaker Data Curation Librarian Hodges Library, Room 236 865-974-4404 chris@utk.edu Available to help with… • Data management plan support • Data repositories • Metadata • Data management consulting

Editor's Notes

  1. Manage your data for yourself: Keep yourself organized – be able to find your files (data inputs, analytic scripts, outputs at various stages of the analytic process, etc) Track your science processes for reproducibility – be able to match up your outputs with exact inputs and transformations that produced them Better control versions of data – identify easily versions that can be periodically purged Quality control your data more efficiently Data is a valuable asset – it is expensive and time consuming to collect Data should be managed to: maximize the effective use and value of data and information assets continually improve the quality including: data accuracy, integrity, integration, timeliness of data capture and presentation, relevance and usefulness ensure appropriate use of data and information facilitate data sharing ensure sustainability and accessibility in long term for re-use in science
  2. To summarize, the goals of effective DM is to get data that are well organized – both within the files themselves, and the groups of files, documented adequately with metadata, preserved for future reuse, accessible by others for that reuse, and accurate and valid. If data are all those things, then you will have high quality data that is easy to share and reuse in science. The data can then be cited by other researchers which will add credibility to the researcher who prepared them. Overall, it saves money and advances science.
  3. One of the main goals of effective data management is to facilitate sharing data with other researchers. Grant funding agencies want to see a greater return on their investment. Let’s talk a bit about data sharing and why it’s important.
  4. Why expend the extra effort to share data? Because it benefits the public, the research sponsor, the research community and, perhaps most importantly, the researcher.
  5. How does the public benefit from shared research? The more informed the public is, the better they are able to understand and contribute toward effective public and personal decisions: The public needs data to help with environmental and economic planning Data help inform federal, state and local policies The public can use data to help with social choices such as who they will vote for, how they want their tax dollars to be used, and where they will send their children to school. It can even help them make personal lifestyles and health choices such as exercise, smoking, and nutrition
  6. Why do research sponsors encourage data sharing? Because sponsors have an obligation to maximize the investment of research dollars. Data sharing enhances the value of the research investment by enabling external reviewers to verify the project performance metrics and outcomes. This not only increases the credibility of the data but also spurs new research that can build upon the initial investment and advance the science rather than duplicate expenditures.
  7. The scientific community as a whole also benefits from sharing among researchers. Data sharing allows researchers to build upon one another’s work and to further, rather than duplicate, the science by exploring new findings or combining findings into meta analyses that cannot be performed with individual data. In sharing data, the scientific community expands both individual perspectives and the collective comprehension.
  8. Access to related research enables members of the scientific community to better reproduce, compare and assess methods and results. Scientists are able to learn from one another and educate new researchers as to the most current and significant findings.
  9. And finally, how does the independent researcher benefit from data sharing? When scientists share their data, they gain recognition as an authoritative source and respect as a wise investment for research dollars. When data are exposed, feedback from the broader community can be used to improve the quality and presentation of the data. Shared data also allows for greater opportunity for data exchange and networking opportunities with peers and potential collaborators. I’m going to go through 4 steps to make data shareable. Then we’ll go through the nuts and bolts of executing on those four steps.
  10. The more robust your metadata, the easier your data will be discovered and the more appropriately it will be used. Specifically, when creating metadata, be specific in regards to the geographic and time period coverage of your data. For example…. Use a discipline specific metadata schema, if possible. For example… Use discipline specific themes, place names, and keywords. Describe any attributes (variable names, specimen names, etc) thoroughly
  11. Step 2: Be sure to include archival and reference information with properly formatted data citations for sources and content. Include persistent data identifiers with the data citation. We’re going to go into data citation in more depth later, but this is an example of a citation for a data set on the Dryad Data Repository. As you can see, there is a DOI added to the end, which allows the data to be located easily.
  12. Step 3: Be sure to have data contributors review their metadata to ensure validity and organizational correctness. Are the processes correct? Is your contribution adequately represented and reflected? Is your organization properly recognized and is the funding organization properly recognized? Be sure to get management and sponsor approval on the data publication including the content, presentation, and manner in which contributors are identified.
  13. The last step to make your data Shareable: Step 4: Publish your metadata in data portals and clearinghouses. Seek out relevant government portals and portals developed by specific communities of practice. The nice thing about these two science data repositories is that the metadata in them is harvested by the DataONE search portal and so people just have to search one interface to find data in these two, among many other, data repositories.
  14. Now, one of the most important things one can do when managing their research data is to take steps to make sure data entry, processing, and analysis are done in such a way to minimize error. We’ll talk now of ways to do that during data entry and manipulation.
  15. The goals of data entry are to create data that are valid, or have gone through a process to assure quality, and are organized to support use of the data or for ease of archiving.
  16. Many researchers like to manage their data in Excel, and Excel makes it easy to use poor data entry practices. Excel doesn’t enforce any rules on your data entry unless you tell it to. These are data entered in to Excel from a small mammal trapping study. Each block of data represents a different trapping period (2/13, 3/15, and 4/10/2010). Inconsistencies in how the data were entered for each sampling period make the data difficult to analyze and difficult for anyone but the data collector to understand. Note that the date is listed in different places in each block. Date is a column in the first block, but listed in the header in the block on the right. Inconsistent date formats were also used. In one place the date is formatted as day-month-year, with the first three letters of the month spelled out, while elsewhere the format is mm/dd/yyyy. Note also that the order of the columns is inconsistent- Site, Date in the first block, and Site, Plot in the bottom block. Even the columns are named differently. Species is called Species in the first block, and RodentSp in the block on the right. This can be confusing to any user who must try to make sense of these data! And it would be a nightmare to try to write metadata for this spreadsheet.
  17. There are other problems with how these data was entered. Naming of sites is also inconsistent. For instance, Deep Well is used in the first block vs. DW in the block on the right. The file contains several typos, also such as rioSalado vs. rioSlado. A human can figure out what each of these site names refers to, but the names would have to be harmonized for a statistical program to use. It would be easier to filter for just Deep Well (with a space), and not have to know you need to filter for DeepWell (no space), also. Similarly, in one place a species code is capitalized PERO, and lowercase elsewhere. Further, in the first block of data, a mean was calculated for the weight of the rodents. The value for that mean, called Mean1, is in the same column as the weights of the individual animals. In later manipulations of these data, it would be easy to copy that value as though it represented the weight of a single animal. It is bad practice to mix types of information in one column. It is best for raw data should be maintained in one file, and calculations should be done elsewhere. In addition, there is text data mixed with numeric data in the Weight column in the block on the right – it says “escaped < 15” (presumably indicating that a rodent less than 15 grams escaped). A statistical program will not know how to deal with text data mixed with numeric data. What is the mean of 12, 91, and “escaped < 15”? To analyze all these data using statistical software, and to make it much easier to understand by any user, these data will need to be organized in to a column for each variable. Therefore it essential that only one type of info be entered in to each column, and that spellings, codes, formats, etc. be consistent.
  18. This shows the same data entered in a way that would make it easy to understand and analyze. The data are not entered in separate blocks arrayed in a single worksheet. They are entered in one table with columns defined by variables Date, Site, Plot, Species, and Weight, Adult, and Comments that are recorded for each sampling event. The columns of data have consistent types. Each column contains only numbers, dates, or text. There are consistent names, codes, and formats used in each column. For instance, all dates are in the same format (mm/dd/yyyy), and there are no typos in the Site Names. Species are all referred to by standard codes. Therefore, if the user wanted to subset the data for species = ‘PERO”, they could easily filter the file for just those data. Additionally, there are only numeric data in the Weight column, so a statistical program or Excel could readily calculate statistics on this column. Preparing metadata for this file would also be straightforward.
  19. Descriptive column names: A best practice in data entry is to create descriptive column names without spaces or special characters. Sometimes statistical programs have special uses for some characters, so you should avoid using them in your data file. Descriptive file names: For instance, a file named FieldVisit1_SmallMammalData_2010-04-11.csv indicates that this file contains data collected during Field Visit 1, contains small mammal data, and is version dated April 11, 2010. We know it’s April 11, and not November 4, because they’re using the standard ISO data format, which is four digit year, followed by 2 digit month, and 2 digit day. This name is much more helpful than a file named Mammal data.csv.
  20. There are a lot of great things about spreadsheets, but one must be wary of problems that can arise from their use. Spreadsheets, for instance, can sort one column independently of all others. The data entry person for the upper spreadsheet elected to leave empty cells for site, treat, web, plot, quad. It’s obvious why and doesn’t cause the human reader any problems. But if someone happens to decide to sort on Species, it is no longer clear which species maps to which time period or to which measurements. This could make the spreadsheet unusable. It is good practice to fill in all cells when using a spreadsheet for data entry. A best practice is to enter complete lines of data, so that the data are sorted on one column without loss of information
  21. When you collect data, you’re bound to have some data points with no values. There are ways to ensure that those values are consistent and don’t throw off your analysis. A preferred way to identify missing data is with an empty field. If for some reason an empty cell is not possible, for example, the software you’re using requires something to be there, then use an impossible value such as 9999 in numeric fields and in text fields use NA. If you need to explain something, you can use data flags in a separate column to qualify empty cells. For instance, in this example of stream chemistry data, the flag M1 indicates that the sample was not collected at that interval. The flag E1 indicates that value was estimated.
  22. Excel is a very popular data entry tool. It also allows you to enforce data validation rules. Here, a dropdown list has been generated that allows the user to only select entries from this list. In this way, only defined species codes get entered, and the data is consistent.
  23. Here is another example of data validation using Excel. Height has been defined to contain values between 11 and 15. When 20 is entered, the user is told that they have entered an illegal value.
  24. Some researchers are turning to database software instead of spreadsheets for their data management needs. Databases are a powerful option for storing and manipulating datasets. Here, we list some of the pros and cons of spreadsheets vs. databases (which include software such as Oracle, MySQL, SQL Server and Microsoft Access). Spreadsheets are good at making charts and graphs, and doing calculations. They are easy to use, but they become unwieldly as the number of records grows and a dataset becomes complex. Databases, on the other hand, work well with high volumes of data, and they are much easier to query in order to select data having particular characteristics. They also maintain data integrity – that is, one column cannot be sorted separately of all others, as spreadsheets can. Databases also enforce data typing, which is a best practice. This means that only data of type ‘text’, for example, can be entered in to a column of type ‘text’. This helps prevents data entry errors. Databases do have a steeper learning curve than a spreadsheet such as Excel does, but there are many benefits
  25. A relational database matches data stored in tables by using common characteristics found within the data set. This helps preserve data integrity and also makes it possible to flexibly mix and match data to get different combinations of information. A database consists of a set of tables and each table has a defined relationship with another table or tables using a common piece of information called the Primary Key. Databases also have a powerful command language for querying and manipulating information called Structured Query Language, or SQL. Here, a dataset for plant phenology has been divided into three tables, one describing site information, one describing characteristics of each sample, and one describing the plant species found. Relational databases are currently the predominant choice in storing data like financial records, medical records, personal information and manufacturing and logistical data.
  26. Database features includes explicit control over data types and has the advantages of quality control and performance. Here, in the plant phenology table, only dates are allowed in the Date column, only text is allows in the site column, only real numbers are allowed in the Height column. If a user tries to enter a ? Under flowering, the database will reject the entry. This is useful for defining how data is to be entered.
  27. Relationships can be defined between two sets of data or in this example between two tables. Suppose that you have two tables used in the plant phenology study, one for observations and one for sites, and you want a table that contains both observations and the latitude and longitude of your sites. Because both tables contain Site info, they can be joined to create a table containing the info you want.
  28. Database features also includes a powerful command language called Structured Query Language (SQL) These are just a couple of examples of what you can do with SQL. The table at the top of this slide is named SoilTemp in the database. The first example SQL command returns all records collected on 2010-02-01. The second select statement, returns all records from table SoilTemp where treatment is N and SensorDepth is 0. From this example you can get a sense of how easy SQL is to use to subset data based on different criteria. This is only very simple SQL. There is much, much more than can be done with it.
  29. Forms can be created that make entering data in to a relational database as easy as entering it in to Excel. The screenshot below shows embedded forms that were quickly generated in MS Access for adding data to three tables in a database of plant cover measurements
  30. Be aware of best practices when designing data file structures. Choose a data entry method that allows validation of data entered and be sure to invest time in learning how to use a database especially if the dataset are large or complex.
  31. Ok, now you’ve taken steps during your data entry process to minimize the possibility of data error, but that doesn’t mean you need to stop there. There are other ways to help with quality control, as well.
  32. In general, there are two types of errors that can occur in a data set. First, errors of commission are the result of incorrect or inaccurate data being included in the data set. This may happen because of a malfunctioning instrument that produces faulty results, data that are mistyped during entry, or other problems. Errors of omission are the second type of errors. These result from data or metadata being omitted. Situations that result in omission errors are when data are inadequately documented, when there are human errors during data collection or entry, or when there are anomalies in the field that affect the data.
  33. The remainder of the module will cover best practices for quality control and quality assurance for the different stages of a research project. First, before data collection, a researcher should think about defining and enforcing standards that will be used during the project. Consider formats that will be used for the data tables or data entry forms. Also, if abbreviations or codes are used, they should be defined up front. Measurement units should also be specified and relevant metadata should be identified before collection. Second, you should assign responsibility for data quality before collection begins. Ideally, the person responsible for data quality assurance and data quality control is the person collecting the data, and is educated in quality control and assurance methods.
  34. Consider using techniques that help eliminate mistakes during data entry. Examples are using Double data entry, where non-digital data are keyed in by two people independently. Differences in entries can then be detected via computer programs and examined further for mistakes. Another way to reduce data entry error is to record yourself reading off the data, and then transcribe it from the recording. You may also use a text-to-speech program reads the data to you while you type it into the computer.
  35. If you are using spreadsheets or databases, you should carefully consider their design before and during data entry. Use consistent terminology within the database, and atomize data. This means only one piece of information is in each cell of the spreadsheet -- multiple pieces of information embedded in a single data cell will be problematic during data analysis. If you are using a database, restrict what can be entered into the database; for example, set up a field to accept only text or only numerical values, choose a maximum number of characters or a range of values a field will accept, or set a field to accept only unique values.   Finally, document any changes made to data. It saves time if good records of data editing are kept since multiple users are less likely to spend time on error-checking with old versions of data. If mistakes in data editing or cleaning are made, good data records will allow these mistakes to be undone. Documenting data changes may be as simple as creating a text file to accompany the data set, or it may involve using a scripted program for correcting errors so that each step taken is clearly documented.
  36. Once data are entered, basic quality assurance measures can be taken. First, if data are in spreadsheets or databases, be sure they line up in their proper columns. Also check for any missing, impossible, or anomalous values. One way to check for these problems is to sort data fields and check for discrepancies. It is often also useful to perform basic statistical summaries, such as means, and standard errors. If data transformation was performed for analysis, compare the statistical summaries before and after transformation to ensure no mistakes were made during transformation.
  37. Another strategy for quality control after data are entered is to look for outliers. Outliers are extreme values for a variable. Extreme values are those that lie outside of the statistical model being used to describe the data. Keep in mind that the goal is not to eliminate outliers but to identify potential data contamination. An easy way to do this is to plot your data on a graph. In this graph, most of the data fall along the black line. The outlier identified with the red arrow should be flagged for further investigation. You may find that that data point is legitimately correct, in which case you would keep it. But you may find it’s the result of a typo when you entered the data, in which case you have the opportunity to correct it before it messes up your analysis.
  38. One common strategy for identifying outliers is using graphical methods, for instance normal probability plots, regression (as in the previous slide), or scatter plots. If data are geographical, mapping the points can be to ensure latitude and longitude were correctly entered. This map shows an example of an error that is the result of mis-entering latitude data.   Another method for identifying outliers is using statistics. You can look at the deviation of a value, which is the difference between the observed value and the mean of that variable. By subtracting values from the mean of the data set, the presence of outliers or faulty data points can become apparent.
  39. During this tutorial we first defined several concepts important for understanding quality assurance and quality control. This included data contamination and the types of data errors that can result in poor quality data. We then covered best practices for quality assurance and quality control. These strategies prevent errors from entering a dataset, or identifying those errors if they are present in the data. It is important to define and enforce quality assurance and quality control standards before, during, and after the collection and entry of data.
  40. Now you’ve made sure your data were entered properly and you’ve cleaned up the data from errors of commission and omission. All that hard work could be for naught if you don’t have a good backup plan.
  41. The terms data backups and data archiving are often used interchangeably as they both relate to saving a specific version of a file, but they do convey different processes. The term “backup” is used specifically when making copies of various files with the knowledge that the files may change. Backups are kept for a certain amount of time, but can be discarded after a specified time has passed. Archiving is used when a file is to be preserved as-is, often at the end of a project and acts as a static (and usually final) record. Data preservation encompasses many of these same methodologies, but can also include things like data rescue, reformatting of files, converting data, and the creation of metadata. We’ll talk about backups, archives, and preservation in this section. Metadata has its own section next.
  42. The main difference between data backups and archiving is that backups deal with data that is copied elsewhere and potentially can be overwritten again as the data change. Archiving makes a record of data that is usually in its final state. When a user performs a backup, they are in essence taking a snapshot of the data at that moment in time. This allows the user to restore the file as needed, such as when the current version of the file is corrupted, lost, or somehow destroyed or altered. Backups are often used for short-term storage or near long-term storage, depending upon the user’s backup needs and procedures. Backups are usually scheduled on a frequent basis. Archiving deals more with records that are no longer in use and is used to create a historical snapshot of the data. This provides for preservation of the data for future needs. Usually, archives are made when a project ends, or when appropriate. Regardless of whether you are dealing with backups or archives, you should have multiple copies in case one (or many) versions fail.
  43. There are many reasons to perform backups including: - mitigate or prohibit the loss of data, which may or may not be reproducible - save time, money, and productivity as little to none of the data will have to be reproduced - having a backup already in place means you are prepared for when the unexpected happens, such as human error, disasters, or computer failures - allows you to go back to earlier versions and see what your results were. For example, if you are creating models and used data from an earlier model run, the most recent file you have on your computer may not have the same data as when you first created the model output. - provides for the ability to send older files to others, regardless of the current version or state (for example, if the current version has been corrupted) - may allow you to respond during times when questioned results were based on older versions of files. For example, you may find that you will have to justify your results in court or to other scientists. By having access to older files, you may be able to respond to their requests for information. Or, you may not be able to reproduce the data, and the original copy may be the only evidence of the data collection.
  44. Our last topic covers Data Preservation. Data preservation is a comprehensive topic, which includes things such as backups, archives, data conversion, reformatting, and rescue. “Data rescue” deals directly with older files that may no longer be in a format that is easily accessible and will require some “rescuing” before it can be used again Data rescue becomes more and more important as projects end. Even if data was being preserved through the lifetime of the project, often files go untouched or orphaned. Frequently, data has not been managed properly, and requires that some form of data rescue be performed so that the data isn’t a total loss.
  45. When preserving your data, you need to consider many things: Data formats: it is best to use non-proprietary and standardized formats. This will better ensure readability in the future. Be sure to check files after converting them, as data, metadata, and formatting loss can occur. Versioning: make sure to use some sort of naming convention (such as incremental letters or numbers) to help keep track of file edits and revisions. This will also help you more quickly locate the correct version of a specific file. Naming: use file names that are consistent, descriptive, and concise. Many software programs use a generic file name as their default file output and usually these names are too general to be useful. If data is well-preserved, then data rescue may not be necessary. With proper file naming (can help the file from getting lost in the system), utilization of proper file formats (lets you open the file without having to convert the file), backups (limits loss of files), and media types (limits degradation of files), you may limit or prevent the need for data rescue. A good data management plan, which is discussed in another lesson, is another important tool in limiting the need for data rescue.
  46. The first best practice is to create a backup plan. This may be a physical document, a web page, or listed as someone’s job description. Regardless of how you create it, it’s important to address any of the previously mentioned issues and concerns. For example, who do I contact when I need to get a file off of backup? I’ve included a data backup and security checklist in your handouts for you to use to create that policy or plan. Once you have a backup plan in place, it is good practice to review it periodically to ensure the information still has value and is applicable. Hardware, software, projects, and staff can change over time.
  47. We have two more sections: metadata and data citation.
  48. The two questions you need to consider when you think about metadata are What information would you need to provide someone else so they would be able to work with your data? And two, what kind of information would you want to receive from someone else to be able to work with their data? The answers to these two questions will indicate to you what metadata you need to provide with your data. Just to define the term, metadata is descriptive information that describes your data, the project, the people involved, etc. It’s important for someone else to make sense of your data, and even for you to make sense of it if you go back to it after not having worked on it for a while.
  49. When sharing data, some considerations include: - why the data was created; - what limitations, if any, the data have;. - what the data means; and who should be cited if someone publishes something that utilized the data.
  50. When receiving data from an alternative source, consider: What are the data gaps? What processes were used for creating the current data? Are there any fees associated with the data? In what scale were the data created? What do the values in the tables mean? What software do I need in order to read the data? What projection is the data in? Can I give this data to someone else?
  51. Metadata is data about data. It describes the content, quality, condition, and other characteristics of a dataset. Metadata records answer questions such as: Why was the data set created? What processes were used to create the data set? What projection is the data in? When was the data last updated? Who created the data? What scale was used? What fields are in the table? What do the values in those fields mean? Who do I contact about getting more information about the data? How do I obtain a copy of the data? Do the data cost anything? Are there any limitations to the data? Metadata is a valuable tool. Metadata records preserve the usefulness of data over time by detailing methods for data collection and data set creation. Metadata greatly minimizes duplication of effort in the collection of expensive digital data and fosters the sharing of digital data resources.
  52. Metadata is all around us. . .from Mp3 players, to nutrition labels, to library card catalogues. For example, a card catalogue tell us more information than just the title of the book, they also tells the user: Who is the author? Who published the book? What subject area does the book fall in? And finally, where is it located in the library? Another example of metadata that we see in our daily lives is the nutrition and ingredient information on food labels. Nutrition labels answer questions such as: What ingredients were used? Who made the food? How many calories per serving? How many servings in the can? What percentage of daily vitamins are in each serving?
  53. This graph illustrates the phenomenon of “information entropy”, associated with research. At the time of the research project, a scientists memory is fresh. Details about the development of the dataset are easily recalled, and it is a good time to document information about the process. Over time, memory of the details begins to fade. A variety of circumstances can intervene, and eventually detailed knowledge about the dataset fades. Without a metadata record, this data might be unusable. A dataset it not considered complete without a metadata record to accompany it.
  54. An established standard provides common terms, definitions and structure that allow for consistent communication. The use of standards also support search and retrieval in automated systems.
  55. This is an example of a metadata record using the Federal Geographic Data Committee (FGDC) standard. The benefit is that if every geographic data set used this standard metadata schema, then everyone would be able to understand the metadata and computers would be able to process it.