Data Management
Stephanie Wright
University of Washington
swright@uw.edu SPATIAL / IsoCamp
June 2015
Tips & Tools
Who Am
I?
• Computing Trainer
• Cruise Ship Lecturer (Love Boat)
• Library Merger Manager
• Atmospheric Sciences Librarian
• Assessment Librarian
• Data Services Coordinator
HTTP://GUIDES.LIB.WASHINGTON.EDU/SWRIGHT
Disclaimer
I am not a scientist I am a librarian …
Disclaimer
I am not a scientist More like this…
What Do I
Do?
• Data Management Plans
(DMPs)
• Courses
• Consultations
• Research Projects
• DataONE, RDA, eScience
Institute
• Institutional Data
Repository (DRUW)
Why?
THEN NOW
THEN
NOW
THEN NOW
A Real Life
Example
Many tables
my spreadsheet
No headings
Embedded
figures
my spreadsheet
my spreadsheet
my spreadsheet
?
One More
Example
https://www.youtube.com/watch?v=66oNv_DJuPc
Data Sharing and
Management Snafu
in 3 Short Acts
Why Does It
Matter?
From Flickr by tomhilton
HTTP://WWW.SPARC.ARL.ORG/ISSUES/OPEN-DATA/DATA-SHARING-INITIATIVE/POLICIES
… “Federal agencies investing in research
and development (more than $100 million in
annual expenditures) must have clear and
coordinated policies for increasing public
access to research products.”
“The best thing to do with your data will be
thought of by someone else.”
“We need open data because we don’t just want
to use a car we want to poke around in the
engine, see how it works and then rebuild it.”
~ Rufus Pollock
Founder and President of Open Knowledge Foundation (www.okfn.org)
From Flickr by cogdog
WICHERTS JM, BAKKER M, MOLENAAR D (2011) WILLINGNESS TO SHARE RESEARCH DATA IS RELATED TO THE STRENGTH OF THE EVIDENCE AND THE QUALITY OF REPORTING OF
STATISTICAL RESULTS. PLOS ONE 6(11): E26828. DOI:10.1371/JOURNAL.PONE.0026828
HTTP://127.0.0.1:8081/PLOSONE/ARTICLE?ID=INFO:DOI/10.1371/JOURNAL.PONE.0026828
How To Do
It?
Data planning is more efficient than data forensics.
DATA MANAGEMENT PLANNING
•What will be collected
•Methods
•Standards
•Sharing/access
•Long-term storage
COLLECTING
•Keep raw data raw
• Use scripts to
process data
ORGANIZING
• Machine readable
• Human readable
• Works well with
default ordering
AVOID
• spaces
• punctuation
• special characters
• case sensitivity
20130503_DOEProject_DesignDocument_Smith_v2-01.docx
20130709_DOEProject_MasterData_Jones_v1-00.xlsx
20130825_DOEProject_Ex1Test1_Data_Gonzalez_v3-03.xlsx
20130825_DOEProject_Ex1Test1_Documentation_Gonzalez_v3-03.xlsx
20131002_DOEProject_Ex1Test2_Data_Gonzalez_v1-01.xlsx
20141023_DOEProject_ProjectMeetingNotes_Kramer_v1-00.docx
Eaffinis_nanaimo_2010_counts.xls
Site
name
Year
What was
measured
Study
organism
YYYYMMDD
NOBLE, WILLIAM S. (2009) "A QUICK GUIDE TO ORGANIZING COMPUTATIONAL BIOLOGY PROJECTS."
PLOS COMPUTATIONAL BIOLOGY. 5(7): DOI/10.1371/JOURNAL.PCBI.1000424
• Pick a method that works for you and stick to it
• DOCUMENT IT!
METADATA
•Who?
•What?
•Where?
•When?
•How?
•Why?
Digital context
• Name of the data set
• The name(s) of the data file(s) in the
data set
• Date the data set was last modified
• Example data file records for each
data type file
• Pertinent companion files
• List of related or ancillary data sets
• Software (including version number)
used to prepare/read the data set
• Data processing that was performed
Personnel & stakeholders
• Who collected
• Who to contact with questions
• Funders
Scientific context
• Scientific reason why the data were
collected
• What data were collected
• What instruments (including model & serial
number) were used
• Environmental conditions during collection
• Temporal & spatial resolution
• Standards or calibrations used
Information about parameters
• How each was measured or produced
• Units of measure
• Format used in the data set
• Precision & accuracy if known
Information about data
• Definitions of codes used
• Quality assurance & control measures
• Known problems that limit data use (e.g.
uncertainty, sampling problems)
Temperature
data
Salinity
data
Data import into Excel
Analysis: mean, SD
Graph production
Quality control &
data cleaning
“Clean” T
& S data
Summary
statistics
Data in
spread-
sheet
Simple: Flow chart
WORKFLOW
Simple: Commented script
Resulting output
More Fancy: Kepler, Taverna
From Flickr by cogdog
BACKING UP: 3 places, 3 ways
From Flickr by lippo
From Flickr by see phar
Original
Near
Far
What software?
What hardware?
What personnel?
How often?
Set up reminders!
Test system
SHARING
Repositories
Institutional
Disciplinary
Journal
re3data.org
Sustainable formats
Open, non-proprietary
Commonly used in your
discipline
Not encrypted or compressed
Review your DMP
Did you do what you said you would?
Photo credit Michael Ham
How Do I
Learn
More?
•Funding Mandates
http://chronicle.com/article/Where-Should-You-
Keep-Your/231065/
http://datapub.cdlib.org/2013/02/28/the-new-ostp-
policy-what-it-means/
•File Naming Conventions:
http://www.exadox.com/en/articles/file-naming-
convention-ten-rules-best-practice
•Folder Structures:
http://www.damlearningcenter.com/resources/
articles/best-practices-for-folder-organization/
•Metadata:
http://www.dcc.ac.uk/resources/metadata-
standards
•DataONE Primer
https://www.dataone.org/best-practices
•Software Carpentry
http://software-carpentry.org/
•Research Data Alliance
https://rd-alliance.org/
•Your Library
http://guides.lib.washington.edu/dmg
Tools
•Data Mgmt Planning
DMPTool https://dmptool.org/
•Metadata
Morpho https://www.dataone.org/software-
tools/morpho
NOAA MERMaid http://www.ncddc.noaa.gov/
metadata-standards/mermaid/
•Workflows
Kepler https://kepler-project.org/
Taverna http://www.taverna.org.uk/
•Sharing
re3data http://www.re3data.org/
GitHub https://github.com/
•Miscellaneous
EZID http://ezid.cdlib.org/
ImpactStory https://impactstory.org/
ORCID http://orcid.org/
Any Other
Questions? Stephanie Wright
Web data.blogspot.com
Twitter @UWLibsData
Email swright@uw.edu

Data Management: Tips & Tools

  • 1.
    Data Management Stephanie Wright Universityof Washington swright@uw.edu SPATIAL / IsoCamp June 2015 Tips & Tools
  • 2.
  • 3.
    • Computing Trainer •Cruise Ship Lecturer (Love Boat) • Library Merger Manager • Atmospheric Sciences Librarian • Assessment Librarian • Data Services Coordinator HTTP://GUIDES.LIB.WASHINGTON.EDU/SWRIGHT
  • 4.
    Disclaimer I am nota scientist I am a librarian …
  • 5.
    Disclaimer I am nota scientist More like this…
  • 6.
    What Do I Do? •Data Management Plans (DMPs) • Courses • Consultations • Research Projects • DataONE, RDA, eScience Institute • Institutional Data Repository (DRUW)
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
  • 13.
  • 14.
  • 15.
  • 16.
  • 17.
  • 18.
  • 20.
  • 21.
  • 22.
    Why Does It Matter? FromFlickr by tomhilton
  • 23.
    HTTP://WWW.SPARC.ARL.ORG/ISSUES/OPEN-DATA/DATA-SHARING-INITIATIVE/POLICIES … “Federal agenciesinvesting in research and development (more than $100 million in annual expenditures) must have clear and coordinated policies for increasing public access to research products.”
  • 27.
    “The best thingto do with your data will be thought of by someone else.” “We need open data because we don’t just want to use a car we want to poke around in the engine, see how it works and then rebuild it.” ~ Rufus Pollock Founder and President of Open Knowledge Foundation (www.okfn.org)
  • 28.
  • 29.
    WICHERTS JM, BAKKERM, MOLENAAR D (2011) WILLINGNESS TO SHARE RESEARCH DATA IS RELATED TO THE STRENGTH OF THE EVIDENCE AND THE QUALITY OF REPORTING OF STATISTICAL RESULTS. PLOS ONE 6(11): E26828. DOI:10.1371/JOURNAL.PONE.0026828 HTTP://127.0.0.1:8081/PLOSONE/ARTICLE?ID=INFO:DOI/10.1371/JOURNAL.PONE.0026828
  • 30.
  • 31.
    Data planning ismore efficient than data forensics. DATA MANAGEMENT PLANNING •What will be collected •Methods •Standards •Sharing/access •Long-term storage
  • 32.
    COLLECTING •Keep raw dataraw • Use scripts to process data
  • 33.
    ORGANIZING • Machine readable •Human readable • Works well with default ordering
  • 34.
    AVOID • spaces • punctuation •special characters • case sensitivity 20130503_DOEProject_DesignDocument_Smith_v2-01.docx 20130709_DOEProject_MasterData_Jones_v1-00.xlsx 20130825_DOEProject_Ex1Test1_Data_Gonzalez_v3-03.xlsx 20130825_DOEProject_Ex1Test1_Documentation_Gonzalez_v3-03.xlsx 20131002_DOEProject_Ex1Test2_Data_Gonzalez_v1-01.xlsx 20141023_DOEProject_ProjectMeetingNotes_Kramer_v1-00.docx Eaffinis_nanaimo_2010_counts.xls Site name Year What was measured Study organism
  • 35.
  • 36.
    NOBLE, WILLIAM S.(2009) "A QUICK GUIDE TO ORGANIZING COMPUTATIONAL BIOLOGY PROJECTS." PLOS COMPUTATIONAL BIOLOGY. 5(7): DOI/10.1371/JOURNAL.PCBI.1000424 • Pick a method that works for you and stick to it • DOCUMENT IT!
  • 37.
  • 38.
    Digital context • Nameof the data set • The name(s) of the data file(s) in the data set • Date the data set was last modified • Example data file records for each data type file • Pertinent companion files • List of related or ancillary data sets • Software (including version number) used to prepare/read the data set • Data processing that was performed Personnel & stakeholders • Who collected • Who to contact with questions • Funders Scientific context • Scientific reason why the data were collected • What data were collected • What instruments (including model & serial number) were used • Environmental conditions during collection • Temporal & spatial resolution • Standards or calibrations used Information about parameters • How each was measured or produced • Units of measure • Format used in the data set • Precision & accuracy if known Information about data • Definitions of codes used • Quality assurance & control measures • Known problems that limit data use (e.g. uncertainty, sampling problems)
  • 39.
    Temperature data Salinity data Data import intoExcel Analysis: mean, SD Graph production Quality control & data cleaning “Clean” T & S data Summary statistics Data in spread- sheet Simple: Flow chart WORKFLOW
  • 40.
  • 41.
  • 42.
  • 43.
    BACKING UP: 3places, 3 ways From Flickr by lippo From Flickr by see phar Original Near Far What software? What hardware? What personnel? How often? Set up reminders! Test system
  • 44.
  • 45.
    Review your DMP Didyou do what you said you would?
  • 46.
  • 47.
    How Do I Learn More? •FundingMandates http://chronicle.com/article/Where-Should-You- Keep-Your/231065/ http://datapub.cdlib.org/2013/02/28/the-new-ostp- policy-what-it-means/ •File Naming Conventions: http://www.exadox.com/en/articles/file-naming- convention-ten-rules-best-practice •Folder Structures: http://www.damlearningcenter.com/resources/ articles/best-practices-for-folder-organization/ •Metadata: http://www.dcc.ac.uk/resources/metadata- standards •DataONE Primer https://www.dataone.org/best-practices •Software Carpentry http://software-carpentry.org/ •Research Data Alliance https://rd-alliance.org/ •Your Library http://guides.lib.washington.edu/dmg
  • 48.
    Tools •Data Mgmt Planning DMPToolhttps://dmptool.org/ •Metadata Morpho https://www.dataone.org/software- tools/morpho NOAA MERMaid http://www.ncddc.noaa.gov/ metadata-standards/mermaid/ •Workflows Kepler https://kepler-project.org/ Taverna http://www.taverna.org.uk/ •Sharing re3data http://www.re3data.org/ GitHub https://github.com/ •Miscellaneous EZID http://ezid.cdlib.org/ ImpactStory https://impactstory.org/ ORCID http://orcid.org/
  • 49.
    Any Other Questions? StephanieWright Web data.blogspot.com Twitter @UWLibsData Email swright@uw.edu