Data Matters for AGU Early Career Conference

Data Matters
Tips & Tools for Better Research
Carly Strasser, California Digital Library
carlystrasser@gmail.com
AGU Student & Early Career Scientist Conference
14 Dec 2014
From Flickr by Lachlan Donald

Why are
you here?
Science: you’re (probably)
doing it wrong

From Wikimedia Commons
Back in the day…
From ahswhg.wikispaces.com

Back in the day…
Da Vinci
Curie
Newton
classicalschool.blogspot.com
Darwin

From wikimedia
Such
Internet!
So many
tools!
From Flickr by John Jobby
So much
data!

Digital data
From Flickr by Flickmor
From Flickr by DW0825
From Flickr by US Army Environmental Command
C. Strasser
Courtesey of WHOI
From Flickr by deltaMike

Digital data
+
Complex workflows

Scientists are bad at
data management.

An embarrassing
example…
From Flickr by lincolnblues

From Flickr by ransomtech
Didn’t share the data
Didn’t document the data (metadata)
Didn’t document provenance/workflow

Why should I care?
From Flickr by johntrainor

Because reproducibility is one of
the fundamental tenets of science.
Because we need to be credible.

Because Fox News, creationism,
and the war on science.

“Help us identify grants that are wasteful
or that you don’t think are a good use of
taxpayer dollars.”
Rep. Adrian Smith (R-Nebraska), a member of the House Committee on Science
and Technology

Because Fox News, creationism,
and the war on science
Because it means faster progress.

Because you are a good person.

From Flickr by Redden-McAllister
From Flickr by Ken Cowell
From Flickr Brandi Jordan

Map of Scientific Collaborations
flowingdata.com

Journals
Institutions
Funders
From Flickr by Eva Rinaldi Celebrity and Live Music
Photographer

Feb
2013
… “Federal agencies investing in research and
development (more than $100 million in annual
expenditures) must have clear and coordinated
policies for increasing public access to
research products.”

From
Flickr
by
Michael
Tinkler

From Flickr by Big Swede Guy
data management
Best
Practices

From Flickr by Mark Sardella
Plan before data collection

Design sample naming schemePlanning
• Create a key (data dictionary)
• Make sure names are unique
• Define codes
From Flickr by zebbie

Design file naming schemePlanning
Use descriptive file names
• Unique
• Reflect contents
From
R
Cook,
ESA
Best
Practices
Workshop
2010
Bad:
Mydata.xls
2001_data.csv
best version.txt
Better:
Eaffinis_nanaimo_2010_counts.xls
Site
name
Year
What was
measured
Study
organism
*Not for everyone
*

Design file organizationPlanning
Biodiversity
Lake
Experiments
Field work
Grassland
Biodiv_H20_heatExp_2005to2008.csv
Biodiv_H20_predatorExp_2001to2003.csv
…
Biodiv_H20_PlanktonCount_2001toActive.csv
Biodiv_H20_ChlAprofiles_2003.csv
…
Consider…
• Dependencies?
• File formats?
• Time of collection?
• Order of analysis?
From S. Hampton

Planning
Design your spreadsheet
Constrain entries
Atomize
Break down spreadsheets
From Flickr by Ulleskelf

Consider a databasePlanning
A relational database is
A set of tables
Relationships among the tables
A language to specify & query the tables
A RDB provides
Scalability: millions+ records
Features for sub-setting, querying, sorting
Reduced redundancy & entry errors
From Mark Schildhauer

Pick a data repository
Store your data in a repository
Institutional archive
Discipline/specialty archive
From Flickr by torkildr
Planning

Planning
Ask a librarian

Planning
Ask a librarian
Repos of repos:
databib.org
re3data.org

Decide on preservation/backup
From Flickr by sepa synod
From Flickr by taberandrew
From Flickr by withassociates
Planning

Decide on preservation/backup
From Flickr by sepa synod
From Flickr by taberandrew
From Flickr by withassociates
What software?
What hardware?
What personnel?
How often?
Set up reminders!
Test system
Planning

…document that
describes what you will
do with your data
throughout
the research project
From Flickr by Barbies Land
Write a data
management plan!
Planning

Planning
DMP components
• What will be collected
• Methods
• Standards
• Metadata
• Sharing/But they access
all have
• Long-term storage
different requirements
and express them in
different ways

dmptool.org
Step-by-step wizard for generating DMP
create | edit | re-use | share
Free & open to community
Planning

During Data Collection & Entry
From Flickr by Julia Manzerova

Realistically:
• Archive .csv version of raw data
• Make a “raw” tab in working data file
• Do all work on other tabs
During
Keep raw data rawcollection

Keep raw data raw
Raw data as .csv
During
collection
R script for processing & analysis
Ideally:
• Use scripts to process data
• Save them with data

During
Document your workflowcollection
Workflow: how you get from the raw data to the final
products of your research
Temperature
data
Salinity
data
Data import into Excel
Quality control &
“Clean” T data cleaning
& S data
Analysis: mean, SD
Graph production
Data in
spread-sheet
Summary
statistics
Simple workflow: flow chart

During
collection
Workflow: how you get from the raw data to the final
products of your research
Commented script
• R, SAS, MATLAB…
• Well-documented code is
Easier to review
Easier to share
Easier to use for repeat analysis
#
%$
&
Document your workflow

Constrain data entries
• Excel lists
• Data validation
• Google docs forms
Modified from K. Vanderbilt
During
collection

Atomize
During
collection
One piece of information per cell

During
Break down spreadsheetscollection
Fake a relational database
Create parameter table
From doi:10.3334/ORNLDAAC/777
From doi:10.3334/ORNLDAAC/777
From R Cook, ESA Best Practices Workshop 2010
Create a site table

Metadata: data reporting
WHO created the data?
WHAT is the content
of the data set?
WHEN was it created?
WHERE was it collected?
HOW was it developed?
WHY was it developed?
From Flickr by //ichael Patric|{
During
Create metadatacollection

Create metadatacollection
Digital context
• Name of the data set
• The name(s) of the data file(s) in the
data set
• Date the data set was last modified
• Example data file records for each data
type file
• Pertinent companion files
• List of related or ancillary data sets
• Software (including version number)
used to prepare/read the data set
• Data processing that was performed
Personnel & stakeholders
• Who collected
• Who to contact with questions
• Funders
During
Scientific context
• Scientific reason why the data were
collected
• What data were collected
• What instruments (including model & serial
number) were used
• Environmental conditions during collection
• Temporal & spatial resolution
• Standards or calibrations used
Information about parameters
• How each was measured or produced
• Units of measure
• Format used in the data set
• Precision & accuracy if known
Information about data
• Definitions of codes used
• Quality assurance & control measures
• Known problems that limit data use (e.g.
uncertainty, sampling problems)

< Create metadata
St a n da rd
Metadata standards…
• Provide structure to describe data
During
collection
What is
metadata?
Common terms | definitions | language | structure
• Come in many flavors
EML , FGDC, ISO19115, DarwinCore,…
• Can be met using software tools
Morpho (EML), Metavist (FGDC), NOAA MERMaid (CSGDM)

Back up daily
During
collection
From Flickr by lippo
From Flickr by see phar
Original
Near
Far

During
collection
Remember that data
management plan?
Revisit
Review
Revise

During
collection
Schedule a time each
week or month
Revisit
Review
Revise
From Flickr by purplemattfish

From
Flickr
by
celikins
Where to start?

Make a
resolution
• Triage on current
projects
• Get advisor, lab mates,
collaborators on board
• Do better next time
From Flickr by Andy Graulund

From
Flickr
by
karindalziel
Start working online

Open notebooks
http://datapub.cdlib.org

Write a DMPdmptool.org
Step-by-step wizard for generating DMP
create | edit | re-use | share
Free & open to community

databib.org
Find a repository
Where
should I put
my data?

Learn new skills
software carpentry
www.software-carpentry.org

Other Fun Stuff
From Flickr by Micah Taylor

Credit in academia…
Altmetrics?
Impact
Factors
+
Citation
Counts

Altmetrics
Article-level metrics
Altmetrics for alt-products
Data
Code
Slides
Blogs
Downloads
Tweets
Mentions
Views
From Flickr by Skakerman

Altmetrics
Article-level metrics
Altmetrics for alt-products

NSF funded DataNet Project
Office of Cyberinfrastructure
www.dataone.org

From Flickr by dotpolka
Manage & share
your data!

Website
Email
Twitter
Slides
carlystrasser.net
carlystrasser@gmail.com
@carlystrasser
slideshare.net/carlystrasser

Data Matters for AGU Early Career Conference

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (9)

Similar to Data Matters for AGU Early Career Conference

Similar to Data Matters for AGU Early Career Conference (20)

More from Carly Strasser

More from Carly Strasser (16)

Recently uploaded

Recently uploaded (20)

Data Matters for AGU Early Career Conference