3. Main topics
• Hierarchical organization
• Folders in folders
• Open Science Framework
• File naming
• Human readability
• Machine readability
• “Tidy” data in spreadsheets
4. Folder systems
• Organize your data
hierarchically
• Identify ways to divide your data
into categories (Attributes)
• Top level organization is the
most important attribute
6. Questions to ask
• What kinds of files are there? (See data inventory)
• How could you group them?
• Project?
• Time?
• Location?
• File type?
• What are the most important attributes?
7. Example: Lou the first year
Lou is a first year graduate student working on a project in a
biomedical research laboratory. He’s trying to decipher data
left by a former post doc as a start for his thesis project. For
one year, the postdoc recorded weight daily and cytokine
levels monthly from 16 mice. Half were infected with a
parasite, half were treated with saline.
• List the attributes of his project?
• How would you rank these attributes?
8. Example: Lou the first year
Lou is a first year graduate student working on a project in a
biomedical research laboratory. He’s trying to decipher data
left by a former post doc as a start for his thesis project. For
one year, the postdoc recorded weight daily and cytokine
levels monthly from 16 mice. Half were infected with a
parasite, half were treated with saline.
• List the attributes of his project?
• How would you rank these attributes?
Attributes
• Time
9. Example: Lou the first year
Lou is a first year graduate student working on a project in a
biomedical research laboratory. He’s trying to decipher data
left by a former post doc as a start for his thesis project. For
one year, the postdoc recorded weight daily and cytokine
levels monthly from 16 mice. Half were infected with a
parasite, half were treated with saline.
• List the attributes of his project?
• How would you rank these attributes?
Attributes
• Time
• Infection Status
10. Example: Lou the first year
Lou is a first year graduate student working on a project in a
biomedical research laboratory. He’s trying to decipher data
left by a former post doc as a start for his thesis project. For
one year, the postdoc recorded weight daily and cytokine
levels monthly from 16 mice. Half were infected with a
parasite, half were treated with saline.
• List the attributes of his project?
• How would you rank these attributes?
Attributes
• Time
• Infection Status
• Data Type
11. Exercise: Organize files
• Download Lou’s files (look in the README file for insight)
• http://tinyurl.com/hvna4mg
• Create a hierarchical folder structure for Lou
• Drag his files into the correct folders
• Fix Lou’s README
• Bonus: think about how you’d organize your data.
12. Tool: Open Science Framework
• Components
• Add-ons
• Contributors
• Wiki
http://help.osf.io/m/collaborating/l/524109-using-the-wiki http://www.slideshare.net/DuraSpace/121014
-slides-roadmap-to-the-future-of-share
13. Organization tips
• Be consistent
• One directory per project
• Separate components for
• Raw data
• Processed data
• Code
• Output
• Make raw data read-only
• Make README files
http://help.osf.io/m/60347/l/611391-organizing-files
14. Components
• “Subprojects”
• Separate privacy settings,
contributors, wiki, add-ons, and
files.
• Examples:
• Different projects:
https://osf.io/82fba/
• Clinical: https://osf.io/gq4mz/
• Manuscript: https://osf.io/if7ug/
• Collaboration: https://osf.io/ezcuj/
16. Don’t panic!
• Just try something
• There’s no right answer
• Be consistent
• Write a README.txt file
http://4vector.com/i/free-vector-don-t-panic-clip-
art_103946_Dont_Panic_clip_art_hight.png
18. Use descriptive names
• Bad name: file.txt
• Ok name: 05-07-2016-mouse-data.txt
• Good name: 2016-05-07-mouse-weight.tsv
• Human readability: name contains information about content
19. Go from general to specific
• Bad name: rep1-5-7-2016-gene-expression.csv
• Good name: 2016-05-07-gene-expression-rep1.csv
• Machine readability: can be sorted meaningfully
20. Avoid abbreviations
• Bad name: “sprlbgp1”
• Good name: “spencer_lab_group_1”
• Human readability: no one understands your acronyms
21. Avoid spaces
• Alternatives
• Dashes-are-cool.txt
• I_also_like_underscores.txt
• CamelCaseIsNeatToo.txt
• Machine readability: spaces are delimiters in programming
• Human readability: delineates words
22. Avoid special characters
• Bad characters: ~ ! @ # $ % ^ & * ( ) ` ; < > ? , [ ] { } ' "
• Machine readability: can have special meanings in scripting
languages
• Example: ~ tells unix to go to your home directory
• Alternatives: underscore (_) dash( - ) dot (.)
23. Be consistent
• Establishing standards makes data more findable
• Extending standards to everyone who works on a project is
even better
27. Spreadsheets as lab notebook
• Color coding
• Formatting
• Notes
• Calculations
• Graphs/Tables
28. Downsides
• Computers don’t understand
notes/formatting/color coding
• Calculations/Graphs/Tables in
spreadsheets are inefficient
• “Tidy data” + automation =
saved time
29. Using spreadsheets wisely
• Don’t put multiple tables in one sheet
• Don’t use multiple sheets
• Use descriptive field names
• Don’t mix notes and data
30. Tidy Data
1. Columns as variables
• Don’t combine multiple
pieces of info in one column
2. Rows as observations
• One measured value
31. Exercise: Tidy Lou’s data
• Open MouseInventory.xls
• Is he using spreadsheets wisely?
• Is each column a variable?
• Is each row an observation?
• Open the January files for both weight and cytokines
• What variables are being measured? –ie, what columns should we
have?
• Can we combine some of these tables?
32. Exercise: Data carpentry ecology
• Lesson: http://www.datacarpentry.org/spreadsheet-ecology-
lesson/
• File: https://ndownloader.figshare.com/files/2252083
• Goal: combine data from first 2 tabs into one table
• Make a new tab, don’t edit the raw data!