1. Michael M. Hoffman
Princess Margaret Cancer Centre Department of Medical Biophysics
Department of Computer Science
University of Toronto
http://hoffmanlab.org/
Twitter: @michaelhoffman
Data challenges for researchers
2. Who I am
• Scientist at Princess Margaret Cancer
Centre/Asst Professor at University of Toronto
• Previously part of Encyclopedia of DNA
Elements (ENCODE) Project
• Develop computational methods for big
genomic data
3. View of an analysis pipeline
Source data
Intermediate files
Data products Publications
4.
5.
6. Challenges in data acquisition
Showstoppers
• Data available “on request”
• Data available on application or agreement
Timewasters
• Data in inappropriate format
• Data in different format than I need
• Data doesn’t comply with format specification
7. More challenges in data acquisition
Annoyances
• Transferring
• Storing
• Staleness
• Deletion
• Organization
• Discovery
8. Challenges in data distribution
• Permanence
• Job changes
• Embargo pre-publication
• Space
• Waiting for approval
• Enabling acquisition by external services
• Graphical-only interfaces
• Ongoing costs
9. Challenges in intermediate files
• Poor organization
• Big
• Don’t always need them, sometimes do
• Sometimes need someone else’s intermediate
files
• Should be reproducible given source data and
pipeline but often isn’t
10. My dream solution
Policy: Data must be deposited in archive and
available at publication time
Technical: Trivially simple multi-level data
caching
Economic: Central archival space should cost
researcher less than keeping their own copy
Editor's Notes
ENCODE: 12000 assays, many multiples of that in terms of number of datasets
Guessing about 2-20 GB of accessioned data per assay, so in the hundreds of terabytes to single-digit petabyte sizes
Most evaluation of researchers relies primarily on the Publications. And that’s primarily what a lot of researchers are interested in
Wastes of time and money, some of this should be fixed at publication gating
“advanced file copying”
Most have to do with local copies
Want to avoid “solutions” that are like Canadian Common CV but for data science