Michael M. Hoffman
Princess Margaret Cancer Centre Department of Medical Biophysics
Department of Computer Science
University of Toronto
http://hoffmanlab.org/
Twitter: @michaelhoffman
Data challenges for researchers
Who I am
• Scientist at Princess Margaret Cancer
Centre/Asst Professor at University of Toronto
• Previously part of Encyclopedia of DNA
Elements (ENCODE) Project
• Develop computational methods for big
genomic data
View of an analysis pipeline
Source data
Intermediate files
Data products Publications
Challenges in data acquisition
Showstoppers
• Data available “on request”
• Data available on application or agreement
Timewasters
• Data in inappropriate format
• Data in different format than I need
• Data doesn’t comply with format specification
More challenges in data acquisition
Annoyances
• Transferring
• Storing
• Staleness
• Deletion
• Organization
• Discovery
Challenges in data distribution
• Permanence
• Job changes
• Embargo pre-publication
• Space
• Waiting for approval
• Enabling acquisition by external services
• Graphical-only interfaces
• Ongoing costs
Challenges in intermediate files
• Poor organization
• Big
• Don’t always need them, sometimes do
• Sometimes need someone else’s intermediate
files
• Should be reproducible given source data and
pipeline but often isn’t
My dream solution
Policy: Data must be deposited in archive and
available at publication time
Technical: Trivially simple multi-level data
caching
Economic: Central archival space should cost
researcher less than keeping their own copy

Data challenges for researchers

  • 1.
    Michael M. Hoffman PrincessMargaret Cancer Centre Department of Medical Biophysics Department of Computer Science University of Toronto http://hoffmanlab.org/ Twitter: @michaelhoffman Data challenges for researchers
  • 2.
    Who I am •Scientist at Princess Margaret Cancer Centre/Asst Professor at University of Toronto • Previously part of Encyclopedia of DNA Elements (ENCODE) Project • Develop computational methods for big genomic data
  • 3.
    View of ananalysis pipeline Source data Intermediate files Data products Publications
  • 6.
    Challenges in dataacquisition Showstoppers • Data available “on request” • Data available on application or agreement Timewasters • Data in inappropriate format • Data in different format than I need • Data doesn’t comply with format specification
  • 7.
    More challenges indata acquisition Annoyances • Transferring • Storing • Staleness • Deletion • Organization • Discovery
  • 8.
    Challenges in datadistribution • Permanence • Job changes • Embargo pre-publication • Space • Waiting for approval • Enabling acquisition by external services • Graphical-only interfaces • Ongoing costs
  • 9.
    Challenges in intermediatefiles • Poor organization • Big • Don’t always need them, sometimes do • Sometimes need someone else’s intermediate files • Should be reproducible given source data and pipeline but often isn’t
  • 10.
    My dream solution Policy:Data must be deposited in archive and available at publication time Technical: Trivially simple multi-level data caching Economic: Central archival space should cost researcher less than keeping their own copy

Editor's Notes

  • #3 ENCODE: 12000 assays, many multiples of that in terms of number of datasets Guessing about 2-20 GB of accessioned data per assay, so in the hundreds of terabytes to single-digit petabyte sizes
  • #4 Most evaluation of researchers relies primarily on the Publications. And that’s primarily what a lot of researchers are interested in
  • #7 Wastes of time and money, some of this should be fixed at publication gating “advanced file copying”
  • #8 Most have to do with local copies
  • #11 Want to avoid “solutions” that are like Canadian Common CV but for data science