Data challenges for researchers

•Download as PPTX, PDF•

1 like•394 views

Michael Hoffman

Draft slides for CASRAI Reconnect16 <http: /> panel Canada’s Research Data Management (RDM) Ecosystem

Science

Who I am
• Scientist at Princess Margaret Cancer
Centre/Asst Professor at University of Toronto
• Previously part of Encyclopedia of DNA
Elements (ENCODE) Project
• Develop computational methods for big
genomic data

View of an analysis pipeline
Source data
Intermediate files
Data products Publications

Challenges in data acquisition
Showstoppers
• Data available “on request”
• Data available on application or agreement
Timewasters
• Data in inappropriate format
• Data in different format than I need
• Data doesn’t comply with format specification

More challenges in data acquisition
Annoyances
• Transferring
• Storing
• Staleness
• Deletion
• Organization
• Discovery

Challenges in data distribution
• Permanence
• Job changes
• Embargo pre-publication
• Space
• Waiting for approval
• Enabling acquisition by external services
• Graphical-only interfaces
• Ongoing costs

Challenges in intermediate files
• Poor organization
• Big
• Don’t always need them, sometimes do
• Sometimes need someone else’s intermediate
files
• Should be reproducible given source data and
pipeline but often isn’t

My dream solution
Policy: Data must be deposited in archive and
available at publication time
Technical: Trivially simple multi-level data
caching
Economic: Central archival space should cost
researcher less than keeping their own copy

What's hot

Henderson "Institutional Identifiers"National Information Standards Organization (NISO)

Reading Group: From Database to DataspacesJürgen Umbrich

Case Study Life Sciences Data: Central for Integrative Systems Biology and Bi...sesrdm

Doing research better: The role of meta‐dataGarethKnight

Payton Eliminating Conflicts in Ebook MetadataNational Information Standards Organization (NISO)

Federating Research Profiling Dataericmeeks

Can Clinicians Create High-Quality Databases?The Children's Hospital of Philadelphia

Scientific Data and peer review session at Dryad event, May 2015 Susanna-Assunta Sansone

RDAP 16 Poster: Challenges and Opportunities in an Institutional Repository S...ASIS&T

Bringing Things Together and Linking to Health Information using openEHRKoray Atalag

Data Management for Graduate StudentsRebekah Cummings

Transparency and reproducibility in researchLouise Corti

RDAP 16 Poster: Interpreting Local Data Policies in PracticeASIS&T

Martin Rasmussen: Ensuring availability and quality of research data through ..."Open Access - Open Data" conference, 13th/14th December, 2010

Adaptive Knowledge Portal for Education DomainMikhail Navrotskii

Gaining credit for sharing research data: Viewpoints on Data PublishingVarsha Khodiyar

Enabling FAIR Data: TAG B Authoring GuidelinesAnita de Waard

Peer Reviewing Data: experiences from a data journalVarsha Khodiyar

Data sharing as part of the research workflowVarsha Khodiyar

eSource: A Clinical Data Manager's Tale of Three Studieswww.datatrak.com

What's hot (20)

Henderson "Institutional Identifiers"

Reading Group: From Database to Dataspaces

Case Study Life Sciences Data: Central for Integrative Systems Biology and Bi...

Doing research better: The role of meta‐data

Payton Eliminating Conflicts in Ebook Metadata

Federating Research Profiling Data

Can Clinicians Create High-Quality Databases?

Scientific Data and peer review session at Dryad event, May 2015

RDAP 16 Poster: Challenges and Opportunities in an Institutional Repository S...

Bringing Things Together and Linking to Health Information using openEHR

Data Management for Graduate Students

Transparency and reproducibility in research

RDAP 16 Poster: Interpreting Local Data Policies in Practice

Martin Rasmussen: Ensuring availability and quality of research data through ...

Adaptive Knowledge Portal for Education Domain

Gaining credit for sharing research data: Viewpoints on Data Publishing

Enabling FAIR Data: TAG B Authoring Guidelines

Peer Reviewing Data: experiences from a data journal

Data sharing as part of the research workflow

eSource: A Clinical Data Manager's Tale of Three Studies

Viewers also liked

100 percent open access: expect no less!Michael Hoffman

Stunning photosMichael Hoffman

Khotbah Berjaga-jaga Ps. Matius LimMatthew Lim

M2 t1 planificador_aamtic.docxMartha Campo

Avança - Canvia de Xip i connecta amb el teu fillCursbook

Implementation of Synchronization Algorithms for Media FLO Systemsa_elmoslimany

KOM Presentation STN-N Conductor_rev 2Brian Quan (Minh)

Life of the holy prophet (sumar01cdz

AS PLANTASafcovelo15

ETP-Corporate BrochureNeev Ahuja

A New Communication Scheme Implying Amplitude-Limited Inputs and Signal-Depen...a_elmoslimany

CREAMOS POESÍA (por Lucas)afcovelo15

VSP brochure-company org chartBrian Quan (Minh)

Channel Modeling for Wideband MIMO Vehicle-to-Vehicle Channelsa_elmoslimany

Las Grandes ReligionesJesusCordoba2003

Teambuilding present T&I STT 2016Brian Quan (Minh)

AS PLANTASafcovelo15

Exhibitions in the age of digitizationMCH Group - Global Live Marketing

Collaborative 3D Environments over Windows AzureJiri Danihelka

Viewers also liked (20)

100 percent open access: expect no less!

Stunning photos

Khotbah Berjaga-jaga Ps. Matius Lim

M2 t1 planificador_aamtic.docx

Avança - Canvia de Xip i connecta amb el teu fill

Implementation of Synchronization Algorithms for Media FLO Systems

KOM Presentation STN-N Conductor_rev 2

Life of the holy prophet (s

AS PLANTAS

ETP-Corporate Brochure

A New Communication Scheme Implying Amplitude-Limited Inputs and Signal-Depen...

CREAMOS POESÍA (por Lucas)

VSP brochure-company org chart

Channel Modeling for Wideband MIMO Vehicle-to-Vehicle Channels

Las Grandes Religiones

Teambuilding present T&I STT 2016

AS PLANTAS

Exhibitions in the age of digitization

Collaborative 3D Environments over Windows Azure

Similar to Data challenges for researchers

Workshop - finding and accessing data - Cambridge August 22 2016Fiona Nielsen

Finding and Accessing Human Genomics DatasetsManuel Corpas

Best Practice in Data Management and Sharing Mojtaba Lotfaliany

Data Management for Undergraduate ResearchRebekah Cummings

Andrew Treloar, overview of ACEAS Data Workflow, ACEAS Grand 2014aceas13tern

Research Data Sharing and Re-Use: Practical Implications for Data Citation Pr...SC CTSI at USC and CHLA

How to make your published data findable, accessible, interoperable and reusablePhoenix Bioinformatics

The Simulacrum, a Synthetic Cancer DatasetCongChen35

A FAIR Data Sharing Framework for Large-Scale Human Cancer ProteogenomicsBrett Tully

Research Data Mangagement Essentials, 5th July 2017Research Data Leeds

Creating a Data Management PlanKristin Briney

Managing Your Research DataKristin Briney

Genome sharing projects around the world nijmegen oct 29 - 2015Fiona Nielsen

A Data Scientist Perspective on Data Curation in the Digital EraVicki Ferrini

Lab Notebooks as Data Management (SLA Winter Virtual Conference 2012)Kristin Briney

DC101 UWESarah Jones

Educause 2015 RDM Maturity ResearchSpace

Faculty Research Support Needs SurveyKathryn Crowe

RDAP14: University-wide Research Data Management PolicyASIS&T

Data Management for Undergraduate ResearchersRebekah Cummings

Similar to Data challenges for researchers (20)

Workshop - finding and accessing data - Cambridge August 22 2016

Finding and Accessing Human Genomics Datasets

Best Practice in Data Management and Sharing

Data Management for Undergraduate Research

Andrew Treloar, overview of ACEAS Data Workflow, ACEAS Grand 2014

Research Data Sharing and Re-Use: Practical Implications for Data Citation Pr...

How to make your published data findable, accessible, interoperable and reusable

The Simulacrum, a Synthetic Cancer Dataset

A FAIR Data Sharing Framework for Large-Scale Human Cancer Proteogenomics

Research Data Mangagement Essentials, 5th July 2017

Creating a Data Management Plan

Managing Your Research Data

Genome sharing projects around the world nijmegen oct 29 - 2015

A Data Scientist Perspective on Data Curation in the Digital Era

Lab Notebooks as Data Management (SLA Winter Virtual Conference 2012)

DC101 UWE

Educause 2015 RDM Maturity

Faculty Research Support Needs Survey

RDAP14: University-wide Research Data Management Policy

Data Management for Undergraduate Researchers

Recently uploaded

Analytical Profile of Coleus Forskohlii | Forskolin .pptxSwapnil Therkar

Luciferase in rDNA technology (biotechnology).pptxAleenaTreesaSaji

Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptxanandsmhk

Scheme-of-Work-Science-Stage-4 cambridge science.docxyaramohamed343013

SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxkessiyaTpeter

Isotopic evidence of long-lived volcanism on IoSérgio Sacani

Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.aasikanpl

Cultivation of KODO MILLET . made by Ghanshyam pptxpradhanghanshyam7136

NAVSEA PEO USC - Unmanned & Small Combatants 26Oct23.pdfWadeK3

Biopesticide (2).pptx .This slides helps to know the different types of biop...RohitNehra6

All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...Sérgio Sacani

Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bSérgio Sacani

Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...anilsa9823

Artificial Intelligence In Microbiology by Dr. Prince C PPRINCE C P

Physiochemical properties of nanomaterials and its nanotoxicity.pptxAArockiyaNisha

Grafana in space: Monitoring Japan's SLIM moon lander in real timeSatoshi NAKAHIRA

PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...Sérgio Sacani

Hubble Asteroid Hunter III. Physical properties of newly found asteroidsSérgio Sacani

9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service9953056974 Low Rate Call Girls In Saket, Delhi NCR

Natural Polymer Based NanomaterialsAArockiyaNisha

Recently uploaded (20)

Analytical Profile of Coleus Forskohlii | Forskolin .pptx

Luciferase in rDNA technology (biotechnology).pptx

Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx

Scheme-of-Work-Science-Stage-4 cambridge science.docx

SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx

Isotopic evidence of long-lived volcanism on Io

Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.

Cultivation of KODO MILLET . made by Ghanshyam pptx

NAVSEA PEO USC - Unmanned & Small Combatants 26Oct23.pdf

Biopesticide (2).pptx .This slides helps to know the different types of biop...

All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...

Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b

Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...

Artificial Intelligence In Microbiology by Dr. Prince C P

Physiochemical properties of nanomaterials and its nanotoxicity.pptx

Grafana in space: Monitoring Japan's SLIM moon lander in real time

PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...

Hubble Asteroid Hunter III. Physical properties of newly found asteroids

9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service

Natural Polymer Based Nanomaterials

Data challenges for researchers

1. Michael M. Hoffman Princess Margaret Cancer Centre Department of Medical Biophysics Department of Computer Science University of Toronto http://hoffmanlab.org/ Twitter: @michaelhoffman Data challenges for researchers

2. Who I am • Scientist at Princess Margaret Cancer Centre/Asst Professor at University of Toronto • Previously part of Encyclopedia of DNA Elements (ENCODE) Project • Develop computational methods for big genomic data

3. View of an analysis pipeline Source data Intermediate files Data products Publications

6. Challenges in data acquisition Showstoppers • Data available “on request” • Data available on application or agreement Timewasters • Data in inappropriate format • Data in different format than I need • Data doesn’t comply with format specification

7. More challenges in data acquisition Annoyances • Transferring • Storing • Staleness • Deletion • Organization • Discovery

8. Challenges in data distribution • Permanence • Job changes • Embargo pre-publication • Space • Waiting for approval • Enabling acquisition by external services • Graphical-only interfaces • Ongoing costs

9. Challenges in intermediate files • Poor organization • Big • Don’t always need them, sometimes do • Sometimes need someone else’s intermediate files • Should be reproducible given source data and pipeline but often isn’t

10. My dream solution Policy: Data must be deposited in archive and available at publication time Technical: Trivially simple multi-level data caching Economic: Central archival space should cost researcher less than keeping their own copy

Editor's Notes

ENCODE: 12000 assays, many multiples of that in terms of number of datasets Guessing about 2-20 GB of accessioned data per assay, so in the hundreds of terabytes to single-digit petabyte sizes
Most evaluation of researchers relies primarily on the Publications. And that’s primarily what a lot of researchers are interested in
Wastes of time and money, some of this should be fixed at publication gating “advanced file copying”
Most have to do with local copies
Want to avoid “solutions” that are like Canadian Common CV but for data science

Data challenges for researchers

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Data challenges for researchers

Similar to Data challenges for researchers (20)

Recently uploaded

Recently uploaded (20)

Data challenges for researchers

Editor's Notes