SlideShare a Scribd company logo
Brett Whitty
ICGC Data Coordination Center Curation Manager
Ontario Institute for Cancer Research
Open Cloud Consortium
“Towards a Biomedical Commons Cloud” Working Group
April, 2013
Some Considerations for Enabling Users of
International Cancer Genome Consortium (ICGC)
Data in a Biomedical Compute Cloud
2
53 projects 16 countries/regions > 25,000 tumors committed
ICGC Data
Current data:
(represents ~1/3 of goal)
• ~100GB of gzipped analysis results (open access)
◦ hosted via HTTP(S)/FTP at ICGC DCC data portal
• ~700TB raw sequencing and array datasets* (controlled access)
◦ hosted at EBI EGA repository (and other public repos)
*excluding data from TCGA projects (~50% of ICGC member projects are TCGA projects)
3
ICGC Data Access
• Blanket access to ICGC data granted by ICGC Data Access & Compliance Office (DACO)
◦ Excludes TCGA data for which access is granted by the TCGA project
• DACO, ICGC.org & DCC support OpenID for authentication
◦ Access to ICGC & TCGA data at NCBI, CGHub, EBI EGA use different authentication mechanisms
• ICGC datasets are presently distributed across several public repositories
◦ Presents a challenge to end users
◦ Need to aggregate the data through a single access point, virtually if not physically
• Ideally a single user sign-on method would be recognized by all resources
◦ May be impossible due to technical/organizational challenges
4
ICGC Computes(1)
• No common ICGC data analysis centers (yet)
• No common ICGC workflow systems (yet)
• No common ICGC pipelines (yet)
5
ICGC Computes(2)
• Who are the cloud-based data consumers?
◦ What do they need/want?
• Sufficient to have ICGC simply provide datasets?
• Does ICGC need to also provide canned analysis pipelines?
◦ Reproduce methods used in ICGC publications?
◦ Who creates/maintains these?
◦ Using which workflow system?
6
Other Issues
• Can ICGC DACO assure authorization and compliance of
cloud-based data consumers?
◦ Auditing, revoking access, etc.
◦ How is this achieved?
• What are the support needs of “ICGC Cloud” users?
◦ How much effort will they require?
◦ From whom?
• What is the minimal metadata we need to collect to make
the data useful?
◦ Who ensures this?
7

More Related Content

Similar to 2013-B_Whitty-biomedical_cloud

Chris Armit at IDW2018: Democratising Data Publishing: A Global Perspective
Chris Armit at IDW2018: Democratising Data Publishing: A Global PerspectiveChris Armit at IDW2018: Democratising Data Publishing: A Global Perspective
Chris Armit at IDW2018: Democratising Data Publishing: A Global Perspective
GigaScience, BGI Hong Kong
 
SIES IoT spresentation
SIES IoT spresentationSIES IoT spresentation
SIES IoT spresentation
Alexios Lekidis
 
Advanced Automated Analytics Using OSS Tools, GA Tech FDA Conference 2016
Advanced Automated Analytics Using OSS Tools, GA Tech FDA Conference 2016Advanced Automated Analytics Using OSS Tools, GA Tech FDA Conference 2016
Advanced Automated Analytics Using OSS Tools, GA Tech FDA Conference 2016
Grid Protection Alliance
 
Overview of Next Gen Sequencing Data Analysis
Overview of Next Gen Sequencing Data AnalysisOverview of Next Gen Sequencing Data Analysis
Overview of Next Gen Sequencing Data Analysis
Bioinformatics and Computational Biosciences Branch
 
Shifting the goal post – from high impact journals to high impact data
 Shifting the goal post – from high impact journals to high impact data Shifting the goal post – from high impact journals to high impact data
Shifting the goal post – from high impact journals to high impact data
CGIAR Research Program on Dryland Systems
 
ORCID @ PTCRIS
ORCID @ PTCRISORCID @ PTCRIS
ORCID @ PTCRIS
PTCRIS FCT
 
How to make your data count webinar, 26 Nov 2018
How to make your data count webinar, 26 Nov 2018How to make your data count webinar, 26 Nov 2018
How to make your data count webinar, 26 Nov 2018
ARDC
 
The need for interoperability in blockchain-based initiatives to facilitate c...
The need for interoperability in blockchain-based initiatives to facilitate c...The need for interoperability in blockchain-based initiatives to facilitate c...
The need for interoperability in blockchain-based initiatives to facilitate c...
Massimiliano Masi
 
Using The Internet of Things for Population Health Management - StampedeCon 2016
Using The Internet of Things for Population Health Management - StampedeCon 2016Using The Internet of Things for Population Health Management - StampedeCon 2016
Using The Internet of Things for Population Health Management - StampedeCon 2016
StampedeCon
 
Cancer uk 2015_module1_ouellette_ver02
Cancer uk 2015_module1_ouellette_ver02Cancer uk 2015_module1_ouellette_ver02
Cancer uk 2015_module1_ouellette_ver02
Neuro, McGill University
 
Research Methodology Presentation - Research in Supply Chain Digital Twins
Research Methodology Presentation - Research in Supply Chain Digital TwinsResearch Methodology Presentation - Research in Supply Chain Digital Twins
Research Methodology Presentation - Research in Supply Chain Digital Twins
Arwa Abougharib
 
Graham Pryor
Graham PryorGraham Pryor
Graham Pryor
Eduserv
 
Data in Motion - tech-intro-for-paris-hackathon
Data in Motion - tech-intro-for-paris-hackathonData in Motion - tech-intro-for-paris-hackathon
Data in Motion - tech-intro-for-paris-hackathon
Cisco DevNet
 
Data Discoverability and Persistent Identifiers - EUDAT Summer School (Chris...
Data Discoverability and Persistent Identifiers - EUDAT Summer School  (Chris...Data Discoverability and Persistent Identifiers - EUDAT Summer School  (Chris...
Data Discoverability and Persistent Identifiers - EUDAT Summer School (Chris...
EUDAT
 
EUDAT-EGI collaboration - Welcome and Overview
EUDAT-EGI collaboration - Welcome and OverviewEUDAT-EGI collaboration - Welcome and Overview
EUDAT-EGI collaboration - Welcome and Overview
EUDAT
 
Providing support for JC Bradleys vision of open science using RSC cheminform...
Providing support for JC Bradleys vision of open science using RSC cheminform...Providing support for JC Bradleys vision of open science using RSC cheminform...
Providing support for JC Bradleys vision of open science using RSC cheminform...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
DC_OC15_mo
DC_OC15_moDC_OC15_mo
DC_OC15_mo
Michael Otieno
 
Challenges and Opportunities of the IoT Data and Service Interoperability
Challenges and Opportunities of the IoT Data and Service InteroperabilityChallenges and Opportunities of the IoT Data and Service Interoperability
Challenges and Opportunities of the IoT Data and Service Interoperability
SensorUp
 
Scott Edmunds flashtalk slides from Beyond the PDF2
Scott Edmunds flashtalk slides from Beyond the PDF2Scott Edmunds flashtalk slides from Beyond the PDF2
Scott Edmunds flashtalk slides from Beyond the PDF2
GigaScience, BGI Hong Kong
 
Electronic Data Capture (EDC) Systems: Streamlining Data Collection and Manag...
Electronic Data Capture (EDC) Systems: Streamlining Data Collection and Manag...Electronic Data Capture (EDC) Systems: Streamlining Data Collection and Manag...
Electronic Data Capture (EDC) Systems: Streamlining Data Collection and Manag...
ClinosolIndia
 

Similar to 2013-B_Whitty-biomedical_cloud (20)

Chris Armit at IDW2018: Democratising Data Publishing: A Global Perspective
Chris Armit at IDW2018: Democratising Data Publishing: A Global PerspectiveChris Armit at IDW2018: Democratising Data Publishing: A Global Perspective
Chris Armit at IDW2018: Democratising Data Publishing: A Global Perspective
 
SIES IoT spresentation
SIES IoT spresentationSIES IoT spresentation
SIES IoT spresentation
 
Advanced Automated Analytics Using OSS Tools, GA Tech FDA Conference 2016
Advanced Automated Analytics Using OSS Tools, GA Tech FDA Conference 2016Advanced Automated Analytics Using OSS Tools, GA Tech FDA Conference 2016
Advanced Automated Analytics Using OSS Tools, GA Tech FDA Conference 2016
 
Overview of Next Gen Sequencing Data Analysis
Overview of Next Gen Sequencing Data AnalysisOverview of Next Gen Sequencing Data Analysis
Overview of Next Gen Sequencing Data Analysis
 
Shifting the goal post – from high impact journals to high impact data
 Shifting the goal post – from high impact journals to high impact data Shifting the goal post – from high impact journals to high impact data
Shifting the goal post – from high impact journals to high impact data
 
ORCID @ PTCRIS
ORCID @ PTCRISORCID @ PTCRIS
ORCID @ PTCRIS
 
How to make your data count webinar, 26 Nov 2018
How to make your data count webinar, 26 Nov 2018How to make your data count webinar, 26 Nov 2018
How to make your data count webinar, 26 Nov 2018
 
The need for interoperability in blockchain-based initiatives to facilitate c...
The need for interoperability in blockchain-based initiatives to facilitate c...The need for interoperability in blockchain-based initiatives to facilitate c...
The need for interoperability in blockchain-based initiatives to facilitate c...
 
Using The Internet of Things for Population Health Management - StampedeCon 2016
Using The Internet of Things for Population Health Management - StampedeCon 2016Using The Internet of Things for Population Health Management - StampedeCon 2016
Using The Internet of Things for Population Health Management - StampedeCon 2016
 
Cancer uk 2015_module1_ouellette_ver02
Cancer uk 2015_module1_ouellette_ver02Cancer uk 2015_module1_ouellette_ver02
Cancer uk 2015_module1_ouellette_ver02
 
Research Methodology Presentation - Research in Supply Chain Digital Twins
Research Methodology Presentation - Research in Supply Chain Digital TwinsResearch Methodology Presentation - Research in Supply Chain Digital Twins
Research Methodology Presentation - Research in Supply Chain Digital Twins
 
Graham Pryor
Graham PryorGraham Pryor
Graham Pryor
 
Data in Motion - tech-intro-for-paris-hackathon
Data in Motion - tech-intro-for-paris-hackathonData in Motion - tech-intro-for-paris-hackathon
Data in Motion - tech-intro-for-paris-hackathon
 
Data Discoverability and Persistent Identifiers - EUDAT Summer School (Chris...
Data Discoverability and Persistent Identifiers - EUDAT Summer School  (Chris...Data Discoverability and Persistent Identifiers - EUDAT Summer School  (Chris...
Data Discoverability and Persistent Identifiers - EUDAT Summer School (Chris...
 
EUDAT-EGI collaboration - Welcome and Overview
EUDAT-EGI collaboration - Welcome and OverviewEUDAT-EGI collaboration - Welcome and Overview
EUDAT-EGI collaboration - Welcome and Overview
 
Providing support for JC Bradleys vision of open science using RSC cheminform...
Providing support for JC Bradleys vision of open science using RSC cheminform...Providing support for JC Bradleys vision of open science using RSC cheminform...
Providing support for JC Bradleys vision of open science using RSC cheminform...
 
DC_OC15_mo
DC_OC15_moDC_OC15_mo
DC_OC15_mo
 
Challenges and Opportunities of the IoT Data and Service Interoperability
Challenges and Opportunities of the IoT Data and Service InteroperabilityChallenges and Opportunities of the IoT Data and Service Interoperability
Challenges and Opportunities of the IoT Data and Service Interoperability
 
Scott Edmunds flashtalk slides from Beyond the PDF2
Scott Edmunds flashtalk slides from Beyond the PDF2Scott Edmunds flashtalk slides from Beyond the PDF2
Scott Edmunds flashtalk slides from Beyond the PDF2
 
Electronic Data Capture (EDC) Systems: Streamlining Data Collection and Manag...
Electronic Data Capture (EDC) Systems: Streamlining Data Collection and Manag...Electronic Data Capture (EDC) Systems: Streamlining Data Collection and Manag...
Electronic Data Capture (EDC) Systems: Streamlining Data Collection and Manag...
 

2013-B_Whitty-biomedical_cloud

  • 1. Brett Whitty ICGC Data Coordination Center Curation Manager Ontario Institute for Cancer Research Open Cloud Consortium “Towards a Biomedical Commons Cloud” Working Group April, 2013 Some Considerations for Enabling Users of International Cancer Genome Consortium (ICGC) Data in a Biomedical Compute Cloud
  • 2. 2 53 projects 16 countries/regions > 25,000 tumors committed
  • 3. ICGC Data Current data: (represents ~1/3 of goal) • ~100GB of gzipped analysis results (open access) ◦ hosted via HTTP(S)/FTP at ICGC DCC data portal • ~700TB raw sequencing and array datasets* (controlled access) ◦ hosted at EBI EGA repository (and other public repos) *excluding data from TCGA projects (~50% of ICGC member projects are TCGA projects) 3
  • 4. ICGC Data Access • Blanket access to ICGC data granted by ICGC Data Access & Compliance Office (DACO) ◦ Excludes TCGA data for which access is granted by the TCGA project • DACO, ICGC.org & DCC support OpenID for authentication ◦ Access to ICGC & TCGA data at NCBI, CGHub, EBI EGA use different authentication mechanisms • ICGC datasets are presently distributed across several public repositories ◦ Presents a challenge to end users ◦ Need to aggregate the data through a single access point, virtually if not physically • Ideally a single user sign-on method would be recognized by all resources ◦ May be impossible due to technical/organizational challenges 4
  • 5. ICGC Computes(1) • No common ICGC data analysis centers (yet) • No common ICGC workflow systems (yet) • No common ICGC pipelines (yet) 5
  • 6. ICGC Computes(2) • Who are the cloud-based data consumers? ◦ What do they need/want? • Sufficient to have ICGC simply provide datasets? • Does ICGC need to also provide canned analysis pipelines? ◦ Reproduce methods used in ICGC publications? ◦ Who creates/maintains these? ◦ Using which workflow system? 6
  • 7. Other Issues • Can ICGC DACO assure authorization and compliance of cloud-based data consumers? ◦ Auditing, revoking access, etc. ◦ How is this achieved? • What are the support needs of “ICGC Cloud” users? ◦ How much effort will they require? ◦ From whom? • What is the minimal metadata we need to collect to make the data useful? ◦ Who ensures this? 7