2013-B_Whitty-biomedical_cloud

•Download as PPTX, PDF•

0 likes•97 views

This document discusses considerations for enabling access to and use of data from the International Cancer Genome Consortium (ICGC) in a biomedical compute cloud. It notes that the ICGC has over 25,000 tumors across 53 projects and 16 countries/regions, with about 100GB of open access analysis results and 700TB of controlled access sequencing and array data hosted across various repositories. It raises questions about how to aggregate this distributed data through a single access point, what compute and analysis resources users may need, who would create and maintain common pipelines, how to ensure authorization and compliance of cloud-based data users, and what metadata is required to make the data useful.

Brett Whitty
ICGC Data Coordination Center Curation Manager
Ontario Institute for Cancer Research
Open Cloud Consortium
“Towards a Biomedical Commons Cloud” Working Group
April, 2013
Some Considerations for Enabling Users of
International Cancer Genome Consortium (ICGC)
Data in a Biomedical Compute Cloud

2
53 projects 16 countries/regions > 25,000 tumors committed

ICGC Data
Current data:
(represents ~1/3 of goal)
• ~100GB of gzipped analysis results (open access)
◦ hosted via HTTP(S)/FTP at ICGC DCC data portal
• ~700TB raw sequencing and array datasets* (controlled access)
◦ hosted at EBI EGA repository (and other public repos)
*excluding data from TCGA projects (~50% of ICGC member projects are TCGA projects)
3

ICGC Data Access
• Blanket access to ICGC data granted by ICGC Data Access & Compliance Office (DACO)
◦ Excludes TCGA data for which access is granted by the TCGA project
• DACO, ICGC.org & DCC support OpenID for authentication
◦ Access to ICGC & TCGA data at NCBI, CGHub, EBI EGA use different authentication mechanisms
• ICGC datasets are presently distributed across several public repositories
◦ Presents a challenge to end users
◦ Need to aggregate the data through a single access point, virtually if not physically
• Ideally a single user sign-on method would be recognized by all resources
◦ May be impossible due to technical/organizational challenges
4

ICGC Computes(1)
• No common ICGC data analysis centers (yet)
• No common ICGC workflow systems (yet)
• No common ICGC pipelines (yet)
5

ICGC Computes(2)
• Who are the cloud-based data consumers?
◦ What do they need/want?
• Sufficient to have ICGC simply provide datasets?
• Does ICGC need to also provide canned analysis pipelines?
◦ Reproduce methods used in ICGC publications?
◦ Who creates/maintains these?
◦ Using which workflow system?
6

Other Issues
• Can ICGC DACO assure authorization and compliance of
cloud-based data consumers?
◦ Auditing, revoking access, etc.
◦ How is this achieved?
• What are the support needs of “ICGC Cloud” users?
◦ How much effort will they require?
◦ From whom?
• What is the minimal metadata we need to collect to make
the data useful?
◦ Who ensures this?
7

PSICQUIC is a community effort to standardize accessing and retrieving molecular interaction data from decentralized databases. It uses a client-server model where a single client can integrate information from multiple sources using a common query interface and standard formats. PSICQUIC provides over 150 million binary interactions and allows querying via its registry, services using MIQL, and visualization of results.

Towards GeneratingPolicy-compliant Datasets (poster)

Christophe Debruyne

Scientific Data Cataloging Framework

Supun Nakandala

This document summarizes a scientific data cataloging framework developed by Datamedici. The framework uses a central Solr server to maintain metadata from distributed agents that parse scientific data products. It allows for flexible querying and supports both static and dynamic metadata fields. The solution architecture includes the Solr server, distributed agents to extract and send metadata, and a query interface for users to search metadata and locate data products.

ICRISAT Global Planning Meeting 2019: Research Data Management by Abhishek Ra...

ICRISAT

Cancer Moonshot, Data sharing and the Genomic Data Commons

Warren Kibbe

From construction to deployment of LifeWatchGreece the potentail role of EGI-...

Emmanouella Panteri

CINECA webinar slides: Data Gravity in the Life Sciences: Lessons learned fro...

CINECAProject

We live in an era of cloud computing. Many of the services in the life sciences are keenly planning cloud transformations, seeking to create globally distributed ecosystems of harmonised data based on standards from organisations like GA4GH. CINECA faces similar challenges, gathering cohort datasets from all over the globe, many of which are pinned in place, due to their size, legal restrictions, or other considerations. But is “bringing compute to the data” always the right choice? In this webinar, based on experiences from the Human Cell Atlas Data Coordination Platform and other projects from EMBL-EBI, we will explore the concept of “data gravity”: The idea that whilst there are forces that may hold data in one place, there are others that require it to be mobile. We’ll consider how effectively planning a cloud strategy requires consideration of the gravity of datasets, and the impact it may have on team skills required, incentives for good practice, and storage and compute costs. The CINECA webinar series aims to discuss ways to address common challenges and share best practices in the field of cohort data analysis, as well as distribute CINECA project results. All CINECA webinars include an audience Q&A session during which attendees can ask questions and make suggestions. Please note that all webinars are recorded and available for posterior viewing. CINECA webinars include an audience Q&A session during which attendees can ask questions and make suggestions. This webinar took place on 12th November 2020 and is part of the CINECA webinar series. For previous and upcoming CINECA webinars see: https://www.cineca-project.eu/webinars

Web scraping and healthcare

Avanish Giri

This document provides an overview of 6 research papers related to using web scraping in healthcare. It discusses how web scraping can be used to extract structured health data from discussion forums for purposes like decision making. The document proposes a framework for scraping health forums using Python libraries like BeautifulSoup and storing extracted data in JSON format. It concludes that web scraping of forums can help healthcare organizations and individuals by improving traditional decision making processes with anonymized online health information.

Democratising Data Publishing: A Global Perspective discusses the need for open and fair data globally to tackle problems more efficiently through collaboration. Some challenges to open data include cultural and technical hurdles to data sharing, as well as concerns about funding open access models internationally. The document provides examples of initiatives by GigaScience and the African Orphan Crop Consortium to make large genomic datasets more accessible and usable for researchers and plant breeders through tools like Galaxy. While bandwidth and agreements can pose difficulties, opening data benefits research and finding solutions to issues like food security.

SIES IoT spresentation

Alexios Lekidis

This document presents a model-based approach to developing Internet of Things (IoT) applications using the Behavior-Interaction-Priority (BIP) framework. It models IoT applications built using the Contiki operating system and the REST architectural style. Atomic BIP components model the Contiki kernel and RESTful applications. Interactions coordinate data transfer and priorities enforce scheduling. A smart heating case study is modeled and its functional and non-functional requirements are verified using state-space exploration and statistical model checking. The BIP models are validated against simulations in Contiki's native simulator. The approach aims to provide modularity, reusability, early validation and formal verification for resource-constrained IoT applications.

Advanced Automated Analytics Using OSS Tools, GA Tech FDA Conference 2016

Grid Protection Alliance

Fred Elmendorf presented on using open source software (OSS) tools to build automated analytics systems. He discussed OSS projects that can get data from devices (openMIC), analyze the data (openXDA), and visualize results (Open PQ Dashboard). Examples of automated analytics included fault detection and breaker timing. Integrating lightning data was also proposed. The OSS approach stimulates collaboration and innovation while reducing costs compared to proprietary software.

Overview of Next Gen Sequencing Data Analysis

Bioinformatics and Computational Biosciences Branch

This document provides an overview of next generation sequencing (NGS) analysis. It discusses various NGS platforms such as Illumina, Roche 454, PacBio, and Ion Torrent. It also covers common file formats for sequencing data like FASTQ, quality control measures to assess data quality, and applications of NGS such as RNA-seq and ChIP-seq. The document aims to introduce researchers to basic concepts in NGS analysis and highlights available resources for storing and analyzing large sequencing datasets.

Shifting the goal post – from high impact journals to high impact data

CGIAR Research Program on Dryland Systems

The document discusses open access policies for research data at the World Agroforestry Centre (ICRAF). It provides an overview of ICRAF's policy which states that data should be made openly accessible within 12 months of collection or project milestone. The policy allows centers flexibility to determine what constitutes incomplete or low-value data. Common misconceptions about open access are addressed, and benefits of open data are discussed such as improved publications, transparency, and recognition for researchers. Guidelines are provided for implementing open data policies including using metadata standards and archiving data.

ORCID @ PTCRIS

PTCRIS FCT

The PTCRIS program from the FCT|FCCN aims to ensure the creation and sustained development of a national integrated information ecosystem, to support research management according to the best international standards and practices. Central to this framework is the synchronization framework PTCRISync that relies on ORCID as a central hub for information exchange between the various national systems and international systems. PTCRISync will enable researchers to register a given research output at one of the interconnected national systems, and automatically propagate that output to the remaining ones, thus ensuring global consistency of the stored information.

How to make your data count webinar, 26 Nov 2018

ARDC

This document outlines the Make Data Count (MDC) initiative to standardize and promote the tracking of research data usage metrics. MDC has developed a Code of Practice for data usage logs, built an open hub to aggregate standardized usage data, and implemented tracking and display of usage metrics at their own repositories. They encourage other repositories to follow five simple steps to Make Their Data Count: 1) Read the Code of Practice, 2) Process usage logs, 3) Send logs to the hub, 4) Pull usage metrics from the hub, and 5) Display metrics. Future work includes outreach, iteration on implementations, and expanding metrics beyond DOIs.

The need for interoperability in blockchain-based initiatives to facilitate c...

Massimiliano Masi

Using The Internet of Things for Population Health Management - StampedeCon 2016

StampedeCon

The Internet of (Human) Things is just beginning to take shape. The human body is an inexhaustible source of data about personal health, and the healthcare industry is just beginning to scratch the surface of the potential insights and value that will come from that data.  While much of healthcare traditionally focuses on the episodic delivery of services, the Affordable Care Act is pushing healthcare providers, payers, and self-funded employer groups to look at ways to proactively encourage healthy behaviors. Providing personal health devices as a way to promote individual health is one way that healthcare is beginning to take advantage of IoT technologies.  This session provides insight into how IoT is being leveraged in population health management through a solution jointly delivered by Amitech Solutions and Big Cloud Analytics.  Attendees will learn how Hadoop is being used to gather personal device from various vendors, integrate and analyze that information, differentiate trends across regional and cultural diversity, and provide personal recommendations and insights into health risks. This session presents one important way the healthcare industry is leveraging IoT.

Cancer uk 2015_module1_ouellette_ver02

Neuro, McGill University

The document provides information about a workshop on cancer genomic databases, including The Cancer Genome Atlas (TCGA), the International Cancer Genome Consortium (ICGC), and the Catalogue of Somatic Mutations in Cancer (COSMIC). It summarizes the goals, data access, and analysis tools available for each database. It also discusses controlled access vs open data and the process for applying for access to controlled TCGA and ICGC genomic and clinical data.

Research Methodology Presentation - Research in Supply Chain Digital Twins

Arwa Abougharib

Graham Pryor

Eduserv

If Big Data is data that exceeds the processing capacity of conventional systems, thereby necessitating alternative processing measures, we are looking at an essentially technological challenge that IT managers are best equipped to address. The DCC is currently working with 18 HEIs to support and develop their capabilities in the management of research data and, whilst the aforementioned challenge is not usually core to their expressed concerns, are there particular issues of curation inherent to Big Data that might force a different perspective? We have some understanding of Big Data from our contacts in the Astronomy and High Energy Physics domains, and the scale and speed of development in Genomics data generation is well known, but the inability to provide sufficient processing capacity is not one of their more frequent complaints. That’s not to say that Big Science and its Big Data are free of challenges in data curation; only that they are shared with their lesser cousins, where one might say that the real challenge is less one of size than diversity and complexity. This brief presentation explores those aspects of data curation that go beyond the challenges of processing power but which may lend a broader perspective to the technology selection process.

Data in Motion - tech-intro-for-paris-hackathon

Cisco DevNet

Data Discoverability and Persistent Identifiers - EUDAT Summer School (Chris...

EUDAT

We will introduce the concept of persistent identifiers. We will explain how PIDs can be used, which PID systems exist and which use cases they are fit for. The use cases highlight that PIDs are a vital technology to enable FAIR data. The focus will lie on gathering hands-on experience with the Handle system. Participants will mint PIDs, i.e. not only create a resolvable PID but will also learn how to add, alter and delete metadata in the PID entry by employing the handle API directly and EUDAT’s B2HANDLE library. Visit: https://www.eudat.eu/eudat-summer-school

EUDAT-EGI collaboration - Welcome and Overview

EUDAT

This document summarizes the collaboration between EUDAT and EGI to provide seamless access to integrated data and computing resources. It discusses pilots with research communities in earth sciences, bioinformatics, and space physics to collect requirements and test the services. Challenges included documentation, automation needs, and multiple policies. Recommendations were to improve documentation, establish common understanding between services, and provide simple guides. The collaboration aimed to offer end-users transparent access to both infrastructures using single credentials.

Providing support for JC Bradleys vision of open science using RSC cheminform...

US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure

Jean-Claude Bradley had an incredible passion for providing open science tools and data to the community. He had boundless energy, no shortage of ideas and ran so many projects in parallel that it was often difficult to keep up. But at RSC we tried. We provided access to our data, our application programming interfaces and lots of our out-of-hours time to help turn his vision into reality. As a result we helped in the delivery of the SpectralGame to help people learn about NMR and we supported the integration of our services into GoogleDocs underpinning the management and curation of physicochemical property data. We tweaked a number of our services based on JC’s input and as a result we have ended up with a suite of capabilities that serve many of our existing efforts to integrate to electronic lab notebooks and support the ongoing shift towards Open Chemistry. JC was very much ahead of his time….and we were glad to have supported his work. This presentation will give a snapshot of some of the work we did to support his vision.

DC_OC15_mo

Michael Otieno

1) The document discusses integrating OpenClinica, an open-source clinical data management system, with a patient monitoring tool (PMT) to improve efficiency in clinical data management for studies conducted by DNDi and PHPT. 2) Key objectives of the integration are to reduce the time to obtain clean study data sets and decrease error rates by facilitating real-time monitoring of patient data entered into OpenClinica. 3) The methodology developed uses a community data mart to transfer study data from OpenClinica to the PMT database, allowing monitors to access collated subject data through a single interface and improving monitoring.

Challenges and Opportunities of the IoT Data and Service Interoperability

SensorUp

The document discusses challenges and opportunities of data and service interoperability in the Internet of Things (IoT). It notes that interoperability is the biggest challenge currently facing the IoT. The document advocates for a sensor web vision where sensors are accessible as a service and describes efforts to develop open standards and platforms to realize this vision through service enablement and overcoming data silos. Case studies are presented that demonstrate how sensor data from multiple systems can be integrated through open standards and platforms to provide a common operating picture for various users.

Scott Edmunds flashtalk slides from Beyond the PDF2

GigaScience, BGI Hong Kong

Electronic Data Capture (EDC) Systems: Streamlining Data Collection and Manag...

ClinosolIndia

Electronic Data Capture (EDC) systems are software applications designed to streamline the process of data collection and management in clinical trials and other research studies. EDC systems replace traditional paper-based data collection methods, allowing for more efficient and accurate data capture. Here are some key aspects of EDC systems and how they streamline data collection and management: Data Entry and Validation: EDC systems provide a digital platform for data entry, eliminating the need for paper forms. Researchers can enter data directly into electronic case report forms (eCRFs) using user-friendly interfaces. EDC systems often include built-in data validation checks, reducing data entry errors and ensuring data accuracy. Real-Time Data Capture: EDC systems enable real-time data capture, allowing researchers to enter data as it is collected during study visits or remotely. This reduces the time lag between data collection and data availability for analysis. Real-time data capture facilitates timely decision-making, data monitoring, and data quality control. Electronic Source Data Verification (eSDV): EDC systems support electronic source data verification, where the electronic data captured in the system can be compared with the original source documents, such as electronic health records or laboratory reports. This simplifies the process of data verification, ensuring data accuracy and reliability. Data Management: EDC systems streamline data management processes by automating tasks such as data cleaning, query management, and data reconciliation. They provide functionalities to track and resolve data queries, allowing efficient communication between investigators and data management teams. EDC systems also facilitate data exports and transfers to statistical analysis software. Remote Data Capture: EDC systems support remote data capture, enabling participants to enter data from their own locations using secure online portals. This eliminates the need for participants to travel to study sites for data entry, reducing participant burden and improving data collection efficiency. Data Security and Compliance: EDC systems employ robust security measures to protect data confidentiality and integrity. They adhere to regulatory requirements, such as the Health Insurance Portability and Accountability Act (HIPAA) in the United States and the General Data Protection Regulation (GDPR) in the European Union, ensuring compliance with data privacy and security standards. Data Monitoring and Auditing: EDC systems provide tools for data monitoring and auditing. They enable remote monitoring of data quality, allowing sponsors or monitors to review data in real-time, identify discrepancies or errors, and take necessary actions promptly. EDC systems also facilitate centralized data monitoring and remote audits, minimizing on-site visits and associated costs.

Similar to 2013-B_Whitty-biomedical_cloud

Chris Armit at IDW2018: Democratising Data Publishing: A Global Perspective

GigaScience, BGI Hong Kong

SIES IoT spresentation

Alexios Lekidis

Advanced Automated Analytics Using OSS Tools, GA Tech FDA Conference 2016

Grid Protection Alliance

Overview of Next Gen Sequencing Data Analysis

Bioinformatics and Computational Biosciences Branch

Shifting the goal post – from high impact journals to high impact data

CGIAR Research Program on Dryland Systems

ORCID @ PTCRIS

PTCRIS FCT

How to make your data count webinar, 26 Nov 2018

ARDC

The need for interoperability in blockchain-based initiatives to facilitate c...

Massimiliano Masi

Using The Internet of Things for Population Health Management - StampedeCon 2016

StampedeCon

Cancer uk 2015_module1_ouellette_ver02

Neuro, McGill University

Research Methodology Presentation - Research in Supply Chain Digital Twins

Arwa Abougharib

Graham Pryor

Eduserv

Data in Motion - tech-intro-for-paris-hackathon

Cisco DevNet

Data Discoverability and Persistent Identifiers - EUDAT Summer School (Chris...

EUDAT

EUDAT-EGI collaboration - Welcome and Overview

EUDAT

Providing support for JC Bradleys vision of open science using RSC cheminform...

US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure

DC_OC15_mo

Michael Otieno

Challenges and Opportunities of the IoT Data and Service Interoperability

SensorUp

Scott Edmunds flashtalk slides from Beyond the PDF2

GigaScience, BGI Hong Kong

Electronic Data Capture (EDC) Systems: Streamlining Data Collection and Manag...

ClinosolIndia

Similar to 2013-B_Whitty-biomedical_cloud (20)

Chris Armit at IDW2018: Democratising Data Publishing: A Global Perspective

SIES IoT spresentation

Advanced Automated Analytics Using OSS Tools, GA Tech FDA Conference 2016

Overview of Next Gen Sequencing Data Analysis

Shifting the goal post – from high impact journals to high impact data

ORCID @ PTCRIS

How to make your data count webinar, 26 Nov 2018

The need for interoperability in blockchain-based initiatives to facilitate c...

Using The Internet of Things for Population Health Management - StampedeCon 2016

Cancer uk 2015_module1_ouellette_ver02

Research Methodology Presentation - Research in Supply Chain Digital Twins

Graham Pryor

Data in Motion - tech-intro-for-paris-hackathon

Data Discoverability and Persistent Identifiers - EUDAT Summer School (Chris...

EUDAT-EGI collaboration - Welcome and Overview

Providing support for JC Bradleys vision of open science using RSC cheminform...

DC_OC15_mo

Challenges and Opportunities of the IoT Data and Service Interoperability

Scott Edmunds flashtalk slides from Beyond the PDF2

Electronic Data Capture (EDC) Systems: Streamlining Data Collection and Manag...

2013-B_Whitty-biomedical_cloud

1. Brett Whitty ICGC Data Coordination Center Curation Manager Ontario Institute for Cancer Research Open Cloud Consortium “Towards a Biomedical Commons Cloud” Working Group April, 2013 Some Considerations for Enabling Users of International Cancer Genome Consortium (ICGC) Data in a Biomedical Compute Cloud

2. 2 53 projects 16 countries/regions > 25,000 tumors committed

3. ICGC Data Current data: (represents ~1/3 of goal) • ~100GB of gzipped analysis results (open access) ◦ hosted via HTTP(S)/FTP at ICGC DCC data portal • ~700TB raw sequencing and array datasets* (controlled access) ◦ hosted at EBI EGA repository (and other public repos) *excluding data from TCGA projects (~50% of ICGC member projects are TCGA projects) 3

4. ICGC Data Access • Blanket access to ICGC data granted by ICGC Data Access & Compliance Office (DACO) ◦ Excludes TCGA data for which access is granted by the TCGA project • DACO, ICGC.org & DCC support OpenID for authentication ◦ Access to ICGC & TCGA data at NCBI, CGHub, EBI EGA use different authentication mechanisms • ICGC datasets are presently distributed across several public repositories ◦ Presents a challenge to end users ◦ Need to aggregate the data through a single access point, virtually if not physically • Ideally a single user sign-on method would be recognized by all resources ◦ May be impossible due to technical/organizational challenges 4

5. ICGC Computes(1) • No common ICGC data analysis centers (yet) • No common ICGC workflow systems (yet) • No common ICGC pipelines (yet) 5

6. ICGC Computes(2) • Who are the cloud-based data consumers? ◦ What do they need/want? • Sufficient to have ICGC simply provide datasets? • Does ICGC need to also provide canned analysis pipelines? ◦ Reproduce methods used in ICGC publications? ◦ Who creates/maintains these? ◦ Using which workflow system? 6

7. Other Issues • Can ICGC DACO assure authorization and compliance of cloud-based data consumers? ◦ Auditing, revoking access, etc. ◦ How is this achieved? • What are the support needs of “ICGC Cloud” users? ◦ How much effort will they require? ◦ From whom? • What is the minimal metadata we need to collect to make the data useful? ◦ Who ensures this? 7

2013-B_Whitty-biomedical_cloud

Recommended

Recommended

More Related Content

Similar to 2013-B_Whitty-biomedical_cloud

Similar to 2013-B_Whitty-biomedical_cloud (20)

2013-B_Whitty-biomedical_cloud