SlideShare a Scribd company logo
1 of 25
Multi-omics infrastructure and data
for R/Bioconductor
Levi Waldron
Sept 29, 2017
Why Bioconductor?
1,400 packages on a backbone of data structures
The Genomic Ranges algebra
Huber, W. et al. Orchestrating high-throughput genomic analysis with
Bioconductor. Nat. Methods 12, 115–121 (2015).
The integrative data container SummarizedExperiment
Bioconductor core data classes
• Rectangular feature x sample data
– SummarizedExperiment::SummarizedExperiment()
– (RNAseq count matrix, microarray, …)
• Genomic coordinates
– GenomicRanges::GRanges() (1-based, closed interval)
• DNA / RNA / AA sequences
– Biostrings::*Stringset()
• Gene sets
– GSEABase::GeneSet() GSEABase::GeneSetCollection()
• Single cell data
– SingleCellExperiment::SingleCellExperiment()
• Mass spec data – MSnbase::MSnExp()
Credit: Marcel Ramos
Diseases, platforms, and data types of
The TCGA
33 diseases
50 platforms
19 data types
Multi-assay experiments can be complex
The need for MultiAssayExperiment
Need a core data structure to:
– harmonize single-assay data structures
– relate multiple assays & clinical data
– handle missing and replicate observations
– accommodate ID-based and range-based data
– support on-disk representations of big data
MultiAssayExperiment design
Credit: Marcel Ramos
Ramos et al. Software for the Integration of Multi-omics Experiments in Bioconductor. Cancer Research (In Press).
TCGA as MultiAssayExperiments
Access from www.github.com/waldronlab/MultiAssayExperiment
…... 33 cancer types
Ramos et al. Software for the Integration of Multi-omics Experiments in Bioconductor. Cancer Research (In Press).
TCGA as MultiAssayExperiments
> acc
A MultiAssayExperiment object of 9 listed
experiments with user-defined names and respective classes.
Containing an ExperimentList class object of length 9:
[1] RNASeq2GeneNorm: ExpressionSet with 20501 rows and 79 columns
[2] miRNASeqGene: ExpressionSet with 1046 rows and 80 columns
[3] CNASNP: RaggedExperiment with 79861 rows and 180 columns
[4] CNVSNP: RaggedExperiment with 21052 rows and 180 columns
[5] Methylation: SummarizedExperiment with 485577 rows and 80 columns
[6] RPPAArray: ExpressionSet with 192 rows and 46 columns
[7] Mutations: RaggedExperiment with 20166 rows and 90 columns
[8] gistica: SummarizedExperiment with 24776 rows and 90 columns
[9] gistict: SummarizedExperiment with 24776 rows and 90 columns
Features:
experiments() - obtain the ExperimentList instance
colData() - the primary/phenotype DataFrame
sampleMap() - the sample availability DataFrame
`$`, `[`, `[[` - extract colData columns, subset, or experiment
*Format() - convert into a long or wide DataFrame
assays() - convert ExperimentList to a SimpleList of matrices
>
Ramos et al. Software for the Integration of Multi-omics Experiments in Bioconductor. Cancer Research (In Press).
The MultiAssayExperiment API
Credit:
Marcel Ramos
Ramos et al. Software for the Integration of Multi-omics Experiments in Bioconductor. Cancer Research (In Press).
For building visualizations
Upset Venn diagram for adrenocortical carcinoma TCGA
> data(miniACC)
> upsetSamples(miniACC)
Ramos et al. Software for the Integration of Multi-omics Experiments in Bioconductor. Cancer Research (In Press).
For multi-omics analysis
> mae <- mae[, , c("Mutations", "gistict")]
> mae <- intersectColumns(mae)
> mae$cnload <- colMeans(abs(assay(mae[["gistict"]])))
Davoli et al. Tumor aneuploidy correlates
with markers of immune evasion and
with reduced response to immunotherapy.
Science 355, (2017).
Ramos et al. Software for the Integration of Multi-omics Experiments in Bioconductor. Cancer Research (In Press).
For integrating remotely stored data
> st <- ldblock::stack1kg() #Create a URL referencing 1000 genomes content in AWS S3
> multiban <- MultiAssayExperiment(
list(meth = banovichSE, snp = st),
colData = colData(banovichSE))
> multibanfocus <- multiban[rowRanges(banovichSE)[“cg04793911”], , ]
> assoc <- cisAssoc(multibanfocus[[“meth”]],
TabixFile(files(multibanfocus[[“snp”]])))
Using tabix-indexed SNP VCFs
from 1000 genomes
on Amazon S3
credit: Vince Carey
Ramos et al. Software for the Integration of Multi-omics Experiments in Bioconductor. Cancer Research (In Press).
A big software engineering effort
Past curated*Data
Bioconductor packages
• curatedOvarianData
– 30 datasets, > 3K unique samples
– survival, surgical debulking, histology...
• curatedCRCData
– 34 datasets, ~4K unique samples
– many annotated for MSS, gender, stage, age, N, M
• curatedBladderData
– 12 datasets, ~1,200 unique samples
– many annotated for stage, grade, OS
14
curatedMetagenomicData: motivation
• Increasing amount of public data
• Can be fast and free, but hard to use:
– fastq files from NCBI, EBI, ...
– bioinformatic expertise
– computational resources
– manual curation / standardization
• Wanted to make acquisition of curated, ready-
to-use public data easy and reproducible
15
curatedMetagenomicData: pipeline
Download (~57TB)
Uniform processing
MetaPhlAn2 HUMAnN2
species
abundance
marker
presence
gene family
abundance
marker
abundance
metabolic pathway
abundance
metabolic pathway
presence
standardized
metadata
Manual curation
Raw
fastq files
 13 datasets
 2,875 samples
Study
metadata
Age, body site,
disease, etc…
Offline high computational load pipeline
> 120 kH CPU
Integrated Bioconductor
ExpressionSet objects
 Per-patient microbiome data
 Per-patient metadata
 Experiment-wide metadata
Integration
Automatic documentation
ExperimentHub product
 Amazon S3 cloud distribution
 Tag-based searching
 Dataset snapshot dates
 Automatic local caching
Convenience download functions
Megabytes-sized datasets
 Differential abundance
 Diversity metrics
 Clustering
 Machine learning
User
experience
https://waldronlab.github.io/curatedMetagenomicData/
One dataset from R:
> curatedMetagenomicData("HMP_2012.metaphlan_bugs_list.stool”)
, relab=FALSE)
Many datasets from R:
> curatedMetagenomicData("HMP_2012.metaphlan_bugs_list.*”)
Command-line:
$ curatedMetagenomicData -p "HMP_2012.metaphlan_bugs_list.*"
17
curatedMetagenomicData: use
Pasolli/Schiffer/Manghi et al. Accessible, Curated Metagenomic Data Through ExperimentHub. Nature Methods (In Press).
Supervised disease classification
18
Credit:
Edoardo Pasolli
Pasolli/Schiffer/Manghi et al. Accessible, Curated Metagenomic Data Through ExperimentHub. Nature Methods (In Press).
Unsupervised clustering
19
Credit:
Audrey Renson
Pasolli/Schiffer/Manghi et al. Accessible, Curated Metagenomic Data Through ExperimentHub. Nature Methods (In Press).
20Credit: Audrey Renson
Unsupervised clustering
Meta-
analysis
(partial) validation of
reported associations
between genera and BMI
Credit: Lucas Schiffer
Beaumont M et al. Heritable
components of the human fecal
microbiome are associated with
visceral fat. Genome Biol.
2016;17:189.
Meta-
analysis
“protective” bacteria for CRC
• Lower in stool samples of CRC
cases compared to healthy controls
curatedMetagenomicData summary
• 25 datasets (5,716 samples) available
• Six data products per dataset
• Three taxonomy-based from MetaPhlAn2
• Three functional from HUMAnN2
• Reproduce all analyses in manuscript at:
– https://waldronlab.github.io/curatedMetagenomi
cData/analyses/
• Lowest barrier to entry, highest level of
curation of any microbiome data resource
23Pasolli/Schiffer/Manghi et al., bioRxiv 103085
Future work
• Integrated databases as HDF5, indexed remote files
– fast remote slicing of ranges, genes, gene families...
• Distribute TCGA, cBioPortal through ExperimentHub
– omics and clinical data as MultiAssayExperiments
• Curated microbial signatures / BugSigDB
Thank you
• Lab (www.waldronlab.org / www.waldronlab.github.io)
– Lucas Schiffer (curatedMetagenomicData), Marcel Ramos
(MultiAssayExperiment)
– Audrey Renson, Andy Samedy, Rimsha Azar, Carmen Rodriguez,
Tiffany Chan, Abzal Bacchus, Jaya Amatya, Ludwig Geistlinger
• Collaborators
– Nicola Segata lab
• Francesco Beghini, Edoardo Passoli, Paolo Manghi
– Heidi Jones, Jennifer Dowd, Sharon Perlman, Lorna Thorpe,
Robert Burk Lab (NYC-HANES)
– Valerie Obenchain, Martin Morgan (Bioconductor core team)
• CUNY High-performance Computing Center
25

More Related Content

What's hot

Data retreival system
Data retreival systemData retreival system
Data retreival systemShikha Thakur
 
Ncbi basic intro_v_pitt_kent_osu
Ncbi basic intro_v_pitt_kent_osuNcbi basic intro_v_pitt_kent_osu
Ncbi basic intro_v_pitt_kent_osuBen Busby
 
Biological databases: Challenges in organization and usability
Biological databases: Challenges in organization and usabilityBiological databases: Challenges in organization and usability
Biological databases: Challenges in organization and usabilityLars Juhl Jensen
 
A guided SQL tour of bioinformatics databases
A guided SQL tour of bioinformatics databasesA guided SQL tour of bioinformatics databases
A guided SQL tour of bioinformatics databasesYannick Pouliot
 
Database in bioinformatics
Database in bioinformaticsDatabase in bioinformatics
Database in bioinformaticsVinaKhan1
 
Bioinformatics biological databases
Bioinformatics biological databasesBioinformatics biological databases
Bioinformatics biological databasesSangeeta Das
 
Literature Based Framework for Semantic Descriptions of e-Science resources
Literature Based Framework for Semantic Descriptions of e-Science resourcesLiterature Based Framework for Semantic Descriptions of e-Science resources
Literature Based Framework for Semantic Descriptions of e-Science resourcesHammad Afzal
 
Composite protein databases
Composite protein databasesComposite protein databases
Composite protein databasesShritilekhaDash
 
Primary and secondary database
Primary and secondary databasePrimary and secondary database
Primary and secondary databaseKAUSHAL SAHU
 
Introduction to NCBI
Introduction to NCBIIntroduction to NCBI
Introduction to NCBIgeetikaJethra
 
databases in bioinformatics
databases in bioinformaticsdatabases in bioinformatics
databases in bioinformaticsnadeem akhter
 

What's hot (20)

Major databases in bioinformatics
Major databases in bioinformaticsMajor databases in bioinformatics
Major databases in bioinformatics
 
Data retrieval tools
Data retrieval toolsData retrieval tools
Data retrieval tools
 
Composite and Specialized databases
Composite and Specialized databasesComposite and Specialized databases
Composite and Specialized databases
 
Data retreival system
Data retreival systemData retreival system
Data retreival system
 
Ncbi basic intro_v_pitt_kent_osu
Ncbi basic intro_v_pitt_kent_osuNcbi basic intro_v_pitt_kent_osu
Ncbi basic intro_v_pitt_kent_osu
 
Biological databases: Challenges in organization and usability
Biological databases: Challenges in organization and usabilityBiological databases: Challenges in organization and usability
Biological databases: Challenges in organization and usability
 
Bioinformatics
BioinformaticsBioinformatics
Bioinformatics
 
A guided SQL tour of bioinformatics databases
A guided SQL tour of bioinformatics databasesA guided SQL tour of bioinformatics databases
A guided SQL tour of bioinformatics databases
 
Database in bioinformatics
Database in bioinformaticsDatabase in bioinformatics
Database in bioinformatics
 
Bioinformatics biological databases
Bioinformatics biological databasesBioinformatics biological databases
Bioinformatics biological databases
 
Literature Based Framework for Semantic Descriptions of e-Science resources
Literature Based Framework for Semantic Descriptions of e-Science resourcesLiterature Based Framework for Semantic Descriptions of e-Science resources
Literature Based Framework for Semantic Descriptions of e-Science resources
 
Data retrieval
Data retrievalData retrieval
Data retrieval
 
Gen bank databases
Gen bank databasesGen bank databases
Gen bank databases
 
Composite protein databases
Composite protein databasesComposite protein databases
Composite protein databases
 
Biological Database
Biological DatabaseBiological Database
Biological Database
 
Primary and secondary database
Primary and secondary databasePrimary and secondary database
Primary and secondary database
 
Introduction to NCBI
Introduction to NCBIIntroduction to NCBI
Introduction to NCBI
 
databases in bioinformatics
databases in bioinformaticsdatabases in bioinformatics
databases in bioinformatics
 
Tools and database of NCBI
Tools and database of NCBITools and database of NCBI
Tools and database of NCBI
 
(Expasy)
(Expasy)(Expasy)
(Expasy)
 

Viewers also liked

PO WER - XX LO Gdańsk - Alan Turing
PO WER - XX LO Gdańsk - Alan TuringPO WER - XX LO Gdańsk - Alan Turing
PO WER - XX LO Gdańsk - Alan TuringAgnieszka J.
 
Jupyter, A Platform for Data Science at Scale
Jupyter, A Platform for Data Science at ScaleJupyter, A Platform for Data Science at Scale
Jupyter, A Platform for Data Science at ScaleMatthias Bussonnier
 
MongoDB - Big Data mit Open Source
MongoDB - Big Data mit Open SourceMongoDB - Big Data mit Open Source
MongoDB - Big Data mit Open SourceB1 Systems GmbH
 
Apps for Science - Elsevier Developer Network Workshop 201102
Apps for Science - Elsevier Developer Network Workshop 201102Apps for Science - Elsevier Developer Network Workshop 201102
Apps for Science - Elsevier Developer Network Workshop 201102remko caprio
 
Computational Biology and Bioinformatics
Computational Biology and BioinformaticsComputational Biology and Bioinformatics
Computational Biology and BioinformaticsSharif Shuvo
 
The Computer Scientist and the Cleaner v4
The Computer Scientist and the Cleaner v4The Computer Scientist and the Cleaner v4
The Computer Scientist and the Cleaner v4turingfan
 
Analytics meets Big Data – R/Python auf der Hadoop/Spark-Plattform
Analytics meets Big Data – R/Python auf der Hadoop/Spark-PlattformAnalytics meets Big Data – R/Python auf der Hadoop/Spark-Plattform
Analytics meets Big Data – R/Python auf der Hadoop/Spark-PlattformRising Media Ltd.
 
Computational Approaches to Systems Biology
Computational Approaches to Systems BiologyComputational Approaches to Systems Biology
Computational Approaches to Systems BiologyMike Hucka
 
Alan Turing Scientist Unlimited | Turing100@Persistent Systems
Alan Turing Scientist Unlimited | Turing100@Persistent SystemsAlan Turing Scientist Unlimited | Turing100@Persistent Systems
Alan Turing Scientist Unlimited | Turing100@Persistent SystemsPersistent Systems Ltd.
 
Systems biology - Understanding biology at the systems level
Systems biology - Understanding biology at the systems levelSystems biology - Understanding biology at the systems level
Systems biology - Understanding biology at the systems levelLars Juhl Jensen
 
Do you know what k-Means? Cluster-Analysen
Do you know what k-Means? Cluster-Analysen Do you know what k-Means? Cluster-Analysen
Do you know what k-Means? Cluster-Analysen Harald Erb
 
IBM - Big Value from Big Data
IBM - Big Value from Big DataIBM - Big Value from Big Data
IBM - Big Value from Big DataWilfried Hoge
 
Tutorial 1: Your First Science App - Araport Developer Workshop
Tutorial 1: Your First Science App - Araport Developer WorkshopTutorial 1: Your First Science App - Araport Developer Workshop
Tutorial 1: Your First Science App - Araport Developer WorkshopVivek Krishnakumar
 
Day in the Life of a Computer Scientist
Day in the Life of a Computer ScientistDay in the Life of a Computer Scientist
Day in the Life of a Computer ScientistJustin Brunelle
 
Systems biology: Bioinformatics on complete biological system
Systems biology: Bioinformatics on complete biological systemSystems biology: Bioinformatics on complete biological system
Systems biology: Bioinformatics on complete biological systemLars Juhl Jensen
 
Data Scientist - The Sexiest Job of the 21st Century?
Data Scientist - The Sexiest Job of the 21st Century?Data Scientist - The Sexiest Job of the 21st Century?
Data Scientist - The Sexiest Job of the 21st Century?IoT User Group Hamburg
 
Computational Systems Biology (JCSB)
Computational Systems Biology (JCSB)Computational Systems Biology (JCSB)
Computational Systems Biology (JCSB)Annex Publishers
 

Viewers also liked (20)

PO WER - XX LO Gdańsk - Alan Turing
PO WER - XX LO Gdańsk - Alan TuringPO WER - XX LO Gdańsk - Alan Turing
PO WER - XX LO Gdańsk - Alan Turing
 
Jupyter, A Platform for Data Science at Scale
Jupyter, A Platform for Data Science at ScaleJupyter, A Platform for Data Science at Scale
Jupyter, A Platform for Data Science at Scale
 
MongoDB - Big Data mit Open Source
MongoDB - Big Data mit Open SourceMongoDB - Big Data mit Open Source
MongoDB - Big Data mit Open Source
 
Apps for Science - Elsevier Developer Network Workshop 201102
Apps for Science - Elsevier Developer Network Workshop 201102Apps for Science - Elsevier Developer Network Workshop 201102
Apps for Science - Elsevier Developer Network Workshop 201102
 
Computational Biology and Bioinformatics
Computational Biology and BioinformaticsComputational Biology and Bioinformatics
Computational Biology and Bioinformatics
 
The Computer Scientist and the Cleaner v4
The Computer Scientist and the Cleaner v4The Computer Scientist and the Cleaner v4
The Computer Scientist and the Cleaner v4
 
Analytics meets Big Data – R/Python auf der Hadoop/Spark-Plattform
Analytics meets Big Data – R/Python auf der Hadoop/Spark-PlattformAnalytics meets Big Data – R/Python auf der Hadoop/Spark-Plattform
Analytics meets Big Data – R/Python auf der Hadoop/Spark-Plattform
 
COMPUTATIONAL BIOLOGY
COMPUTATIONAL BIOLOGYCOMPUTATIONAL BIOLOGY
COMPUTATIONAL BIOLOGY
 
Computational Approaches to Systems Biology
Computational Approaches to Systems BiologyComputational Approaches to Systems Biology
Computational Approaches to Systems Biology
 
Alan Turing Scientist Unlimited | Turing100@Persistent Systems
Alan Turing Scientist Unlimited | Turing100@Persistent SystemsAlan Turing Scientist Unlimited | Turing100@Persistent Systems
Alan Turing Scientist Unlimited | Turing100@Persistent Systems
 
Systems biology - Understanding biology at the systems level
Systems biology - Understanding biology at the systems levelSystems biology - Understanding biology at the systems level
Systems biology - Understanding biology at the systems level
 
Do you know what k-Means? Cluster-Analysen
Do you know what k-Means? Cluster-Analysen Do you know what k-Means? Cluster-Analysen
Do you know what k-Means? Cluster-Analysen
 
Zwischen Browser, Code & Photoshop - aus dem Leben eines Webworkers
Zwischen Browser, Code & Photoshop - aus dem Leben eines WebworkersZwischen Browser, Code & Photoshop - aus dem Leben eines Webworkers
Zwischen Browser, Code & Photoshop - aus dem Leben eines Webworkers
 
Python for Data Science
Python for Data SciencePython for Data Science
Python for Data Science
 
IBM - Big Value from Big Data
IBM - Big Value from Big DataIBM - Big Value from Big Data
IBM - Big Value from Big Data
 
Tutorial 1: Your First Science App - Araport Developer Workshop
Tutorial 1: Your First Science App - Araport Developer WorkshopTutorial 1: Your First Science App - Araport Developer Workshop
Tutorial 1: Your First Science App - Araport Developer Workshop
 
Day in the Life of a Computer Scientist
Day in the Life of a Computer ScientistDay in the Life of a Computer Scientist
Day in the Life of a Computer Scientist
 
Systems biology: Bioinformatics on complete biological system
Systems biology: Bioinformatics on complete biological systemSystems biology: Bioinformatics on complete biological system
Systems biology: Bioinformatics on complete biological system
 
Data Scientist - The Sexiest Job of the 21st Century?
Data Scientist - The Sexiest Job of the 21st Century?Data Scientist - The Sexiest Job of the 21st Century?
Data Scientist - The Sexiest Job of the 21st Century?
 
Computational Systems Biology (JCSB)
Computational Systems Biology (JCSB)Computational Systems Biology (JCSB)
Computational Systems Biology (JCSB)
 

Similar to Multi-omics infrastructure and data for R/Bioconductor

Multi-omics methods and resources for Bioconductor
Multi-omics methods and resources for BioconductorMulti-omics methods and resources for Bioconductor
Multi-omics methods and resources for BioconductorLevi Waldron
 
Building bioinformatics resources for the global community
Building bioinformatics resources for the global communityBuilding bioinformatics resources for the global community
Building bioinformatics resources for the global communityExternalEvents
 
Integrative Networks Centric Bioinformatics
Integrative Networks Centric BioinformaticsIntegrative Networks Centric Bioinformatics
Integrative Networks Centric BioinformaticsNatalio Krasnogor
 
Integrative analysis of transcriptomics and proteomics data with ArrayMining ...
Integrative analysis of transcriptomics and proteomics data with ArrayMining ...Integrative analysis of transcriptomics and proteomics data with ArrayMining ...
Integrative analysis of transcriptomics and proteomics data with ArrayMining ...Natalio Krasnogor
 
R Analytics in the Cloud
R Analytics in the CloudR Analytics in the Cloud
R Analytics in the CloudDataMine Lab
 
RUCK 2017 김성환 R 패키지 메타주성분분석(MetaPCA)
RUCK 2017 김성환 R 패키지 메타주성분분석(MetaPCA)RUCK 2017 김성환 R 패키지 메타주성분분석(MetaPCA)
RUCK 2017 김성환 R 패키지 메타주성분분석(MetaPCA)r-kor
 
Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...
Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...
Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...Golden Helix Inc
 
Genomic Big Data Management, Integration and Mining - Emanuel Weitschek
Genomic Big Data Management, Integration and Mining - Emanuel WeitschekGenomic Big Data Management, Integration and Mining - Emanuel Weitschek
Genomic Big Data Management, Integration and Mining - Emanuel WeitschekData Driven Innovation
 
Bioinformatics_1_ChenS.pptx
Bioinformatics_1_ChenS.pptxBioinformatics_1_ChenS.pptx
Bioinformatics_1_ChenS.pptxxRowlet
 
Metagenomic Data Provenance and Management using the ISA infrastructure --- o...
Metagenomic Data Provenance and Management using the ISA infrastructure --- o...Metagenomic Data Provenance and Management using the ISA infrastructure --- o...
Metagenomic Data Provenance and Management using the ISA infrastructure --- o...Alejandra Gonzalez-Beltran
 
Bioinformatics-R program의 실례
Bioinformatics-R program의 실례Bioinformatics-R program의 실례
Bioinformatics-R program의 실례mothersafe
 
Jax bio dataworldcongress.ngs.20181128finalwithoutbu
Jax bio dataworldcongress.ngs.20181128finalwithoutbuJax bio dataworldcongress.ngs.20181128finalwithoutbu
Jax bio dataworldcongress.ngs.20181128finalwithoutbuAnne Deslattes Mays
 
Using open bioactivity data for developing machine-learning prediction models...
Using open bioactivity data for developing machine-learning prediction models...Using open bioactivity data for developing machine-learning prediction models...
Using open bioactivity data for developing machine-learning prediction models...Sunghwan Kim
 
Microbiome studies using 16S ribosomal DNA PCR: some cautionary tales.
Microbiome studies using 16S ribosomal DNA PCR: some cautionary tales.Microbiome studies using 16S ribosomal DNA PCR: some cautionary tales.
Microbiome studies using 16S ribosomal DNA PCR: some cautionary tales.jennomics
 
A Step to the Clouded Solution of Scalable Clinical Genome Sequencing (BDT308...
A Step to the Clouded Solution of Scalable Clinical Genome Sequencing (BDT308...A Step to the Clouded Solution of Scalable Clinical Genome Sequencing (BDT308...
A Step to the Clouded Solution of Scalable Clinical Genome Sequencing (BDT308...Amazon Web Services
 
Role of bioinformatics in life sciences research
Role of bioinformatics in life sciences researchRole of bioinformatics in life sciences research
Role of bioinformatics in life sciences researchAnshika Bansal
 
2014 Taverna Tutorial Introduction to eScience and workflows
2014 Taverna Tutorial Introduction to eScience and workflows2014 Taverna Tutorial Introduction to eScience and workflows
2014 Taverna Tutorial Introduction to eScience and workflowsmyGrid team
 

Similar to Multi-omics infrastructure and data for R/Bioconductor (20)

Multi-omics methods and resources for Bioconductor
Multi-omics methods and resources for BioconductorMulti-omics methods and resources for Bioconductor
Multi-omics methods and resources for Bioconductor
 
Building bioinformatics resources for the global community
Building bioinformatics resources for the global communityBuilding bioinformatics resources for the global community
Building bioinformatics resources for the global community
 
Integrative Networks Centric Bioinformatics
Integrative Networks Centric BioinformaticsIntegrative Networks Centric Bioinformatics
Integrative Networks Centric Bioinformatics
 
Integrative analysis of transcriptomics and proteomics data with ArrayMining ...
Integrative analysis of transcriptomics and proteomics data with ArrayMining ...Integrative analysis of transcriptomics and proteomics data with ArrayMining ...
Integrative analysis of transcriptomics and proteomics data with ArrayMining ...
 
R Analytics in the Cloud
R Analytics in the CloudR Analytics in the Cloud
R Analytics in the Cloud
 
RUCK 2017 김성환 R 패키지 메타주성분분석(MetaPCA)
RUCK 2017 김성환 R 패키지 메타주성분분석(MetaPCA)RUCK 2017 김성환 R 패키지 메타주성분분석(MetaPCA)
RUCK 2017 김성환 R 패키지 메타주성분분석(MetaPCA)
 
Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...
Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...
Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...
 
Genomic Big Data Management, Integration and Mining - Emanuel Weitschek
Genomic Big Data Management, Integration and Mining - Emanuel WeitschekGenomic Big Data Management, Integration and Mining - Emanuel Weitschek
Genomic Big Data Management, Integration and Mining - Emanuel Weitschek
 
Bioinformatics_1_ChenS.pptx
Bioinformatics_1_ChenS.pptxBioinformatics_1_ChenS.pptx
Bioinformatics_1_ChenS.pptx
 
Metagenomic Data Provenance and Management using the ISA infrastructure --- o...
Metagenomic Data Provenance and Management using the ISA infrastructure --- o...Metagenomic Data Provenance and Management using the ISA infrastructure --- o...
Metagenomic Data Provenance and Management using the ISA infrastructure --- o...
 
Bioinformatics-R program의 실례
Bioinformatics-R program의 실례Bioinformatics-R program의 실례
Bioinformatics-R program의 실례
 
Jax bio dataworldcongress.ngs.20181128finalwithoutbu
Jax bio dataworldcongress.ngs.20181128finalwithoutbuJax bio dataworldcongress.ngs.20181128finalwithoutbu
Jax bio dataworldcongress.ngs.20181128finalwithoutbu
 
Using open bioactivity data for developing machine-learning prediction models...
Using open bioactivity data for developing machine-learning prediction models...Using open bioactivity data for developing machine-learning prediction models...
Using open bioactivity data for developing machine-learning prediction models...
 
MPDB Presentation
MPDB PresentationMPDB Presentation
MPDB Presentation
 
Microbiome studies using 16S ribosomal DNA PCR: some cautionary tales.
Microbiome studies using 16S ribosomal DNA PCR: some cautionary tales.Microbiome studies using 16S ribosomal DNA PCR: some cautionary tales.
Microbiome studies using 16S ribosomal DNA PCR: some cautionary tales.
 
A Step to the Clouded Solution of Scalable Clinical Genome Sequencing (BDT308...
A Step to the Clouded Solution of Scalable Clinical Genome Sequencing (BDT308...A Step to the Clouded Solution of Scalable Clinical Genome Sequencing (BDT308...
A Step to the Clouded Solution of Scalable Clinical Genome Sequencing (BDT308...
 
Role of bioinformatics in life sciences research
Role of bioinformatics in life sciences researchRole of bioinformatics in life sciences research
Role of bioinformatics in life sciences research
 
2014 Taverna Tutorial Introduction to eScience and workflows
2014 Taverna Tutorial Introduction to eScience and workflows2014 Taverna Tutorial Introduction to eScience and workflows
2014 Taverna Tutorial Introduction to eScience and workflows
 
2023 GIAB AMP Update
2023 GIAB AMP Update2023 GIAB AMP Update
2023 GIAB AMP Update
 
BioData World Basel 2018
BioData World Basel 2018BioData World Basel 2018
BioData World Basel 2018
 

Recently uploaded

TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providermohitmore19
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)OPEN KNOWLEDGE GmbH
 
Engage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The UglyEngage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The UglyFrank van der Linden
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about usDynamic Netsoft
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdfWave PLM
 
Introduction to Decentralized Applications (dApps)
Introduction to Decentralized Applications (dApps)Introduction to Decentralized Applications (dApps)
Introduction to Decentralized Applications (dApps)Intelisync
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio, Inc.
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - InfographicHr365.us smith
 
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfThe Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfkalichargn70th171
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...Christina Lin
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideChristina Lin
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...soniya singh
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...gurkirankumar98700
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityNeo4j
 

Recently uploaded (20)

TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
Exploring iOS App Development: Simplifying the Process
Exploring iOS App Development: Simplifying the ProcessExploring iOS App Development: Simplifying the Process
Exploring iOS App Development: Simplifying the Process
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)
 
Engage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The UglyEngage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The Ugly
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about us
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
Introduction to Decentralized Applications (dApps)
Introduction to Decentralized Applications (dApps)Introduction to Decentralized Applications (dApps)
Introduction to Decentralized Applications (dApps)
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - Infographic
 
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfThe Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered Sustainability
 

Multi-omics infrastructure and data for R/Bioconductor

  • 1. Multi-omics infrastructure and data for R/Bioconductor Levi Waldron Sept 29, 2017
  • 2. Why Bioconductor? 1,400 packages on a backbone of data structures The Genomic Ranges algebra Huber, W. et al. Orchestrating high-throughput genomic analysis with Bioconductor. Nat. Methods 12, 115–121 (2015). The integrative data container SummarizedExperiment
  • 3. Bioconductor core data classes • Rectangular feature x sample data – SummarizedExperiment::SummarizedExperiment() – (RNAseq count matrix, microarray, …) • Genomic coordinates – GenomicRanges::GRanges() (1-based, closed interval) • DNA / RNA / AA sequences – Biostrings::*Stringset() • Gene sets – GSEABase::GeneSet() GSEABase::GeneSetCollection() • Single cell data – SingleCellExperiment::SingleCellExperiment() • Mass spec data – MSnbase::MSnExp()
  • 4. Credit: Marcel Ramos Diseases, platforms, and data types of The TCGA 33 diseases 50 platforms 19 data types Multi-assay experiments can be complex
  • 5. The need for MultiAssayExperiment Need a core data structure to: – harmonize single-assay data structures – relate multiple assays & clinical data – handle missing and replicate observations – accommodate ID-based and range-based data – support on-disk representations of big data
  • 6. MultiAssayExperiment design Credit: Marcel Ramos Ramos et al. Software for the Integration of Multi-omics Experiments in Bioconductor. Cancer Research (In Press).
  • 7. TCGA as MultiAssayExperiments Access from www.github.com/waldronlab/MultiAssayExperiment …... 33 cancer types Ramos et al. Software for the Integration of Multi-omics Experiments in Bioconductor. Cancer Research (In Press).
  • 8. TCGA as MultiAssayExperiments > acc A MultiAssayExperiment object of 9 listed experiments with user-defined names and respective classes. Containing an ExperimentList class object of length 9: [1] RNASeq2GeneNorm: ExpressionSet with 20501 rows and 79 columns [2] miRNASeqGene: ExpressionSet with 1046 rows and 80 columns [3] CNASNP: RaggedExperiment with 79861 rows and 180 columns [4] CNVSNP: RaggedExperiment with 21052 rows and 180 columns [5] Methylation: SummarizedExperiment with 485577 rows and 80 columns [6] RPPAArray: ExpressionSet with 192 rows and 46 columns [7] Mutations: RaggedExperiment with 20166 rows and 90 columns [8] gistica: SummarizedExperiment with 24776 rows and 90 columns [9] gistict: SummarizedExperiment with 24776 rows and 90 columns Features: experiments() - obtain the ExperimentList instance colData() - the primary/phenotype DataFrame sampleMap() - the sample availability DataFrame `$`, `[`, `[[` - extract colData columns, subset, or experiment *Format() - convert into a long or wide DataFrame assays() - convert ExperimentList to a SimpleList of matrices > Ramos et al. Software for the Integration of Multi-omics Experiments in Bioconductor. Cancer Research (In Press).
  • 9. The MultiAssayExperiment API Credit: Marcel Ramos Ramos et al. Software for the Integration of Multi-omics Experiments in Bioconductor. Cancer Research (In Press).
  • 10. For building visualizations Upset Venn diagram for adrenocortical carcinoma TCGA > data(miniACC) > upsetSamples(miniACC) Ramos et al. Software for the Integration of Multi-omics Experiments in Bioconductor. Cancer Research (In Press).
  • 11. For multi-omics analysis > mae <- mae[, , c("Mutations", "gistict")] > mae <- intersectColumns(mae) > mae$cnload <- colMeans(abs(assay(mae[["gistict"]]))) Davoli et al. Tumor aneuploidy correlates with markers of immune evasion and with reduced response to immunotherapy. Science 355, (2017). Ramos et al. Software for the Integration of Multi-omics Experiments in Bioconductor. Cancer Research (In Press).
  • 12. For integrating remotely stored data > st <- ldblock::stack1kg() #Create a URL referencing 1000 genomes content in AWS S3 > multiban <- MultiAssayExperiment( list(meth = banovichSE, snp = st), colData = colData(banovichSE)) > multibanfocus <- multiban[rowRanges(banovichSE)[“cg04793911”], , ] > assoc <- cisAssoc(multibanfocus[[“meth”]], TabixFile(files(multibanfocus[[“snp”]]))) Using tabix-indexed SNP VCFs from 1000 genomes on Amazon S3 credit: Vince Carey Ramos et al. Software for the Integration of Multi-omics Experiments in Bioconductor. Cancer Research (In Press).
  • 13. A big software engineering effort
  • 14. Past curated*Data Bioconductor packages • curatedOvarianData – 30 datasets, > 3K unique samples – survival, surgical debulking, histology... • curatedCRCData – 34 datasets, ~4K unique samples – many annotated for MSS, gender, stage, age, N, M • curatedBladderData – 12 datasets, ~1,200 unique samples – many annotated for stage, grade, OS 14
  • 15. curatedMetagenomicData: motivation • Increasing amount of public data • Can be fast and free, but hard to use: – fastq files from NCBI, EBI, ... – bioinformatic expertise – computational resources – manual curation / standardization • Wanted to make acquisition of curated, ready- to-use public data easy and reproducible 15
  • 16. curatedMetagenomicData: pipeline Download (~57TB) Uniform processing MetaPhlAn2 HUMAnN2 species abundance marker presence gene family abundance marker abundance metabolic pathway abundance metabolic pathway presence standardized metadata Manual curation Raw fastq files  13 datasets  2,875 samples Study metadata Age, body site, disease, etc… Offline high computational load pipeline > 120 kH CPU Integrated Bioconductor ExpressionSet objects  Per-patient microbiome data  Per-patient metadata  Experiment-wide metadata Integration Automatic documentation ExperimentHub product  Amazon S3 cloud distribution  Tag-based searching  Dataset snapshot dates  Automatic local caching Convenience download functions Megabytes-sized datasets  Differential abundance  Diversity metrics  Clustering  Machine learning User experience https://waldronlab.github.io/curatedMetagenomicData/
  • 17. One dataset from R: > curatedMetagenomicData("HMP_2012.metaphlan_bugs_list.stool”) , relab=FALSE) Many datasets from R: > curatedMetagenomicData("HMP_2012.metaphlan_bugs_list.*”) Command-line: $ curatedMetagenomicData -p "HMP_2012.metaphlan_bugs_list.*" 17 curatedMetagenomicData: use Pasolli/Schiffer/Manghi et al. Accessible, Curated Metagenomic Data Through ExperimentHub. Nature Methods (In Press).
  • 18. Supervised disease classification 18 Credit: Edoardo Pasolli Pasolli/Schiffer/Manghi et al. Accessible, Curated Metagenomic Data Through ExperimentHub. Nature Methods (In Press).
  • 19. Unsupervised clustering 19 Credit: Audrey Renson Pasolli/Schiffer/Manghi et al. Accessible, Curated Metagenomic Data Through ExperimentHub. Nature Methods (In Press).
  • 21. Meta- analysis (partial) validation of reported associations between genera and BMI Credit: Lucas Schiffer Beaumont M et al. Heritable components of the human fecal microbiome are associated with visceral fat. Genome Biol. 2016;17:189.
  • 22. Meta- analysis “protective” bacteria for CRC • Lower in stool samples of CRC cases compared to healthy controls
  • 23. curatedMetagenomicData summary • 25 datasets (5,716 samples) available • Six data products per dataset • Three taxonomy-based from MetaPhlAn2 • Three functional from HUMAnN2 • Reproduce all analyses in manuscript at: – https://waldronlab.github.io/curatedMetagenomi cData/analyses/ • Lowest barrier to entry, highest level of curation of any microbiome data resource 23Pasolli/Schiffer/Manghi et al., bioRxiv 103085
  • 24. Future work • Integrated databases as HDF5, indexed remote files – fast remote slicing of ranges, genes, gene families... • Distribute TCGA, cBioPortal through ExperimentHub – omics and clinical data as MultiAssayExperiments • Curated microbial signatures / BugSigDB
  • 25. Thank you • Lab (www.waldronlab.org / www.waldronlab.github.io) – Lucas Schiffer (curatedMetagenomicData), Marcel Ramos (MultiAssayExperiment) – Audrey Renson, Andy Samedy, Rimsha Azar, Carmen Rodriguez, Tiffany Chan, Abzal Bacchus, Jaya Amatya, Ludwig Geistlinger • Collaborators – Nicola Segata lab • Francesco Beghini, Edoardo Passoli, Paolo Manghi – Heidi Jones, Jennifer Dowd, Sharon Perlman, Lorna Thorpe, Robert Burk Lab (NYC-HANES) – Valerie Obenchain, Martin Morgan (Bioconductor core team) • CUNY High-performance Computing Center 25

Editor's Notes

  1. Hi. I’d like to introduce you to MultiAssayExperiment, a framework for the representation and analysis of multi-omics experiments in Bioconductor.
  2. For anyone unfamiliar with Bioconductor, it is a suite of over a thousand packages for statistical analysis and visualization of high-throughput biological data, accessible via the R programming language and unified by a backbone of core data structures designed for the requirements of specific genomic data types. * Core developers provide this key set of data structures that are efficient and well tested, and contributed packages are expected to use these where applicable For example, the Genomic Ranges system provides a representation and algebra for any data associated with genomic coordinates. Efficient in-memory and on-disk representations Integrative data containers such as SummarizedExperiment, integrate high-throughput data with, for example, gene annotations, sample data such as clinical information, experimental metadata, and can even represent multiple assays. In this case, however, the assays must be matrix-like and of identical dimensions Until now, Bioconductor was lacking in a core data structure to provide a framework for analysis and development of tools for multi-omics experiments
  3. This work was motivated by the need to simplify general statistical analysis and development of bioinformatic tools for a study as complex as the Cancer Genome Atlas, where 33 cancers were assayed on many platforms to generate different types of data, but also to provide a simplified framework for more easily reproducible and less error-prone analysis of simpler experiments involving just a couple of complementary assays and clinical data.
  4. A core data structure was needed to * harmonize existing structures for different types of data, * relate assays with each other and clinical data * handle the reality that such experiments are often incomplete and missing observations on some assays, and also may contain replicates, time series, or matched normal, * accommodate data that are indexed by IDs such as genes and data indexed by genomic ranges, * and support on-disk representations for big data
  5. MultiAssayExperiment addresses these challenges by relating a table of information about subjects, say clinical and pathological data, to a series of genomic data sets of arbitrary shape and even non-tabular data, via a map or a network relating these. This sounds complex and it can be, but from the analyst’s perspective, there is an API that will be familiar to users of R, and that abstracts this complexity from the user. Constructing, accessing, subsetting, data management or manipulation, and combining and reshaping into forms usable by standard tools become straightforward.
  6. To help those wanting to analyze TCGA data, we’ve constructed MultiAssayExperiments for 33 cancer types. Each cancer type is represented by a single object containing all the most commonly used, unrestricted data. These objects are immediately usable, even on most laptops, with the API shown on the previous page.
  7. To give you an idea of what this looks like, here is a sort of complex Venn Diagram of just four of the assays for GBM. Although GISTIC copy number and microRNA are assayed on about 600 samples each, but only a fraction of these cases have data available for both, and an even smaller fraction have data for all four of GISTIC, microRNA, methylation, and RNA-seq data.
  8. Analyses of a single assay or that combine assays, such as this reproduction of the result from Davoli et al. that cancer types with high levels of aneuploidy often show a positive correlation of mutation load and chromosomal instability, perhaps due to a higher tolerance of deleterious mutations, as shown here in orange for breast cancer. Whereas, tumors with a hypermutator phenotype rarely display extensive chromosomal instability, resulting in a negative correlation of mutation load and chromosomal instability in cancer types where hypermutation is common (shown in grey for colon adenocarcinoma).
  9. Larger files, such as SNPs in VCF format, demonstrated here from the 1000 genomes project because this is unrestricted data, can be analyzed for example in this SNP/methylation association study, in chunks from an on-disk representation. This data format, by the way, was supported by default without any modification of MultiAssayExperiment, as is any data class meeting a few minimum requirements.
  10. https://upload.wikimedia.org/wikipedia/commons/thumb/7/7e/Funnel_Mech.svg/667px-Funnel_Mech.svg.png https://pixabay.com/en/cheering-happy-jumping-people-297419/
  11. met Jin Xu from East China Normal University, Shanghai