SlideShare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.
SlideShare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.
Successfully reported this slideshow.
Activate your 30 day free trial to unlock unlimited reading.
1.
Why re-use core classes?
A plea to developers of Bioconductor packages
Levi Waldron
Oct 16, 2017
2.
What is Bioconductor?
1,400 packages on a backbone of data structures
The Genomic Ranges algebra
Huber, W. et al. Orchestrating high-throughput genomic analysis with
Bioconductor. Nat. Methods 12, 115–121 (2015).
The integrative data container SummarizedExperiment
3.
Why do core classes matter to developers?
Let’s say you have a great idea for an improved bicycle
New rocket-powered bikeraw steel forge frame
What could possibly go wrong?
4.
Why do core classes matter to developers?
• What could possibly go wrong?
– Your frame has limited testing
– Your frame lacks features you never even thought of
Ouch!
Little eyelets allows users to install a rack and fenders
5.
Why do core classes matter to developers?
It’s easy to define a new S4 class in R
> setClass("BicycleFrame",
representation(height = "numeric", color = "character"))
> my.new.frame <- new("BicycleFrame", height = 31, color = "red")
> my.new.frame
An object of class "BicycleFrame"
Slot "height":
[1] 31
Slot "color":
[1] "red”
> However, it’s very difficult to define a
robust and flexible data class for genomic data analysis
6.
Why do core classes matter to developers?
setClass(Class="phyloseq",
representation=representation(
otu_table="otu_tableOrNULL",
tax_table="taxonomyTableOrNULL",
sam_data="sample_dataOrNULL",
phy_tree="phyloOrNULL",
refseq = "XStringSetOrNULL")
)
From phyloseq Bioconductor package
Does not contain any base class
It is a list with elements of defined class
7.
Why do core classes matter to developers?
setClass("MRexperiment", contains=c("eSet"),
representation=representation(
expSummary = "environment"))
)
From the metagenomeSeq Bioconductor package
Contains the eSet base virtual class
(since outdated by SummarizedExperiment)
8.
Load a metagenomeSeq class object
This loads an example object and demonstrates that it uses the
default show method defined for eSet. A custom show method
could be defined if desired.
suppressPackageStartupMessages(library(metagenomeSeq))
data(lungData)
lungData
## MRexperiment (storageMode: environment)
## assayData: 51891 features, 78 samples
## element names: counts
## protocolData: none
## phenoData
## sampleNames: CHK_6467_E3B11_BRONCH2_PREWASH_V1V2
## CHK_6467_E3B11_OW_V1V2 ... CHK_6467_E3B09_BAL_A_V1V2 (78
## total)
## varLabels: SampleType SiteSampled SmokingStatus
## varMetadata: labelDescription
## featureData
## featureNames: 1 2 ... 51891 (51891 total)
## fvarLabels: taxa
## fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## pubMedIds: 21680950
## Annotation:
9.
Load a phyloseq class object
Do the same for a phyloseq class example data object, and
demonstrate its custom show method:
suppressPackageStartupMessages(library(phyloseq))
data(GlobalPatterns)
GlobalPatterns
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 19216 taxa and 26 samples ]
## sample_data() Sample Data: [ 26 samples by 7 sample variables ]
## tax_table() Taxonomy Table: [ 19216 taxa by 7 taxonomic ranks ]
## phy_tree() Phylogenetic Tree: [ 19216 tips and 19215 internal nodes ]
10.
Inheritance of core methods: example 1
Since metagenomeSeq contains eSet, it automatically inherits core
methods like dim(). These would have to be defined separately for
the phyloseq class since it does not extend a core class.
dim(lungData)
## Features Samples
## 51891 78
dim(GlobalPatterns)
## NULL
Note that neither the phyloseq or the metagenomeSeq package
defines a dim() method, but metagenomeSeq got it for free by
extending eSet.
11.
Inheritance of core methods: example 2
For core Bioconductor objects, $ generally accessess the sample
data, but for phyloseq objects the sample data must be explicitly
extracted first:
head(lungData$SampleType)
## CHK_6467_E3B11_BRONCH2_PREWASH_V1V2 CHK_6467_E3B11_OW_V1V2
## Bronch2.PreWash OW
## CHK_6467_E3B08_OW_V1V2 CHK_6467_E3B07_BAL_A_V1V2
## OW BAL.A
## CHK_6467_E3B11_BAL_A_V1V2 CHK_6467_E3B09_OP_V1V2
## BAL.A OP.Swab
## 12 Levels: BAL.1stReturn BAL.A BAL.B Bronch1.PostWash ... PSB
head(sample_data(GlobalPatterns)$SampleType)
## [1] Soil Soil Soil Feces Feces Skin
## 9 Levels: Feces Freshwater Freshwater (creek) Mock ... Tongue
12.
Inheritance of core methods: example 2
subset(), [, and head() are core methods
they are defined for eSet and other core classes, so these
familiar operations work “out of the box”:
subset(lungData, lungData$SampleType=="OW")
lungData[, lungData$SampleType=="OW"]
lungData[, 1:5]
head(lungData)
13.
Inheritance of core methods: example 2
phyloseq cannot use these, so a custom subset_samples()
method is defined instead:
subset_samples(GlobalPatterns, SampleType=="Ocean")
But square bracket subsetting, subset(), and head() are not
defined for phyloseq objects, and have no parent class to inherit
them from.
GlobalPatterns[, 1:5]
## Error in GlobalPatterns[, 1:5]: object of type 'S4' is not subsettable
subset(GlobalPatterns, 1:5)
## Error in subset.default(GlobalPatterns, 1:5): 'subset' must be logical
14.
Relevance to multi-omics data analysis
The MultiAssayExperiment core class allows coordinated
representation and management of an open-ended set of assays,
as long as their data class provides basic methods:
dimnames()
[ subsetting
dim()
and preferably assay()
MultiAssayExperiment data management is modeled on
SummarizedExperiment but allows for multiple assays of
different row and column dimensions.
15.
Relevance to multi-omics data analysis (cont’d)
With no special accommodations, the lungData object “just works”
in a MultiAssayExperiment:
suppressPackageStartupMessages(library(MultiAssayExperiment))
MultiAssayExperiment(list(lung=lungData))
## A MultiAssayExperiment object of 1 listed
## experiment with a user-defined name and respective class.
## Containing an ExperimentList class object of length 1:
## [1] lung: MRexperiment with 51891 rows and 78 columns
## Features:
## experiments() - obtain the ExperimentList instance
## colData() - the primary/phenotype DataFrame
## sampleMap() - the sample availability DataFrame
## `$`, `[`, `[[` - extract colData columns, subset, or experiment
## *Format() - convert into a long or wide DataFrame
## assays() - convert ExperimentList to a SimpleList of matrices
But GlobalPattern does not, because it is not derived from a core class:
MultiAssayExperiment(list(global=GlobalPatterns))
## Error in if (dim(object)[2] > 0 && is.null(colnames(object))) {: missing value where TRUE/FALSE needed
16.
Inheritance of core methods
These are not isolated examples.Full-time, professional software
developers have developed many methods for core classes.
Classes containing core classes get all of these for free,
providing future advantages that you can not possibly imagine
in advance.
For example, SummarizedExperiment has more than 100
methods defined!
17.
Inheritance of core methods (cont’d)
suppressPackageStartupMessages(library(SummarizedExperiment))
methods(class="SummarizedExperiment")
## [1] != [ [[ [[<- [<-
## [6] %in% < <= == >
## [11] >= $ $<- aggregate anyNA
## [16] append as.character as.complex as.data.frame as.env
## [21] as.integer as.list as.logical as.matrix as.numeric
## [26] as.raw assay assay<- assayNames assayNames<-
## [31] assays assays<- by cbind coerce
## [36] coerce<- colData colData<- countOverlaps dim
## [41] dimnames dimnames<- duplicated elementMetadata elementMetadata<-
## [46] eval expand expand.grid extractROWS findOverlaps
## [51] head intersect is.na length lengths
## [56] match mcols mcols<- merge metadata
## [61] metadata<- mstack names names<- NROW
## [66] overlapsAny parallelSlotNames pcompare rank rbind
## [71] realize relist rename rep rep.int
## [76] replaceROWS rev rowData rowData<- ROWNAMES
## [81] rowRanges<- seqlevelsInUse setdiff setequal shiftApply
## [86] show sort split split<- subset
## [91] subsetByOverlaps table tail tapply transform
## [96] union unique updateObject values values<-
## [101] window window<- with xtabs
## see '?methods' for accessing help and source code
SummarizedExperiment also provides great functionality like
out-of-the-box compatibility with on-disk data representation.
USE AND DERIVE FROM THESE CLASSES!
18.
What are the “core” classes?
• Rectangular feature x sample data (RNAseq count matrix, microarray, …)
– SummarizedExperiment::SummarizedExperiment()
• Genomic coordinates (1-based, closed interval)
– GenomicRanges::GRanges()
• DNA / RNA / AA sequences
– Biostrings::*Stringset()
• Gene sets
– GSEABase::GeneSet()
– GSEABase::GeneSetCollection()
• Multi-omics data
– MultiAssayExperiment::MultiAssayExperiment()
• Single cell data
– SingleCellExperiment::SingleCellExperiment()
• Mass spectrometry data
– MSnbase::MSnExp()
https://www.bioconductor.org/developers/how-to/commonMethodsAndClasses/
19.
Core classes represent years of work and
maintenance by experienced developers
Bioconductor core team members
– Martin Morgan (Project Lead)
– Hervé Pagès
– James MacDonald
– Valerie Obenchain
– Andrzej Oleś
– Marcel Ramos
– Lori Shepherd
– Nitesh Turaga
– Daniel van Twisk So you can spend less time frame-building
And more time building rocket boosters