Successfully reported this slideshow.

More Related Content

Why re-use core classes?

  1. 1. Why re-use core classes? A plea to developers of Bioconductor packages Levi Waldron Oct 16, 2017
  2. 2. What is Bioconductor? 1,400 packages on a backbone of data structures The Genomic Ranges algebra Huber, W. et al. Orchestrating high-throughput genomic analysis with Bioconductor. Nat. Methods 12, 115–121 (2015). The integrative data container SummarizedExperiment
  3. 3. Why do core classes matter to developers? Let’s say you have a great idea for an improved bicycle New rocket-powered bikeraw steel forge frame What could possibly go wrong?
  4. 4. Why do core classes matter to developers? • What could possibly go wrong? – Your frame has limited testing – Your frame lacks features you never even thought of Ouch! Little eyelets allows users to install a rack and fenders
  5. 5. Why do core classes matter to developers? It’s easy to define a new S4 class in R > setClass("BicycleFrame", representation(height = "numeric", color = "character")) > my.new.frame <- new("BicycleFrame", height = 31, color = "red") > my.new.frame An object of class "BicycleFrame" Slot "height": [1] 31 Slot "color": [1] "red” > However, it’s very difficult to define a robust and flexible data class for genomic data analysis
  6. 6. Why do core classes matter to developers? setClass(Class="phyloseq", representation=representation( otu_table="otu_tableOrNULL", tax_table="taxonomyTableOrNULL", sam_data="sample_dataOrNULL", phy_tree="phyloOrNULL", refseq = "XStringSetOrNULL") ) From phyloseq Bioconductor package Does not contain any base class It is a list with elements of defined class
  7. 7. Why do core classes matter to developers? setClass("MRexperiment", contains=c("eSet"), representation=representation( expSummary = "environment")) ) From the metagenomeSeq Bioconductor package Contains the eSet base virtual class (since outdated by SummarizedExperiment)
  8. 8. Load a metagenomeSeq class object This loads an example object and demonstrates that it uses the default show method defined for eSet. A custom show method could be defined if desired. suppressPackageStartupMessages(library(metagenomeSeq)) data(lungData) lungData ## MRexperiment (storageMode: environment) ## assayData: 51891 features, 78 samples ## element names: counts ## protocolData: none ## phenoData ## sampleNames: CHK_6467_E3B11_BRONCH2_PREWASH_V1V2 ## CHK_6467_E3B11_OW_V1V2 ... CHK_6467_E3B09_BAL_A_V1V2 (78 ## total) ## varLabels: SampleType SiteSampled SmokingStatus ## varMetadata: labelDescription ## featureData ## featureNames: 1 2 ... 51891 (51891 total) ## fvarLabels: taxa ## fvarMetadata: labelDescription ## experimentData: use 'experimentData(object)' ## pubMedIds: 21680950 ## Annotation:
  9. 9. Load a phyloseq class object Do the same for a phyloseq class example data object, and demonstrate its custom show method: suppressPackageStartupMessages(library(phyloseq)) data(GlobalPatterns) GlobalPatterns ## phyloseq-class experiment-level object ## otu_table() OTU Table: [ 19216 taxa and 26 samples ] ## sample_data() Sample Data: [ 26 samples by 7 sample variables ] ## tax_table() Taxonomy Table: [ 19216 taxa by 7 taxonomic ranks ] ## phy_tree() Phylogenetic Tree: [ 19216 tips and 19215 internal nodes ]
  10. 10. Inheritance of core methods: example 1 Since metagenomeSeq contains eSet, it automatically inherits core methods like dim(). These would have to be defined separately for the phyloseq class since it does not extend a core class. dim(lungData) ## Features Samples ## 51891 78 dim(GlobalPatterns) ## NULL Note that neither the phyloseq or the metagenomeSeq package defines a dim() method, but metagenomeSeq got it for free by extending eSet.
  11. 11. Inheritance of core methods: example 2 For core Bioconductor objects, $ generally accessess the sample data, but for phyloseq objects the sample data must be explicitly extracted first: head(lungData$SampleType) ## CHK_6467_E3B11_BRONCH2_PREWASH_V1V2 CHK_6467_E3B11_OW_V1V2 ## Bronch2.PreWash OW ## CHK_6467_E3B08_OW_V1V2 CHK_6467_E3B07_BAL_A_V1V2 ## OW BAL.A ## CHK_6467_E3B11_BAL_A_V1V2 CHK_6467_E3B09_OP_V1V2 ## BAL.A OP.Swab ## 12 Levels: BAL.1stReturn BAL.A BAL.B Bronch1.PostWash ... PSB head(sample_data(GlobalPatterns)$SampleType) ## [1] Soil Soil Soil Feces Feces Skin ## 9 Levels: Feces Freshwater Freshwater (creek) Mock ... Tongue
  12. 12. Inheritance of core methods: example 2 subset(), [, and head() are core methods they are defined for eSet and other core classes, so these familiar operations work “out of the box”: subset(lungData, lungData$SampleType=="OW") lungData[, lungData$SampleType=="OW"] lungData[, 1:5] head(lungData)
  13. 13. Inheritance of core methods: example 2 phyloseq cannot use these, so a custom subset_samples() method is defined instead: subset_samples(GlobalPatterns, SampleType=="Ocean") But square bracket subsetting, subset(), and head() are not defined for phyloseq objects, and have no parent class to inherit them from. GlobalPatterns[, 1:5] ## Error in GlobalPatterns[, 1:5]: object of type 'S4' is not subsettable subset(GlobalPatterns, 1:5) ## Error in subset.default(GlobalPatterns, 1:5): 'subset' must be logical
  14. 14. Relevance to multi-omics data analysis The MultiAssayExperiment core class allows coordinated representation and management of an open-ended set of assays, as long as their data class provides basic methods: dimnames() [ subsetting dim() and preferably assay() MultiAssayExperiment data management is modeled on SummarizedExperiment but allows for multiple assays of different row and column dimensions.
  15. 15. Relevance to multi-omics data analysis (cont’d) With no special accommodations, the lungData object “just works” in a MultiAssayExperiment: suppressPackageStartupMessages(library(MultiAssayExperiment)) MultiAssayExperiment(list(lung=lungData)) ## A MultiAssayExperiment object of 1 listed ## experiment with a user-defined name and respective class. ## Containing an ExperimentList class object of length 1: ## [1] lung: MRexperiment with 51891 rows and 78 columns ## Features: ## experiments() - obtain the ExperimentList instance ## colData() - the primary/phenotype DataFrame ## sampleMap() - the sample availability DataFrame ## `$`, `[`, `[[` - extract colData columns, subset, or experiment ## *Format() - convert into a long or wide DataFrame ## assays() - convert ExperimentList to a SimpleList of matrices But GlobalPattern does not, because it is not derived from a core class: MultiAssayExperiment(list(global=GlobalPatterns)) ## Error in if (dim(object)[2] > 0 && is.null(colnames(object))) {: missing value where TRUE/FALSE needed
  16. 16. Inheritance of core methods These are not isolated examples.Full-time, professional software developers have developed many methods for core classes. Classes containing core classes get all of these for free, providing future advantages that you can not possibly imagine in advance. For example, SummarizedExperiment has more than 100 methods defined!
  17. 17. Inheritance of core methods (cont’d) suppressPackageStartupMessages(library(SummarizedExperiment)) methods(class="SummarizedExperiment") ## [1] != [ [[ [[<- [<- ## [6] %in% < <= == > ## [11] >= $ $<- aggregate anyNA ## [16] append as.character as.complex as.data.frame as.env ## [21] as.integer as.list as.logical as.matrix as.numeric ## [26] as.raw assay assay<- assayNames assayNames<- ## [31] assays assays<- by cbind coerce ## [36] coerce<- colData colData<- countOverlaps dim ## [41] dimnames dimnames<- duplicated elementMetadata elementMetadata<- ## [46] eval expand expand.grid extractROWS findOverlaps ## [51] head intersect is.na length lengths ## [56] match mcols mcols<- merge metadata ## [61] metadata<- mstack names names<- NROW ## [66] overlapsAny parallelSlotNames pcompare rank rbind ## [71] realize relist rename rep rep.int ## [76] replaceROWS rev rowData rowData<- ROWNAMES ## [81] rowRanges<- seqlevelsInUse setdiff setequal shiftApply ## [86] show sort split split<- subset ## [91] subsetByOverlaps table tail tapply transform ## [96] union unique updateObject values values<- ## [101] window window<- with xtabs ## see '?methods' for accessing help and source code SummarizedExperiment also provides great functionality like out-of-the-box compatibility with on-disk data representation. USE AND DERIVE FROM THESE CLASSES!
  18. 18. What are the “core” classes? • Rectangular feature x sample data (RNAseq count matrix, microarray, …) – SummarizedExperiment::SummarizedExperiment() • Genomic coordinates (1-based, closed interval) – GenomicRanges::GRanges() • DNA / RNA / AA sequences – Biostrings::*Stringset() • Gene sets – GSEABase::GeneSet() – GSEABase::GeneSetCollection() • Multi-omics data – MultiAssayExperiment::MultiAssayExperiment() • Single cell data – SingleCellExperiment::SingleCellExperiment() • Mass spectrometry data – MSnbase::MSnExp() https://www.bioconductor.org/developers/how-to/commonMethodsAndClasses/
  19. 19. Core classes represent years of work and maintenance by experienced developers Bioconductor core team members – Martin Morgan (Project Lead) – Hervé Pagès – James MacDonald – Valerie Obenchain – Andrzej Oleś – Marcel Ramos – Lori Shepherd – Nitesh Turaga – Daniel van Twisk So you can spend less time frame-building And more time building rocket boosters

×