SlideShare a Scribd company logo
1 of 49
Download to read offline
Lightweight data engineering,
tools, and approaches to facilitate
data reuse and data science
Sean Davis, MD, PhD
National Cancer Institute, National Institutes of Health
AIDR 2019, Carnegie Mellon University
https://seandavi.github.io
@seandavis12 http://bit.ly/SD-AIDR2019
Data Engineering: a Definition
https://en.wikipedia.org/wiki/data_engineering
301 Redirect
?????
Background and motivation
https://datascience.nih.gov/sites/default/files/NIH_Strategic_Plan_for_Data_Science_Final_508.pdf
https://datascience.nih.gov/sites/default/files/NIH_Strategic_Plan_for_Data_Science_Final_508.pdf
FAIR data standards
https://www.force11.org/group/fairgroup/fairprinciples
Is FAIR data enough?
● FAIR data standards and guidelines are general; implementation can be
challenging.
● FAIR data are not always fit-for-use data.
● FAIR data standards are somewhat context-dependent and are most easily
applied within a homogeneous community of data users and providers.
● FAIR data are not always usable data.
● FAIR data are not always useful data.
● [OFF TOPIC]: expense, maintenance, metrics, deprecation, augmentation
and correction, provenance over time, ownership and IP
Health
In the context of healthcare, there is a virtual
imperative that we learn to value and use our data
ethically and effectively. The promise of a learning
healthcare system depends on this charge….
Big Biological data
● Acquisition
● Storage
● Distribution
● Analysis
● Integration
https://doi.org/10.1371/journal.pbio.1002195
Comparison of four domains of big data in 2025
https://doi.org/10.1371/journal.pbio.1002195.t001
Disconnected and
Distributed
Baby steps: Data reuse reframed
as reproducible research
Genomics and Bioconductor as a use case….
Leek, Jeffrey T., and Roger D. Peng. 2015. https://doi.org/10.1073/pnas.1421412111
Reproducibility, the ability to recompute results, and
replicability, the chances other experimenters will achieve a
consistent result, are two foundational characteristics of
successful scientific research...of late there has been a crisis
of confidence among researchers worried about the rate at
which studies are either reproducible or replicable. In order to
maintain the integrity of science research and maintain the
public's trust in science, the scientific community must ensure
reproducibility and replicability by engaging in a more
preventative approach that greatly expands data analysis
education and routinely employs software tools.
software
encodes
knowledge
Bioconductor is a large, NIH-funded
open source software community
dedicated to the analysis and
comprehension of high throughput
biological data.
4 a4Base a4Classif a4Core a4Preproc a4Reporting ABAEnrichment ABarray ABSSeq acde aCGH ACME ADaCGH2 adSplit affxparser affy affycomp AffyCompatible affyContam affycoretools AffyExpress affyILM affyio affylmGUI affyPara affypdnn affyPLM affyQCReport
ffyRNADegradation AGDEX agilp AgiMicroRna AIMS ALDEx2 AllelicImbalance alpine alsace altcdfenvs AMOUNTAIN amplican ampliQueso AnalysisPageServer anamiR Anaquin AneuFinder ANF annaffy annmap annotate AnnotationDbi AnnotationFilter AnnotationForge
nnotationFuncs AnnotationHub AnnotationHubData annotationTools annotatr anota anota2seq antiProfiles apComplex apeglm aroma ArrayExpress ArrayExpressHTS arrayMvout arrayQuality arrayQualityMetrics ArrayTools ArrayTV ARRmNormalization ASAFE ASEB ASGS
Spli ASSET ASSIGN ATACseqQC attract AUCell BaalChIP BAC bacon BADER BadRegionFinder BAGS ballgown bamsignals banocc basecallQC BaseSpaceR Basic4Cseq BASiCS BasicSTARRseq BatchQC BayesKnockdown BayesPeak baySeq BBCAnalyzer BCRANK
eachmat beadarray beadarraySNP BeadDataPackR BEARscc BEAT BEclear bgafun BgeeDB BGmix bgx BHC BicARE BiFET BiGGR bigmelon bigmemoryExtras bioassayR Biobase biobroom bioCancer BiocCaseStudies BiocCheck BiocFileCache BiocGenerics biocGraph
iocInstaller BioCor BiocParallel BiocSklearn BiocStyle biocViews BiocWorkflowTools bioDist biomaRt biomformat BioMVCClass biomvRCNS BioNet BioQC BioSeqClass biosigner Biostrings biosvd biotmle biovizBase BiRewire birta birte BiSeq BitSeq blima BLMA bnbc BPR
RAIN BrainStars branchpointer bridge BridgeDbR BrowserViz BrowserVizDemo BSgenome bsseq BubbleTree BufferedMatrix BufferedMatrixMethods BUMHMM bumphunter BUS CAFE CAGEr CALIB CAMERA canceR cancerclass CancerInSilico CancerMutationAnalysis
ancerSubtypes CAnD caOmicsV Cardinal casper CATALYST Category categoryCompare CausalR cbaf ccmap CCPROMISE ccrepe cellbaseR cellGrowth cellHTS2 cellity CellMapper CellNOptR cellscape cellTree CEMiTool CexoR CFAssay CGEN CGHbase CGHcall cghM
GHnormaliter CGHregions ChAMP CHARGE charm ChemmineOB ChemmineR Chicago chimera chimeraviz ChIPanalyser ChIPComp chipenrich ChIPexoQual ChIPpeakAnno ChIPQC ChIPseeker chipseq ChIPseqR ChIPSeqSpike ChIPsim ChIPXpress chopsticks chroG
hromDraw ChromHeatMap chromPlot chromstaR chromswitch chromVAR CHRONOS CINdex cisPath ClassifyR cleanUpdTSeq cleaver clippda clipper Clomial Clonality clonotypeR clst clstutils clustComp clusterExperiment ClusterJudge clusterProfiler clusterSeq
lusterSignificance clusterStab CMA cn CNAnorm CNEr CNORdt CNORfeeder CNORfuzzy CNORode CNPBayes CNTools cnvGSA CNVPanelizer CNVrd2 CNVtools cobindR CoCiteStats codelink CODEX coexnet CoGAPS cogena coGPS COHCAP coMET COMPASS
ompcodeR compEpiTools CompGO ComplexHeatmap CONFESS ConsensusClusterPlus consensusOV consensusSeekeR contiBAIT conumee convert copa copynumber CopywriteR CoRegNet Cormotif CorMut CORREP coseq cosmiq COSNet CountClust covEB
overageView covRNA cpvSNP cqn CRImage CRISPRseek crisprseekplus CrispRVariants crlmm crossmeta CSAR csaw CSSP ctc CTDquerier ctsGE cummeRbund customProDB CVE cycle cydar cytofkit cytolib CytoML dada2 dagLogo daMA DaMiRseq DAPAR DART DAS
BChIP dcGSA DChIPRep ddCt debrowser DECIPHER DEComplexDisease DeconRNASeq DEDS DeepBlueR deepSNV DEFormats DEGraph DEGreport DEGseq DelayedArray DelayedMatrixStats deltaGseg DeMAND DEP derfinder derfinderHelper derfinderPlot DESeq
ESeq2 destiny DEsubs DEXSeq dexus DFP DiffBind diffGeneAnalysis diffHic DiffLogo diffloop diffuStats diggit Director DirichletMultinomial discordant dks DMCHMM DMRcaller DMRcate DMRforPairs DMRScan DNABarcodes DNAcopy DNAshapeR doppelgangR DOQTL
oscheda DOSE drawProteins DRIMSeq DriverNet DropletUtils DrugVsDisease dSimer DSS DTA dualKS DupChecker dupRadar dyebias DynDoc EasyqpcR easyRNASeq EBarrays EBcoexpress EBImage EBSEA EBSeq EBSeqHMM ecolitk EDASeq EDDA edge edgeR ee
GAD EGSEA eiR eisa ELBOW ELMER EMDomics EmpiricalBrownsMethod ENCODExplorer ENmix EnrichedHeatmap EnrichmentBrowser ensembldb ensemblVEP ENVISIONQuery EpiDISH epigenomix epiNEM epivizr epivizrChart epivizrData epivizrServer epivizrStanda
rccdashboard erma esATAC esetVis eudysbiome EventPointer ExiMiR exomeCopy exomePeak ExperimentHub ExperimentHubData explorase ExpressionAtlas ExpressionView fabia facopy factDesign FamAgg farms fastLiquidAssociation fastseg fCCAC fCI fdrame FEM ff
GNet fgsea FindMyFriends FISHalyseR FitHiC flagme flipflop flowAI flowBeads flowBin flowcatchR flowCHIC flowCL flowClean flowClust flowCore flowCyBar flowDensity flowFit flowFP flowMap flowMatch flowMeans flowMerge flowPeaks flowPloidy flowPlots flowQ flowQB
owRepositoryR FlowSOM flowStats flowTime flowTrans flowType flowUtils flowViz flowVS flowWorkspace fmcsR focalCall FourCSeq FRGEpistasis frma frmaTools FunChIP FunciSNP funtooNorm GA4GHclient GA4GHshiny gaga gage gaggle gaia GAprediction garfield ga
capc gcatest gCMAP gCMAPWeb gCrisprTools gcrma GDCRNATools gdsfmt geecc GEM genArise genbankr GeneAnswers geneAttribution GeneBreak geneClassifiers GeneExpressionSignature genefilter genefu GeneGA GeneGeneInteR GeneMeta GeneNetworkBuilder
eneOverlap geneplast geneplotter geneRecommender GeneRegionScan geneRxCluster GeneSelectMMD GeneSelector GENESIS geNetClassifier GeneticsDesign GeneticsPed geneXtendeR GENIE3 genoCN GenoGAM genomation GenomeGraphs GenomeInfoDb
enomeIntervals genomes GenomicAlignments GenomicDataCommons GenomicFeatures GenomicFiles GenomicInteractions GenomicRanges GenomicScores GenomicTuples Genominator genoset genotypeeval genphen GenRank GenVisR GEOmetadb GEOquery
EOsubmission gep2pep gespeR GEWIST GGBase ggbio ggcyto GGtools ggtree girafe GISPA GLAD Glimma GlobalAncova globalSeq globaltest gmapR GMRP GOexpress GOfuncR GOFunction GoogleGenomics GOpro goProfiles GOSemSim goseq GOSim goSTAG GO
Osummaries GOTHiC goTools gpls gprege gQTLBase gQTLstats graph GraphAlignment GraphAT graphite GraphPAC GRENITS GreyListChIP GRmetrics groHMM GRridge GSALightning GSAR GSCA GSEABase GSEAlm gsean GSReg GSRI GSVA gtrellis GUIDEseq Gu
viz gwascat GWASTools h5vc hapFabia Harman Harshlight HDF5Array HDTD heatmaps Heatplus HelloRanges HELP HEM hiAnnotator HIBAG HiCcompare hicrep hierGWAS HilbertCurve HilbertVis HilbertVisGUI hiReadsProcessor HiTC hmdbQuery HMMcopy hopach hp
TqPCR HTSanalyzeR HTSeqGenie htSeqTools HTSFilter HybridMTest hyperdraw hypergraph iASeq iBBiG ibh iBMQ iCARE Icens iCheck iChip iClusterPlus iCOBRA ideal IdeoViz idiogram IdMappingAnalysis IdMappingRetrieval iGC IHW illuminaio imageHTS IMAS Imetag
mmuneSpaceR immunoClust IMPCdata ImpulseDE ImpulseDE2 impute InPAS INPower INSPEcT intansv InteractionSet interactiveDisplay interactiveDisplayBase IntEREst InterMineR IntramiRExploreR inveRsion IONiseR iPAC IPO IPPD IRanges IrisSpatialFeatures iSeq is
oformSwitchAnalyzeR IsoGeneGUI ISoLDE isomiRs ITALICS iterativeBMA iterativeBMAsurv iterClust IVAS ivygapSE IWTomics JASPAR2018 joda JunctionSeq karyoploteR KCsmart kebabs KEGGgraph KEGGlincs keggorthology KEGGprofile KEGGREST kimod lapmix LB
block LEA LedPred les lfa limma limmaGUI LINC LineagePulse Linnorm LiquidAssociation lmdme LMGene LOBSTAHS loci2path logicFS logitT Logolas lol LOLA LowMACA LPE LPEadj lpNet lpsymphony lumi LVSmiRNA LymphoSeq M3C M3D M3Drop maanova macat
aCorrPlot made4 MADSEQ maftools MAGeCKFlute maigesPack MAIT makecdfenv MANOR manta MantelCorr mAPKL maPredictDSC mapscape marray maSigPro maskBAD MassArray massiR MassSpecWavelet MAST matchBox MatrixRider matter MaxContrastProjecti
BAmethyl MBASED MBCB mBPCR MBttest mcaGUI MCbiclust MCRestimate mCSEA mdgsa mdqc MEAL MeasurementError MEDIPS MEDME MEIGOR MergeMaid Mergeomics MeSHDbi meshes meshr messina metaArray Metab metabomxtr MetaboSignal metaCCA
etaCyto metagene metagenomeFeatures metagenomeSeq metahdep metaMS MetaNeighbor metaSeq metaseqR metavizr MetCirc methimpute methInheritSim MethPed MethTargetedNGS methVisual methyAnalysis MethylAid methylInheritance methylKit MethylMix meth
ethylPipe MethylSeekR methylumi methyvim mfa Mfuzz MGFM MGFR mgsa MiChip microbiome microRNA MIGSA mimager MIMOSA MineICA minet minfi MinimumDistance MiPP MIRA MiRaGE miRBaseConverter miRcomp mirIntegrator miRLAB miRmine miRNAmeCon
iRNApath miRNAtap miRsponge Mirsynergy missMethyl mitoODE MLInterfaces MLP MLSeq MMDiff2 MmPalateMiRNA MODA mogsa monocle MoonlightR MoPS mosaics motifbreakR motifcounter MotifDb motifmatchr motifRG motifStack MotIV MPFE mpra mQTL msa m
SGFgui MSGFplus msmsEDA msmsTests MSnbase MSnID msPurity MSstats Mulcom MultiAssayExperiment multiClust MultiDataSet MultiMed multiMiR multiOmicsViz multiscan multtest muscle MutationalPatterns MVCClass mvGST MWASTools mygene myvariant mzID
ADfinder NanoStringDiff NanoStringQCPro NarrowPeaks ncdfFlow NCIgraph ndexr nem netbenchmark netbiov nethet NetPathMiner netprioR netReg netresponse NetSAM networkBMA NGScopy nnNorm NOISeq nondetects normalize450K NormqPCR normr npGSEA NT
ucleoSim nucleR nudge NuPoP occugene OCplus odseq OGSA oligo oligoClasses OLIN OLINgui omicade4 OmicCircos omicplotR omicRexposome OmicsMarkeR omicsPrint Onassis oncomix OncoScore OncoSimulR oneSENSE ontoCAT ontoProc openCyto openPrimeR
penPrimeRui OperaMate oposSOM oppar OPWeight OrderedList Organism OrganismDbi OSAT Oscope OTUbase OutlierD PAA PADOG paircompviz pandaR panelcn PAnnBuilder panp PANR PanVizGenerator PAPi parglms parody Path2PPI pathifier PathNet PathoStat
athprint pathRender pathVar pathview PathwaySplice paxtoolsr Pbase pbcmc pcaExplorer pcaGoPromoter pcaMethods PCAN pcot2 PCpheno pcxn pdInfoBuilder PECA pepStat pepXMLTab PGA pgca PGSEA PharmacoGx phenoDist phenopath phenoTest PhenStat philr
hosphonormalizer phyloseq Pi piano pickgene PICS Pigengene PING pint pkgDepTools plateCore plethy plgem plier PLPE plrs plw pmm podkat pogos polyester Polyfit POST PPInfer ppiStats pqsfinder prada prebs PREDA predictionet preprocessCore Prize proBAMr PROc
rocoil ProCoNA proFIA profileScoreDist progeny pRoloc pRolocGUI PROMISE PROPER PROPS Prostar prot2D proteinProfiles ProteomicsAnnotationHubData proteoQC ProtGenerics PSEA psichomics PSICQUIC psygenet2r puma PureCN pvac pvca Pviz PWMEnrich pwO
cmetrics QDNAseq qpcrNorm qpgraph qrqc qsea QUALIFIER quantro quantsmooth QuartPAC QuasR QuaternaryProd QUBIC qusage qvalue R3CPET r3Cseq R453Plus1Toolbox R4RNA RaggedExperiment rain rama RamiGO ramwas randPack RankProd RareVariantVis
ariant RbcBook1 RBGL RBioinf rBiopaxParser RBM Rbowtie Rbowtie2 rbsurv Rcade RCAS RCASPAR rcellminer rCGH Rchemcpp RchyOptimyx RcisTarget Rcpi RCy3 RCyjs RDAVIDWebService rDGIdb Rdisop RDRToolbox ReactomePA readat ReadqPCR reb recount re
edeR REDseq RefNet RefPlus regioneR regionReport regsplice REMP Repitools ReportingTools ReQON restfulSE rexposome rfPred rGADEM RGalaxy RGMQL RGraph2js Rgraphviz rGREAT RGSEA rgsepd rhdf5 rhdf5client Rhdf5lib Rhtslib rHVDM RiboProfiling riboSeq
ImmPort Ringo RIPSeeker Risa RITAN RIVER RJMCMCNucleosomes RLMM Rmagpie RMassBank rMAT RmiR RNAinteract RNAither RNAprobR rnaseqcomp rnaSeqMap RNASeqPower RnaSeqSampleSize RnBeads Rnits roar ROC Roleswitch rols roma ROntoTools rop
OTS RPA RProtoBufLib RpsiXML rpx Rqc rqt rqubic rRDP RRHO Rsamtools rsbml rSFFreader Rsubread RSVSim rTANDEM RTCA RTCGA RTCGAToolbox RTN RTNduals RTNsurvival RTopper rtracklayer Rtreemix rTRM rTRMui runibic RUVcorr RUVnormalize RUVSeq R
4Vectors safe sagenhaft SAGx samExploreR sampleClassifier SamSPECTRAL sangerseqR SANTA sapFinder savR SBMLR SC3 Scale4C SCAN scater scDD scde scFeatureFilter scfind ScISI scmap SCnorm scone scoreInvHap scPipe scran scsR segmentSeq SELEX Se
emisup SEPA seq2pathway SeqArray seqbias seqCAT seqCNA seqcombo SeqGSEA seqLogo seqPattern seqplots SeqSQC seqTools SeqVarTools sevenbridges SGSeq shinyMethyl shinyTANDEM ShortRead SICtools sigaR SigCheck SigFuge siggenes sights signeR sign
Bioconductor: Education, Training, and Community
Core infrastructure Community contribution
Building Bridges
~300,000 users, 1000 active
developers, >1700 interop software
tools and analysis packages
112,837 datasets, 3,031,406 samples
publicly accessible
Building bridges
From this….
To this…
http://bit.ly/2W3l8dK
Understand, expose, and adapt to
your data ecosystem and community
https://gdc.cancer.gov/files/public/image/Data_Model_Oobleck.png
Baby steps: Reusable components
Where necessary, build
new software or tools
● Recognize complexity in
high-throughput biological data
● Version control everything
● Continuous testing and
integration
● Text-based workflow (no GUI)
● Literate programming
approaches and documentation
● Education on tooling
● Numerous mechanisms for
FAIR data sharing
Core value: reuse and interoperability
Where possible reuse existing tools, leverage existing
communities
The R package: the unit of sharing in the R community
Published, versioned, documented, tested, measured
Maximize the impact (and reward) of curation efforts
Publish the how….
Addtional thoughts
Source: Monica Rogati https://hackernoon.com/the-ai-hierarchy-of-needs-18f111fcc007
Metadata are data
Consumers
Engineers
Producers
Researchers Public
Funders
Donors
Stakeholders or
Data Communities
Analysts
Commodity big data
technologies
Education: we are all learning and teaching….
We, not they
Technical Advisory Group
Vince Carey, Brigham & Women’s, Harvard Medical School, USA.
Aedin Culhane, Dana-Farber Cancer Institute, Harvard School of Public Health, USA.
Sean Davis, National Cancer Institute, USA.
Robert Gentleman, Computational Biology, 23andMe, USA.
Kasper Daniel Hansen, Bloomberg School of Public Health, Johns Hopkins
University.
Wolfgang Huber European Molecular Biology Laboratory, Heidelberg, Germany.
Rafael Irizarry Dana-Farber Cancer Institute, USA
Michael Lawrence, Genentech Research and Early Development, USA.
Levi Waldron, CUNY School of Public Health at Hunter College, New York, NY.
Questions? https://bioconductor.org
https://seandavi.github.io
Lightweight data engineering, tools, and software to facilitate data reuse and data science

More Related Content

Similar to Lightweight data engineering, tools, and software to facilitate data reuse and data science

Legacy Analysis: How Hadoop Streaming Enables Software Reuse – A Genomics Cas...
Legacy Analysis: How Hadoop Streaming Enables Software Reuse – A Genomics Cas...Legacy Analysis: How Hadoop Streaming Enables Software Reuse – A Genomics Cas...
Legacy Analysis: How Hadoop Streaming Enables Software Reuse – A Genomics Cas...StampedeCon
 
Role of bioinformatics in life sciences research
Role of bioinformatics in life sciences researchRole of bioinformatics in life sciences research
Role of bioinformatics in life sciences researchAnshika Bansal
 
VariantSpark: applying Spark-based machine learning methods to genomic inform...
VariantSpark: applying Spark-based machine learning methods to genomic inform...VariantSpark: applying Spark-based machine learning methods to genomic inform...
VariantSpark: applying Spark-based machine learning methods to genomic inform...Denis C. Bauer
 
Knowledge Sharing - aCCCeso
Knowledge Sharing - aCCCesoKnowledge Sharing - aCCCeso
Knowledge Sharing - aCCCesoKaitlin Thaney
 
Towards Ultra-Large-Scale System: Design of Scalable Software and Next-Gen H...
Towards Ultra-Large-Scale System:  Design of Scalable Software and Next-Gen H...Towards Ultra-Large-Scale System:  Design of Scalable Software and Next-Gen H...
Towards Ultra-Large-Scale System: Design of Scalable Software and Next-Gen H...Arghya Kusum Das
 
A Genome Sequence Analysis System Built With Hypertable
A Genome Sequence Analysis System Built With HypertableA Genome Sequence Analysis System Built With Hypertable
A Genome Sequence Analysis System Built With Hypertablehypertable
 
University Course Timetabling by using Multi Objective Genetic Algortihms
University Course Timetabling by using Multi Objective Genetic AlgortihmsUniversity Course Timetabling by using Multi Objective Genetic Algortihms
University Course Timetabling by using Multi Objective Genetic AlgortihmsHalil Kaşkavalcı
 
Platforms CIBERER and INB-ELIXIR-es
Platforms CIBERER and INB-ELIXIR-esPlatforms CIBERER and INB-ELIXIR-es
Platforms CIBERER and INB-ELIXIR-esJoaquin Dopazo
 
BC-Cancer ChimeraScan Presentation
BC-Cancer ChimeraScan PresentationBC-Cancer ChimeraScan Presentation
BC-Cancer ChimeraScan PresentationElijah Willie
 
End to End Machine Learning Open Source Solution Presented in Cisco Developer...
End to End Machine Learning Open Source Solution Presented in Cisco Developer...End to End Machine Learning Open Source Solution Presented in Cisco Developer...
End to End Machine Learning Open Source Solution Presented in Cisco Developer...Manish Harsh
 
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...GenomeInABottle
 
Esa 2014 qiime
Esa 2014 qiimeEsa 2014 qiime
Esa 2014 qiimeZech Xu
 
Text Mining
Text MiningText Mining
Text Miningrueshyna
 
Hadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGI
Hadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGIHadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGI
Hadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGIAllen Day, PhD
 
Spark Summit EU talk by Erwin Datema and Roeland van Ham
Spark Summit EU talk by Erwin Datema and Roeland van HamSpark Summit EU talk by Erwin Datema and Roeland van Ham
Spark Summit EU talk by Erwin Datema and Roeland van HamSpark Summit
 
The Role of Metadata in Reproducible Computational Research
The Role of Metadata in Reproducible Computational ResearchThe Role of Metadata in Reproducible Computational Research
The Role of Metadata in Reproducible Computational ResearchJeremy Leipzig
 
Latest seminar topics
Latest seminar topicsLatest seminar topics
Latest seminar topicsPulla Surya
 

Similar to Lightweight data engineering, tools, and software to facilitate data reuse and data science (20)

Legacy Analysis: How Hadoop Streaming Enables Software Reuse – A Genomics Cas...
Legacy Analysis: How Hadoop Streaming Enables Software Reuse – A Genomics Cas...Legacy Analysis: How Hadoop Streaming Enables Software Reuse – A Genomics Cas...
Legacy Analysis: How Hadoop Streaming Enables Software Reuse – A Genomics Cas...
 
Role of bioinformatics in life sciences research
Role of bioinformatics in life sciences researchRole of bioinformatics in life sciences research
Role of bioinformatics in life sciences research
 
VariantSpark: applying Spark-based machine learning methods to genomic inform...
VariantSpark: applying Spark-based machine learning methods to genomic inform...VariantSpark: applying Spark-based machine learning methods to genomic inform...
VariantSpark: applying Spark-based machine learning methods to genomic inform...
 
Poster_Jan
Poster_JanPoster_Jan
Poster_Jan
 
Knowledge Sharing - aCCCeso
Knowledge Sharing - aCCCesoKnowledge Sharing - aCCCeso
Knowledge Sharing - aCCCeso
 
Towards Ultra-Large-Scale System: Design of Scalable Software and Next-Gen H...
Towards Ultra-Large-Scale System:  Design of Scalable Software and Next-Gen H...Towards Ultra-Large-Scale System:  Design of Scalable Software and Next-Gen H...
Towards Ultra-Large-Scale System: Design of Scalable Software and Next-Gen H...
 
A Genome Sequence Analysis System Built With Hypertable
A Genome Sequence Analysis System Built With HypertableA Genome Sequence Analysis System Built With Hypertable
A Genome Sequence Analysis System Built With Hypertable
 
University Course Timetabling by using Multi Objective Genetic Algortihms
University Course Timetabling by using Multi Objective Genetic AlgortihmsUniversity Course Timetabling by using Multi Objective Genetic Algortihms
University Course Timetabling by using Multi Objective Genetic Algortihms
 
Platforms CIBERER and INB-ELIXIR-es
Platforms CIBERER and INB-ELIXIR-esPlatforms CIBERER and INB-ELIXIR-es
Platforms CIBERER and INB-ELIXIR-es
 
BC-Cancer ChimeraScan Presentation
BC-Cancer ChimeraScan PresentationBC-Cancer ChimeraScan Presentation
BC-Cancer ChimeraScan Presentation
 
End to End Machine Learning Open Source Solution Presented in Cisco Developer...
End to End Machine Learning Open Source Solution Presented in Cisco Developer...End to End Machine Learning Open Source Solution Presented in Cisco Developer...
End to End Machine Learning Open Source Solution Presented in Cisco Developer...
 
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
 
Esa 2014 qiime
Esa 2014 qiimeEsa 2014 qiime
Esa 2014 qiime
 
Text Mining
Text MiningText Mining
Text Mining
 
Hadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGI
Hadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGIHadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGI
Hadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGI
 
Spark Summit EU talk by Erwin Datema and Roeland van Ham
Spark Summit EU talk by Erwin Datema and Roeland van HamSpark Summit EU talk by Erwin Datema and Roeland van Ham
Spark Summit EU talk by Erwin Datema and Roeland van Ham
 
2023 GIAB AMP Update
2023 GIAB AMP Update2023 GIAB AMP Update
2023 GIAB AMP Update
 
The Role of Metadata in Reproducible Computational Research
The Role of Metadata in Reproducible Computational ResearchThe Role of Metadata in Reproducible Computational Research
The Role of Metadata in Reproducible Computational Research
 
Ashg2015 schneider final
Ashg2015 schneider finalAshg2015 schneider final
Ashg2015 schneider final
 
Latest seminar topics
Latest seminar topicsLatest seminar topics
Latest seminar topics
 

More from Sean Davis

2016 07 12_purdue_bigdatainomics_seandavis
2016 07 12_purdue_bigdatainomics_seandavis2016 07 12_purdue_bigdatainomics_seandavis
2016 07 12_purdue_bigdatainomics_seandavisSean Davis
 
RNA-seq: A High-resolution View of the Transcriptome
RNA-seq: A High-resolution View of the TranscriptomeRNA-seq: A High-resolution View of the Transcriptome
RNA-seq: A High-resolution View of the TranscriptomeSean Davis
 
SRAdb Bioconductor Package Overview
SRAdb Bioconductor Package OverviewSRAdb Bioconductor Package Overview
SRAdb Bioconductor Package OverviewSean Davis
 
ShinySRAdb: an R package using shiny to wrap the SRAdb Bioconductor package
ShinySRAdb: an R package using shiny to wrap the SRAdb Bioconductor packageShinySRAdb: an R package using shiny to wrap the SRAdb Bioconductor package
ShinySRAdb: an R package using shiny to wrap the SRAdb Bioconductor packageSean Davis
 
RNA-seq Data Analysis Overview
RNA-seq Data Analysis OverviewRNA-seq Data Analysis Overview
RNA-seq Data Analysis OverviewSean Davis
 
Introduction to R
Introduction to RIntroduction to R
Introduction to RSean Davis
 
Forsharing cshl2011 sequencing
Forsharing cshl2011 sequencingForsharing cshl2011 sequencing
Forsharing cshl2011 sequencingSean Davis
 
Public datatutorialoverview
Public datatutorialoverviewPublic datatutorialoverview
Public datatutorialoverviewSean Davis
 
Sssc retreat.bioinfo resources.20110411
Sssc retreat.bioinfo resources.20110411Sssc retreat.bioinfo resources.20110411
Sssc retreat.bioinfo resources.20110411Sean Davis
 
OKC Grand Rounds 2009
OKC Grand Rounds 2009OKC Grand Rounds 2009
OKC Grand Rounds 2009Sean Davis
 
Genetics Branch Journal club
Genetics Branch Journal clubGenetics Branch Journal club
Genetics Branch Journal clubSean Davis
 
Genomics Technologies
Genomics TechnologiesGenomics Technologies
Genomics TechnologiesSean Davis
 
Bioc strucvariant seattle_11_09
Bioc strucvariant seattle_11_09Bioc strucvariant seattle_11_09
Bioc strucvariant seattle_11_09Sean Davis
 

More from Sean Davis (14)

2016 07 12_purdue_bigdatainomics_seandavis
2016 07 12_purdue_bigdatainomics_seandavis2016 07 12_purdue_bigdatainomics_seandavis
2016 07 12_purdue_bigdatainomics_seandavis
 
RNA-seq: A High-resolution View of the Transcriptome
RNA-seq: A High-resolution View of the TranscriptomeRNA-seq: A High-resolution View of the Transcriptome
RNA-seq: A High-resolution View of the Transcriptome
 
SRAdb Bioconductor Package Overview
SRAdb Bioconductor Package OverviewSRAdb Bioconductor Package Overview
SRAdb Bioconductor Package Overview
 
ShinySRAdb: an R package using shiny to wrap the SRAdb Bioconductor package
ShinySRAdb: an R package using shiny to wrap the SRAdb Bioconductor packageShinySRAdb: an R package using shiny to wrap the SRAdb Bioconductor package
ShinySRAdb: an R package using shiny to wrap the SRAdb Bioconductor package
 
RNA-seq Data Analysis Overview
RNA-seq Data Analysis OverviewRNA-seq Data Analysis Overview
RNA-seq Data Analysis Overview
 
Introduction to R
Introduction to RIntroduction to R
Introduction to R
 
Rna seq
Rna seqRna seq
Rna seq
 
Forsharing cshl2011 sequencing
Forsharing cshl2011 sequencingForsharing cshl2011 sequencing
Forsharing cshl2011 sequencing
 
Public datatutorialoverview
Public datatutorialoverviewPublic datatutorialoverview
Public datatutorialoverview
 
Sssc retreat.bioinfo resources.20110411
Sssc retreat.bioinfo resources.20110411Sssc retreat.bioinfo resources.20110411
Sssc retreat.bioinfo resources.20110411
 
OKC Grand Rounds 2009
OKC Grand Rounds 2009OKC Grand Rounds 2009
OKC Grand Rounds 2009
 
Genetics Branch Journal club
Genetics Branch Journal clubGenetics Branch Journal club
Genetics Branch Journal club
 
Genomics Technologies
Genomics TechnologiesGenomics Technologies
Genomics Technologies
 
Bioc strucvariant seattle_11_09
Bioc strucvariant seattle_11_09Bioc strucvariant seattle_11_09
Bioc strucvariant seattle_11_09
 

Recently uploaded

Kesar Bagh Call Girl Price 9548273370 , Lucknow Call Girls Service
Kesar Bagh Call Girl Price 9548273370 , Lucknow Call Girls ServiceKesar Bagh Call Girl Price 9548273370 , Lucknow Call Girls Service
Kesar Bagh Call Girl Price 9548273370 , Lucknow Call Girls Servicemakika9823
 
Call Girls Service Pune Vaishnavi 9907093804 Short 1500 Night 6000 Best call ...
Call Girls Service Pune Vaishnavi 9907093804 Short 1500 Night 6000 Best call ...Call Girls Service Pune Vaishnavi 9907093804 Short 1500 Night 6000 Best call ...
Call Girls Service Pune Vaishnavi 9907093804 Short 1500 Night 6000 Best call ...Miss joya
 
Call Girls Service Surat Samaira ❤️🍑 8250192130 👄 Independent Escort Service ...
Call Girls Service Surat Samaira ❤️🍑 8250192130 👄 Independent Escort Service ...Call Girls Service Surat Samaira ❤️🍑 8250192130 👄 Independent Escort Service ...
Call Girls Service Surat Samaira ❤️🍑 8250192130 👄 Independent Escort Service ...CALL GIRLS
 
Call Girls In Andheri East Call 9920874524 Book Hot And Sexy Girls
Call Girls In Andheri East Call 9920874524 Book Hot And Sexy GirlsCall Girls In Andheri East Call 9920874524 Book Hot And Sexy Girls
Call Girls In Andheri East Call 9920874524 Book Hot And Sexy Girlsnehamumbai
 
VIP Call Girls Pune Sanjana 9907093804 Short 1500 Night 6000 Best call girls ...
VIP Call Girls Pune Sanjana 9907093804 Short 1500 Night 6000 Best call girls ...VIP Call Girls Pune Sanjana 9907093804 Short 1500 Night 6000 Best call girls ...
VIP Call Girls Pune Sanjana 9907093804 Short 1500 Night 6000 Best call girls ...Miss joya
 
💎VVIP Kolkata Call Girls Parganas🩱7001035870🩱Independent Girl ( Ac Rooms Avai...
💎VVIP Kolkata Call Girls Parganas🩱7001035870🩱Independent Girl ( Ac Rooms Avai...💎VVIP Kolkata Call Girls Parganas🩱7001035870🩱Independent Girl ( Ac Rooms Avai...
💎VVIP Kolkata Call Girls Parganas🩱7001035870🩱Independent Girl ( Ac Rooms Avai...Taniya Sharma
 
VIP Call Girls Pune Vrinda 9907093804 Short 1500 Night 6000 Best call girls S...
VIP Call Girls Pune Vrinda 9907093804 Short 1500 Night 6000 Best call girls S...VIP Call Girls Pune Vrinda 9907093804 Short 1500 Night 6000 Best call girls S...
VIP Call Girls Pune Vrinda 9907093804 Short 1500 Night 6000 Best call girls S...Miss joya
 
CALL ON ➥9907093804 🔝 Call Girls Hadapsar ( Pune) Girls Service
CALL ON ➥9907093804 🔝 Call Girls Hadapsar ( Pune)  Girls ServiceCALL ON ➥9907093804 🔝 Call Girls Hadapsar ( Pune)  Girls Service
CALL ON ➥9907093804 🔝 Call Girls Hadapsar ( Pune) Girls ServiceMiss joya
 
(👑VVIP ISHAAN ) Russian Call Girls Service Navi Mumbai🖕9920874524🖕Independent...
(👑VVIP ISHAAN ) Russian Call Girls Service Navi Mumbai🖕9920874524🖕Independent...(👑VVIP ISHAAN ) Russian Call Girls Service Navi Mumbai🖕9920874524🖕Independent...
(👑VVIP ISHAAN ) Russian Call Girls Service Navi Mumbai🖕9920874524🖕Independent...Taniya Sharma
 
Bangalore Call Girls Nelamangala Number 7001035870 Meetin With Bangalore Esc...
Bangalore Call Girls Nelamangala Number 7001035870  Meetin With Bangalore Esc...Bangalore Call Girls Nelamangala Number 7001035870  Meetin With Bangalore Esc...
Bangalore Call Girls Nelamangala Number 7001035870 Meetin With Bangalore Esc...narwatsonia7
 
Best Rate (Hyderabad) Call Girls Jahanuma ⟟ 8250192130 ⟟ High Class Call Girl...
Best Rate (Hyderabad) Call Girls Jahanuma ⟟ 8250192130 ⟟ High Class Call Girl...Best Rate (Hyderabad) Call Girls Jahanuma ⟟ 8250192130 ⟟ High Class Call Girl...
Best Rate (Hyderabad) Call Girls Jahanuma ⟟ 8250192130 ⟟ High Class Call Girl...astropune
 
Call Girls Service Jaipur Grishma WhatsApp ❤8445551418 VIP Call Girls Jaipur
Call Girls Service Jaipur Grishma WhatsApp ❤8445551418 VIP Call Girls JaipurCall Girls Service Jaipur Grishma WhatsApp ❤8445551418 VIP Call Girls Jaipur
Call Girls Service Jaipur Grishma WhatsApp ❤8445551418 VIP Call Girls Jaipurparulsinha
 
Bangalore Call Girls Majestic 📞 9907093804 High Profile Service 100% Safe
Bangalore Call Girls Majestic 📞 9907093804 High Profile Service 100% SafeBangalore Call Girls Majestic 📞 9907093804 High Profile Service 100% Safe
Bangalore Call Girls Majestic 📞 9907093804 High Profile Service 100% Safenarwatsonia7
 
Call Girls Service Bellary Road Just Call 7001305949 Enjoy College Girls Service
Call Girls Service Bellary Road Just Call 7001305949 Enjoy College Girls ServiceCall Girls Service Bellary Road Just Call 7001305949 Enjoy College Girls Service
Call Girls Service Bellary Road Just Call 7001305949 Enjoy College Girls Servicenarwatsonia7
 
Artifacts in Nuclear Medicine with Identifying and resolving artifacts.
Artifacts in Nuclear Medicine with Identifying and resolving artifacts.Artifacts in Nuclear Medicine with Identifying and resolving artifacts.
Artifacts in Nuclear Medicine with Identifying and resolving artifacts.MiadAlsulami
 
High Profile Call Girls Coimbatore Saanvi☎️ 8250192130 Independent Escort Se...
High Profile Call Girls Coimbatore Saanvi☎️  8250192130 Independent Escort Se...High Profile Call Girls Coimbatore Saanvi☎️  8250192130 Independent Escort Se...
High Profile Call Girls Coimbatore Saanvi☎️ 8250192130 Independent Escort Se...narwatsonia7
 
Call Girls Yelahanka Bangalore 📲 9907093804 💞 Full Night Enjoy
Call Girls Yelahanka Bangalore 📲 9907093804 💞 Full Night EnjoyCall Girls Yelahanka Bangalore 📲 9907093804 💞 Full Night Enjoy
Call Girls Yelahanka Bangalore 📲 9907093804 💞 Full Night Enjoynarwatsonia7
 
Vip Call Girls Anna Salai Chennai 👉 8250192130 ❣️💯 Top Class Girls Available
Vip Call Girls Anna Salai Chennai 👉 8250192130 ❣️💯 Top Class Girls AvailableVip Call Girls Anna Salai Chennai 👉 8250192130 ❣️💯 Top Class Girls Available
Vip Call Girls Anna Salai Chennai 👉 8250192130 ❣️💯 Top Class Girls AvailableNehru place Escorts
 

Recently uploaded (20)

Kesar Bagh Call Girl Price 9548273370 , Lucknow Call Girls Service
Kesar Bagh Call Girl Price 9548273370 , Lucknow Call Girls ServiceKesar Bagh Call Girl Price 9548273370 , Lucknow Call Girls Service
Kesar Bagh Call Girl Price 9548273370 , Lucknow Call Girls Service
 
Call Girls Service Pune Vaishnavi 9907093804 Short 1500 Night 6000 Best call ...
Call Girls Service Pune Vaishnavi 9907093804 Short 1500 Night 6000 Best call ...Call Girls Service Pune Vaishnavi 9907093804 Short 1500 Night 6000 Best call ...
Call Girls Service Pune Vaishnavi 9907093804 Short 1500 Night 6000 Best call ...
 
Call Girls Service Surat Samaira ❤️🍑 8250192130 👄 Independent Escort Service ...
Call Girls Service Surat Samaira ❤️🍑 8250192130 👄 Independent Escort Service ...Call Girls Service Surat Samaira ❤️🍑 8250192130 👄 Independent Escort Service ...
Call Girls Service Surat Samaira ❤️🍑 8250192130 👄 Independent Escort Service ...
 
Call Girls In Andheri East Call 9920874524 Book Hot And Sexy Girls
Call Girls In Andheri East Call 9920874524 Book Hot And Sexy GirlsCall Girls In Andheri East Call 9920874524 Book Hot And Sexy Girls
Call Girls In Andheri East Call 9920874524 Book Hot And Sexy Girls
 
VIP Call Girls Pune Sanjana 9907093804 Short 1500 Night 6000 Best call girls ...
VIP Call Girls Pune Sanjana 9907093804 Short 1500 Night 6000 Best call girls ...VIP Call Girls Pune Sanjana 9907093804 Short 1500 Night 6000 Best call girls ...
VIP Call Girls Pune Sanjana 9907093804 Short 1500 Night 6000 Best call girls ...
 
💎VVIP Kolkata Call Girls Parganas🩱7001035870🩱Independent Girl ( Ac Rooms Avai...
💎VVIP Kolkata Call Girls Parganas🩱7001035870🩱Independent Girl ( Ac Rooms Avai...💎VVIP Kolkata Call Girls Parganas🩱7001035870🩱Independent Girl ( Ac Rooms Avai...
💎VVIP Kolkata Call Girls Parganas🩱7001035870🩱Independent Girl ( Ac Rooms Avai...
 
Russian Call Girls in Delhi Tanvi ➡️ 9711199012 💋📞 Independent Escort Service...
Russian Call Girls in Delhi Tanvi ➡️ 9711199012 💋📞 Independent Escort Service...Russian Call Girls in Delhi Tanvi ➡️ 9711199012 💋📞 Independent Escort Service...
Russian Call Girls in Delhi Tanvi ➡️ 9711199012 💋📞 Independent Escort Service...
 
VIP Call Girls Pune Vrinda 9907093804 Short 1500 Night 6000 Best call girls S...
VIP Call Girls Pune Vrinda 9907093804 Short 1500 Night 6000 Best call girls S...VIP Call Girls Pune Vrinda 9907093804 Short 1500 Night 6000 Best call girls S...
VIP Call Girls Pune Vrinda 9907093804 Short 1500 Night 6000 Best call girls S...
 
CALL ON ➥9907093804 🔝 Call Girls Hadapsar ( Pune) Girls Service
CALL ON ➥9907093804 🔝 Call Girls Hadapsar ( Pune)  Girls ServiceCALL ON ➥9907093804 🔝 Call Girls Hadapsar ( Pune)  Girls Service
CALL ON ➥9907093804 🔝 Call Girls Hadapsar ( Pune) Girls Service
 
(👑VVIP ISHAAN ) Russian Call Girls Service Navi Mumbai🖕9920874524🖕Independent...
(👑VVIP ISHAAN ) Russian Call Girls Service Navi Mumbai🖕9920874524🖕Independent...(👑VVIP ISHAAN ) Russian Call Girls Service Navi Mumbai🖕9920874524🖕Independent...
(👑VVIP ISHAAN ) Russian Call Girls Service Navi Mumbai🖕9920874524🖕Independent...
 
Bangalore Call Girls Nelamangala Number 7001035870 Meetin With Bangalore Esc...
Bangalore Call Girls Nelamangala Number 7001035870  Meetin With Bangalore Esc...Bangalore Call Girls Nelamangala Number 7001035870  Meetin With Bangalore Esc...
Bangalore Call Girls Nelamangala Number 7001035870 Meetin With Bangalore Esc...
 
sauth delhi call girls in Bhajanpura 🔝 9953056974 🔝 escort Service
sauth delhi call girls in Bhajanpura 🔝 9953056974 🔝 escort Servicesauth delhi call girls in Bhajanpura 🔝 9953056974 🔝 escort Service
sauth delhi call girls in Bhajanpura 🔝 9953056974 🔝 escort Service
 
Best Rate (Hyderabad) Call Girls Jahanuma ⟟ 8250192130 ⟟ High Class Call Girl...
Best Rate (Hyderabad) Call Girls Jahanuma ⟟ 8250192130 ⟟ High Class Call Girl...Best Rate (Hyderabad) Call Girls Jahanuma ⟟ 8250192130 ⟟ High Class Call Girl...
Best Rate (Hyderabad) Call Girls Jahanuma ⟟ 8250192130 ⟟ High Class Call Girl...
 
Call Girls Service Jaipur Grishma WhatsApp ❤8445551418 VIP Call Girls Jaipur
Call Girls Service Jaipur Grishma WhatsApp ❤8445551418 VIP Call Girls JaipurCall Girls Service Jaipur Grishma WhatsApp ❤8445551418 VIP Call Girls Jaipur
Call Girls Service Jaipur Grishma WhatsApp ❤8445551418 VIP Call Girls Jaipur
 
Bangalore Call Girls Majestic 📞 9907093804 High Profile Service 100% Safe
Bangalore Call Girls Majestic 📞 9907093804 High Profile Service 100% SafeBangalore Call Girls Majestic 📞 9907093804 High Profile Service 100% Safe
Bangalore Call Girls Majestic 📞 9907093804 High Profile Service 100% Safe
 
Call Girls Service Bellary Road Just Call 7001305949 Enjoy College Girls Service
Call Girls Service Bellary Road Just Call 7001305949 Enjoy College Girls ServiceCall Girls Service Bellary Road Just Call 7001305949 Enjoy College Girls Service
Call Girls Service Bellary Road Just Call 7001305949 Enjoy College Girls Service
 
Artifacts in Nuclear Medicine with Identifying and resolving artifacts.
Artifacts in Nuclear Medicine with Identifying and resolving artifacts.Artifacts in Nuclear Medicine with Identifying and resolving artifacts.
Artifacts in Nuclear Medicine with Identifying and resolving artifacts.
 
High Profile Call Girls Coimbatore Saanvi☎️ 8250192130 Independent Escort Se...
High Profile Call Girls Coimbatore Saanvi☎️  8250192130 Independent Escort Se...High Profile Call Girls Coimbatore Saanvi☎️  8250192130 Independent Escort Se...
High Profile Call Girls Coimbatore Saanvi☎️ 8250192130 Independent Escort Se...
 
Call Girls Yelahanka Bangalore 📲 9907093804 💞 Full Night Enjoy
Call Girls Yelahanka Bangalore 📲 9907093804 💞 Full Night EnjoyCall Girls Yelahanka Bangalore 📲 9907093804 💞 Full Night Enjoy
Call Girls Yelahanka Bangalore 📲 9907093804 💞 Full Night Enjoy
 
Vip Call Girls Anna Salai Chennai 👉 8250192130 ❣️💯 Top Class Girls Available
Vip Call Girls Anna Salai Chennai 👉 8250192130 ❣️💯 Top Class Girls AvailableVip Call Girls Anna Salai Chennai 👉 8250192130 ❣️💯 Top Class Girls Available
Vip Call Girls Anna Salai Chennai 👉 8250192130 ❣️💯 Top Class Girls Available
 

Lightweight data engineering, tools, and software to facilitate data reuse and data science

  • 1. Lightweight data engineering, tools, and approaches to facilitate data reuse and data science Sean Davis, MD, PhD National Cancer Institute, National Institutes of Health AIDR 2019, Carnegie Mellon University https://seandavi.github.io @seandavis12 http://bit.ly/SD-AIDR2019
  • 2. Data Engineering: a Definition https://en.wikipedia.org/wiki/data_engineering 301 Redirect ?????
  • 7. Is FAIR data enough? ● FAIR data standards and guidelines are general; implementation can be challenging. ● FAIR data are not always fit-for-use data. ● FAIR data standards are somewhat context-dependent and are most easily applied within a homogeneous community of data users and providers. ● FAIR data are not always usable data. ● FAIR data are not always useful data. ● [OFF TOPIC]: expense, maintenance, metrics, deprecation, augmentation and correction, provenance over time, ownership and IP
  • 8. Health In the context of healthcare, there is a virtual imperative that we learn to value and use our data ethically and effectively. The promise of a learning healthcare system depends on this charge….
  • 9. Big Biological data ● Acquisition ● Storage ● Distribution ● Analysis ● Integration https://doi.org/10.1371/journal.pbio.1002195
  • 10. Comparison of four domains of big data in 2025 https://doi.org/10.1371/journal.pbio.1002195.t001
  • 12. Baby steps: Data reuse reframed as reproducible research Genomics and Bioconductor as a use case….
  • 13. Leek, Jeffrey T., and Roger D. Peng. 2015. https://doi.org/10.1073/pnas.1421412111
  • 14. Reproducibility, the ability to recompute results, and replicability, the chances other experimenters will achieve a consistent result, are two foundational characteristics of successful scientific research...of late there has been a crisis of confidence among researchers worried about the rate at which studies are either reproducible or replicable. In order to maintain the integrity of science research and maintain the public's trust in science, the scientific community must ensure reproducibility and replicability by engaging in a more preventative approach that greatly expands data analysis education and routinely employs software tools.
  • 16. Bioconductor is a large, NIH-funded open source software community dedicated to the analysis and comprehension of high throughput biological data.
  • 17. 4 a4Base a4Classif a4Core a4Preproc a4Reporting ABAEnrichment ABarray ABSSeq acde aCGH ACME ADaCGH2 adSplit affxparser affy affycomp AffyCompatible affyContam affycoretools AffyExpress affyILM affyio affylmGUI affyPara affypdnn affyPLM affyQCReport ffyRNADegradation AGDEX agilp AgiMicroRna AIMS ALDEx2 AllelicImbalance alpine alsace altcdfenvs AMOUNTAIN amplican ampliQueso AnalysisPageServer anamiR Anaquin AneuFinder ANF annaffy annmap annotate AnnotationDbi AnnotationFilter AnnotationForge nnotationFuncs AnnotationHub AnnotationHubData annotationTools annotatr anota anota2seq antiProfiles apComplex apeglm aroma ArrayExpress ArrayExpressHTS arrayMvout arrayQuality arrayQualityMetrics ArrayTools ArrayTV ARRmNormalization ASAFE ASEB ASGS Spli ASSET ASSIGN ATACseqQC attract AUCell BaalChIP BAC bacon BADER BadRegionFinder BAGS ballgown bamsignals banocc basecallQC BaseSpaceR Basic4Cseq BASiCS BasicSTARRseq BatchQC BayesKnockdown BayesPeak baySeq BBCAnalyzer BCRANK eachmat beadarray beadarraySNP BeadDataPackR BEARscc BEAT BEclear bgafun BgeeDB BGmix bgx BHC BicARE BiFET BiGGR bigmelon bigmemoryExtras bioassayR Biobase biobroom bioCancer BiocCaseStudies BiocCheck BiocFileCache BiocGenerics biocGraph iocInstaller BioCor BiocParallel BiocSklearn BiocStyle biocViews BiocWorkflowTools bioDist biomaRt biomformat BioMVCClass biomvRCNS BioNet BioQC BioSeqClass biosigner Biostrings biosvd biotmle biovizBase BiRewire birta birte BiSeq BitSeq blima BLMA bnbc BPR RAIN BrainStars branchpointer bridge BridgeDbR BrowserViz BrowserVizDemo BSgenome bsseq BubbleTree BufferedMatrix BufferedMatrixMethods BUMHMM bumphunter BUS CAFE CAGEr CALIB CAMERA canceR cancerclass CancerInSilico CancerMutationAnalysis ancerSubtypes CAnD caOmicsV Cardinal casper CATALYST Category categoryCompare CausalR cbaf ccmap CCPROMISE ccrepe cellbaseR cellGrowth cellHTS2 cellity CellMapper CellNOptR cellscape cellTree CEMiTool CexoR CFAssay CGEN CGHbase CGHcall cghM GHnormaliter CGHregions ChAMP CHARGE charm ChemmineOB ChemmineR Chicago chimera chimeraviz ChIPanalyser ChIPComp chipenrich ChIPexoQual ChIPpeakAnno ChIPQC ChIPseeker chipseq ChIPseqR ChIPSeqSpike ChIPsim ChIPXpress chopsticks chroG hromDraw ChromHeatMap chromPlot chromstaR chromswitch chromVAR CHRONOS CINdex cisPath ClassifyR cleanUpdTSeq cleaver clippda clipper Clomial Clonality clonotypeR clst clstutils clustComp clusterExperiment ClusterJudge clusterProfiler clusterSeq lusterSignificance clusterStab CMA cn CNAnorm CNEr CNORdt CNORfeeder CNORfuzzy CNORode CNPBayes CNTools cnvGSA CNVPanelizer CNVrd2 CNVtools cobindR CoCiteStats codelink CODEX coexnet CoGAPS cogena coGPS COHCAP coMET COMPASS ompcodeR compEpiTools CompGO ComplexHeatmap CONFESS ConsensusClusterPlus consensusOV consensusSeekeR contiBAIT conumee convert copa copynumber CopywriteR CoRegNet Cormotif CorMut CORREP coseq cosmiq COSNet CountClust covEB overageView covRNA cpvSNP cqn CRImage CRISPRseek crisprseekplus CrispRVariants crlmm crossmeta CSAR csaw CSSP ctc CTDquerier ctsGE cummeRbund customProDB CVE cycle cydar cytofkit cytolib CytoML dada2 dagLogo daMA DaMiRseq DAPAR DART DAS BChIP dcGSA DChIPRep ddCt debrowser DECIPHER DEComplexDisease DeconRNASeq DEDS DeepBlueR deepSNV DEFormats DEGraph DEGreport DEGseq DelayedArray DelayedMatrixStats deltaGseg DeMAND DEP derfinder derfinderHelper derfinderPlot DESeq ESeq2 destiny DEsubs DEXSeq dexus DFP DiffBind diffGeneAnalysis diffHic DiffLogo diffloop diffuStats diggit Director DirichletMultinomial discordant dks DMCHMM DMRcaller DMRcate DMRforPairs DMRScan DNABarcodes DNAcopy DNAshapeR doppelgangR DOQTL oscheda DOSE drawProteins DRIMSeq DriverNet DropletUtils DrugVsDisease dSimer DSS DTA dualKS DupChecker dupRadar dyebias DynDoc EasyqpcR easyRNASeq EBarrays EBcoexpress EBImage EBSEA EBSeq EBSeqHMM ecolitk EDASeq EDDA edge edgeR ee GAD EGSEA eiR eisa ELBOW ELMER EMDomics EmpiricalBrownsMethod ENCODExplorer ENmix EnrichedHeatmap EnrichmentBrowser ensembldb ensemblVEP ENVISIONQuery EpiDISH epigenomix epiNEM epivizr epivizrChart epivizrData epivizrServer epivizrStanda rccdashboard erma esATAC esetVis eudysbiome EventPointer ExiMiR exomeCopy exomePeak ExperimentHub ExperimentHubData explorase ExpressionAtlas ExpressionView fabia facopy factDesign FamAgg farms fastLiquidAssociation fastseg fCCAC fCI fdrame FEM ff GNet fgsea FindMyFriends FISHalyseR FitHiC flagme flipflop flowAI flowBeads flowBin flowcatchR flowCHIC flowCL flowClean flowClust flowCore flowCyBar flowDensity flowFit flowFP flowMap flowMatch flowMeans flowMerge flowPeaks flowPloidy flowPlots flowQ flowQB owRepositoryR FlowSOM flowStats flowTime flowTrans flowType flowUtils flowViz flowVS flowWorkspace fmcsR focalCall FourCSeq FRGEpistasis frma frmaTools FunChIP FunciSNP funtooNorm GA4GHclient GA4GHshiny gaga gage gaggle gaia GAprediction garfield ga capc gcatest gCMAP gCMAPWeb gCrisprTools gcrma GDCRNATools gdsfmt geecc GEM genArise genbankr GeneAnswers geneAttribution GeneBreak geneClassifiers GeneExpressionSignature genefilter genefu GeneGA GeneGeneInteR GeneMeta GeneNetworkBuilder eneOverlap geneplast geneplotter geneRecommender GeneRegionScan geneRxCluster GeneSelectMMD GeneSelector GENESIS geNetClassifier GeneticsDesign GeneticsPed geneXtendeR GENIE3 genoCN GenoGAM genomation GenomeGraphs GenomeInfoDb enomeIntervals genomes GenomicAlignments GenomicDataCommons GenomicFeatures GenomicFiles GenomicInteractions GenomicRanges GenomicScores GenomicTuples Genominator genoset genotypeeval genphen GenRank GenVisR GEOmetadb GEOquery EOsubmission gep2pep gespeR GEWIST GGBase ggbio ggcyto GGtools ggtree girafe GISPA GLAD Glimma GlobalAncova globalSeq globaltest gmapR GMRP GOexpress GOfuncR GOFunction GoogleGenomics GOpro goProfiles GOSemSim goseq GOSim goSTAG GO Osummaries GOTHiC goTools gpls gprege gQTLBase gQTLstats graph GraphAlignment GraphAT graphite GraphPAC GRENITS GreyListChIP GRmetrics groHMM GRridge GSALightning GSAR GSCA GSEABase GSEAlm gsean GSReg GSRI GSVA gtrellis GUIDEseq Gu viz gwascat GWASTools h5vc hapFabia Harman Harshlight HDF5Array HDTD heatmaps Heatplus HelloRanges HELP HEM hiAnnotator HIBAG HiCcompare hicrep hierGWAS HilbertCurve HilbertVis HilbertVisGUI hiReadsProcessor HiTC hmdbQuery HMMcopy hopach hp TqPCR HTSanalyzeR HTSeqGenie htSeqTools HTSFilter HybridMTest hyperdraw hypergraph iASeq iBBiG ibh iBMQ iCARE Icens iCheck iChip iClusterPlus iCOBRA ideal IdeoViz idiogram IdMappingAnalysis IdMappingRetrieval iGC IHW illuminaio imageHTS IMAS Imetag mmuneSpaceR immunoClust IMPCdata ImpulseDE ImpulseDE2 impute InPAS INPower INSPEcT intansv InteractionSet interactiveDisplay interactiveDisplayBase IntEREst InterMineR IntramiRExploreR inveRsion IONiseR iPAC IPO IPPD IRanges IrisSpatialFeatures iSeq is oformSwitchAnalyzeR IsoGeneGUI ISoLDE isomiRs ITALICS iterativeBMA iterativeBMAsurv iterClust IVAS ivygapSE IWTomics JASPAR2018 joda JunctionSeq karyoploteR KCsmart kebabs KEGGgraph KEGGlincs keggorthology KEGGprofile KEGGREST kimod lapmix LB block LEA LedPred les lfa limma limmaGUI LINC LineagePulse Linnorm LiquidAssociation lmdme LMGene LOBSTAHS loci2path logicFS logitT Logolas lol LOLA LowMACA LPE LPEadj lpNet lpsymphony lumi LVSmiRNA LymphoSeq M3C M3D M3Drop maanova macat aCorrPlot made4 MADSEQ maftools MAGeCKFlute maigesPack MAIT makecdfenv MANOR manta MantelCorr mAPKL maPredictDSC mapscape marray maSigPro maskBAD MassArray massiR MassSpecWavelet MAST matchBox MatrixRider matter MaxContrastProjecti BAmethyl MBASED MBCB mBPCR MBttest mcaGUI MCbiclust MCRestimate mCSEA mdgsa mdqc MEAL MeasurementError MEDIPS MEDME MEIGOR MergeMaid Mergeomics MeSHDbi meshes meshr messina metaArray Metab metabomxtr MetaboSignal metaCCA etaCyto metagene metagenomeFeatures metagenomeSeq metahdep metaMS MetaNeighbor metaSeq metaseqR metavizr MetCirc methimpute methInheritSim MethPed MethTargetedNGS methVisual methyAnalysis MethylAid methylInheritance methylKit MethylMix meth ethylPipe MethylSeekR methylumi methyvim mfa Mfuzz MGFM MGFR mgsa MiChip microbiome microRNA MIGSA mimager MIMOSA MineICA minet minfi MinimumDistance MiPP MIRA MiRaGE miRBaseConverter miRcomp mirIntegrator miRLAB miRmine miRNAmeCon iRNApath miRNAtap miRsponge Mirsynergy missMethyl mitoODE MLInterfaces MLP MLSeq MMDiff2 MmPalateMiRNA MODA mogsa monocle MoonlightR MoPS mosaics motifbreakR motifcounter MotifDb motifmatchr motifRG motifStack MotIV MPFE mpra mQTL msa m SGFgui MSGFplus msmsEDA msmsTests MSnbase MSnID msPurity MSstats Mulcom MultiAssayExperiment multiClust MultiDataSet MultiMed multiMiR multiOmicsViz multiscan multtest muscle MutationalPatterns MVCClass mvGST MWASTools mygene myvariant mzID ADfinder NanoStringDiff NanoStringQCPro NarrowPeaks ncdfFlow NCIgraph ndexr nem netbenchmark netbiov nethet NetPathMiner netprioR netReg netresponse NetSAM networkBMA NGScopy nnNorm NOISeq nondetects normalize450K NormqPCR normr npGSEA NT ucleoSim nucleR nudge NuPoP occugene OCplus odseq OGSA oligo oligoClasses OLIN OLINgui omicade4 OmicCircos omicplotR omicRexposome OmicsMarkeR omicsPrint Onassis oncomix OncoScore OncoSimulR oneSENSE ontoCAT ontoProc openCyto openPrimeR penPrimeRui OperaMate oposSOM oppar OPWeight OrderedList Organism OrganismDbi OSAT Oscope OTUbase OutlierD PAA PADOG paircompviz pandaR panelcn PAnnBuilder panp PANR PanVizGenerator PAPi parglms parody Path2PPI pathifier PathNet PathoStat athprint pathRender pathVar pathview PathwaySplice paxtoolsr Pbase pbcmc pcaExplorer pcaGoPromoter pcaMethods PCAN pcot2 PCpheno pcxn pdInfoBuilder PECA pepStat pepXMLTab PGA pgca PGSEA PharmacoGx phenoDist phenopath phenoTest PhenStat philr hosphonormalizer phyloseq Pi piano pickgene PICS Pigengene PING pint pkgDepTools plateCore plethy plgem plier PLPE plrs plw pmm podkat pogos polyester Polyfit POST PPInfer ppiStats pqsfinder prada prebs PREDA predictionet preprocessCore Prize proBAMr PROc rocoil ProCoNA proFIA profileScoreDist progeny pRoloc pRolocGUI PROMISE PROPER PROPS Prostar prot2D proteinProfiles ProteomicsAnnotationHubData proteoQC ProtGenerics PSEA psichomics PSICQUIC psygenet2r puma PureCN pvac pvca Pviz PWMEnrich pwO cmetrics QDNAseq qpcrNorm qpgraph qrqc qsea QUALIFIER quantro quantsmooth QuartPAC QuasR QuaternaryProd QUBIC qusage qvalue R3CPET r3Cseq R453Plus1Toolbox R4RNA RaggedExperiment rain rama RamiGO ramwas randPack RankProd RareVariantVis ariant RbcBook1 RBGL RBioinf rBiopaxParser RBM Rbowtie Rbowtie2 rbsurv Rcade RCAS RCASPAR rcellminer rCGH Rchemcpp RchyOptimyx RcisTarget Rcpi RCy3 RCyjs RDAVIDWebService rDGIdb Rdisop RDRToolbox ReactomePA readat ReadqPCR reb recount re edeR REDseq RefNet RefPlus regioneR regionReport regsplice REMP Repitools ReportingTools ReQON restfulSE rexposome rfPred rGADEM RGalaxy RGMQL RGraph2js Rgraphviz rGREAT RGSEA rgsepd rhdf5 rhdf5client Rhdf5lib Rhtslib rHVDM RiboProfiling riboSeq ImmPort Ringo RIPSeeker Risa RITAN RIVER RJMCMCNucleosomes RLMM Rmagpie RMassBank rMAT RmiR RNAinteract RNAither RNAprobR rnaseqcomp rnaSeqMap RNASeqPower RnaSeqSampleSize RnBeads Rnits roar ROC Roleswitch rols roma ROntoTools rop OTS RPA RProtoBufLib RpsiXML rpx Rqc rqt rqubic rRDP RRHO Rsamtools rsbml rSFFreader Rsubread RSVSim rTANDEM RTCA RTCGA RTCGAToolbox RTN RTNduals RTNsurvival RTopper rtracklayer Rtreemix rTRM rTRMui runibic RUVcorr RUVnormalize RUVSeq R 4Vectors safe sagenhaft SAGx samExploreR sampleClassifier SamSPECTRAL sangerseqR SANTA sapFinder savR SBMLR SC3 Scale4C SCAN scater scDD scde scFeatureFilter scfind ScISI scmap SCnorm scone scoreInvHap scPipe scran scsR segmentSeq SELEX Se emisup SEPA seq2pathway SeqArray seqbias seqCAT seqCNA seqcombo SeqGSEA seqLogo seqPattern seqplots SeqSQC seqTools SeqVarTools sevenbridges SGSeq shinyMethyl shinyTANDEM ShortRead SICtools sigaR SigCheck SigFuge siggenes sights signeR sign
  • 18. Bioconductor: Education, Training, and Community Core infrastructure Community contribution
  • 19. Building Bridges ~300,000 users, 1000 active developers, >1700 interop software tools and analysis packages 112,837 datasets, 3,031,406 samples publicly accessible
  • 23.
  • 24.
  • 25. Understand, expose, and adapt to your data ecosystem and community https://gdc.cancer.gov/files/public/image/Data_Model_Oobleck.png
  • 26.
  • 27.
  • 28. Baby steps: Reusable components
  • 29. Where necessary, build new software or tools ● Recognize complexity in high-throughput biological data ● Version control everything ● Continuous testing and integration ● Text-based workflow (no GUI) ● Literate programming approaches and documentation ● Education on tooling ● Numerous mechanisms for FAIR data sharing
  • 30. Core value: reuse and interoperability
  • 31. Where possible reuse existing tools, leverage existing communities
  • 32. The R package: the unit of sharing in the R community
  • 34. Maximize the impact (and reward) of curation efforts
  • 35.
  • 37.
  • 38.
  • 40. Source: Monica Rogati https://hackernoon.com/the-ai-hierarchy-of-needs-18f111fcc007
  • 43.
  • 45. Education: we are all learning and teaching….
  • 47. Technical Advisory Group Vince Carey, Brigham & Women’s, Harvard Medical School, USA. Aedin Culhane, Dana-Farber Cancer Institute, Harvard School of Public Health, USA. Sean Davis, National Cancer Institute, USA. Robert Gentleman, Computational Biology, 23andMe, USA. Kasper Daniel Hansen, Bloomberg School of Public Health, Johns Hopkins University. Wolfgang Huber European Molecular Biology Laboratory, Heidelberg, Germany. Rafael Irizarry Dana-Farber Cancer Institute, USA Michael Lawrence, Genentech Research and Early Development, USA. Levi Waldron, CUNY School of Public Health at Hunter College, New York, NY.