• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
NextGen BigData Workloads in NextGen Sequencing - 20140402 - Phoenix - TGEN
 

NextGen BigData Workloads in NextGen Sequencing - 20140402 - Phoenix - TGEN

on

  • 744 views

thoughts on upcoming bigdata workloads in the human genomics space.

thoughts on upcoming bigdata workloads in the human genomics space.

Statistics

Views

Total Views
744
Views on SlideShare
680
Embed Views
64

Actions

Likes
2
Downloads
0
Comments
0

2 Embeds 64

https://twitter.com 63
http://tweetedtimes.com 1

Accessibility

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • The genomic position (x-axis) of probesets within a 6 megabase region centered at the location of TTN, a gene known to be associated with LMGD2, is plotted versus the Pearson correlation coefficient An external file that holds a picture, illustration, etc.Object name is pone.0008491.e023.jpg (y-axis) to a list of probesets targeting other genes known to be associated with LGMD2 (excluding TTN) across 11636 HG-U133_Plus_2 microarrays. Solid circles: probesets targeting TTN, An external file that holds a picture, illustration, etc.Object name is pone.0008491.e024.jpg: probesets that are for genes of unknown function and, open circles: probesets for known genes in interval.

NextGen BigData Workloads in NextGen Sequencing - 20140402 - Phoenix - TGEN NextGen BigData Workloads in NextGen Sequencing - 20140402 - Phoenix - TGEN Presentation Transcript

  • Biomedical & Advertising Tech Overarching Themes* Eugenics & Determinism Free will vs. Determinism Media Tech & Privacy *Obligatory movie references… shout-out to my hometown LA
  • Biomedical Research Goal: Therapeutics => Diagnostics => Prognostics • Reverse engineer how genetic variation leads to (un)desired traits • Therapeutics => traditional medicine • Diagnostics => personalized medicine – NextGen public health – Requires hi-res mechanical knowledge • Prognostics => GATTACA (dys/eu)topia – Managed populations / NextGen eugenics
  • NextGen BigData Workloads in NextGen Sequencing Allen Day, PhD @MapR @allenday April 2014
  • Typical Plan, Phases 1-4 1. Design Experiment / Collect Biosamples 2. Sequencing / Molecular Assays 3. Data Management 4. ? ? ? 5. PROFIT ! ! ! ! http://knowyourmeme.com/memes/profit
  • The Changing Workload in Underpants Collection Sboner, et al, 2011. The real cost of sequencing: higher than you think!
  • Typical Plan, Phases 1-4 1. Design Experiment / Collect Biosamples 2. Sequencing / Molecular Assays 3. Data Management 4. ? ? ? 5. PROFIT ! ! ! !
  • Phase 4: Example GWAS/SNP Analysis • Find me related SNPs… – From other experiments • Given a phenotype… – And an associated SNP from my experiment • That elucidate genetic basis of phenotype… • And rank order them by impact/likelihood/etc
  • Phase 4: Example GWAS/SNP Analysis SELECT snp, expEvidence FROM myExp, exp1, … OUTER JOIN expN … WHERE myExp.snp = “mySnp” ORDER BY p, freq, conservation, etc
  • Phase 4: Example GWAS/SNP Analysis • In context of – Racial background – Experimental design- specific concerns (e.g. familial IBD/IBS) – Environmental factors and penetrance – Assay-specific biases and noise SELECT snp, expEvidence FROM myExp, exp1, … OUTER JOIN expN … WHERE myExp.snp = “mySnp” ORDER BY p, freq, conservation, etc
  • SELECT snp, expEvidence FROM myExp, exp1, … OUTER JOIN expN … WHERE myExp.snp = “mySnp” ORDER BY p, freq, conservation, etc Phase 4: Example GWAS/SNP Analysis • In context of, e.g. – ε1: Racial, etc. background – ε2: Experimental design- specific concerns (e.g. familial IBD/IBS) – ε3: Environmental factors and penetrance – ε4: Assay-specific biases and noise phenotype = αgenotype + β + ε1 + ε2 + ε3 + ε4 At risk of over-simplification for business-level concept…
  • Phase 4: Automated Insights Engine SELECT snp, expEvidence FROM myExp, exp1, … expN exps=powerset(all exps) OUTER JOIN complement(exps) WHERE myExp.snp = “mySnp” powerset(all SNPs, phenotypes) ORDER BY p, freq, conservation, etc arbitrary models SNPs, experimental groupings, assay technologies, assayed phenotypes, annotations/ontologies C. Briggsae inbred strain compatibility [supplementary slide]
  • Phase 4: Automated Insights Engine SELECT snp, expEvidence FROM myExp, exp1, … expN exps=powerset(all exps) OUTER JOIN complement(exps) WHERE myExp.snp = “mySnp” powerset(all SNPs, phenotypes) ORDER BY p, freq, conservation, etc arbitrary models SNPs, experimental groupings, assay technologies, assayed phenotypes, annotations/ontologies
  • Phase 4: Automated Insights Engine Right… SNPs, experimental groupings, assay technologies, assayed phenotypes, annotations/ontologies
  • Phase 4: Naïve Implementation SNPs, experimental groupings, assay technologies, assayed phenotypes, annotations/ontologies Big compute It’s monolithic It scales polynomially with data size This is bad, it takes too long to get a result
  • Co-expression (10K samples) and Linkage Gene Annotation / Set Completion BMP6 BMP2 MMP3 LIF NOS2A MMP13 CSPG4 ACAN ACAN ACAN COL11A2 COL11A2 COL9A1 MATN1 LECT1 MATN4 HAPLN1 HAPLN1 ITGA10 EDIL3 NGF MAST4 MATN3 EPYC COL11A1 COL11A1 COL10A1 COL10A1 THBS3 C1QTNF3 WISP1 PDPN PDLIM4 CHST3 MIA SOX5 CYTL1 TNMD AKR1C1 MMP12 ETNK1 RELA FOSL1 EIF2C2 NUPL1 RLF RELB SOD2 RNF24 RNF24 XYLT1 HAS2 BDKRB1 HSPC159 SLC28A3 FZD10 SLC28A3 HSPC159 BDKRB1 HAS2 XYLT1 RNF24 RNF24 SOD2 RELB RLF NUPL1 EIF2C2 FOSL1 RELA ETNK1 MMP12 AKR1C1 TNMD CYTL1 SOX5 MIA CHST3 PDLIM4 PDPN FZD10 WISP1 C1QTNF3 THBS3 COL10A1 COL10A1 COL11A1 COL11A1 EPYC MATN3 MAST4 NGF EDIL3 ITGA10 HAPLN1 HAPLN1 MATN4 ACAN ACAN ACAN LECT1 MATN1 COL9A1 COL11A2 COL11A2 CSPG4 MMP13 NOS2A LIF MMP3 BMP2 BMP6 Day. 2009. Disease gene characterization through large-scale co-expression analysis. http://www.ncbi.nlm.nih.gov/pubmed/20046828 + =>
  • Better: Push Logic to Phase 3 SNPs, experimental groupings, assay technologies, assayed phenotypes, annotations/ontologies Denormalize and Percolate (re)prioritize & (re)process service queries drive dashboards create reports denormalize for display buffer New models
  • What’s a Percolator? • Google Percolator – “Caffeine” update 2010 • Iterative, incremental updates • No batch processing • Decouple computation from data size Peng & Dabek, 2010. Large-scale Incremental Processing Using Distributed Transactions and Notifications
  • What’s in the Percolator? • Optimize for access patterns, maybe many tables – Dependency graph of intermediate matrices – AT = correlate(transpose(A)) • Parallelize table computation – Twitter Algebird • Analysis of intermediates triggers downstream action – Codify business logic (research methods) into data management layer – Prioritize and minimize unproductive computation Denormalize and Percolate (re)prioritize & (re)process https://github.com/twitter/algebird http://www.michael-noll.com/blog/2013/12/02/twitter-algebird-monoid-monad-for-large-scala-d
  • What’s in the Percolator? • Optimize for access patterns, maybe many tables – Dependency graph of intermediate matrices – AT = correlate(transpose(A)) • Parallelize table computation – Twitter Algebird • Analysis of intermediates triggers downstream action – Codify business logic (research methods) into data management layer – Prioritize and minimize unproductive computation Denormalize and Percolate (re)prioritize & (re)process MapR M7 especially suitable – services complex multi-tenant workloads at very large scale, see http://www.slideshare.net/allenday/20131212-sydney-big-data-analytics
  • Double Percolator • Apologies, Google Images yields no SFW images for – “Double Percolator”
  • ENCODE http://www.nature.com/news/encode-the-human-encyclopaedia-1.11312
  • Data Generation e.g. basic research Data Analysis e.g. pharma Control channel Clinical / Patient consumption
  • Data Generation e.g. basic research Data Analysis e.g. pharma Clinical / Patient consumption Control channel
  • Robot Scientist Sparkes, et al. 2010. Towards Robot Scientists for autonomous scientific discovery
  • Robot (Data?) Scientist Sparkes, et al. 2010. Towards Robot Scientists for autonomous scientific discovery
  • If they were unlabeled, would you know which is which? Friend. 2010. The Need for Precompetitive Integrative Bionetwork Disease Model Building NPR. 2011. The Search For Analysts To Make Sense Of 'Big Data’ http://www.npr.org/2011/11/30/142893065
  • If they were unlabeled, would you know which is which? Friend. 2010. The Need for Precompetitive Integrative Bionetwork Disease Model Building • Identify network structures • Label them • Observe stimulus=>response space mapping • Purposefully target • PROFIT ! ! ! !
  • If they were unlabeled, would you know which is which? Friend. 2010. The Need for Precompetitive Integrative Bionetwork Disease Model Building • Identify network structures • Label them • Observe stimulus=>response space mapping • Purposefully target • PROFIT ! ! ! ! Parallels to Twitter revenue model social network node labeling => gene annotation Google Knowledge Graph => bio-ontologies Ad impressions => small molecule perturbation Profit => Save lives  http://www.google.com/insidesearch/features/search/knowledge.html http://www.bioontology.org/
  • Dendrite on M7 HBase Denormalize and Percolate (re)prioritize & (re)process MapR M7 HBase Titan API Percolation “business logic” Dendrite Visualization & ad-hoc queries detailed view…
  • Further Reading MapR M7 especially suitable – services complex multi-tenant workloads at very large scale, see @allenday deck: http://www.slideshare.net/allenday/20131212-sydney-big- data-analytics Implementing matrix transforms + business logic workflows, see @ceteri “Enterprise Data Workflows with Cascading”: http://shop.oreilly.com/product/0636920028536.do Math and data structure underpinnings, see @ceteri and @allenday “Just Enough Math”: http://liber118.com/pxn/course/itml/just_enough_math.html Denormalize and Percolate (re)prioritize & (re)process
  • Further Reading Day, et al. 2007. Celsius: a community resource for Affymetrix microarray data. http://www.ncbi.nlm.nih.gov/pubmed/17570842 Human Genetics & Big Data http://www.slideshare.net/allenday/20131212-sydney- garvan-institute-human-genetics-and-big-data Denormalize and Percolate (re)prioritize & (re)process
  • Next Topic: Optimizing 1º Analysis Sboner, et al, 2011. The real cost of sequencing: higher than you think! <= We were just here “future high ROI use cases” <= We now go here “current high ROI use cases”
  • 1º seq analysis, in a nutshell
  • 1º seq analysis, in a nutshell
  • Crossbow Langmead, et al. 2009. Searching for SNPs with cloud computing
  • 1º seq analysis, format details .fastq .bam .vcf short read alignment genotype calling analysis
  • 1º seq analysis, map-reduce style .fastq .bam .vcf short read alignment genotype calling MAP MAP REDUCE, rotate matrix 90º ref seq
  • Ion Flux • Sequencing workflow in MapReduce (Hadoop, Cascading, Amazon Elastic M/R) • Integrated with Ion Torrent as a plugin to stream sequence to the cloud • Emphasis on scalability and latency – assay->clinical report turnaround in < 24h • Compare to fast-follower stack ILMN MiSeq+BaseSpace http://aws.amazon.com/solutions/case-studies/ion-flux/ http://d36cz9buwru1tt.cloudfront.net/Ion-Flux-2011-02-Architecture.pdf
  • SeqWare / Nimbus Informatics O’Connor, et al. 2010. SeqWare Query Engine: storing and searching sequence data in the cloud http://seqware.github.io/
  • MapR Advantage for R&D • Home directories on DFS – NFS. Transparent to user • Low prototype-cost / research support – Scale prototype as needed • Low transition cost / operationalizing research – Prototype incrementally becomes a product • Low operational cost / high machine utilization – Leverage MapR performance
  • THANKS!
  • C. briggsae inbred line incompatibility Ross, et al. 2011. Caenorhabditis briggsae Recombinant Inbred Line Genotypes Reveal Inter- Strain Incompatibility and the Evolution of Recombination
  • self join self joineQTLs Samples eQTLs Samples Samples Samples eQTLs eQTLs
  • Incidence matrices A (U*Q) and B (U*V) UsersQuery Terms Users Clicked Videos Query Term = Clicked Term
  • Join on dimension U… QueryTerms Users
  • Relate Q to V QueryTerms Users
  • Cross-recommendation QueryTerms Clicked Videos