Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Managing Genomes At Scale: What We Learned - StampedeCon 2014

3,069 views

Published on

At StampedeCon 2014, Rob Long (Monstanto) presented "Managing Genomes At Scale: What We Learned."

Monsanto generates large amounts of genomic sequence data every year. Agronomists and other scientists use this data as input for predictive analytics to aid breeding and the discovery of new traits such as disease or drought resistance. In order to enable the broadest use possible of this valuable data, scientists would like to query genomic data by species, chromosome, position, and myriad other categories. We present our solutions to these problems, as realized on top of HBase here at Monsanto.We will be discussing our particular learnings around: flat/wide vs tall/narrow HBase schema design, preprocessing and caching windows of data for use in web based visualizations, approaches to complex multi-join queries across deep data sets, and distributed indexing via SolrCloud.

Published in: Technology
  • Be the first to comment

Managing Genomes At Scale: What We Learned - StampedeCon 2014

  1. 1. Managing Genomes at Scale: What We Learned Rob Long Monsanto: Big Data Engineer Twitter: @plantimals 5-29-14
  2. 2. Big Data At Monsanto What do we mean when we say “big data”?
  3. 3. Monsanto is a (Big) Data Company
  4. 4. A-Maizing
  5. 5. Molecular Machinery
  6. 6. A Reference
  7. 7. Genomic Data Warehouse
  8. 8. Architecture Application Layer Compute Farm Query Genomic Data Hadoop Cluster Genomic Data HbaseHadoop Map Reduce Engine Data Services Edge Node Edge Node … Unstructured Query Solr Indexes Lineage Graph DB
  9. 9. VCF – Variant Call Format ##fileformat=VCFv4.1 ##fileDate=20090805 ##source=myImputationProgramV3.1 ##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta ##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x> ##phasing=partial ##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data"> ##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth"> ##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency"> ##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele"> ##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129"> ##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership"> ##FILTER=<ID=q10,Description="Quality below 10"> ##FILTER=<ID=s50,Description="Less than 50% of samples have data"> ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype"> ##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality"> ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth"> ##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality"> #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 20 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27 20 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 20 1234567 microsat1 GTC G,GTCT 50 PASS NS=3;DP=9;AA=G GT:GQ:DP 0/1:35:4 20 14370 rs6054257 G A 29 PASS …
  10. 10. Let’s Try HBase
  11. 11. HBase Layout C1 C2 Cn V1 Vn V2 Vn Rowkey1 -> Rowkey2 -> ColumnFamily1
  12. 12. Tall Narrow Ind:chr:pos Id Ref SNP rs1243 A true rs321 C true rs1243 A true 1a2b3c:chr1:1001 -> 1a2b3c:chr1:1000 -> ColumnFamily:V 456def:chr1:1000 ->
  13. 13. Matrix Report Use Case Pos VCF1 VCF2 VCF3 VCF4 chr1:100 A A . . chr1:101 . C C C chr1:102 . . . A chr1:103 T T . T chr1:104 . . . .
  14. 14. First Try Mapper Individual 1 Region 1 Mapper Individual n Region m … Reduce chr1:1000 Chrom:Pos Reduce chr1:1001 Reduce chrN:M … intermediate result intermediate result intermediate result Mapper Individual 1 Region 1
  15. 15. Mind the gaps 8 12 24
  16. 16. Check For Gaps Mapper Individual 1 chr 1 Mapper Individual 2 chr 1 Mapper Individual n chr m … Reduce Abc123:chr1 Reduce abc124:chr1 Reduce n:m … intermediate result intermediate result intermediate result intermediate result intermediate result intermediate result Check Gaps against coverage table
  17. 17. This Takes Some Time 0.00 5.00 10.00 15.00 20.00 25.00 30.00 35.00 1 2 3 4 5 6 7 8 9 10 11 Matrix Report Running Time time in minutes # of VCFs Minutes
  18. 18. Flat Wide genome:chr:pos 123:alt 123:qual 456:alt 456:qual A 30 C 50 T 35 T 45 b73:chr1:1000 -> b73:chr1:1001 ->
  19. 19. Improved workflow Mapper chr1:100 Mapper chr1:100 Mapper chrN:M … Reduce 123abc:chr1 Reduce 456def … intermediate result intermediate result join on positions
  20. 20. Feature Search Find genomic features like promoter regions, UTR, exons, genes, etc.
  21. 21. Original Indexing Workflow HBase Feature Table Table Mapper Solr Index Reducer Solr Index Reducer Index zip Index zip Table Mapper Table Mapper
  22. 22. Solr Pulls In Updates HDFS Solr Server cron
  23. 23. SolrCloud
  24. 24. Cloudera Search HBase Feature Table HDFS HBase MapReduce Indexer Tool Morphline file
  25. 25. Morphline File { extractAvroPaths { flatten : true paths : { feature_name : /featureName feature_type : /type start : /start stop : /stop } } }
  26. 26. Lessons Learned • Use HBase like a HashMap, not like a relational database • Denormalize HBase schemas, no foreign keys • To scale solr indexes past one server, use SolrCloud/Cloudera Search, don’t rebuild it on your own
  27. 27. Questions

×