Managing Genomes At Scale: What We Learned - StampedeCon 2014


Published on

At StampedeCon 2014, Rob Long (Monstanto) presented "Managing Genomes At Scale: What We Learned."

Monsanto generates large amounts of genomic sequence data every year. Agronomists and other scientists use this data as input for predictive analytics to aid breeding and the discovery of new traits such as disease or drought resistance. In order to enable the broadest use possible of this valuable data, scientists would like to query genomic data by species, chromosome, position, and myriad other categories. We present our solutions to these problems, as realized on top of HBase here at Monsanto.We will be discussing our particular learnings around: flat/wide vs tall/narrow HBase schema design, preprocessing and caching windows of data for use in web based visualizations, approaches to complex multi-join queries across deep data sets, and distributed indexing via SolrCloud.

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Good Afternoon
    Rob Long
    Today, talk about big data at monsanto
    And some of the lessons we’ve learned.

    Started > year ago
    Little knowledge
    Something with plants
    But then, so cool
  • Been doing big data since 2009
    It wasn’t so big then, but…

    Big data not size, but handling
    Mutliple machines
    Denormalized nosql

    Also variety,
    unstructured and semistructured

  • Monsanto not prev. assoc. w/ big data
    Sell seeds
    Recipes for soil air water into food
    We need to understand:
    Genomics, agronomy, breeding, chemistry,
    Data feedback for breeding
    Year over year improvement
  • US average corn yields 1863 - 2002
    Axes: x- year, y-bushels per acre ( 0 – 160 )
    Up to 1930’s, yield was 30 bushels
    Increasing yields, due to breeding, treatments, automation, weather prediction, etc
    Now > 140 bshls/acre
    5x increase
    Goal of 300 Bshls/acre by 2030
  • Knowledge of the molecular machinery, answers “how?”
    Highschool biology, applied
    Squiggles are important – called organelles
    Smaller scale
    Genes as recipes
  • How many people familiar w/ hum gen proj?
    Some done on the east campus

    Actual books on shelves
    Map for a genome
    Shred a book
    Need a guide
    A reference, just like the library.
    Many references, one per species
  • Store individual genomes
    Warehouse / data mart
    Archipelago of sources
    Bureaucratic friction
    Solution: GenRe
  • <point out the parts>
    App servers are web logic,
    Using Cloudera, CDH4.6
    hadoop cluster has 30 data nodes, 3 edge nodes
    Edge nodes host services
    Compute farm
    - blast to find sequences
    Graph db for lineage
  • Header – standard/custom data
    1 variant per line
    Variant ind. != ref
    Address, chrom/pos
    Chrom = file, pos = offset
    Count from 1

    1 line per pos
    Maize 2.3B bases
    1/1000 variants
    Keep variants only
    1/1000th storage requirement
    Pay a price, explain later

    Three main access patterns.

    Dump ind.

    Matrix, sites passing filters
    Regions or whole genome
    Aligned to same ref

    Flanking regions

    Not real-time
  • How many familiar w/ HBase?
    watch: “Introduction to NoSQL” by Martin Fowler on youtube
    HBase, big table
    Distributed persistent hashmap
    Keyspace partitioned into regions
    Region servers host the regions
    CAP, CP
  • Rows, cf
    Column families, similar datas
    Sparse columns, 1 c1 vs 2 c1
  • Our approach
    Tall/narrow schema
    Fixed # cols
    Many rows
    1 row / variant / VCF
    Id:chr:pos, groups indiv. data
    Good for indiv. And flanking (explain)
  • Review matrix use case
    Filter: Pos w/ => 1 variant
    More joins in tall/narrow
    M * N scanners, worst case
  • TableMapper, region/individual
    Join this on chrom:pos, get individuals at a pos
    Now we can filter
    Emit if needed
    Intermediate result, map files
  • Blue: ref
    Green: data
    vert.boxes, 8 variant, 12 ref, 24 no data
    24 complicates, no row
    Either ref or no data?
    Solved: store gaps
  • Group by individual/chrom
    Gaps addressed this way
    Intersect gaps
    w/gaps, back to chr:pos orientation
    Dump results to n files
  • X: num individuals
    Y: minutes
    Exponential growth
    Too many joins
  • Swap individual with genome build
    One row per pos per genome
    One get
    HBase used correctly
  • Single pass for M individ.
  • Variants have a context
    Location, in a gene, known effects
    Search on any field
    Jbrowse, open source visualization
  • An established pattern
    Features dumped from hbase
    Embedded solr in reducers
    Zips in hdfs
  • Solr server runs cron
    Pulls zipped indexes
    Incremental updates skip MR phase

    Brittle, lots of code to maintain
    Not distributed, a single solr server
    Must move data off HDFS
  • Ramping up to billions of features
    The old solr server was choking.
    Enter, solrcloud.
    Indexes in HDFS,
    Coordinates through zookeeper
    Collection concept
    Same REST interface
  • Cloudera search
    lily hbase indexer service (real time)
    MR indexer
    Morphline file defines mappings between inputs and solr docs
    Foreign key problem
    Add a step
  • Mapping from avro to solr
    Can have multi-level records
  • Hbase like hashmap
    Scale with solrcloud, don’t rebuild
  • Credit to:
    GenRe team
    Amandeep Khurana
  • Managing Genomes At Scale: What We Learned - StampedeCon 2014

    1. 1. Managing Genomes at Scale: What We Learned Rob Long Monsanto: Big Data Engineer Twitter: @plantimals 5-29-14
    2. 2. Big Data At Monsanto What do we mean when we say “big data”?
    3. 3. Monsanto is a (Big) Data Company
    4. 4. A-Maizing
    5. 5. Molecular Machinery
    6. 6. A Reference
    7. 7. Genomic Data Warehouse
    8. 8. Architecture Application Layer Compute Farm Query Genomic Data Hadoop Cluster Genomic Data HbaseHadoop Map Reduce Engine Data Services Edge Node Edge Node … Unstructured Query Solr Indexes Lineage Graph DB
    9. 9. VCF – Variant Call Format ##fileformat=VCFv4.1 ##fileDate=20090805 ##source=myImputationProgramV3.1 ##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta ##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x> ##phasing=partial ##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data"> ##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth"> ##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency"> ##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele"> ##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129"> ##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership"> ##FILTER=<ID=q10,Description="Quality below 10"> ##FILTER=<ID=s50,Description="Less than 50% of samples have data"> ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype"> ##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality"> ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth"> ##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality"> #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 20 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27 20 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 20 1234567 microsat1 GTC G,GTCT 50 PASS NS=3;DP=9;AA=G GT:GQ:DP 0/1:35:4 20 14370 rs6054257 G A 29 PASS …
    10. 10. Let’s Try HBase
    11. 11. HBase Layout C1 C2 Cn V1 Vn V2 Vn Rowkey1 -> Rowkey2 -> ColumnFamily1
    12. 12. Tall Narrow Ind:chr:pos Id Ref SNP rs1243 A true rs321 C true rs1243 A true 1a2b3c:chr1:1001 -> 1a2b3c:chr1:1000 -> ColumnFamily:V 456def:chr1:1000 ->
    13. 13. Matrix Report Use Case Pos VCF1 VCF2 VCF3 VCF4 chr1:100 A A . . chr1:101 . C C C chr1:102 . . . A chr1:103 T T . T chr1:104 . . . .
    14. 14. First Try Mapper Individual 1 Region 1 Mapper Individual n Region m … Reduce chr1:1000 Chrom:Pos Reduce chr1:1001 Reduce chrN:M … intermediate result intermediate result intermediate result Mapper Individual 1 Region 1
    15. 15. Mind the gaps 8 12 24
    16. 16. Check For Gaps Mapper Individual 1 chr 1 Mapper Individual 2 chr 1 Mapper Individual n chr m … Reduce Abc123:chr1 Reduce abc124:chr1 Reduce n:m … intermediate result intermediate result intermediate result intermediate result intermediate result intermediate result Check Gaps against coverage table
    17. 17. This Takes Some Time 0.00 5.00 10.00 15.00 20.00 25.00 30.00 35.00 1 2 3 4 5 6 7 8 9 10 11 Matrix Report Running Time time in minutes # of VCFs Minutes
    18. 18. Flat Wide genome:chr:pos 123:alt 123:qual 456:alt 456:qual A 30 C 50 T 35 T 45 b73:chr1:1000 -> b73:chr1:1001 ->
    19. 19. Improved workflow Mapper chr1:100 Mapper chr1:100 Mapper chrN:M … Reduce 123abc:chr1 Reduce 456def … intermediate result intermediate result join on positions
    20. 20. Feature Search Find genomic features like promoter regions, UTR, exons, genes, etc.
    21. 21. Original Indexing Workflow HBase Feature Table Table Mapper Solr Index Reducer Solr Index Reducer Index zip Index zip Table Mapper Table Mapper
    22. 22. Solr Pulls In Updates HDFS Solr Server cron
    23. 23. SolrCloud
    24. 24. Cloudera Search HBase Feature Table HDFS HBase MapReduce Indexer Tool Morphline file
    25. 25. Morphline File { extractAvroPaths { flatten : true paths : { feature_name : /featureName feature_type : /type start : /start stop : /stop } } }
    26. 26. Lessons Learned • Use HBase like a HashMap, not like a relational database • Denormalize HBase schemas, no foreign keys • To scale solr indexes past one server, use SolrCloud/Cloudera Search, don’t rebuild it on your own
    27. 27. Questions