• Like
  • Save
Managing Genomes At Scale: What We Learned - StampedeCon 2014
Upcoming SlideShare
Loading in...5
×
 

Managing Genomes At Scale: What We Learned - StampedeCon 2014

on

  • 211 views

At StampedeCon 2014, Rob Long (Monstanto) presented "Managing Genomes At Scale: What We Learned." ...

At StampedeCon 2014, Rob Long (Monstanto) presented "Managing Genomes At Scale: What We Learned."

Monsanto generates large amounts of genomic sequence data every year. Agronomists and other scientists use this data as input for predictive analytics to aid breeding and the discovery of new traits such as disease or drought resistance. In order to enable the broadest use possible of this valuable data, scientists would like to query genomic data by species, chromosome, position, and myriad other categories. We present our solutions to these problems, as realized on top of HBase here at Monsanto.We will be discussing our particular learnings around: flat/wide vs tall/narrow HBase schema design, preprocessing and caching windows of data for use in web based visualizations, approaches to complex multi-join queries across deep data sets, and distributed indexing via SolrCloud.

Statistics

Views

Total Views
211
Views on SlideShare
210
Embed Views
1

Actions

Likes
1
Downloads
0
Comments
0

1 Embed 1

http://www.slideee.com 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Good Afternoon <br /> Rob Long <br /> Today, talk about big data at monsanto <br /> And some of the lessons we’ve learned. <br /> <br /> Started > year ago <br /> Little knowledge <br /> Something with plants <br /> But then, so cool
  • Been doing big data since 2009 <br /> It wasn’t so big then, but… <br /> <br /> Big data not size, but handling <br /> Mutliple machines <br /> Denormalized nosql <br /> Mapreduce <br /> <br /> Also variety, <br /> unstructured and semistructured <br /> integration <br /> <br />
  • Monsanto not prev. assoc. w/ big data <br /> Sell seeds <br /> Recipes for soil air water into food <br /> We need to understand: <br /> Genomics, agronomy, breeding, chemistry, <br /> Data feedback for breeding <br /> Year over year improvement <br />
  • US average corn yields 1863 - 2002 <br /> Axes: x- year, y-bushels per acre ( 0 – 160 ) <br /> Up to 1930’s, yield was 30 bushels <br /> Increasing yields, due to breeding, treatments, automation, weather prediction, etc <br /> Now > 140 bshls/acre <br /> 5x increase <br /> Goal of 300 Bshls/acre by 2030 <br /> How?
  • Knowledge of the molecular machinery, answers “how?” <br /> Highschool biology, applied <br /> Squiggles are important – called organelles <br /> Smaller scale <br /> ACGT’s <br /> Genes as recipes
  • How many people familiar w/ hum gen proj? <br /> Some done on the east campus <br /> <br /> Actual books on shelves <br /> Map for a genome <br /> Shred a book <br /> Need a guide <br /> A reference, just like the library. <br /> Many references, one per species
  • Store individual genomes <br /> Warehouse / data mart <br /> Archipelago of sources <br /> Bureaucratic friction <br /> Solution: GenRe
  • <br /> App servers are web logic, <br /> Using Cloudera, CDH4.6 <br /> hadoop cluster has 30 data nodes, 3 edge nodes <br /> Edge nodes host services <br /> Compute farm <br /> - blast to find sequences <br /> Graph db for lineage <br />
  • Header – standard/custom data <br /> 1 variant per line <br /> Variant ind. != ref <br /> Address, chrom/pos <br /> Chrom = file, pos = offset <br /> Count from 1 <br /> <br /> 1 line per pos <br /> Maize 2.3B bases <br /> 1/1000 variants <br /> Keep variants only <br /> 1/1000th storage requirement <br /> Pay a price, explain later <br /> <br /> Three main access patterns. <br /> <br /> Dump ind. <br /> <br /> Matrix, sites passing filters <br /> Regions or whole genome <br /> Aligned to same ref <br /> <br /> Flanking regions <br /> <br /> Not real-time
  • How many familiar w/ HBase? <br /> watch: “Introduction to NoSQL” by Martin Fowler on youtube <br /> HBase, big table <br /> Distributed persistent hashmap <br /> Keyspace partitioned into regions <br /> Region servers host the regions <br /> CAP, CP <br />
  • Rows, cf <br /> rowkeys <br /> Column families, similar datas <br /> Sparse columns, 1 c1 vs 2 c1 <br />
  • Our approach <br /> Tall/narrow schema <br /> Fixed # cols <br /> Many rows <br /> 1 row / variant / VCF <br /> Id:chr:pos, groups indiv. data <br /> Good for indiv. And flanking (explain) <br /> But…
  • Review matrix use case <br /> Filter: Pos w/ => 1 variant <br /> More joins in tall/narrow <br /> M * N scanners, worst case
  • TableMapper, region/individual <br /> Join this on chrom:pos, get individuals at a pos <br /> Now we can filter <br /> Emit if needed <br /> Intermediate result, map files
  • Blue: ref <br /> Green: data <br /> vert.boxes, 8 variant, 12 ref, 24 no data <br /> 24 complicates, no row <br /> Either ref or no data? <br /> Solved: store gaps
  • Group by individual/chrom <br /> Gaps addressed this way <br /> Intersect gaps <br /> Then <br /> w/gaps, back to chr:pos orientation <br /> Dump results to n files
  • X: num individuals <br /> Y: minutes <br /> Exponential growth <br /> Too many joins
  • Swap individual with genome build <br /> One row per pos per genome <br /> One get <br /> HBase used correctly
  • Single pass for M individ.
  • Variants have a context <br /> Location, in a gene, known effects <br /> Search on any field <br /> Jbrowse, open source visualization
  • An established pattern <br /> Features dumped from hbase <br /> Embedded solr in reducers <br /> Zips in hdfs <br />
  • Solr server runs cron <br /> Pulls zipped indexes <br /> Merges <br /> Incremental updates skip MR phase <br /> <br /> Problems: <br /> Brittle, lots of code to maintain <br /> Not distributed, a single solr server <br /> Must move data off HDFS
  • Ramping up to billions of features <br /> The old solr server was choking. <br /> Enter, solrcloud. <br /> Indexes in HDFS, <br /> Coordinates through zookeeper <br /> Collection concept <br /> Same REST interface <br />
  • Cloudera search <br /> lily hbase indexer service (real time) <br /> MR indexer <br /> Morphline file defines mappings between inputs and solr docs <br /> Foreign key problem <br /> Add a step
  • Mapping from avro to solr <br /> Can have multi-level records
  • Hbase like hashmap <br /> Denormalize <br /> Scale with solrcloud, don’t rebuild
  • Credit to: <br /> Jeff <br /> GenRe team <br /> Amandeep Khurana

Managing Genomes At Scale: What We Learned - StampedeCon 2014 Managing Genomes At Scale: What We Learned - StampedeCon 2014 Presentation Transcript

  • Managing Genomes at Scale: What We Learned Rob Long Monsanto: Big Data Engineer Twitter: @plantimals 5-29-14
  • Big Data At Monsanto What do we mean when we say “big data”?
  • Monsanto is a (Big) Data Company
  • A-Maizing
  • Molecular Machinery
  • A Reference
  • Genomic Data Warehouse
  • Architecture Application Layer Compute Farm Query Genomic Data Hadoop Cluster Genomic Data HbaseHadoop Map Reduce Engine Data Services Edge Node Edge Node … Unstructured Query Solr Indexes Lineage Graph DB
  • VCF – Variant Call Format ##fileformat=VCFv4.1 ##fileDate=20090805 ##source=myImputationProgramV3.1 ##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta ##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x> ##phasing=partial ##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data"> ##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth"> ##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency"> ##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele"> ##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129"> ##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership"> ##FILTER=<ID=q10,Description="Quality below 10"> ##FILTER=<ID=s50,Description="Less than 50% of samples have data"> ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype"> ##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality"> ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth"> ##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality"> #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 20 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27 20 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 20 1234567 microsat1 GTC G,GTCT 50 PASS NS=3;DP=9;AA=G GT:GQ:DP 0/1:35:4 20 14370 rs6054257 G A 29 PASS …
  • Let’s Try HBase
  • HBase Layout C1 C2 Cn V1 Vn V2 Vn Rowkey1 -> Rowkey2 -> ColumnFamily1
  • Tall Narrow Ind:chr:pos Id Ref SNP rs1243 A true rs321 C true rs1243 A true 1a2b3c:chr1:1001 -> 1a2b3c:chr1:1000 -> ColumnFamily:V 456def:chr1:1000 ->
  • Matrix Report Use Case Pos VCF1 VCF2 VCF3 VCF4 chr1:100 A A . . chr1:101 . C C C chr1:102 . . . A chr1:103 T T . T chr1:104 . . . .
  • First Try Mapper Individual 1 Region 1 Mapper Individual n Region m … Reduce chr1:1000 Chrom:Pos Reduce chr1:1001 Reduce chrN:M … intermediate result intermediate result intermediate result Mapper Individual 1 Region 1
  • Mind the gaps 8 12 24
  • Check For Gaps Mapper Individual 1 chr 1 Mapper Individual 2 chr 1 Mapper Individual n chr m … Reduce Abc123:chr1 Reduce abc124:chr1 Reduce n:m … intermediate result intermediate result intermediate result intermediate result intermediate result intermediate result Check Gaps against coverage table
  • This Takes Some Time 0.00 5.00 10.00 15.00 20.00 25.00 30.00 35.00 1 2 3 4 5 6 7 8 9 10 11 Matrix Report Running Time time in minutes # of VCFs Minutes
  • Flat Wide genome:chr:pos 123:alt 123:qual 456:alt 456:qual A 30 C 50 T 35 T 45 b73:chr1:1000 -> b73:chr1:1001 ->
  • Improved workflow Mapper chr1:100 Mapper chr1:100 Mapper chrN:M … Reduce 123abc:chr1 Reduce 456def … intermediate result intermediate result join on positions
  • Feature Search Find genomic features like promoter regions, UTR, exons, genes, etc.
  • Original Indexing Workflow HBase Feature Table Table Mapper Solr Index Reducer Solr Index Reducer Index zip Index zip Table Mapper Table Mapper
  • Solr Pulls In Updates HDFS Solr Server cron
  • SolrCloud
  • Cloudera Search HBase Feature Table HDFS HBase MapReduce Indexer Tool Morphline file
  • Morphline File { extractAvroPaths { flatten : true paths : { feature_name : /featureName feature_type : /type start : /start stop : /stop } } }
  • Lessons Learned • Use HBase like a HashMap, not like a relational database • Denormalize HBase schemas, no foreign keys • To scale solr indexes past one server, use SolrCloud/Cloudera Search, don’t rebuild it on your own
  • Questions