Managing Genomes at Scale:
What We Learned
Rob Long
Monsanto: Big Data Engineer
Twitter: @plantimals
5-29-14
Big Data At Monsanto
What do we mean when we say “big data”?
Monsanto is a (Big) Data Company
A-Maizing
Molecular Machinery
A Reference
Genomic Data Warehouse
Architecture
Application Layer
Compute Farm
Query Genomic Data
Hadoop Cluster
Genomic Data
HbaseHadoop Map Reduce Engine
Data Services
Edge Node Edge Node
…
Unstructured Query
Solr Indexes
Lineage
Graph DB
VCF – Variant Call Format
##fileformat=VCFv4.1
##fileDate=20090805
##source=myImputationProgramV3.1
##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta
##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x>
##phasing=partial
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">
##FILTER=<ID=q10,Description="Quality below 10">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001
20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51
20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50
20 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27
20 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60
20 1234567 microsat1 GTC G,GTCT 50 PASS NS=3;DP=9;AA=G GT:GQ:DP 0/1:35:4
20 14370 rs6054257 G A 29 PASS …
Let’s Try HBase
HBase Layout
C1 C2 Cn
V1 Vn
V2 Vn
Rowkey1 ->
Rowkey2 ->
ColumnFamily1
Tall Narrow
Ind:chr:pos Id Ref SNP
rs1243 A true
rs321 C true
rs1243 A true
1a2b3c:chr1:1001 ->
1a2b3c:chr1:1000 ->
ColumnFamily:V
456def:chr1:1000 ->
Matrix Report Use Case
Pos VCF1 VCF2 VCF3 VCF4
chr1:100 A A . .
chr1:101 . C C C
chr1:102 . . . A
chr1:103 T T . T
chr1:104 . . . .
First Try
Mapper
Individual 1
Region 1
Mapper
Individual n
Region m
…
Reduce
chr1:1000
Chrom:Pos
Reduce
chr1:1001
Reduce
chrN:M
…
intermediate
result
intermediate
result
intermediate
result
Mapper
Individual 1
Region 1
Mind the gaps
8 12 24
Check For Gaps
Mapper
Individual 1
chr 1
Mapper
Individual 2
chr 1
Mapper
Individual n
chr m
…
Reduce
Abc123:chr1
Reduce
abc124:chr1
Reduce
n:m
…
intermediate
result
intermediate
result
intermediate
result
intermediate
result
intermediate
result
intermediate
result
Check Gaps against
coverage table
This Takes Some Time
0.00
5.00
10.00
15.00
20.00
25.00
30.00
35.00
1 2 3 4 5 6 7 8 9 10 11
Matrix Report Running Time
time in minutes
# of VCFs
Minutes
Flat Wide
genome:chr:pos
123:alt 123:qual 456:alt 456:qual
A 30 C 50
T 35 T 45
b73:chr1:1000 ->
b73:chr1:1001 ->
Improved workflow
Mapper
chr1:100
Mapper
chr1:100
Mapper
chrN:M
…
Reduce
123abc:chr1
Reduce
456def
…
intermediate
result
intermediate
result
join on
positions
Feature Search
Find genomic features like promoter regions, UTR,
exons, genes, etc.
Original Indexing Workflow
HBase
Feature Table
Table
Mapper
Solr Index
Reducer
Solr Index
Reducer
Index zip
Index zip
Table
Mapper
Table
Mapper
Solr Pulls In Updates
HDFS
Solr Server
cron
SolrCloud
Cloudera Search
HBase
Feature Table HDFS
HBase MapReduce
Indexer Tool
Morphline
file
Morphline File
{
extractAvroPaths {
flatten : true
paths : {
feature_name : /featureName
feature_type : /type
start : /start
stop : /stop
}
}
}
Lessons Learned
• Use HBase like a HashMap, not like a relational
database
• Denormalize HBase schemas, no foreign keys
• To scale solr indexes past one server, use
SolrCloud/Cloudera Search, don’t rebuild it on
your own
Questions

Managing Genomes At Scale: What We Learned - StampedeCon 2014

  • 1.
    Managing Genomes atScale: What We Learned Rob Long Monsanto: Big Data Engineer Twitter: @plantimals 5-29-14
  • 2.
    Big Data AtMonsanto What do we mean when we say “big data”?
  • 3.
    Monsanto is a(Big) Data Company
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
    Architecture Application Layer Compute Farm QueryGenomic Data Hadoop Cluster Genomic Data HbaseHadoop Map Reduce Engine Data Services Edge Node Edge Node … Unstructured Query Solr Indexes Lineage Graph DB
  • 9.
    VCF – VariantCall Format ##fileformat=VCFv4.1 ##fileDate=20090805 ##source=myImputationProgramV3.1 ##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta ##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x> ##phasing=partial ##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data"> ##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth"> ##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency"> ##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele"> ##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129"> ##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership"> ##FILTER=<ID=q10,Description="Quality below 10"> ##FILTER=<ID=s50,Description="Less than 50% of samples have data"> ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype"> ##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality"> ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth"> ##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality"> #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 20 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27 20 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 20 1234567 microsat1 GTC G,GTCT 50 PASS NS=3;DP=9;AA=G GT:GQ:DP 0/1:35:4 20 14370 rs6054257 G A 29 PASS …
  • 10.
  • 11.
    HBase Layout C1 C2Cn V1 Vn V2 Vn Rowkey1 -> Rowkey2 -> ColumnFamily1
  • 12.
    Tall Narrow Ind:chr:pos IdRef SNP rs1243 A true rs321 C true rs1243 A true 1a2b3c:chr1:1001 -> 1a2b3c:chr1:1000 -> ColumnFamily:V 456def:chr1:1000 ->
  • 13.
    Matrix Report UseCase Pos VCF1 VCF2 VCF3 VCF4 chr1:100 A A . . chr1:101 . C C C chr1:102 . . . A chr1:103 T T . T chr1:104 . . . .
  • 14.
    First Try Mapper Individual 1 Region1 Mapper Individual n Region m … Reduce chr1:1000 Chrom:Pos Reduce chr1:1001 Reduce chrN:M … intermediate result intermediate result intermediate result Mapper Individual 1 Region 1
  • 15.
  • 16.
    Check For Gaps Mapper Individual1 chr 1 Mapper Individual 2 chr 1 Mapper Individual n chr m … Reduce Abc123:chr1 Reduce abc124:chr1 Reduce n:m … intermediate result intermediate result intermediate result intermediate result intermediate result intermediate result Check Gaps against coverage table
  • 17.
    This Takes SomeTime 0.00 5.00 10.00 15.00 20.00 25.00 30.00 35.00 1 2 3 4 5 6 7 8 9 10 11 Matrix Report Running Time time in minutes # of VCFs Minutes
  • 18.
    Flat Wide genome:chr:pos 123:alt 123:qual456:alt 456:qual A 30 C 50 T 35 T 45 b73:chr1:1000 -> b73:chr1:1001 ->
  • 19.
  • 20.
    Feature Search Find genomicfeatures like promoter regions, UTR, exons, genes, etc.
  • 21.
    Original Indexing Workflow HBase FeatureTable Table Mapper Solr Index Reducer Solr Index Reducer Index zip Index zip Table Mapper Table Mapper
  • 22.
    Solr Pulls InUpdates HDFS Solr Server cron
  • 23.
  • 24.
    Cloudera Search HBase Feature TableHDFS HBase MapReduce Indexer Tool Morphline file
  • 25.
    Morphline File { extractAvroPaths { flatten: true paths : { feature_name : /featureName feature_type : /type start : /start stop : /stop } } }
  • 26.
    Lessons Learned • UseHBase like a HashMap, not like a relational database • Denormalize HBase schemas, no foreign keys • To scale solr indexes past one server, use SolrCloud/Cloudera Search, don’t rebuild it on your own
  • 27.

Editor's Notes

  • #2 Good Afternoon Rob Long Today, talk about big data at monsanto And some of the lessons we’ve learned. Started > year ago Little knowledge Something with plants But then, so cool
  • #3 Been doing big data since 2009 It wasn’t so big then, but… Big data not size, but handling Mutliple machines Denormalized nosql Mapreduce Also variety, unstructured and semistructured integration
  • #4 Monsanto not prev. assoc. w/ big data Sell seeds Recipes for soil air water into food We need to understand: Genomics, agronomy, breeding, chemistry, Data feedback for breeding Year over year improvement
  • #5 US average corn yields 1863 - 2002 Axes: x- year, y-bushels per acre ( 0 – 160 ) Up to 1930’s, yield was 30 bushels Increasing yields, due to breeding, treatments, automation, weather prediction, etc Now > 140 bshls/acre 5x increase Goal of 300 Bshls/acre by 2030 How?
  • #6 Knowledge of the molecular machinery, answers “how?” Highschool biology, applied Squiggles are important – called organelles Smaller scale ACGT’s Genes as recipes
  • #7 How many people familiar w/ hum gen proj? Some done on the east campus Actual books on shelves Map for a genome Shred a book Need a guide A reference, just like the library. Many references, one per species
  • #8 Store individual genomes Warehouse / data mart Archipelago of sources Bureaucratic friction Solution: GenRe
  • #9 <point out the parts> App servers are web logic, Using Cloudera, CDH4.6 hadoop cluster has 30 data nodes, 3 edge nodes Edge nodes host services Compute farm - blast to find sequences Graph db for lineage
  • #10 Header – standard/custom data 1 variant per line Variant ind. != ref Address, chrom/pos Chrom = file, pos = offset Count from 1 1 line per pos Maize 2.3B bases 1/1000 variants Keep variants only 1/1000th storage requirement Pay a price, explain later Three main access patterns. Dump ind. Matrix, sites passing filters Regions or whole genome Aligned to same ref Flanking regions Not real-time
  • #11 How many familiar w/ HBase? watch: “Introduction to NoSQL” by Martin Fowler on youtube HBase, big table Distributed persistent hashmap Keyspace partitioned into regions Region servers host the regions CAP, CP
  • #12 Rows, cf rowkeys Column families, similar datas Sparse columns, 1 c1 vs 2 c1
  • #13 Our approach Tall/narrow schema Fixed # cols Many rows 1 row / variant / VCF Id:chr:pos, groups indiv. data Good for indiv. And flanking (explain) But…
  • #14 Review matrix use case Filter: Pos w/ => 1 variant More joins in tall/narrow M * N scanners, worst case
  • #15 TableMapper, region/individual Join this on chrom:pos, get individuals at a pos Now we can filter Emit if needed Intermediate result, map files
  • #16 Blue: ref Green: data vert.boxes, 8 variant, 12 ref, 24 no data 24 complicates, no row Either ref or no data? Solved: store gaps
  • #17 Group by individual/chrom Gaps addressed this way Intersect gaps Then w/gaps, back to chr:pos orientation Dump results to n files
  • #18 X: num individuals Y: minutes Exponential growth Too many joins
  • #19 Swap individual with genome build One row per pos per genome One get HBase used correctly
  • #20 Single pass for M individ.
  • #21 Variants have a context Location, in a gene, known effects Search on any field Jbrowse, open source visualization
  • #22 An established pattern Features dumped from hbase Embedded solr in reducers Zips in hdfs
  • #23 Solr server runs cron Pulls zipped indexes Merges Incremental updates skip MR phase Problems: Brittle, lots of code to maintain Not distributed, a single solr server Must move data off HDFS
  • #24 Ramping up to billions of features The old solr server was choking. Enter, solrcloud. Indexes in HDFS, Coordinates through zookeeper Collection concept Same REST interface
  • #25 Cloudera search lily hbase indexer service (real time) MR indexer Morphline file defines mappings between inputs and solr docs Foreign key problem Add a step
  • #26 Mapping from avro to solr Can have multi-level records
  • #27 Hbase like hashmap Denormalize Scale with solrcloud, don’t rebuild
  • #28 Credit to: Jeff GenRe team Amandeep Khurana