Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Scaling up genomic analysis with ADAM

1,258 views

Published on

Slides from DNAnexus about engineering/system design principles from ADAM, as well as emerging work in long-read OLC assembly.

Published in: Engineering
  • Be the first to comment

Scaling up genomic analysis with ADAM

  1. 1. Scaling up genomic analysis with ADAM Frank Austin Nothaft, UC Berkeley AMPLab fnothaft@berkeley.edu, @fnothaft 12/8/2014
  2. 2. Data Intensive Genomics • Scale of genomic analyses is growing rapidly: • New experiments sequence 10-100k samples • Use high coverage, WGS for variant analyses • 100k samples @ 60x WGS will generate ~20PB of read data and ~300TB of genotype data
  3. 3. Petabytes Cause Problems 1. Analysis systems must be horizontally scalable without substantial programmer overhead 2. Data storage format must compress well while providing good read performance 3. Need to efficiently slice and dice dataset: not all users want the same views or subsets of data
  4. 4. Analysis Characteristics • Current genomics pipelines are limited by I/O • Most genomics algorithms can be formulated as a data or graph parallel computation • Analysis algorithms use iteration and pipelining • Reference genome/experiment metadata access must be cheap! —> impacts analysis performance
  5. 5. What is ADAM? • An open source, high performance, distributed platform for genomic analysis • ADAM defines a: 1. Data schema and layout on disk* 2. A Scala API 3. A command line interface * Via Avro and Parquet
  6. 6. Principles for Scalable Design in ADAM • Reuse commodity horizontally scalable systems • Parallel FS and data representation (HDFS + Parquet) combined with in-memory computing eliminates disk bandwidth bottleneck • Spark provides horizontally scalable iterative/ pipelined Map-Reduce • Minimize data movement: send code to data, efficiently encode metadata
  7. 7. • An in-memory data parallel computing framework • Optimized for iterative jobs —> unlike Hadoop • Data maintained in memory unless inter-node movement needed (e.g., on repartitioning) • Presents a functional programing API, along with support for iterative programming via REPL • Set Daytona Greysort record (100TB in 23 min, 206 nodes)
  8. 8. Data Format • Avro schema encoded by Parquet • Schema can be updated without breaking backwards compatibility • Normalize metadata fields into schema for O(1) metadata access • Genotype schema is strictly biallelic, a “cell in the matrix” record AlignmentRecord { union { null, Contig } contig = null; union { null, long } start = null; union { null, long } end = null; union { null, int } mapq = null; union { null, string } readName = null; union { null, string } sequence = null; union { null, string } mateReference = null; union { null, long } mateAlignmentStart = null; union { null, string } cigar = null; union { null, string } qual = null; union { null, string } recordGroupName = null; union { int, null } basesTrimmedFromStart = 0; union { int, null } basesTrimmedFromEnd = 0; union { boolean, null } readPaired = false; union { boolean, null } properPair = false; union { boolean, null } readMapped = false; union { boolean, null } mateMapped = false; union { boolean, null } firstOfPair = false; union { boolean, null } secondOfPair = false; union { boolean, null } failedVendorQualityChecks = false; union { boolean, null } duplicateRead = false; union { boolean, null } readNegativeStrand = false; union { boolean, null } mateNegativeStrand = false; union { boolean, null } primaryAlignment = false; union { boolean, null } secondaryAlignment = false; union { boolean, null } supplementaryAlignment = false; union { null, string } mismatchingPositions = null; union { null, string } origQual = null; union { null, string } attributes = null; union { null, string } recordGroupSequencingCenter = null; union { null, string } recordGroupDescription = null; union { null, long } recordGroupRunDateEpoch = null; union { null, string } recordGroupFlowOrder = null; union { null, string } recordGroupKeySequence = null; union { null, string } recordGroupLibrary = null; union { null, int } recordGroupPredictedMedianInsertSize = null; union { null, string } recordGroupPlatform = null; union { null, string } recordGroupPlatformUnit = null; union { null, string } recordGroupSample = null; union { null, Contig} mateContig = null; }
  9. 9. Parquet • ASF Incubator project, based on Google Dremel • http://www.parquet.io • High performance columnar store with support for projections and push-down predicates • 3 layers of parallelism: • File/row group • Column chunk • Page Image from Parquet format definition: https://github.com/Parquet/parquet-format
  10. 10. Big Data in Parquet • ADAM in Parquet provides a 25% improvement over compressed BAM • Enables efficient slice-and-dice: • Can select column projections —> reduce I/O • Support pushdown predicates for efficient filtering • Have Parquet/S3 integration to push computing down into remote block stores for cold data
  11. 11. Scalability • Evaluated on 1000G WGS NA12878, 234GB dataset • Used 32-128 m2.4xlarge, 1 cr1.8xlarge from AWS • Achieve linear scalability out to 128 nodes for most tasks • 2-4x improvement vs {GATK, samtools/Picard} on single machine for most tasks
  12. 12. Long-read assembly with PacMin
  13. 13. The State of Analysis • Conventional short-read alignment based pipelines are really good at calling SNPs • Need improvement at calling INDELs and SVs • And are slow: 2 weeks to sequence, 1 week to analyze. Not fast enough. • If we move away from short reads, do we have other options?
  14. 14. Opportunities • New read technologies are available • Provide much longer reads (250bp vs. >10kbp) • Different error model… (15% INDEL errors, vs. 2% SNP errors) • Generally, lower sequence specific bias Left: PacBio homepage, Right: Wired, http://www.wired.com/2012/03/oxford-nanopore-sequencing-usb/
  15. 15. If long reads are available… • We can use conventional methods: Carneiro et al, Genome Biology 2012
  16. 16. But! • Why not make raw assemblies out of the reads? Find overlapping reads Find consensus sequence for all pairs of reads (i,j): i j =? …ACACTGCGACTCATCGACTC… • Problems: 1. Overlapping is O(n 2 ) and single evaluation is expensive anyways 2. Typical algorithms find a single consensus sequence; what if we’ve got polymorphisms?
  17. 17. Fast Overlapping with MinHashing • Wonderful realization by Berlin et al1: overlapping is similar to document similarity problem • Use MinHashing to approximate similarity: 1: Berlin et al, bioRxiv 2014 Per document/read, compute signature:! ! 1. Cut into shingles 2. Apply random hashes to shingles 3. Take min over all random hashes Hash into buckets:! ! Signatures of length l can be hashed into b buckets, so we expect to compare all elements with similarity ≥ (1/b)^(b/l) Compare:! ! For two documents with signatures of length l, Jaccard similarity is estimated by (# equal hashes) / l ! • Easy to implement in Spark: map, groupBy, map, filter
  18. 18. Overlaps to Assemblies • Finding pairwise overlaps gives us a directed graph between reads (lots of edges!)
  19. 19. Transitive Reduction • We can find a consensus between clique members • Or, we can reduce down: • Via two iterations of Pregel!
  20. 20. Monoallelic Sequence Model • Traditional probabilistic models assume independence at each site and a good reference model • This discards information about local sequence context • Can consider a different formulation of the problem: • Per reduced segment, build a graph of the alleles • Find the allelic copy numbers that maximize segment probability
  21. 21. Allele Graphs ACACTCG C A TCTCA G C • Edges of graph define conditional probabilities ! ! TCCACACT • Can efficiently marginalize probabilities over graph using Eliminate algorithm1, exactly solve for argmax 1. Jordan, “Probabilistic Graphical Models.” Notes:! X = copy number of this allele Y = copy number of preceding allele k = number of reads observed j = number of reads supporting Y —> X transition Pi = probability that read i supports Y —> X transition
  22. 22. Output • Current assemblers emit FASTA contigs • We’ll emit “multigs”, which we’ll map back to reference graph • Multig = multi-allelic (polymorphic) contig • Will include a confidence score per base • Working with UCSC, who’ve done some really neat work1 deriving formalisms & building software for mapping between sequence graphs, and GA4GH ref. variation team 1. Paten et al, “Mapping to a Reference Genome Structure”, arXiv 2014.
  23. 23. Acknowledgements • UC Berkeley: Matt Massie, André Schumacher, Jey Kottalam, Christos Kozanitis, Adam Bloniarz! • Mt. Sinai: Arun Ahuja, Neal Sidhwaney, Michael Linderman, Jeff Hammerbacher! • GenomeBridge: Timothy Danford, Carl Yeksigian! • Cloudera: Uri Laserson! • Microsoft Research: Jeremy Elson, Ravi Pandya! • And many other open source contributors: 26 contributors to ADAM/BDG from >11 institutions

×