Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Scaling up genomic analysis with ADAM

1,760 views

Published on

Slides from AMPCamp 5 about engineering/system design principles from ADAM.

Published in: Engineering
  • Be the first to comment

  • Be the first to like this

Scaling up genomic analysis with ADAM

  1. 1. Scaling up genomic analysis with ADAM Frank Austin Nothaft, UC Berkeley AMPLab fnothaft@berkeley.edu, @fnothaft 11/20/2014
  2. 2. Credit: Matt Massie & NHGRI
  3. 3. The Sequencing Abstraction It was the best of times, it was the worst of times… the worst of It was the the best of worst of times times, it was • Humans have 46 chromosomes and each chromosome looks like a long strong • We get randomly distributed substrings, and want to reassemble original, whole string Metaphor borrowed from Michael Schatz best of times was the worst
  4. 4. Genomics = Big Data • Sequencing run produces >100 GB of raw data • Want to process 1,000’s of samples at once to improve statistical power • Current pipelines take about a week to run and are not horizontally scalable
  5. 5. How do we process a genome?
  6. 6. What’s our goal? • Human genome is 3.3B letters long, but our reads are only 50-250 letters long • Sequence of the average human genome is known • Insight: Each human genome only differs at 1 in 1000 positions, so we can align short reads to average genome, and compute diff
  7. 7. Align Reads It was the best of times, it was the worst of times… best of times was the worst It was the the best of times, it was the worst of worst of times
  8. 8. Align Reads It was the best of times, it was the worst of times… It was the the best of best of times was the worst times, it was the worst of worst of times
  9. 9. Align Reads It was the best of times, it was the worst of times… It was the the best of best of times was the worst times, it was the worst of worst of times
  10. 10. Align Reads It was the best of times, it was the worst of times… It was the the best of best of times was the worst times, it was the worst of worst of times
  11. 11. Align Reads It was the best of times, it was the worst of times… It was the the best of times, it was the worst of best of times was the worst worst of times
  12. 12. Align Reads It was the best of times, it was the worst of times… It was the the best of times, it was the worst of best of times worst of times was the worst
  13. 13. Align Reads It was the best of times, it was the worst of times… It was the the best of times, it was the worst of best of times worst of times was the worst
  14. 14. Align Reads It was the best of times, it was the worst of times… It was the the best of times, it was the worst of worst of times best of times was the worst
  15. 15. Assemble Reads It was the best of times, it was the worst of times… It was the the best of times, it was the worst of worst of times best of times was the worst
  16. 16. Assemble Reads It was the best of times, it was the worst of times… It was the best of times, it was the worst of worst of times best of times was the worst
  17. 17. Assemble Reads It was the best of times, it was the worst of times… It was the best of times, it was was the worst the worst of worst of times
  18. 18. Assemble Reads It was the best of times, it was the worst of times… It was the best of times, it was the worst the worst of worst of times
  19. 19. Assemble Reads It was the best of times, it was the worst of times… It was the best of times, it was the worst of worst of times
  20. 20. Assemble Reads It was the best of times, it was the worst of times… It was the best of times, it was the worst of times
  21. 21. Overall Pipeline Structure From “GATK Best Practices”, https://www.broadinstitute.org/gatk/guide/best-practices
  22. 22. Overall Pipeline Structure End to end pipeline takes ~120 hours The stages take ~100 hours; ADAM works here From “GATK Best Practices”, https://www.broadinstitute.org/gatk/guide/best-practices
  23. 23. Making Genomics Horizontally Scalable
  24. 24. Key Observations • Current genomics pipelines are I/O limited • Most genomics algorithms can be formulated as either data/graph parallel computation • Genomics is heavy on iteration/pipelining, data access pattern is write once, read many times • High coverage, whole genome (>220 GB) will become main dataset for human genetics
  25. 25. ADAM Principles • Use schema as “narrow waist” • Columnar data representation + in-memory computing eliminates disk bandwidth bottleneck • Minimize data movement: send code to data Application Transformations Presentation Enriched Models Evidence Access MapReduce/DBMS Schema Data Models Materialized Data Columnar Storage Data Distribution Parallel FS/Sharding Physical Storage Disk
  26. 26. Data Independence • Many current genomics systems require data to be stored and processed in sorted order • This is an abstraction inversion! • Narrow waist at schema forces processing to be abstract from data, data to be abstract from disk • Do tricks at the processing level (fast coordinate-system joins) to give necessary programming abstractions
  27. 27. Data Format • Genomics algorithms frequently access global metadata • Schema is fully denormalized, allows O(1) access to metadata • Make all fields nullable to allow for arbitrary column projections • Avro enables literate programming record AlignmentRecord { union { null, Contig } contig = null; union { null, long } start = null; union { null, long } end = null; union { null, int } mapq = null; union { null, string } readName = null; union { null, string } sequence = null; union { null, string } mateReference = null; union { null, long } mateAlignmentStart = null; union { null, string } cigar = null; union { null, string } qual = null; union { null, string } recordGroupName = null; union { int, null } basesTrimmedFromStart = 0; union { int, null } basesTrimmedFromEnd = 0; union { boolean, null } readPaired = false; union { boolean, null } properPair = false; union { boolean, null } readMapped = false; union { boolean, null } mateMapped = false; union { boolean, null } firstOfPair = false; union { boolean, null } secondOfPair = false; union { boolean, null } failedVendorQualityChecks = false; union { boolean, null } duplicateRead = false; union { boolean, null } readNegativeStrand = false; union { boolean, null } mateNegativeStrand = false; union { boolean, null } primaryAlignment = false; union { boolean, null } secondaryAlignment = false; union { boolean, null } supplementaryAlignment = false; union { null, string } mismatchingPositions = null; union { null, string } origQual = null; union { null, string } attributes = null; union { null, string } recordGroupSequencingCenter = null; union { null, string } recordGroupDescription = null; union { null, long } recordGroupRunDateEpoch = null; union { null, string } recordGroupFlowOrder = null; union { null, string } recordGroupKeySequence = null; union { null, string } recordGroupLibrary = null; union { null, int } recordGroupPredictedMedianInsertSize = null; union { null, string } recordGroupPlatform = null; union { null, string } recordGroupPlatformUnit = null; union { null, string } recordGroupSample = null; union { null, Contig } mateContig = null; }
  28. 28. Parquet • ASF Incubator project, based on Google Dremel • http://www.parquet.io • High performance columnar store with support for projections and push-down predicates • 3 layers of parallelism: • File/row group • Column chunk • Page Image from Parquet format definition: https://github.com/Parquet/parquet-format
  29. 29. Access to Remote Data • For genomics, we often have a really huge dataset which we only want to analyze part of • This dataset might be stored in S3/equivalent block store • Minimize data movement by allowing Parquet to support predicate pushdown/projections into S3 • Work is in progress, found at https://github.com/ bigdatagenomics/adam/tree/multi-loader
  30. 30. Performance • Reduced pipeline time from 100 hrs to ~1hr • Linear speedup through 128 nodes, when processing 234GB of data • For flagstat, columnar projection leads to a 5x speedup
  31. 31. ADAM Status • Apache 2 licensed OSS • 25 contributors across 10 institutions • Pushing for production 1.0 release towards end of year • Working with GA4GH to use concepts from ADAM to improve broader genomics data management techniques
  32. 32. Acknowledgements • UC Berkeley: Matt Massie, André Schumacher, Jey Kottalam, Christos Kozanitis, Dave Patterson, Anthony Joseph • Mt. Sinai: Arun Ahuja, Neal Sidhwaney, Ryan Williams, Michael Linderman, Jeff Hammerbacher • GenomeBridge: Timothy Danford, Carl Yeksigian • The Broad Institute: Chris Hartl • Cloudera: Uri Laserson • Microsoft Research: Jeremy Elson, Ravi Pandya • And other open source contributors, including Michael Heuer, Neil Ferguson, Andy Petrella, Xavier Tordoir!

×