ADAM

https://github.com/massie/adam
Matt Massie
University of California, Berkeley
massie@berkeley.edu

Saturday, Novembe...
SAM

BAM

ADAM

Sequence Alignment Map (SAM)
Binary Alignment Map (BAM)
Avro Data Alignment Map (ADAM)

Saturday, November...
Pipeline Issues Today:
Time and Scale
• The time to go from reads to answers is
too long

• Processing thousands of BAM fil...
ADAM:
Speed and Scale
• Read BAM once, perform transformations
(e.g. sort, mark duplicates, BQSR) in
distributed memory, w...
Unlocking Genomic Data
Shark (SQL)
Hadoop
M/R

Spark

Impala (SQL)

ADAM ADAM ADAM ADAM ADAM
ADAM ADAM ADAM ADAM ADAM
ADAM...
record ADAMRecord {
union
union
union
union
union
union
union
union
union
union
union

{
{
{
{
{
{
{
{
{
{
{

null,
null,
...
Parquet
http://parquet.io

Column-oriented layout
Row-oriented layout

https://blog.twitter.com/2013/dremel-made-simple-wi...
Genomic Data Example
chrom20 TCGA

4M

chrom20 GAAT

4M1D

chrom20 CCGAT

5M

Column Oriented
chrom20 chrom20 chrom20

TCG...
http://spark.incubator.apache.org/

Saturday, November 2, 13
Low-Coverage BAM
Experiment
• 14GB Low-coverage BAM with 145M reads
• 10-node ec2 cluster m2.4xlarge
• Reduced to 13GB wit...
High-Coverage BAM
Experiment
• Input: 237GB NA12878- high coverage,
PCR free, whole-genome BAM

• Conversion took 4hrs on ...
Current Features
•
•
•
•
•

Saturday, November 2, 13

Convert BAM to ADAM (read-oriented)
Sort an ADAM file by reference
Ge...
In progress...
•

Frank is working on a distributed variant caller (https://
github.com/fnothaft/avocado), local realignme...
Upcoming SlideShare
Loading in …5
×

ADAM

1,837 views

Published on

Introductory talk on ADAM -- a system of storing and analyzing genomic data using Avro, Parquet and Spark.

0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,837
On SlideShare
0
From Embeds
0
Number of Embeds
32
Actions
Shares
0
Downloads
23
Comments
0
Likes
4
Embeds 0
No embeds

No notes for slide

ADAM

  1. 1. ADAM https://github.com/massie/adam Matt Massie University of California, Berkeley massie@berkeley.edu Saturday, November 2, 13
  2. 2. SAM BAM ADAM Sequence Alignment Map (SAM) Binary Alignment Map (BAM) Avro Data Alignment Map (ADAM) Saturday, November 2, 13
  3. 3. Pipeline Issues Today: Time and Scale • The time to go from reads to answers is too long • Processing thousands of BAM files for statistical analysis doesn’t scale Saturday, November 2, 13
  4. 4. ADAM: Speed and Scale • Read BAM once, perform transformations (e.g. sort, mark duplicates, BQSR) in distributed memory, write the analysisready ADAM file once • Use a distribute filesystem (HDFS), a fast execution system (Spark) and columnar data formats (Parquet) to scale Saturday, November 2, 13
  5. 5. Unlocking Genomic Data Shark (SQL) Hadoop M/R Spark Impala (SQL) ADAM ADAM ADAM ADAM ADAM ADAM ADAM ADAM ADAM ADAM ADAM ADAM ADAM ADAM ADAM ADAM Hadoop Distributed File System (HDFS) Local Filesystem ADAM ADAM ADAM ADAM Saturday, November 2, 13 BAM
  6. 6. record ADAMRecord { union union union union union union union union union union union { { { { { { { { { { { null, null, null, null, null, null, null, null, null, null, null, string } referenceName = null; int } referenceId = null; long } start = null; int } mapq = null; string } readName = null; string } sequence = null; string } mateReference = null; long } mateAlignmentStart = null; string } cigar = null; string } qual = null; string } recordGroupId = null; union union union union union union union union union union union { { { { { { { { { { { boolean, boolean, boolean, boolean, boolean, boolean, boolean, boolean, boolean, boolean, boolean, null null null null null null null null null null null } } } } } } } } } } } http://avro.apache.org/ readPaired = false; properPair = false; readMapped = false; mateMapped = false; readNegativeStrand = false; mateNegativeStrand = false; firstOfPair = false; secondOfPair = false; primaryAlignment = false; failedVendorQualityChecks = false; duplicateRead = false; union { null, string } mismatchingPositions = null; union { null, string } attributes = null; union union union union union union union union union union } { { { { { { { { { { null, null, null, null, null, null, null, null, null, null, string } recordGroupSequencingCenter = null; string } recordGroupDescription = null; long } recordGroupRunDateEpoch = null; string } recordGroupFlowOrder = null; string } recordGroupKeySequence = null; string } recordGroupLibrary = null; int } recordGroupPredictedMedianInsertSize = null; string } recordGroupPlatform = null; string } recordGroupPlatformUnit = null; string } recordGroupSample = null; union { null, int } mateReferenceId = null; Saturday, November 2, 13
  7. 7. Parquet http://parquet.io Column-oriented layout Row-oriented layout https://blog.twitter.com/2013/dremel-made-simple-with-parquet Saturday, November 2, 13
  8. 8. Genomic Data Example chrom20 TCGA 4M chrom20 GAAT 4M1D chrom20 CCGAT 5M Column Oriented chrom20 chrom20 chrom20 TCGA GAAT CCGAT 4M 4M1D 5M Row Oriented chrom20 Saturday, November 2, 13 TCGA 4M chrom20 GAAT 4M1D chrom20 CCGAT 5M
  9. 9. http://spark.incubator.apache.org/ Saturday, November 2, 13
  10. 10. Low-Coverage BAM Experiment • 14GB Low-coverage BAM with 145M reads • 10-node ec2 cluster m2.4xlarge • Reduced to 13GB with ADAM • Conversion/upload to HDFS 22mins • Sorted in 7minutes Saturday, November 2, 13
  11. 11. High-Coverage BAM Experiment • Input: 237GB NA12878- high coverage, PCR free, whole-genome BAM • Conversion took 4hrs on ec2 m2.4xlarge (8cpu, 68.4gb mem) • Output size: 237GB BAM reduced to 212GB ADAM Saturday, November 2, 13
  12. 12. Current Features • • • • • Saturday, November 2, 13 Convert BAM to ADAM (read-oriented) Sort an ADAM file by reference Generate ADAMPileups Print mpileup output Very soon ADAM will be able to mark duplicates (initial benchmarks look good)
  13. 13. In progress... • Frank is working on a distributed variant caller (https:// github.com/fnothaft/avocado), local realignment, adam2bam • Chris Hartl is integrating ADAM with GATK (https:// github.com/chartl/GAParquet) DiagnoseTargets, adding new VCF formats to ADAM, BQSR • Christos Kozanitis has been working on Shark and Impala integration for ad-hoc SQL read queries • Collaborations with Mt. Sinai, GenomeBridge and the Broad Institute who are interested in using ADAM Saturday, November 2, 13

×