SlideShare a Scribd company logo
CRAM
Version 4 proposal
James Bonfield,
Wellcome Sanger Institute
What is CRAM?
● An alternative for aligned (SAM, BAM) and unaligned
(FASTQ) data, since 2012 (EBI).
– An accepted GA4GH standard.
– Updates as need V1 → V2 → V3 → V3.1/V4?
● Can subset by row (region) or column (data type).
– Even without transcoding (cram_filter).
● Flexible: can trade speed vs size vs random access.
● Htslib / Htsjdk offer a unified API to SAM, BAM, CRAM.
– Multiple language bindings: C, Java, JavaScript, Rust, Python,
Perl, C++, R, …
CRAM adoption myths
● From "The Quest To Save Genomics"
Bioinformatics experts already use standard compression tools like gzip to
shrink the size of a file by up to a factor of 20. Some researchers also use
more specialized compression tools that are optimized for genomic data, but
none of these tools have seen wide adoption.
https://spectrum.ieee.org/computing/software/the-desperate-quest-for-genomic-compression-algorithms
CRAM adoption myths
● From "The Quest To Save Genomics"
Bioinformatics experts already use standard compression tools like gzip to
shrink the size of a file by up to a factor of 20. Some researchers also use
more specialized compression tools that are optimized for genomic data, but
none of these tools have seen wide adoption.
https://spectrum.ieee.org/computing/software/the-desperate-quest-for-genomic-compression-algorithms
● CRAM has “been there, done that, got the archives”.
WRONG
WRONG!!
● ENA archives:
– ~170,000 BAMs
– ~350,000 CRAMs
● EGA archives:
– ~470,000 BAMs
– ~850,000 CRAMS
● Broad data:
– ~80,000 germline genome CRAMs
– ~200,000 germline exome BAMs
(soon to be CRAM)
– All somatic data in BAM.
2600 2700 2800 2900 3000 3100 3200 3300 3400
8
16
32
64
128
256
512
1024
2048
Size (MB)
Time(s)
Low Coverage, Illumina HiSeq 2000
Decompression
6300 6400 6500 6600
8
16
32
64
128
256
512
1024
2048
BAM
Deez
Quip
CRAM3
CRAM4
Size (MB)
2600 2700 2800 2900 3000 3100 3200 3300 3400
8
16
32
64
128
256
512
1024
2048
Size (MB)
Time(s)
6300 6400 6500 6600
8
16
32
64
128
256
512
1024
2048
BAM
Deez
Quip
CRAM3
CRAM4
Size (MB)
BAM
Quip
CRAM3
CRAM4
Deez
8threads
High Coverage, Illumina NovaSeq
Decompression
450 500 550 600 650 700
4
8
16
32
64
128
256
512
Size (MB)
Time(s)
1400 1450 1500
4
8
16
32
64
128
256
512
BAM
Deez
Quip
CRAM3
CRAM4
Size (MB)
450 500 550 600 650 700
4
8
16
32
64
128
256
512
Size (MB)
Time(s)
1400 1450 1500
4
8
16
32
64
128
256
512
BAM
Deez
Quip
CRAM3
CRAM4
Size (MB)
BAM
Quip
CRAM3
CRAM4
Deez
8threads
Crumble – lossy aligned data
● Lossy compression of read names
– Keep pairing information only.
● Lossy compression of quality values using “quality budget”
– Vertical: keep quality in regions where variant call is uncertain
(both alt and ref calls).
– Horizontal: smooth quality using libCSAM's P-block method.
– NB: designed for single-sample germline mutations only!
● Lossy compression of axillary tags (discard OQ, BI, BD, etc).
● Validate against Syndip (CHM1 + CHM13 synthetic diploid).
– Covers more genome than GIAB / Platinum Genomes, including
problematic STRs.
– Stresses hard to call indels!
Crumble Sizes
Raw Crumble
0
20
40
60
80
100
120
140
160
180
Syndip full (~50x)
BAM
CRAM3
CRAM4
CRAM4-max
Size(Gb)
Raw Crumble
0
10
20
30
40
50
60
Syndip 15x
BAM
CRAM3
CRAM4
CRAM4-max
Size(Gb)
165.9 to 10.7Gb 51.3 to 3.3Gb
● Lossless & unary binning (Q=25) provide baselines.
● Both QVZ2 (-t4) and Crumble are in the smaller/better box.
● CALQ poorer than 2-binning. (High coverage issues?)
0 500 1000 1500 2000 2500 3000 3500 4000 4500
21000
21500
22000
22500
23000
23500
24000
Qual size vs Error (FN+FP, SNP+Indel)
(Syndip Chr1 only; ~50x)
Lossless
Q=4/28
Q=25
Crumble
CALQ
QVZ2
Quality Size (Mb)
Errors
lossless
100% loss
“Smaller, better” box
Variant accuracy with GATK (Chr1 only)
Crumble calling accuracy
(Raw GATK – no VQSR step)
96.8
97.0
97.2
97.4
97.6
97.8
98.0
98.2
98.4
1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6
TruePositive%
50x
SNP
Lossless GATK
Lossless GATK filt
Crumble -9p8
Crumble -9p8 filt
70.0
75.0
80.0
85.0
90.0
95.0
0.6 0.8 1 1.2 1.4 1.6 1.8 2
15x
SNP
Lossless GATK
Lossless GATK filt
Crumble -9p8
Crumble -9p8 filt
82.0
82.2
82.4
82.6
82.8
83.0
83.2
8.2 8.4 8.6 8.8 9 9.2 9.4 9.6
TruePositive%
False Positive %
Indel
58.0
60.0
62.0
64.0
66.0
68.0
70.0
72.0
74.0
6.4 6.6 6.8 7 7.2 7.4 7.6 7.8 8
False Positive %
Indel
● Vary VCF QUAL threshold to get lines
● With / without filtering.
MPEG-G: ERR174324 chr 11
BAM MPEG-G MPEG-G
Sequences QV AUX
CF ~ 15
CF ~ 1.3
1586MB
666MB
109MB
CF ~ 11
Source: https://mpeg.chiariglione.org/sites/default/files/events/Mattavelli.pdf
ERR174324 chr 11
BAM MPEG-G MPEG-G
1586MB
666MB
109MB
Sequences QV AUX Names
BAM CRAM 3 CRAM 4 Crumble
1462MB ← plot scaled up
ERR174324_[12].fastq; bwa mem; chr11 onlyERR174324; unknown alignments.
ERR174324 chr 11
BAM MPEG-G MPEG-G
1586MB
666MB
109MB
Sequences QV AUX Names
BAM CRAM 3 CRAM 4 Crumble
1462MB ← plot scaled up
58MB
ERR174324_[12].fastq; bwa mem; chr11 onlyERR174324; unknown alignments.
PatentsPatents
pendingpending
(Check license)(Check license)
Conclusion
● CRAM 4 is 10-20% reduction over CRAM 3 size, but
sometimes extra CPU cost. (On the pareto frontier.)
– https://github.com/jkbonfield/io_lib (Scramble v1.14.10)
● Minimal amount of new invention; easy to adopt?
● Crumble is independent of format; controlled loss of
quality based on variant call confidence.
– Independent verification by DNAnexus:
https://blog.dnanexus.com/2018-07-23-breaking-down-crumble/
– Avoid usage on subclonal samples with somatic mutations, or
mixed sample datasets.
● More info: https://datageekdom.blogspot.com
● Sanger colleagues
– Rob Davies
● EBI CRAM authors
– Vadim Zalunin
● Compression tools
– DeeZ (Faraz Hach)
doi: 10.1038/nmeth.3133
– Quip (Daniel Jones)
doi: 10.1093/nar/gks754
– Samtools (Heng Li)
doi: 10.1093/bioinformatics/btp352
– Scramble
doi: 10.1093/bioinformatics/btu390
– Crumble
doi: 10.1093/bioinformatics/bty608
Acknowledgements
– David Jackson
– Markus Fritz (doi: 10.1101/gr.114819.110)
– QVZ
doi: 10.1093/bioinformatics/btv330
– CALQ
doi: 10.1093/bioinformatics/btx737
– ANS (Jarek Duda)
http://arxiv.org/abs/1311.2540
– Zlib (Jean-Loup. Gailly, Mark Adler),
– Bzip2 (Julian Seward; Burrows & Wheeler)
– Lzma (Igor Pavlov)
– Bsc (Ilya Grebnov)
QVZ2 -t4 calling accuracy
96.8
97.0
97.2
97.4
97.6
97.8
98.0
98.2
98.4
1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6
TruePositive%
50x
SNP
Lossless GATK
Lossless GATK filt
QVZ2 -t4 + GATK
QVZ2 -t4 + GATK filt
70.0
75.0
80.0
85.0
90.0
95.0
0.6 0.8 1 1.2 1.4 1.6 1.8 2
15x
SNP
Lossless GATK
Lossless GATK filt
QVZ2 -t4 + GATK
QVZ2 -t4 + GATK filt
82.0
82.2
82.4
82.6
82.8
83.0
83.2
8.2 8.4 8.6 8.8 9 9.2 9.4 9.6
TruePositive%
False Positive %
Indel
58.0
60.0
62.0
64.0
66.0
68.0
70.0
72.0
74.0
6.4 6.6 6.8 7 7.2 7.4 7.6 7.8 8
False Positive %
Indel
CALQ calling accuracy
96.8
97.0
97.2
97.4
97.6
97.8
98.0
98.2
98.4
1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6
TruePositive%
50x
SNP
Lossless GATK
Lossless GATK filt
Calq+GATK
Calq+GATK filt
70.0
75.0
80.0
85.0
90.0
95.0
0.6 0.8 1 1.2 1.4 1.6 1.8 2
15x
SNP
Lossless GATK
Lossless GATK filt
Calq+GATK
Calq+GATK filt
80.0
80.5
81.0
81.5
82.0
82.5
83.0
7.8 8 8.2 8.4 8.6 8.8 9 9.2 9.4 9.6
TruePositive%
False Positive %
Indel
54.0
56.0
58.0
60.0
62.0
64.0
66.0
68.0
70.0
72.0
6 6.2 6.4 6.6 6.8 7 7.2 7.4 7.6 7.8
False Positive %
Indel

More Related Content

Similar to Cram4

Request-Oriented Durable Write Caching for Application Performance (USENIX AT...
Request-Oriented Durable Write Caching for Application Performance (USENIX AT...Request-Oriented Durable Write Caching for Application Performance (USENIX AT...
Request-Oriented Durable Write Caching for Application Performance (USENIX AT...
Sangwook Kim
 
High-performance 32G Fibre Channel Module on MDS 9700 Directors:
High-performance 32G Fibre Channel Module on MDS 9700 Directors:High-performance 32G Fibre Channel Module on MDS 9700 Directors:
High-performance 32G Fibre Channel Module on MDS 9700 Directors:
Tony Antony
 
Gc algorithms
Gc algorithmsGc algorithms
Gc algorithms
Michał Warecki
 
Cram
CramCram
Processing 70Tb Of Genomics Data With ADAM And Toil
Processing 70Tb Of Genomics Data With ADAM And ToilProcessing 70Tb Of Genomics Data With ADAM And Toil
Processing 70Tb Of Genomics Data With ADAM And Toil
Spark Summit
 
Mastering java in containers - MadridJUG
Mastering java in containers - MadridJUGMastering java in containers - MadridJUG
Mastering java in containers - MadridJUG
Jorge Morales
 
Basics of JVM Tuning
Basics of JVM TuningBasics of JVM Tuning
Basics of JVM Tuning
Vladislav Gangan
 
Custom Computer Engine for Optimizing for the Inner kernel of Matrix Multipli...
Custom Computer Engine for Optimizing for the Inner kernel of Matrix Multipli...Custom Computer Engine for Optimizing for the Inner kernel of Matrix Multipli...
Custom Computer Engine for Optimizing for the Inner kernel of Matrix Multipli...
Ardavan Pedram
 
BIND 9 logging best practices
BIND 9 logging best practicesBIND 9 logging best practices
BIND 9 logging best practices
Men and Mice
 
A Beginner’s Guide to Kafka Performance in Cloud Environments with Steffen Ha...
A Beginner’s Guide to Kafka Performance in Cloud Environments with Steffen Ha...A Beginner’s Guide to Kafka Performance in Cloud Environments with Steffen Ha...
A Beginner’s Guide to Kafka Performance in Cloud Environments with Steffen Ha...
HostedbyConfluent
 
jvm goes to big data
jvm goes to big datajvm goes to big data
jvm goes to big data
srisatish ambati
 
2009-01-28 DOI NBC Red Hat on System z Performance Considerations
2009-01-28 DOI NBC Red Hat on System z Performance Considerations2009-01-28 DOI NBC Red Hat on System z Performance Considerations
2009-01-28 DOI NBC Red Hat on System z Performance Considerations
Shawn Wells
 
Red Hat Storage Day Atlanta - Red Hat Gluster Storage vs. Traditional Storage...
Red Hat Storage Day Atlanta - Red Hat Gluster Storage vs. Traditional Storage...Red Hat Storage Day Atlanta - Red Hat Gluster Storage vs. Traditional Storage...
Red Hat Storage Day Atlanta - Red Hat Gluster Storage vs. Traditional Storage...
Red_Hat_Storage
 
OpenPOWER Summit 2020 - OpenCAPI Keynote
OpenPOWER Summit 2020 -  OpenCAPI KeynoteOpenPOWER Summit 2020 -  OpenCAPI Keynote
OpenPOWER Summit 2020 - OpenCAPI Keynote
Allan Cantle
 
TRACK D: A breakthrough in logic design drastically improving performances fr...
TRACK D: A breakthrough in logic design drastically improving performances fr...TRACK D: A breakthrough in logic design drastically improving performances fr...
TRACK D: A breakthrough in logic design drastically improving performances fr...chiportal
 
IXPUG2016_paper_27
IXPUG2016_paper_27IXPUG2016_paper_27
IXPUG2016_paper_27agro28
 
Tape Storage and CRC Protection
Tape Storage and CRC ProtectionTape Storage and CRC Protection
Tape Storage and CRC Protection
Karel Ha
 
Taming The JVM
Taming The JVMTaming The JVM
Taming The JVM
Matthew McCullough
 
Технологии работы с дисковыми хранилищами и файловыми системами Windows Serve...
Технологии работы с дисковыми хранилищами и файловыми системами Windows Serve...Технологии работы с дисковыми хранилищами и файловыми системами Windows Serve...
Технологии работы с дисковыми хранилищами и файловыми системами Windows Serve...
Виталий Стародубцев
 

Similar to Cram4 (20)

Request-Oriented Durable Write Caching for Application Performance (USENIX AT...
Request-Oriented Durable Write Caching for Application Performance (USENIX AT...Request-Oriented Durable Write Caching for Application Performance (USENIX AT...
Request-Oriented Durable Write Caching for Application Performance (USENIX AT...
 
High-performance 32G Fibre Channel Module on MDS 9700 Directors:
High-performance 32G Fibre Channel Module on MDS 9700 Directors:High-performance 32G Fibre Channel Module on MDS 9700 Directors:
High-performance 32G Fibre Channel Module on MDS 9700 Directors:
 
Gc algorithms
Gc algorithmsGc algorithms
Gc algorithms
 
Cram
CramCram
Cram
 
Processing 70Tb Of Genomics Data With ADAM And Toil
Processing 70Tb Of Genomics Data With ADAM And ToilProcessing 70Tb Of Genomics Data With ADAM And Toil
Processing 70Tb Of Genomics Data With ADAM And Toil
 
Mastering java in containers - MadridJUG
Mastering java in containers - MadridJUGMastering java in containers - MadridJUG
Mastering java in containers - MadridJUG
 
Basics of JVM Tuning
Basics of JVM TuningBasics of JVM Tuning
Basics of JVM Tuning
 
Custom Computer Engine for Optimizing for the Inner kernel of Matrix Multipli...
Custom Computer Engine for Optimizing for the Inner kernel of Matrix Multipli...Custom Computer Engine for Optimizing for the Inner kernel of Matrix Multipli...
Custom Computer Engine for Optimizing for the Inner kernel of Matrix Multipli...
 
BIND 9 logging best practices
BIND 9 logging best practicesBIND 9 logging best practices
BIND 9 logging best practices
 
A Beginner’s Guide to Kafka Performance in Cloud Environments with Steffen Ha...
A Beginner’s Guide to Kafka Performance in Cloud Environments with Steffen Ha...A Beginner’s Guide to Kafka Performance in Cloud Environments with Steffen Ha...
A Beginner’s Guide to Kafka Performance in Cloud Environments with Steffen Ha...
 
jvm goes to big data
jvm goes to big datajvm goes to big data
jvm goes to big data
 
2009-01-28 DOI NBC Red Hat on System z Performance Considerations
2009-01-28 DOI NBC Red Hat on System z Performance Considerations2009-01-28 DOI NBC Red Hat on System z Performance Considerations
2009-01-28 DOI NBC Red Hat on System z Performance Considerations
 
Red Hat Storage Day Atlanta - Red Hat Gluster Storage vs. Traditional Storage...
Red Hat Storage Day Atlanta - Red Hat Gluster Storage vs. Traditional Storage...Red Hat Storage Day Atlanta - Red Hat Gluster Storage vs. Traditional Storage...
Red Hat Storage Day Atlanta - Red Hat Gluster Storage vs. Traditional Storage...
 
OpenPOWER Summit 2020 - OpenCAPI Keynote
OpenPOWER Summit 2020 -  OpenCAPI KeynoteOpenPOWER Summit 2020 -  OpenCAPI Keynote
OpenPOWER Summit 2020 - OpenCAPI Keynote
 
TRACK D: A breakthrough in logic design drastically improving performances fr...
TRACK D: A breakthrough in logic design drastically improving performances fr...TRACK D: A breakthrough in logic design drastically improving performances fr...
TRACK D: A breakthrough in logic design drastically improving performances fr...
 
IXPUG2016_paper_27
IXPUG2016_paper_27IXPUG2016_paper_27
IXPUG2016_paper_27
 
ARI. HiPEAK 2014
ARI. HiPEAK 2014ARI. HiPEAK 2014
ARI. HiPEAK 2014
 
Tape Storage and CRC Protection
Tape Storage and CRC ProtectionTape Storage and CRC Protection
Tape Storage and CRC Protection
 
Taming The JVM
Taming The JVMTaming The JVM
Taming The JVM
 
Технологии работы с дисковыми хранилищами и файловыми системами Windows Serve...
Технологии работы с дисковыми хранилищами и файловыми системами Windows Serve...Технологии работы с дисковыми хранилищами и файловыми системами Windows Serve...
Технологии работы с дисковыми хранилищами и файловыми системами Windows Serve...
 

Recently uploaded

原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
yqqaatn0
 
Orion Air Quality Monitoring Systems - CWS
Orion Air Quality Monitoring Systems - CWSOrion Air Quality Monitoring Systems - CWS
Orion Air Quality Monitoring Systems - CWS
Columbia Weather Systems
 
Unveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdfUnveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdf
Erdal Coalmaker
 
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
David Osipyan
 
ANAMOLOUS SECONDARY GROWTH IN DICOT ROOTS.pptx
ANAMOLOUS SECONDARY GROWTH IN DICOT ROOTS.pptxANAMOLOUS SECONDARY GROWTH IN DICOT ROOTS.pptx
ANAMOLOUS SECONDARY GROWTH IN DICOT ROOTS.pptx
RASHMI M G
 
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptxThe use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
MAGOTI ERNEST
 
THEMATIC APPERCEPTION TEST(TAT) cognitive abilities, creativity, and critic...
THEMATIC  APPERCEPTION  TEST(TAT) cognitive abilities, creativity, and critic...THEMATIC  APPERCEPTION  TEST(TAT) cognitive abilities, creativity, and critic...
THEMATIC APPERCEPTION TEST(TAT) cognitive abilities, creativity, and critic...
Abdul Wali Khan University Mardan,kP,Pakistan
 
Nucleic Acid-its structural and functional complexity.
Nucleic Acid-its structural and functional complexity.Nucleic Acid-its structural and functional complexity.
Nucleic Acid-its structural and functional complexity.
Nistarini College, Purulia (W.B) India
 
Lateral Ventricles.pdf very easy good diagrams comprehensive
Lateral Ventricles.pdf very easy good diagrams comprehensiveLateral Ventricles.pdf very easy good diagrams comprehensive
Lateral Ventricles.pdf very easy good diagrams comprehensive
silvermistyshot
 
Leaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdfLeaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdf
RenuJangid3
 
NuGOweek 2024 Ghent programme overview flyer
NuGOweek 2024 Ghent programme overview flyerNuGOweek 2024 Ghent programme overview flyer
NuGOweek 2024 Ghent programme overview flyer
pablovgd
 
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
University of Maribor
 
ESR spectroscopy in liquid food and beverages.pptx
ESR spectroscopy in liquid food and beverages.pptxESR spectroscopy in liquid food and beverages.pptx
ESR spectroscopy in liquid food and beverages.pptx
PRIYANKA PATEL
 
nodule formation by alisha dewangan.pptx
nodule formation by alisha dewangan.pptxnodule formation by alisha dewangan.pptx
nodule formation by alisha dewangan.pptx
alishadewangan1
 
Seminar of U.V. Spectroscopy by SAMIR PANDA
 Seminar of U.V. Spectroscopy by SAMIR PANDA Seminar of U.V. Spectroscopy by SAMIR PANDA
Seminar of U.V. Spectroscopy by SAMIR PANDA
SAMIR PANDA
 
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
Travis Hills MN
 
What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.
moosaasad1975
 
bordetella pertussis.................................ppt
bordetella pertussis.................................pptbordetella pertussis.................................ppt
bordetella pertussis.................................ppt
kejapriya1
 
Introduction to Mean Field Theory(MFT).pptx
Introduction to Mean Field Theory(MFT).pptxIntroduction to Mean Field Theory(MFT).pptx
Introduction to Mean Field Theory(MFT).pptx
zeex60
 
20240520 Planning a Circuit Simulator in JavaScript.pptx
20240520 Planning a Circuit Simulator in JavaScript.pptx20240520 Planning a Circuit Simulator in JavaScript.pptx
20240520 Planning a Circuit Simulator in JavaScript.pptx
Sharon Liu
 

Recently uploaded (20)

原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
 
Orion Air Quality Monitoring Systems - CWS
Orion Air Quality Monitoring Systems - CWSOrion Air Quality Monitoring Systems - CWS
Orion Air Quality Monitoring Systems - CWS
 
Unveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdfUnveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdf
 
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
 
ANAMOLOUS SECONDARY GROWTH IN DICOT ROOTS.pptx
ANAMOLOUS SECONDARY GROWTH IN DICOT ROOTS.pptxANAMOLOUS SECONDARY GROWTH IN DICOT ROOTS.pptx
ANAMOLOUS SECONDARY GROWTH IN DICOT ROOTS.pptx
 
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptxThe use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
 
THEMATIC APPERCEPTION TEST(TAT) cognitive abilities, creativity, and critic...
THEMATIC  APPERCEPTION  TEST(TAT) cognitive abilities, creativity, and critic...THEMATIC  APPERCEPTION  TEST(TAT) cognitive abilities, creativity, and critic...
THEMATIC APPERCEPTION TEST(TAT) cognitive abilities, creativity, and critic...
 
Nucleic Acid-its structural and functional complexity.
Nucleic Acid-its structural and functional complexity.Nucleic Acid-its structural and functional complexity.
Nucleic Acid-its structural and functional complexity.
 
Lateral Ventricles.pdf very easy good diagrams comprehensive
Lateral Ventricles.pdf very easy good diagrams comprehensiveLateral Ventricles.pdf very easy good diagrams comprehensive
Lateral Ventricles.pdf very easy good diagrams comprehensive
 
Leaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdfLeaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdf
 
NuGOweek 2024 Ghent programme overview flyer
NuGOweek 2024 Ghent programme overview flyerNuGOweek 2024 Ghent programme overview flyer
NuGOweek 2024 Ghent programme overview flyer
 
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
 
ESR spectroscopy in liquid food and beverages.pptx
ESR spectroscopy in liquid food and beverages.pptxESR spectroscopy in liquid food and beverages.pptx
ESR spectroscopy in liquid food and beverages.pptx
 
nodule formation by alisha dewangan.pptx
nodule formation by alisha dewangan.pptxnodule formation by alisha dewangan.pptx
nodule formation by alisha dewangan.pptx
 
Seminar of U.V. Spectroscopy by SAMIR PANDA
 Seminar of U.V. Spectroscopy by SAMIR PANDA Seminar of U.V. Spectroscopy by SAMIR PANDA
Seminar of U.V. Spectroscopy by SAMIR PANDA
 
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
 
What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.
 
bordetella pertussis.................................ppt
bordetella pertussis.................................pptbordetella pertussis.................................ppt
bordetella pertussis.................................ppt
 
Introduction to Mean Field Theory(MFT).pptx
Introduction to Mean Field Theory(MFT).pptxIntroduction to Mean Field Theory(MFT).pptx
Introduction to Mean Field Theory(MFT).pptx
 
20240520 Planning a Circuit Simulator in JavaScript.pptx
20240520 Planning a Circuit Simulator in JavaScript.pptx20240520 Planning a Circuit Simulator in JavaScript.pptx
20240520 Planning a Circuit Simulator in JavaScript.pptx
 

Cram4

  • 1. CRAM Version 4 proposal James Bonfield, Wellcome Sanger Institute
  • 2. What is CRAM? ● An alternative for aligned (SAM, BAM) and unaligned (FASTQ) data, since 2012 (EBI). – An accepted GA4GH standard. – Updates as need V1 → V2 → V3 → V3.1/V4? ● Can subset by row (region) or column (data type). – Even without transcoding (cram_filter). ● Flexible: can trade speed vs size vs random access. ● Htslib / Htsjdk offer a unified API to SAM, BAM, CRAM. – Multiple language bindings: C, Java, JavaScript, Rust, Python, Perl, C++, R, …
  • 3. CRAM adoption myths ● From "The Quest To Save Genomics" Bioinformatics experts already use standard compression tools like gzip to shrink the size of a file by up to a factor of 20. Some researchers also use more specialized compression tools that are optimized for genomic data, but none of these tools have seen wide adoption. https://spectrum.ieee.org/computing/software/the-desperate-quest-for-genomic-compression-algorithms
  • 4. CRAM adoption myths ● From "The Quest To Save Genomics" Bioinformatics experts already use standard compression tools like gzip to shrink the size of a file by up to a factor of 20. Some researchers also use more specialized compression tools that are optimized for genomic data, but none of these tools have seen wide adoption. https://spectrum.ieee.org/computing/software/the-desperate-quest-for-genomic-compression-algorithms ● CRAM has “been there, done that, got the archives”. WRONG WRONG!! ● ENA archives: – ~170,000 BAMs – ~350,000 CRAMs ● EGA archives: – ~470,000 BAMs – ~850,000 CRAMS ● Broad data: – ~80,000 germline genome CRAMs – ~200,000 germline exome BAMs (soon to be CRAM) – All somatic data in BAM.
  • 5. 2600 2700 2800 2900 3000 3100 3200 3300 3400 8 16 32 64 128 256 512 1024 2048 Size (MB) Time(s) Low Coverage, Illumina HiSeq 2000 Decompression 6300 6400 6500 6600 8 16 32 64 128 256 512 1024 2048 BAM Deez Quip CRAM3 CRAM4 Size (MB) 2600 2700 2800 2900 3000 3100 3200 3300 3400 8 16 32 64 128 256 512 1024 2048 Size (MB) Time(s) 6300 6400 6500 6600 8 16 32 64 128 256 512 1024 2048 BAM Deez Quip CRAM3 CRAM4 Size (MB) BAM Quip CRAM3 CRAM4 Deez 8threads
  • 6. High Coverage, Illumina NovaSeq Decompression 450 500 550 600 650 700 4 8 16 32 64 128 256 512 Size (MB) Time(s) 1400 1450 1500 4 8 16 32 64 128 256 512 BAM Deez Quip CRAM3 CRAM4 Size (MB) 450 500 550 600 650 700 4 8 16 32 64 128 256 512 Size (MB) Time(s) 1400 1450 1500 4 8 16 32 64 128 256 512 BAM Deez Quip CRAM3 CRAM4 Size (MB) BAM Quip CRAM3 CRAM4 Deez 8threads
  • 7. Crumble – lossy aligned data ● Lossy compression of read names – Keep pairing information only. ● Lossy compression of quality values using “quality budget” – Vertical: keep quality in regions where variant call is uncertain (both alt and ref calls). – Horizontal: smooth quality using libCSAM's P-block method. – NB: designed for single-sample germline mutations only! ● Lossy compression of axillary tags (discard OQ, BI, BD, etc). ● Validate against Syndip (CHM1 + CHM13 synthetic diploid). – Covers more genome than GIAB / Platinum Genomes, including problematic STRs. – Stresses hard to call indels!
  • 8. Crumble Sizes Raw Crumble 0 20 40 60 80 100 120 140 160 180 Syndip full (~50x) BAM CRAM3 CRAM4 CRAM4-max Size(Gb) Raw Crumble 0 10 20 30 40 50 60 Syndip 15x BAM CRAM3 CRAM4 CRAM4-max Size(Gb) 165.9 to 10.7Gb 51.3 to 3.3Gb
  • 9. ● Lossless & unary binning (Q=25) provide baselines. ● Both QVZ2 (-t4) and Crumble are in the smaller/better box. ● CALQ poorer than 2-binning. (High coverage issues?) 0 500 1000 1500 2000 2500 3000 3500 4000 4500 21000 21500 22000 22500 23000 23500 24000 Qual size vs Error (FN+FP, SNP+Indel) (Syndip Chr1 only; ~50x) Lossless Q=4/28 Q=25 Crumble CALQ QVZ2 Quality Size (Mb) Errors lossless 100% loss “Smaller, better” box Variant accuracy with GATK (Chr1 only)
  • 10. Crumble calling accuracy (Raw GATK – no VQSR step) 96.8 97.0 97.2 97.4 97.6 97.8 98.0 98.2 98.4 1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 TruePositive% 50x SNP Lossless GATK Lossless GATK filt Crumble -9p8 Crumble -9p8 filt 70.0 75.0 80.0 85.0 90.0 95.0 0.6 0.8 1 1.2 1.4 1.6 1.8 2 15x SNP Lossless GATK Lossless GATK filt Crumble -9p8 Crumble -9p8 filt 82.0 82.2 82.4 82.6 82.8 83.0 83.2 8.2 8.4 8.6 8.8 9 9.2 9.4 9.6 TruePositive% False Positive % Indel 58.0 60.0 62.0 64.0 66.0 68.0 70.0 72.0 74.0 6.4 6.6 6.8 7 7.2 7.4 7.6 7.8 8 False Positive % Indel ● Vary VCF QUAL threshold to get lines ● With / without filtering.
  • 11. MPEG-G: ERR174324 chr 11 BAM MPEG-G MPEG-G Sequences QV AUX CF ~ 15 CF ~ 1.3 1586MB 666MB 109MB CF ~ 11 Source: https://mpeg.chiariglione.org/sites/default/files/events/Mattavelli.pdf
  • 12. ERR174324 chr 11 BAM MPEG-G MPEG-G 1586MB 666MB 109MB Sequences QV AUX Names BAM CRAM 3 CRAM 4 Crumble 1462MB ← plot scaled up ERR174324_[12].fastq; bwa mem; chr11 onlyERR174324; unknown alignments.
  • 13. ERR174324 chr 11 BAM MPEG-G MPEG-G 1586MB 666MB 109MB Sequences QV AUX Names BAM CRAM 3 CRAM 4 Crumble 1462MB ← plot scaled up 58MB ERR174324_[12].fastq; bwa mem; chr11 onlyERR174324; unknown alignments. PatentsPatents pendingpending (Check license)(Check license)
  • 14. Conclusion ● CRAM 4 is 10-20% reduction over CRAM 3 size, but sometimes extra CPU cost. (On the pareto frontier.) – https://github.com/jkbonfield/io_lib (Scramble v1.14.10) ● Minimal amount of new invention; easy to adopt? ● Crumble is independent of format; controlled loss of quality based on variant call confidence. – Independent verification by DNAnexus: https://blog.dnanexus.com/2018-07-23-breaking-down-crumble/ – Avoid usage on subclonal samples with somatic mutations, or mixed sample datasets. ● More info: https://datageekdom.blogspot.com
  • 15. ● Sanger colleagues – Rob Davies ● EBI CRAM authors – Vadim Zalunin ● Compression tools – DeeZ (Faraz Hach) doi: 10.1038/nmeth.3133 – Quip (Daniel Jones) doi: 10.1093/nar/gks754 – Samtools (Heng Li) doi: 10.1093/bioinformatics/btp352 – Scramble doi: 10.1093/bioinformatics/btu390 – Crumble doi: 10.1093/bioinformatics/bty608 Acknowledgements – David Jackson – Markus Fritz (doi: 10.1101/gr.114819.110) – QVZ doi: 10.1093/bioinformatics/btv330 – CALQ doi: 10.1093/bioinformatics/btx737 – ANS (Jarek Duda) http://arxiv.org/abs/1311.2540 – Zlib (Jean-Loup. Gailly, Mark Adler), – Bzip2 (Julian Seward; Burrows & Wheeler) – Lzma (Igor Pavlov) – Bsc (Ilya Grebnov)
  • 16.
  • 17. QVZ2 -t4 calling accuracy 96.8 97.0 97.2 97.4 97.6 97.8 98.0 98.2 98.4 1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 TruePositive% 50x SNP Lossless GATK Lossless GATK filt QVZ2 -t4 + GATK QVZ2 -t4 + GATK filt 70.0 75.0 80.0 85.0 90.0 95.0 0.6 0.8 1 1.2 1.4 1.6 1.8 2 15x SNP Lossless GATK Lossless GATK filt QVZ2 -t4 + GATK QVZ2 -t4 + GATK filt 82.0 82.2 82.4 82.6 82.8 83.0 83.2 8.2 8.4 8.6 8.8 9 9.2 9.4 9.6 TruePositive% False Positive % Indel 58.0 60.0 62.0 64.0 66.0 68.0 70.0 72.0 74.0 6.4 6.6 6.8 7 7.2 7.4 7.6 7.8 8 False Positive % Indel
  • 18. CALQ calling accuracy 96.8 97.0 97.2 97.4 97.6 97.8 98.0 98.2 98.4 1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 TruePositive% 50x SNP Lossless GATK Lossless GATK filt Calq+GATK Calq+GATK filt 70.0 75.0 80.0 85.0 90.0 95.0 0.6 0.8 1 1.2 1.4 1.6 1.8 2 15x SNP Lossless GATK Lossless GATK filt Calq+GATK Calq+GATK filt 80.0 80.5 81.0 81.5 82.0 82.5 83.0 7.8 8 8.2 8.4 8.6 8.8 9 9.2 9.4 9.6 TruePositive% False Positive % Indel 54.0 56.0 58.0 60.0 62.0 64.0 66.0 68.0 70.0 72.0 6 6.2 6.4 6.6 6.8 7 7.2 7.4 7.6 7.8 False Positive % Indel