SlideShare a Scribd company logo
PacMin: rethinking genome 
analysis with long reads 
Frank Austin Nothaft, AMPLab 
Joint work with Adam Bloniarz 
10/14/2014
Note: 
• This talk is mostly speculative. 
• I.e., the methods we’ll talk about are 
partially* implemented. 
• This means you have an opportunity to steer the 
direction of this work! 
* I’m being generous to myself.
Sequencing 101 
• Most sequence data today comes from Illumina 
machines, which perform sequencing-by-synthesis 
! 
! 
! 
• We get short (100-250 bp) reads, with high accuracy 
• Reads are (usually) paired 
http://en.wikipedia.org/wiki/File:Sequencing_by_synthesis_Reversible_terminators.png
Current Pipelines are 
Reference Based 
• Map subsequences to a “reference genome” 
• Compute variants (diffs) against the reference 
From “GATK Best Practices”, https://www.broadinstitute.org/gatk/guide/best-practices
An aside: What is the 
reference genome? 
• Pool together n individuals, and assemble their genomes 
together 
• A few problems: 
• How does the reference genome handle polymorphisms? 
• What about structural rearrangements? 
• Subpopulation specific alternate haplotypes? 
• It has gaps. 14 years after the first human reference 
genome was released, it is still incomplete.* 
* This problem is Hard.
The Sequencing Abstraction 
It was the best of times, it was the worst of times… 
It was the 
the best of 
times, it was 
worst of times 
the worst of 
• Sample poisson distributed substrings from a 
larger string 
• Reads are more or less unique and correct 
Metaphor borrowed from Michael Schatz 
best of times was the worst
…is a leaky abstraction 
• We frequently encounter “gaps” in the sequence 
Ross et al, Genome Biology 2013
…is a leakier abstraction 
• We preferentially sequence from “biased” regions: 
Ross et al, Genome Biology 2013
A very leaky abstraction! 
• Reads aren’t actually correct 
• >2% error (expect 0.1% variation) 
• Error probability estimates are cruddy 
• Reads aren’t actually unique 
• >7% of the genome is not unique (K. Curtis, SiRen)
The State of Analysis 
• We’re really good at calling SNPs! 
• But, we’re still pretty bad at calling INDELs, and SVs 
• And we’re also bad at expressing diffs 
• Hence, SMaSH! But really, reference + diff format need to be burnt to the 
ground and redesigned. 
• And, its slow. 2 weeks to sequence, 1 week to 
analyze. Not fast enough for practical clinical use.
Opportunities 
• New read technologies are available 
• Provide much longer reads (250bp vs. >10kbp) 
• Different error model… (15% INDEL errors, vs. 2% 
SNP errors) 
• Generally, lower sequence specific bias 
Left: PacBio homepage, Right: Wired, http://www.wired.com/2012/03/oxford-nanopore-sequencing-usb/
If long reads are available… 
• We can use conventional methods: 
Carneiro et al, Genome Biology 2012
But! 
• Why not make raw assemblies out of the reads? 
Find overlapping reads Find consensus sequence 
for all pairs of reads (i,j): 
i j 
=? 
…ACACTGCGACTCATCGACTC… 
• Problems: 
1. Overlapping is O(n 
2 
) and single evaluation is expensive anyways 
2. Typical algorithms find a single consensus sequence; what if we’ve got 
polymorphisms?
Fast Overlapping with 
MinHashing 
• Wonderful realization by Berlin et al1: overlapping is 
similar to document similarity problem 
• Use MinHashing to approximate similarity: 
1: Berlin et al, bioRxiv 2014 
Per document/read, 
compute signature:! 
! 
1. Cut into shingles 
2. Apply random 
hashes to shingles 
3. Take min over all 
random hashes 
Hash into buckets:! 
! 
Signatures of length l 
can be hashed into b 
buckets, so we expect 
to compare all elements 
with similarity 
≥ (1/b)^(b/l) 
Compare:! 
! 
For two documents with 
signatures of length l, 
Jaccard similarity is 
estimated by 
(# equal hashes) / l 
! 
• Easy to implement in Spark: map, groupBy, map, filter
Overlaps to Assemblies 
• Finding pairwise overlaps gives us a directed 
graph between reads (lots of edges!)
Transitive Reduction 
• We can find a consensus between clique members 
• Or, we can reduce down: 
• Via two iterations of Pregel!
Actually Making Calls 
• From here, we need to call copy number per edge 
• Probably via Newton-Raphson based on coverage; we’re not sure yet. 
• Then, per position in each edge, call alleles: 
Notes:! 
Equation is from Li, Bioinformatics 2011 
g = genotype state 
m = ploidy 
휖 = probability allele was erroneously observed 
k = number of reads observed 
l = number of reads observed matching “reference” allele 
TBD: equation assumes biallelic observations at site and reference allele; we won’t have either of those conveniences…
Output 
• Current assemblers emit FASTA contigs 
• In layperson’s speak: long strings 
• We’ll emit “multigs”, which we’ll map back to reference 
graph 
• Multig = multi-allelic (polymorphic) contig 
• Working with UCSC, who’ve done some really neat work1 
deriving formalisms & building software for mapping 
between sequence graphs, and GA4GH ref. variation team 
1. Paten et al, “Mapping to a Reference Genome Structure”, arXiv 2014.

More Related Content

Similar to PacMin @ AMPLab All-Hands

2013 siam-cse-big-data
2013 siam-cse-big-data2013 siam-cse-big-data
2013 siam-cse-big-datac.titus.brown
 
2014 bangkok-talk
2014 bangkok-talk2014 bangkok-talk
2014 bangkok-talk
c.titus.brown
 
CoE-WEBINAR-2_042117v3.pptx
CoE-WEBINAR-2_042117v3.pptxCoE-WEBINAR-2_042117v3.pptx
CoE-WEBINAR-2_042117v3.pptx
Vandana472475
 
GLBIO/CCBC Metagenomics Workshop
GLBIO/CCBC Metagenomics WorkshopGLBIO/CCBC Metagenomics Workshop
GLBIO/CCBC Metagenomics Workshop
Morgan Langille
 
HPCAC - the state of bioinformatics in 2017
HPCAC - the state of bioinformatics in 2017HPCAC - the state of bioinformatics in 2017
HPCAC - the state of bioinformatics in 2017
philippbayer
 
Amorphous Computing (Computación Amorfa)
Amorphous Computing (Computación Amorfa)Amorphous Computing (Computación Amorfa)
Amorphous Computing (Computación Amorfa)
Andres Felipe Trujillo Madrigal
 
Lecture on the annotation of transposable elements
Lecture on the annotation of transposable elementsLecture on the annotation of transposable elements
Lecture on the annotation of transposable elements
fmaumus
 
Sequencing run grief counseling: counting kmers at MG-RAST
Sequencing run grief counseling: counting kmers at MG-RASTSequencing run grief counseling: counting kmers at MG-RAST
Sequencing run grief counseling: counting kmers at MG-RAST
wltrimbl
 
Apolo Taller en BIOS
Apolo Taller en BIOS Apolo Taller en BIOS
Apolo Taller en BIOS
Monica Munoz-Torres
 
Assembling NGS Data - IMB Winter School - 3 July 2012
Assembling NGS Data - IMB Winter School - 3 July 2012Assembling NGS Data - IMB Winter School - 3 July 2012
Assembling NGS Data - IMB Winter School - 3 July 2012
Torsten Seemann
 
Stamps.pptx
Stamps.pptxStamps.pptx
Stamps.pptx
aaaa bbb
 
Apollo Introduction for i5K Groups 2015-10-07
Apollo Introduction for i5K Groups 2015-10-07Apollo Introduction for i5K Groups 2015-10-07
Apollo Introduction for i5K Groups 2015-10-07
Monica Munoz-Torres
 
Apollo Workshop at KSU 2015
Apollo Workshop at KSU 2015Apollo Workshop at KSU 2015
Apollo Workshop at KSU 2015
Monica Munoz-Torres
 
Introduction to Apollo - i5k Research Community – Calanoida (copepod)
Introduction to Apollo - i5k Research Community – Calanoida (copepod)Introduction to Apollo - i5k Research Community – Calanoida (copepod)
Introduction to Apollo - i5k Research Community – Calanoida (copepod)
Monica Munoz-Torres
 
2014 nicta-reproducibility
2014 nicta-reproducibility2014 nicta-reproducibility
2014 nicta-reproducibility
c.titus.brown
 
UC Davis EVE161 Lecture 10 by @phylogenomics
UC Davis EVE161 Lecture 10 by @phylogenomicsUC Davis EVE161 Lecture 10 by @phylogenomics
UC Davis EVE161 Lecture 10 by @phylogenomicsJonathan Eisen
 
ADAM—Spark Summit, 2014
ADAM—Spark Summit, 2014ADAM—Spark Summit, 2014
ADAM—Spark Summit, 2014fnothaft
 
Giab poster structural variants ashg 2018
Giab poster structural variants ashg 2018Giab poster structural variants ashg 2018
Giab poster structural variants ashg 2018
GenomeInABottle
 
2014 nci-edrn
2014 nci-edrn2014 nci-edrn
2014 nci-edrn
c.titus.brown
 
ECCMID 2015 - So I have sequenced my genome ... what now?
ECCMID 2015 - So I have sequenced my genome ... what now?ECCMID 2015 - So I have sequenced my genome ... what now?
ECCMID 2015 - So I have sequenced my genome ... what now?
Nick Loman
 

Similar to PacMin @ AMPLab All-Hands (20)

2013 siam-cse-big-data
2013 siam-cse-big-data2013 siam-cse-big-data
2013 siam-cse-big-data
 
2014 bangkok-talk
2014 bangkok-talk2014 bangkok-talk
2014 bangkok-talk
 
CoE-WEBINAR-2_042117v3.pptx
CoE-WEBINAR-2_042117v3.pptxCoE-WEBINAR-2_042117v3.pptx
CoE-WEBINAR-2_042117v3.pptx
 
GLBIO/CCBC Metagenomics Workshop
GLBIO/CCBC Metagenomics WorkshopGLBIO/CCBC Metagenomics Workshop
GLBIO/CCBC Metagenomics Workshop
 
HPCAC - the state of bioinformatics in 2017
HPCAC - the state of bioinformatics in 2017HPCAC - the state of bioinformatics in 2017
HPCAC - the state of bioinformatics in 2017
 
Amorphous Computing (Computación Amorfa)
Amorphous Computing (Computación Amorfa)Amorphous Computing (Computación Amorfa)
Amorphous Computing (Computación Amorfa)
 
Lecture on the annotation of transposable elements
Lecture on the annotation of transposable elementsLecture on the annotation of transposable elements
Lecture on the annotation of transposable elements
 
Sequencing run grief counseling: counting kmers at MG-RAST
Sequencing run grief counseling: counting kmers at MG-RASTSequencing run grief counseling: counting kmers at MG-RAST
Sequencing run grief counseling: counting kmers at MG-RAST
 
Apolo Taller en BIOS
Apolo Taller en BIOS Apolo Taller en BIOS
Apolo Taller en BIOS
 
Assembling NGS Data - IMB Winter School - 3 July 2012
Assembling NGS Data - IMB Winter School - 3 July 2012Assembling NGS Data - IMB Winter School - 3 July 2012
Assembling NGS Data - IMB Winter School - 3 July 2012
 
Stamps.pptx
Stamps.pptxStamps.pptx
Stamps.pptx
 
Apollo Introduction for i5K Groups 2015-10-07
Apollo Introduction for i5K Groups 2015-10-07Apollo Introduction for i5K Groups 2015-10-07
Apollo Introduction for i5K Groups 2015-10-07
 
Apollo Workshop at KSU 2015
Apollo Workshop at KSU 2015Apollo Workshop at KSU 2015
Apollo Workshop at KSU 2015
 
Introduction to Apollo - i5k Research Community – Calanoida (copepod)
Introduction to Apollo - i5k Research Community – Calanoida (copepod)Introduction to Apollo - i5k Research Community – Calanoida (copepod)
Introduction to Apollo - i5k Research Community – Calanoida (copepod)
 
2014 nicta-reproducibility
2014 nicta-reproducibility2014 nicta-reproducibility
2014 nicta-reproducibility
 
UC Davis EVE161 Lecture 10 by @phylogenomics
UC Davis EVE161 Lecture 10 by @phylogenomicsUC Davis EVE161 Lecture 10 by @phylogenomics
UC Davis EVE161 Lecture 10 by @phylogenomics
 
ADAM—Spark Summit, 2014
ADAM—Spark Summit, 2014ADAM—Spark Summit, 2014
ADAM—Spark Summit, 2014
 
Giab poster structural variants ashg 2018
Giab poster structural variants ashg 2018Giab poster structural variants ashg 2018
Giab poster structural variants ashg 2018
 
2014 nci-edrn
2014 nci-edrn2014 nci-edrn
2014 nci-edrn
 
ECCMID 2015 - So I have sequenced my genome ... what now?
ECCMID 2015 - So I have sequenced my genome ... what now?ECCMID 2015 - So I have sequenced my genome ... what now?
ECCMID 2015 - So I have sequenced my genome ... what now?
 

More from fnothaft

Scalable Genome Analysis with ADAM
Scalable Genome Analysis with ADAMScalable Genome Analysis with ADAM
Scalable Genome Analysis with ADAM
fnothaft
 
Rethinking Data-Intensive Science Using Scalable Analytics Systems
Rethinking Data-Intensive Science Using Scalable Analytics Systems Rethinking Data-Intensive Science Using Scalable Analytics Systems
Rethinking Data-Intensive Science Using Scalable Analytics Systems
fnothaft
 
Fast Variant Calling with ADAM and avocado
Fast Variant Calling with ADAM and avocadoFast Variant Calling with ADAM and avocado
Fast Variant Calling with ADAM and avocado
fnothaft
 
Scaling Genomic Analyses
Scaling Genomic AnalysesScaling Genomic Analyses
Scaling Genomic Analyses
fnothaft
 
Scaling up genomic analysis with ADAM
Scaling up genomic analysis with ADAMScaling up genomic analysis with ADAM
Scaling up genomic analysis with ADAM
fnothaft
 
Scaling up genomic analysis with ADAM
Scaling up genomic analysis with ADAMScaling up genomic analysis with ADAM
Scaling up genomic analysis with ADAM
fnothaft
 
Reproducible Emulation of Analog Behavioral Models
Reproducible Emulation of Analog Behavioral ModelsReproducible Emulation of Analog Behavioral Models
Reproducible Emulation of Analog Behavioral Models
fnothaft
 
Scalable up genomic analysis with ADAM
Scalable up genomic analysis with ADAMScalable up genomic analysis with ADAM
Scalable up genomic analysis with ADAM
fnothaft
 
CS176: Genome Assembly
CS176: Genome AssemblyCS176: Genome Assembly
CS176: Genome Assembly
fnothaft
 
Execution Environments
Execution EnvironmentsExecution Environments
Execution Environments
fnothaft
 
Design for Scalability in ADAM
Design for Scalability in ADAMDesign for Scalability in ADAM
Design for Scalability in ADAMfnothaft
 
Adam bosc-071114
Adam bosc-071114Adam bosc-071114
Adam bosc-071114fnothaft
 

More from fnothaft (12)

Scalable Genome Analysis with ADAM
Scalable Genome Analysis with ADAMScalable Genome Analysis with ADAM
Scalable Genome Analysis with ADAM
 
Rethinking Data-Intensive Science Using Scalable Analytics Systems
Rethinking Data-Intensive Science Using Scalable Analytics Systems Rethinking Data-Intensive Science Using Scalable Analytics Systems
Rethinking Data-Intensive Science Using Scalable Analytics Systems
 
Fast Variant Calling with ADAM and avocado
Fast Variant Calling with ADAM and avocadoFast Variant Calling with ADAM and avocado
Fast Variant Calling with ADAM and avocado
 
Scaling Genomic Analyses
Scaling Genomic AnalysesScaling Genomic Analyses
Scaling Genomic Analyses
 
Scaling up genomic analysis with ADAM
Scaling up genomic analysis with ADAMScaling up genomic analysis with ADAM
Scaling up genomic analysis with ADAM
 
Scaling up genomic analysis with ADAM
Scaling up genomic analysis with ADAMScaling up genomic analysis with ADAM
Scaling up genomic analysis with ADAM
 
Reproducible Emulation of Analog Behavioral Models
Reproducible Emulation of Analog Behavioral ModelsReproducible Emulation of Analog Behavioral Models
Reproducible Emulation of Analog Behavioral Models
 
Scalable up genomic analysis with ADAM
Scalable up genomic analysis with ADAMScalable up genomic analysis with ADAM
Scalable up genomic analysis with ADAM
 
CS176: Genome Assembly
CS176: Genome AssemblyCS176: Genome Assembly
CS176: Genome Assembly
 
Execution Environments
Execution EnvironmentsExecution Environments
Execution Environments
 
Design for Scalability in ADAM
Design for Scalability in ADAMDesign for Scalability in ADAM
Design for Scalability in ADAM
 
Adam bosc-071114
Adam bosc-071114Adam bosc-071114
Adam bosc-071114
 

Recently uploaded

A Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of PassageA Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of Passage
Philip Schwarz
 
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing SuiteAI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
Google
 
2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx
Georgi Kodinov
 
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus
 
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Mind IT Systems
 
Using IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New ZealandUsing IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New Zealand
IES VE
 
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus
 
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, BetterWebinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
XfilesPro
 
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Anthony Dahanne
 
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
informapgpstrackings
 
How to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good PracticesHow to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good Practices
Globus
 
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoamOpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
takuyayamamoto1800
 
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERRORTROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
Tier1 app
 
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology SolutionsProsigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns
 
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Globus
 
A Comprehensive Look at Generative AI in Retail App Testing.pdf
A Comprehensive Look at Generative AI in Retail App Testing.pdfA Comprehensive Look at Generative AI in Retail App Testing.pdf
A Comprehensive Look at Generative AI in Retail App Testing.pdf
kalichargn70th171
 
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdfEnhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Jay Das
 
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptxTop Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
rickgrimesss22
 
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Globus
 
top nidhi software solution freedownload
top nidhi software solution freedownloadtop nidhi software solution freedownload
top nidhi software solution freedownload
vrstrong314
 

Recently uploaded (20)

A Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of PassageA Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of Passage
 
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing SuiteAI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
 
2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx
 
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024
 
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
 
Using IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New ZealandUsing IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New Zealand
 
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024
 
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, BetterWebinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
 
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
 
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
 
How to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good PracticesHow to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good Practices
 
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoamOpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
 
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERRORTROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
 
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology SolutionsProsigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology Solutions
 
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
 
A Comprehensive Look at Generative AI in Retail App Testing.pdf
A Comprehensive Look at Generative AI in Retail App Testing.pdfA Comprehensive Look at Generative AI in Retail App Testing.pdf
A Comprehensive Look at Generative AI in Retail App Testing.pdf
 
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdfEnhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
 
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptxTop Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
 
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
 
top nidhi software solution freedownload
top nidhi software solution freedownloadtop nidhi software solution freedownload
top nidhi software solution freedownload
 

PacMin @ AMPLab All-Hands

  • 1. PacMin: rethinking genome analysis with long reads Frank Austin Nothaft, AMPLab Joint work with Adam Bloniarz 10/14/2014
  • 2. Note: • This talk is mostly speculative. • I.e., the methods we’ll talk about are partially* implemented. • This means you have an opportunity to steer the direction of this work! * I’m being generous to myself.
  • 3. Sequencing 101 • Most sequence data today comes from Illumina machines, which perform sequencing-by-synthesis ! ! ! • We get short (100-250 bp) reads, with high accuracy • Reads are (usually) paired http://en.wikipedia.org/wiki/File:Sequencing_by_synthesis_Reversible_terminators.png
  • 4. Current Pipelines are Reference Based • Map subsequences to a “reference genome” • Compute variants (diffs) against the reference From “GATK Best Practices”, https://www.broadinstitute.org/gatk/guide/best-practices
  • 5. An aside: What is the reference genome? • Pool together n individuals, and assemble their genomes together • A few problems: • How does the reference genome handle polymorphisms? • What about structural rearrangements? • Subpopulation specific alternate haplotypes? • It has gaps. 14 years after the first human reference genome was released, it is still incomplete.* * This problem is Hard.
  • 6. The Sequencing Abstraction It was the best of times, it was the worst of times… It was the the best of times, it was worst of times the worst of • Sample poisson distributed substrings from a larger string • Reads are more or less unique and correct Metaphor borrowed from Michael Schatz best of times was the worst
  • 7. …is a leaky abstraction • We frequently encounter “gaps” in the sequence Ross et al, Genome Biology 2013
  • 8. …is a leakier abstraction • We preferentially sequence from “biased” regions: Ross et al, Genome Biology 2013
  • 9. A very leaky abstraction! • Reads aren’t actually correct • >2% error (expect 0.1% variation) • Error probability estimates are cruddy • Reads aren’t actually unique • >7% of the genome is not unique (K. Curtis, SiRen)
  • 10. The State of Analysis • We’re really good at calling SNPs! • But, we’re still pretty bad at calling INDELs, and SVs • And we’re also bad at expressing diffs • Hence, SMaSH! But really, reference + diff format need to be burnt to the ground and redesigned. • And, its slow. 2 weeks to sequence, 1 week to analyze. Not fast enough for practical clinical use.
  • 11. Opportunities • New read technologies are available • Provide much longer reads (250bp vs. >10kbp) • Different error model… (15% INDEL errors, vs. 2% SNP errors) • Generally, lower sequence specific bias Left: PacBio homepage, Right: Wired, http://www.wired.com/2012/03/oxford-nanopore-sequencing-usb/
  • 12. If long reads are available… • We can use conventional methods: Carneiro et al, Genome Biology 2012
  • 13. But! • Why not make raw assemblies out of the reads? Find overlapping reads Find consensus sequence for all pairs of reads (i,j): i j =? …ACACTGCGACTCATCGACTC… • Problems: 1. Overlapping is O(n 2 ) and single evaluation is expensive anyways 2. Typical algorithms find a single consensus sequence; what if we’ve got polymorphisms?
  • 14. Fast Overlapping with MinHashing • Wonderful realization by Berlin et al1: overlapping is similar to document similarity problem • Use MinHashing to approximate similarity: 1: Berlin et al, bioRxiv 2014 Per document/read, compute signature:! ! 1. Cut into shingles 2. Apply random hashes to shingles 3. Take min over all random hashes Hash into buckets:! ! Signatures of length l can be hashed into b buckets, so we expect to compare all elements with similarity ≥ (1/b)^(b/l) Compare:! ! For two documents with signatures of length l, Jaccard similarity is estimated by (# equal hashes) / l ! • Easy to implement in Spark: map, groupBy, map, filter
  • 15. Overlaps to Assemblies • Finding pairwise overlaps gives us a directed graph between reads (lots of edges!)
  • 16. Transitive Reduction • We can find a consensus between clique members • Or, we can reduce down: • Via two iterations of Pregel!
  • 17. Actually Making Calls • From here, we need to call copy number per edge • Probably via Newton-Raphson based on coverage; we’re not sure yet. • Then, per position in each edge, call alleles: Notes:! Equation is from Li, Bioinformatics 2011 g = genotype state m = ploidy 휖 = probability allele was erroneously observed k = number of reads observed l = number of reads observed matching “reference” allele TBD: equation assumes biallelic observations at site and reference allele; we won’t have either of those conveniences…
  • 18. Output • Current assemblers emit FASTA contigs • In layperson’s speak: long strings • We’ll emit “multigs”, which we’ll map back to reference graph • Multig = multi-allelic (polymorphic) contig • Working with UCSC, who’ve done some really neat work1 deriving formalisms & building software for mapping between sequence graphs, and GA4GH ref. variation team 1. Paten et al, “Mapping to a Reference Genome Structure”, arXiv 2014.