PPT, PDF1,649 views

Strata-Hadoop 2015 Presentation

The document discusses advancements in genomics using Spark and ADAM, highlighting the inefficiencies of current bioinformatics pipelines and file formats. It emphasizes the potential improvements of using Parquet for data models and introduces core genomic primitives aided by Spark. The presentation concludes with a critique of common myths surrounding bioinformatics software development.

Data & Analytics◦

Next-Generation
Genomics
Using Spark and ADAM
Timothy Danford
Tamr Inc.
AMPLab

One chromosome
per person
defines a
reference
chromosome

One chromosome
per person
defines a
reference
chromosome
and
location

Lambert et al. “Meta-analysis of 74,046 individuals identifies 11 new susceptibility loci for Alzheimer's disease” (2013)

Down the
Long Slide,
To
Happiness
Endlessly

We often treat
‘bioinformatics’ as a
black box
Vials into Files

A Tale of Three File Formats
BAM Files: Do You Read
Me?
Compressed text files & custom index formats
User-defined attributes
Multi-record structure

“Not wishing to be outdone
by Amazon, Sanger
Institute develops drone
deliver system for BAM
files.”

Bioinformaticians
❤️
Probabilistic
Models
Our Data Scattered Back and
Forth
Across Space by this Gadget

Why Are We Still Defining
File Formats By Hand?
• Instead of defining custom file
formats for each data type and
access pattern…
• Parquet creates a
compressed format for each
Avro-defined data model.
• Improvement over existing
formats1
• 20-22% for BAM
• ~95% for VCF
1
compression % quoted from 1K Genomes

Spark + Genomics =
ADAM
• Hosted at Berkeley and the
AMPLab
• Apache 2 License
• Contributors from both
research and commercial
organizations
• Core spatial primitives,
variant calling
• Avro and Parquet for data
models and file formats

Core Genomics Primitives:
The Needs of the Many

The Terrible Trouble
with Existing Pipelines
Cibulskis et al. “Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples” (2013)

“I think you know what the
problem is, just as well as I
do.”
A single piece of a
filtering stage for a
somatic variant caller
“11-base-pair window
centered on a candidate
mutation” actually turns
out to be optimized for
a particular file format
and sort order

“Myths of Bioinformatics
Software”
1. Somebody will build on your code
2. You should have assembled a team to build your software
3. If you choose the right license, more people will use and build on your software.
4. Making software free for commercial use shows you are not against companies.
5. You should maintain your software indefinitely
6. Your “stable URL” can exist forever
7. You should make your software “idiot proof”
8. You used the right programming language for the task.
Lior Pachter
https://liorpachter.wordpress.com/2015/07/10/the-myths-of-bioinformatics-software/
W
e
Can
M
ake
O
ur
O
w
n
M
yths

Strata-Hadoop 2015 Presentation

1.
Next-Generation Genomics Using Spark andADAM Timothy Danford Tamr Inc. AMPLab
2.
Next Generation? We come in peace.
3.
What even is genomics?
4.
Organism Cell Genome
5.
One chromosome
6.
One chromosome per person
7.
One chromosome per person definesa reference chromosome
8.
One chromosome per person definesa reference chromosome and location
9.
“… decoding theBook of Life”
10.
Ortellius, 1570
11.
Google, 2005
15.
Lambert et al.“Meta-analysis of 74,046 individuals identifies 11 new susceptibility loci for Alzheimer's disease” (2013)
16.
Down the Long Slide, To Happiness Endlessly
17.
We often treat ‘bioinformatics’as a black box Vials into Files
18.
What’s In The Box?
20.
My God, It’sFull of Pipelines
21.
My God, It’sFull of Pipelines
22.
A Tale ofThree File Formats BAM Files: Do You Read Me? Compressed text files & custom index formats User-defined attributes Multi-record structure
23.
“Not wishing tobe outdone by Amazon, Sanger Institute develops drone deliver system for BAM files.”
24.
Open the PodBay Doors, Pal
25.
I Had aDream It Would End This Way
26.
What to do,what to do?
27.
Bioinformaticians ❤️ Probabilistic Models Our Data ScatteredBack and Forth Across Space by this Gadget
28.
Why Are WeStill Defining File Formats By Hand? • Instead of defining custom file formats for each data type and access pattern… • Parquet creates a compressed format for each Avro-defined data model. • Improvement over existing formats1 • 20-22% for BAM • ~95% for VCF 1 compression % quoted from 1K Genomes
29.
Spark + Genomics= ADAM • Hosted at Berkeley and the AMPLab • Apache 2 License • Contributors from both research and commercial organizations • Core spatial primitives, variant calling • Avro and Parquet for data models and file formats
30.
Core Genomics Primitives: TheNeeds of the Many
31.
The Terrible Trouble withExisting Pipelines Cibulskis et al. “Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples” (2013)
32.
“I think youknow what the problem is, just as well as I do.” A single piece of a filtering stage for a somatic variant caller “11-base-pair window centered on a candidate mutation” actually turns out to be optimized for a particular file format and sort order
33.
“Myths of Bioinformatics Software” 1.Somebody will build on your code 2. You should have assembled a team to build your software 3. If you choose the right license, more people will use and build on your software. 4. Making software free for commercial use shows you are not against companies. 5. You should maintain your software indefinitely 6. Your “stable URL” can exist forever 7. You should make your software “idiot proof” 8. You used the right programming language for the task. Lior Pachter https://liorpachter.wordpress.com/2015/07/10/the-myths-of-bioinformatics-software/ W e Can M ake O ur O w n M yths
34.
Thanks to... And thankyou! Questions?

Editor's Notes

#2 I’m nervous, so I’ll be speaking fast. Before we dive in, let me ask a couple of questions: biologists? Spark experts? This entire presentation is a lie. There are always at least three different constituencies in the room: * biologists * programmers * someone thinking about how to build a business around this I am going to try and split the difference, but I won’t be able to satisfy everyone. In all the places where I have to skip over the truth, maybe there will be at least a breadcrumb back to the truth This isn’t a technical talk. Let’s talk about the title –
#3 Next generations? I didn’t realize that there was a *first* generation! Bioinformatics is a field with a long history, thirty or more years as a separate discipline. At the same time, the fundamental technology is changing. So if I talk about ‘problems’ today, it’s OK [animation] I come in peace! Bioinformatics software development has been *remarkalbly* effective, for decades. If there are problems to be solved, these are the result of new technologies, new conceptions of scale. So that’s “next generation,” but what about…
#4 Genomics? What even is genomics? Who here has heard the terms ‘chromosome’ and ‘gene’ before, and could explain the difference? So before we dive into the main part of the talk, I’m going to spend a few minutes discussing some of the basic biological concepts.
#5 Fundamentally, we’re interested in studying individuals (and populations of individuals) Each individual is *itself* a population: of cells But each of those cells has, ideally, an identical genome. The genome is a collection of 23 molecules. These are called ‘polymers,’ they’re built (like Legos) out of a small number of repeated interlocking parts – these are the A, T, G, and C you’ve probably heard about. The content of the genome is determined by the linear order in which these letters are arranged. (Linear is important!)
#6 Now, not only do all the cells in your body have identical genomes… [ANIMATE] But individual humans have genomes that are very similar to each other. So similar that I can define “the same” chromosome between individuals… and that means [ANIMATE] That we can define a ‘base’ or a ‘reference’ chromosome [ANIMATE] And a concept of ‘location’ across chromosomes. This is maybe the most important concept in genome informatics, the idea that DNA defines a common linear coordinate system. This means that we can talk about differences between individuals in terms of diffs to a common reference genome. But where does this reference genome come from?
#7 Now, not only do all the cells in your body have identical genomes… [ANIMATE] But individual humans have genomes that are very similar to each other. So similar that I can define “the same” chromosome between individuals… and that means [ANIMATE] That we can define a ‘base’ or a ‘reference’ chromosome [ANIMATE] And a concept of ‘location’ across chromosomes. This is maybe the most important concept in genome informatics, the idea that DNA defines a common linear coordinate system. This means that we can talk about differences between individuals in terms of diffs to a common reference genome. But where does this reference genome come from?
#8 Now, not only do all the cells in your body have identical genomes… [ANIMATE] But individual humans have genomes that are very similar to each other. So similar that I can define “the same” chromosome between individuals… and that means [ANIMATE] That we can define a ‘base’ or a ‘reference’ chromosome [ANIMATE] And a concept of ‘location’ across chromosomes. This is maybe the most important concept in genome informatics, the idea that DNA defines a common linear coordinate system. This means that we can talk about differences between individuals in terms of diffs to a common reference genome. But where does this reference genome come from?
#9 Now, not only do all the cells in your body have identical genomes… [ANIMATE] But individual humans have genomes that are very similar to each other. So similar that I can define “the same” chromosome between individuals… and that means [ANIMATE] That we can define a ‘base’ or a ‘reference’ chromosome [ANIMATE] And a concept of ‘location’ across chromosomes. This is maybe the most important concept in genome informatics, the idea that DNA defines a common linear coordinate system. This means that we can talk about differences between individuals in terms of diffs to a common reference genome. But where does this reference genome come from?
#10 Here is Bill Clinton (and Craig Venter and Francis Crick), announcing in June of 2000 the “rough draft” of the Human Genome – this is the Human Genome Project.
#11 1570: Theatrum Orbis Terrarum “Theater of the world” First modern atlas. A direct byproduct of the first 100 years of PRINTING, and a tool for describing and exploring the world around us. It’s direct descendants are still with us, today!
#12 Google maps! But what does the genomic version of this look like?
#15 Mapmakers today focus on *annotation* of the maps themselves. The core technologies are 2D planar and spherical geometry, geometric operations composed out of latitudes and longitudes.
#16 This is a manhattan plot, of alzheimer’s related genes and sequence markers. Now let’s shift gears, and talk about how this was performed – through sequencers. Sequencers are microscopes that read the genome.
#17 If there’s one graph you should remember, in order to understand the last (and the next) ten years of bioinformatics and genomics, it’s this one The Human Genome Project was thousands of researchers, billions of dollars, spent over a decade, all to sequence on-the-order-of half a dozen individuals. Today, we’re close to the “thousand dollar genome” – and already we’re seeing prototype sequencers with the form factor of a USB stick. So sequencers will drive everything before it – but sequencers are only ever half the story.
#18 Bioinformatics is a computational reversal of the sequencing process. [ANIMATE] But to most
#19 So… what’s in the box?
#20 It’s a pipeline! (Makes sense, since I’m also name-checking Spark, right?) It’s never *one* pipeline, we do this once for every person Let me talk a little bit about the structure of one of these pipelines Each step is typically written as a standalone program – passing files from stage to stage – often using something like unix pipes These are written as part of a globally-distrbuted research program, by researchers and grad students around the world, who have to assume the lowest common denominator: command line and filesystem But of course, it’s never one pipeline [ANIMATE] It’s a pipeline per person [ANIMATE] But since each pipeline runs (essentially) serially, scaling up is easy: scale out! [ANIMATE]
#21 It’s a pipeline! (Makes sense, since I’m also name-checking Spark, right?) It’s never *one* pipeline, we do this once for every person Let me talk a little bit about the structure of one of these pipelines Each step is typically written as a standalone program – passing files from stage to stage – often using something like unix pipes These are written as part of a globally-distrbuted research program, by researchers and grad students around the world, who have to assume the lowest common denominator: command line and filesystem But of course, it’s never one pipeline [ANIMATE] It’s a pipeline per person [ANIMATE] But since each pipeline runs (essentially) serially, scaling up is easy: scale out! [ANIMATE]
#22 It’s a pipeline! (Makes sense, since I’m also name-checking Spark, right?) It’s never *one* pipeline, we do this once for every person Let me talk a little bit about the structure of one of these pipelines Each step is typically written as a standalone program – passing files from stage to stage – often using something like unix pipes These are written as part of a globally-distrbuted research program, by researchers and grad students around the world, who have to assume the lowest common denominator: command line and filesystem But of course, it’s never one pipeline [ANIMATE] It’s a pipeline per person [ANIMATE] But since each pipeline runs (essentially) serially, scaling up is easy: scale out! [ANIMATE]
#25 That was the data side, but let’s open up the computation as well. Take one of those boxes, that I drew earlier. Here’s alignment, but it could be… [ANIMATE] any bioinformatics tool. I assert that there are *two* things going on inside any bioinformatics tool – [ANIMATE] There is the method, and there is the implementation of that method. I think this is an important distinction to make… But even that is a lie, because there is a third thing… [ANIMATE] “Platform.” That’s why I’ve included this code snippet up above. So what’s the problem? Faster sequencers means we sequence more people, but we have tools that work and a natural path to parallelism! Why does there need to be a “next generation?” The answer, of course, is that when you have all that data, you want to *USE* all that data.
#26 When you want to *use* all the data, now your entire system will start to show cracks. This is an example, variant calling. But [ANIMATE] God help you if you want to combine statistical information at an earlier phase of the process. But this is by no means a unique problem. And what is one solution? You might have guessed it from the title to my talk….
#28 There’s more parallelism that we can extract from our pipelines.
#30 Spark. The reason this works is that Spark naturally handles pipelines, and automatically performs shuffles when appropriate, but also…

Strata-Hadoop 2015 Presentation

More Related Content

What's hot

Viewers also liked

Similar to Strata-Hadoop 2015 Presentation

Recently uploaded

Strata-Hadoop 2015 Presentation

Editor's Notes