Next-Generation
Genomics
Using Spark and ADAM
Timothy Danford
Tamr Inc.
AMPLab
Next
Generation?
We come in
peace.
What even is
genomics?
Organism Cell Genome
One chromosome
One chromosome
per person
One chromosome
per person
defines a
reference
chromosome
One chromosome
per person
defines a
reference
chromosome
and
location
“… decoding the Book of Life”
Ortellius, 1570
Google, 2005
Lambert et al. “Meta-analysis of 74,046 individuals identifies 11 new susceptibility loci for Alzheimer's disease” (2013)
Down the
Long Slide,
To
Happiness
Endlessly
We often treat
‘bioinformatics’ as a
black box
Vials into Files
What’s
In
The Box?
My God, It’s Full of
Pipelines
My God, It’s Full of
Pipelines
A Tale of Three File Formats
BAM Files: Do You Read
Me?
Compressed text files & custom index formats
User-defined attributes
Multi-record structure
“Not wishing to be outdone
by Amazon, Sanger
Institute develops drone
deliver system for BAM
files.”
Open the Pod Bay Doors,
Pal
I Had a Dream
It Would End This Way
What to do, what to do?
Bioinformaticians
❤️
Probabilistic
Models
Our Data Scattered Back and
Forth
Across Space by this Gadget
Why Are We Still Defining
File Formats By Hand?
• Instead of defining custom file
formats for each data type and
access pattern…
• Parquet creates a
compressed format for each
Avro-defined data model.
• Improvement over existing
formats1
• 20-22% for BAM
• ~95% for VCF
1
compression % quoted from 1K Genomes
Spark + Genomics =
ADAM
• Hosted at Berkeley and the
AMPLab
• Apache 2 License
• Contributors from both
research and commercial
organizations
• Core spatial primitives,
variant calling
• Avro and Parquet for data
models and file formats
Core Genomics Primitives:
The Needs of the Many
The Terrible Trouble
with Existing Pipelines
Cibulskis et al. “Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples” (2013)
“I think you know what the
problem is, just as well as I
do.”
A single piece of a
filtering stage for a
somatic variant caller
“11-base-pair window
centered on a candidate
mutation” actually turns
out to be optimized for
a particular file format
and sort order
“Myths of Bioinformatics
Software”
1. Somebody will build on your code
2. You should have assembled a team to build your software
3. If you choose the right license, more people will use and build on your software.
4. Making software free for commercial use shows you are not against companies.
5. You should maintain your software indefinitely
6. Your “stable URL” can exist forever
7. You should make your software “idiot proof”
8. You used the right programming language for the task.
Lior Pachter
https://liorpachter.wordpress.com/2015/07/10/the-myths-of-bioinformatics-software/
W
e
Can
M
ake
O
ur
O
w
n
M
yths
Thanks to...
And thank you! Questions?

Strata-Hadoop 2015 Presentation

Editor's Notes

  • #2 I’m nervous, so I’ll be speaking fast. Before we dive in, let me ask a couple of questions: biologists? Spark experts? This entire presentation is a lie. There are always at least three different constituencies in the room: * biologists * programmers * someone thinking about how to build a business around this I am going to try and split the difference, but I won’t be able to satisfy everyone. In all the places where I have to skip over the truth, maybe there will be at least a breadcrumb back to the truth This isn’t a technical talk. Let’s talk about the title –
  • #3 Next generations? I didn’t realize that there was a *first* generation! Bioinformatics is a field with a long history, thirty or more years as a separate discipline. At the same time, the fundamental technology is changing. So if I talk about ‘problems’ today, it’s OK [animation] I come in peace! Bioinformatics software development has been *remarkalbly* effective, for decades. If there are problems to be solved, these are the result of new technologies, new conceptions of scale. So that’s “next generation,” but what about…
  • #4 Genomics? What even is genomics? Who here has heard the terms ‘chromosome’ and ‘gene’ before, and could explain the difference? So before we dive into the main part of the talk, I’m going to spend a few minutes discussing some of the basic biological concepts.
  • #5 Fundamentally, we’re interested in studying individuals (and populations of individuals) Each individual is *itself* a population: of cells But each of those cells has, ideally, an identical genome. The genome is a collection of 23 molecules. These are called ‘polymers,’ they’re built (like Legos) out of a small number of repeated interlocking parts – these are the A, T, G, and C you’ve probably heard about. The content of the genome is determined by the linear order in which these letters are arranged. (Linear is important!)
  • #6 Now, not only do all the cells in your body have identical genomes… [ANIMATE] But individual humans have genomes that are very similar to each other. So similar that I can define “the same” chromosome between individuals… and that means [ANIMATE] That we can define a ‘base’ or a ‘reference’ chromosome [ANIMATE] And a concept of ‘location’ across chromosomes. This is maybe the most important concept in genome informatics, the idea that DNA defines a common linear coordinate system. This means that we can talk about differences between individuals in terms of diffs to a common reference genome. But where does this reference genome come from?
  • #7 Now, not only do all the cells in your body have identical genomes… [ANIMATE] But individual humans have genomes that are very similar to each other. So similar that I can define “the same” chromosome between individuals… and that means [ANIMATE] That we can define a ‘base’ or a ‘reference’ chromosome [ANIMATE] And a concept of ‘location’ across chromosomes. This is maybe the most important concept in genome informatics, the idea that DNA defines a common linear coordinate system. This means that we can talk about differences between individuals in terms of diffs to a common reference genome. But where does this reference genome come from?
  • #8 Now, not only do all the cells in your body have identical genomes… [ANIMATE] But individual humans have genomes that are very similar to each other. So similar that I can define “the same” chromosome between individuals… and that means [ANIMATE] That we can define a ‘base’ or a ‘reference’ chromosome [ANIMATE] And a concept of ‘location’ across chromosomes. This is maybe the most important concept in genome informatics, the idea that DNA defines a common linear coordinate system. This means that we can talk about differences between individuals in terms of diffs to a common reference genome. But where does this reference genome come from?
  • #9 Now, not only do all the cells in your body have identical genomes… [ANIMATE] But individual humans have genomes that are very similar to each other. So similar that I can define “the same” chromosome between individuals… and that means [ANIMATE] That we can define a ‘base’ or a ‘reference’ chromosome [ANIMATE] And a concept of ‘location’ across chromosomes. This is maybe the most important concept in genome informatics, the idea that DNA defines a common linear coordinate system. This means that we can talk about differences between individuals in terms of diffs to a common reference genome. But where does this reference genome come from?
  • #10 Here is Bill Clinton (and Craig Venter and Francis Crick), announcing in June of 2000 the “rough draft” of the Human Genome – this is the Human Genome Project.
  • #11 1570: Theatrum Orbis Terrarum “Theater of the world” First modern atlas. A direct byproduct of the first 100 years of PRINTING, and a tool for describing and exploring the world around us. It’s direct descendants are still with us, today!
  • #12 Google maps! But what does the genomic version of this look like?
  • #15 Mapmakers today focus on *annotation* of the maps themselves. The core technologies are 2D planar and spherical geometry, geometric operations composed out of latitudes and longitudes.
  • #16 This is a manhattan plot, of alzheimer’s related genes and sequence markers. Now let’s shift gears, and talk about how this was performed – through sequencers. Sequencers are microscopes that read the genome.
  • #17 If there’s one graph you should remember, in order to understand the last (and the next) ten years of bioinformatics and genomics, it’s this one The Human Genome Project was thousands of researchers, billions of dollars, spent over a decade, all to sequence on-the-order-of half a dozen individuals. Today, we’re close to the “thousand dollar genome” – and already we’re seeing prototype sequencers with the form factor of a USB stick. So sequencers will drive everything before it – but sequencers are only ever half the story.
  • #18 Bioinformatics is a computational reversal of the sequencing process. [ANIMATE] But to most
  • #19 So… what’s in the box?
  • #20 It’s a pipeline! (Makes sense, since I’m also name-checking Spark, right?) It’s never *one* pipeline, we do this once for every person Let me talk a little bit about the structure of one of these pipelines Each step is typically written as a standalone program – passing files from stage to stage – often using something like unix pipes These are written as part of a globally-distrbuted research program, by researchers and grad students around the world, who have to assume the lowest common denominator: command line and filesystem But of course, it’s never one pipeline [ANIMATE] It’s a pipeline per person [ANIMATE] But since each pipeline runs (essentially) serially, scaling up is easy: scale out! [ANIMATE]
  • #21 It’s a pipeline! (Makes sense, since I’m also name-checking Spark, right?) It’s never *one* pipeline, we do this once for every person Let me talk a little bit about the structure of one of these pipelines Each step is typically written as a standalone program – passing files from stage to stage – often using something like unix pipes These are written as part of a globally-distrbuted research program, by researchers and grad students around the world, who have to assume the lowest common denominator: command line and filesystem But of course, it’s never one pipeline [ANIMATE] It’s a pipeline per person [ANIMATE] But since each pipeline runs (essentially) serially, scaling up is easy: scale out! [ANIMATE]
  • #22 It’s a pipeline! (Makes sense, since I’m also name-checking Spark, right?) It’s never *one* pipeline, we do this once for every person Let me talk a little bit about the structure of one of these pipelines Each step is typically written as a standalone program – passing files from stage to stage – often using something like unix pipes These are written as part of a globally-distrbuted research program, by researchers and grad students around the world, who have to assume the lowest common denominator: command line and filesystem But of course, it’s never one pipeline [ANIMATE] It’s a pipeline per person [ANIMATE] But since each pipeline runs (essentially) serially, scaling up is easy: scale out! [ANIMATE]
  • #25 That was the data side, but let’s open up the computation as well. Take one of those boxes, that I drew earlier. Here’s alignment, but it could be… [ANIMATE] any bioinformatics tool. I assert that there are *two* things going on inside any bioinformatics tool – [ANIMATE] There is the method, and there is the implementation of that method. I think this is an important distinction to make… But even that is a lie, because there is a third thing… [ANIMATE] “Platform.” That’s why I’ve included this code snippet up above. So what’s the problem? Faster sequencers means we sequence more people, but we have tools that work and a natural path to parallelism! Why does there need to be a “next generation?” The answer, of course, is that when you have all that data, you want to *USE* all that data.
  • #26 When you want to *use* all the data, now your entire system will start to show cracks. This is an example, variant calling. But [ANIMATE] God help you if you want to combine statistical information at an earlier phase of the process. But this is by no means a unique problem. And what is one solution? You might have guessed it from the title to my talk….
  • #28 There’s more parallelism that we can extract from our pipelines.
  • #30 Spark. The reason this works is that Spark naturally handles pipelines, and automatically performs shuffles when appropriate, but also…