Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Hadoop for Genomics__HadoopSummit2010


Published on

Hadoop Summit 2010 - Research Track
Hadoop for Genomics
Jeremy Bruestle, Spiral Genetics

Published in: Technology

Hadoop for Genomics__HadoopSummit2010

  1. 1. Hadoop for Genomics <ul><li>Jeremy Bruestle </li></ul>Spiral Genetics, Inc.
  2. 2. <ul><li>About Genomics </li></ul><ul><li>About Hadoop and Spiral </li></ul><ul><li>Specific technical detail </li></ul>Session Agenda
  3. 3. <ul><li>Study of genomes (total genetic material) of humans and other organisms </li></ul><ul><li>Requires reading an individual’s DNA sequence, fully or partially </li></ul><ul><li>We can compare genomes within species or across species lines </li></ul><ul><li>Many levels of analysis and uses </li></ul>Genomics Source: Nature Publishing Group
  4. 4. <ul><li>What is it? </li></ul><ul><ul><li>Reading the entire DNA sequence of individual humans </li></ul></ul><ul><ul><li>Analyzing the data </li></ul></ul><ul><ul><li>Interpreting the results for biological significance </li></ul></ul>Full Human Genome DNA Sequencing <ul><li>Why do it? </li></ul><ul><ul><li>Biomedical Research - Find disease-causing mutations for better screening and preventative care </li></ul></ul><ul><ul><li>Personalized Pharmacology - Inform drug development, determine if a particular drug is appropriate for an individual (Warfarin). </li></ul></ul><ul><ul><li>Cancer Treatment – Find differences between DNA of normal and cancer cells to predict prognosis and recommend best treatment. </li></ul></ul>Source: AAAS
  5. 5. <ul><li>All DNA is composed of 4 bases (A, C, T, and G) in long, linear sequences, with a complementary strand </li></ul><ul><li>Similar to a file (with 2 bits per base) </li></ul><ul><li>Either strand can be read (in different directions) </li></ul>DNA 101 Source: BBC Science & Technology
  6. 6. <ul><li>Human DNA is split up into 23 chromosomes </li></ul><ul><li>There are two copies of each </li></ul><ul><li>Chromosome pairs are mostly identical (exception: sex chromosomes) </li></ul>DNA 101, continued Computer Analogy: 46 files, chr1-a.dna, chr1-b.dna … chr22-a.dna, chr22-b.dna Diff of chr1-a.dna and chr1-b.dna shows only minor changes Source: NIH Genetics Home Reference
  7. 7. <ul><li>How BIG is human DNA? </li></ul><ul><ul><li>Total DNA - 6 billion bp (~1.5 GB) </li></ul></ul><ul><ul><li>Just one copy of each chromosome - 3 billion bp (~750 MB) </li></ul></ul><ul><ul><li>Largest chromosome - 247 million bp (~68 MB) </li></ul></ul><ul><ul><li>Smallest chromosome - 47 million bp (~12 MB) </li></ul></ul><ul><li>Your entire genome will fit on a flash drive! </li></ul>DNA 101, continued
  8. 8. <ul><li>How do we sequence a genome? </li></ul><ul><ul><li>Break DNA into many small fragments (~100 bases) </li></ul></ul><ul><ul><li>Read sequences (with errors!) </li></ul></ul><ul><ul><li>Oversample (by 40x - 100x times) to handle error </li></ul></ul><ul><ul><li>Assemble the original full sequence from the pieces </li></ul></ul>Whole Genome Sequencing <ul><li>Analogy: </li></ul><ul><ul><li>Make lots of somewhat inaccurate copies of the Complete Works of Shakespeare , shred them, and try to reassemble the entire text. </li></ul></ul>Source: Illumina, Inc.
  9. 9. Dropping costs - a problem! <ul><li>The physical costs of sequencing are dropping </li></ul><ul><ul><li>2007 - $1M </li></ul></ul><ul><ul><li>2009 - $100K </li></ul></ul><ul><ul><li>2010 - $5K - $10K </li></ul></ul><ul><ul><li>2012 - $1K (estimated) </li></ul></ul><ul><li>Lower costs drive an increase in sequencing activity and data produced </li></ul><ul><li>Computation for sequence assembly and analysis is becoming a bottleneck </li></ul>
  10. 10. <ul><li>Bioinformatics is a relatively new field </li></ul><ul><li>Big datasets are a new challenge - focus has not yet shifted to parallel methods </li></ul><ul><li>Most existing software is command line only, most users are biologists who find it difficult to use </li></ul><ul><li>There are usually separate programs for each analysis step - file formats are not standardized </li></ul>The problem, continued
  11. 11. <ul><li>The Spiral Platform </li></ul><ul><ul><li>A common, coherent, web based interface </li></ul></ul><ul><ul><li>Strong, extensible backend built on open source (Hadoop) </li></ul></ul><ul><ul><li>Shared data sets (Human reference sequence) </li></ul></ul><ul><ul><li>Commercial support available </li></ul></ul><ul><ul><li>Standard data formats and import/export capability with existing formats </li></ul></ul><ul><ul><li>Modular design,since new methods still being invented </li></ul></ul>The solution, a genomics platform
  12. 12. <ul><li>Support for parallelization </li></ul><ul><li>Support for large datasets </li></ul><ul><li>Good Composability </li></ul><ul><li>Existing developer community </li></ul><ul><li>Most genomics problems map well to map/reduce </li></ul>Why Hadoop
  13. 13. <ul><li>The Pipeline </li></ul><ul><ul><li>Assembly - Putting the parts together </li></ul></ul><ul><ul><ul><li>Alignment to reference </li></ul></ul></ul><ul><ul><ul><li>Denovo assembly </li></ul></ul></ul><ul><ul><li>Annotation </li></ul></ul><ul><ul><ul><li>Comparison to existing DNA (human reference) </li></ul></ul></ul><ul><ul><ul><li>Comparison to reference data </li></ul></ul></ul><ul><ul><ul><li>Looking for certain structure (protein coding sequences) </li></ul></ul></ul>Algorithms - Mapping to MapReduce
  14. 14. <ul><li>Primarily limited to alignment to reference </li></ul><ul><li>Crossbow </li></ul><ul><ul><li>Connects existing software via perl scripts and streaming </li></ul></ul><ul><li>Cloudburst </li></ul><ul><ul><li>True native Hadoop implementation </li></ul></ul><ul><ul><li>Performs basic assembly, additional features a TODO </li></ul></ul><ul><li>To reuse or not to reuse </li></ul><ul><ul><li>Always reuse where possible </li></ul></ul><ul><ul><li>Customer comes first </li></ul></ul>Existing use of Hadoop in Genomics
  15. 15. <ul><li>Uses an existing reference sequence </li></ul><ul><li>Performs a ‘near’ match lookup </li></ul><ul><ul><li>Changes in individual DNA / Read errors are usually small (1 bp at a time) </li></ul></ul><ul><li>Once aligned, use redundancy of reads to construct ‘consensus sequence’ </li></ul><ul><li>Two methods, seed based and Burrows–Wheeler transform </li></ul><ul><li>Complicated by the existence of two chromosomes / read direction, etc. </li></ul>Alignment to reference based assembly
  16. 16. <ul><li>Based on the assumption that every read exactly matches the reference for PART of the read (eg 15 bp) </li></ul><ul><li>Input = Unaligned reads + reference </li></ul><ul><li>Map </li></ul><ul><ul><li>Reference - output all 15 long subsequences, as well as surrounding data and location within reference </li></ul></ul><ul><ul><li>Reads - output 15 long subsequence and read data </li></ul></ul><ul><li>Reduce </li></ul><ul><ul><li>Find collisions, extend match and judge quality and output if good </li></ul></ul><ul><li>Output = Aligned reads </li></ul>Step 1, Alignment (seed based)
  17. 17. <ul><li>Input = Aligned read data </li></ul><ul><li>Map - Group aligned reads by location (reads overlapping two regions map to both) </li></ul><ul><li>Reduce - For a given location, determine most likely value based on multiple reads, taking into account accuracy of read and alignment </li></ul><ul><li>Output = Sequence data for the region </li></ul>Step 2, Consensus Calling
  18. 18. Questions? <ul><li> </li></ul><ul><li>[email_address] </li></ul>