Hadoop for Genomics <ul><li>Jeremy Bruestle </li></ul>Spiral Genetics, Inc.
<ul><li>About Genomics </li></ul><ul><li>About Hadoop and Spiral </li></ul><ul><li>Specific technical detail </li></ul>Ses...
<ul><li>Study of genomes (total genetic material) of humans and other organisms </li></ul><ul><li>Requires reading an indi...
<ul><li>What is it? </li></ul><ul><ul><li>Reading the entire DNA sequence of individual humans </li></ul></ul><ul><ul><li>...
<ul><li>All DNA is composed of 4 bases (A, C, T, and G) in long, linear sequences, with a complementary strand </li></ul><...
<ul><li>Human DNA is split up into 23 chromosomes </li></ul><ul><li>There are two copies of each </li></ul><ul><li>Chromos...
<ul><li>How BIG is human DNA? </li></ul><ul><ul><li>Total DNA - 6 billion bp (~1.5 GB) </li></ul></ul><ul><ul><li>Just one...
<ul><li>How do we sequence a genome?  </li></ul><ul><ul><li>Break DNA into many small fragments (~100 bases) </li></ul></u...
Dropping costs - a problem! <ul><li>The physical costs of sequencing are dropping </li></ul><ul><ul><li>2007 - $1M </li></...
<ul><li>Bioinformatics is a relatively new field </li></ul><ul><li>Big datasets are a new challenge - focus has not yet sh...
<ul><li>The Spiral Platform </li></ul><ul><ul><li>A common, coherent, web based interface </li></ul></ul><ul><ul><li>Stron...
<ul><li>Support for parallelization </li></ul><ul><li>Support for large datasets </li></ul><ul><li>Good Composability  </l...
<ul><li>The Pipeline </li></ul><ul><ul><li>Assembly - Putting the parts together </li></ul></ul><ul><ul><ul><li>Alignment ...
<ul><li>Primarily limited to alignment to reference </li></ul><ul><li>Crossbow </li></ul><ul><ul><li>Connects existing sof...
<ul><li>Uses an existing reference sequence </li></ul><ul><li>Performs a ‘near’ match lookup </li></ul><ul><ul><li>Changes...
<ul><li>Based on the assumption that every read exactly matches the reference for PART of the read (eg 15 bp) </li></ul><u...
<ul><li>Input = Aligned read data </li></ul><ul><li>Map - Group aligned reads by location (reads overlapping two regions m...
Questions? <ul><li>www.spiralgenetics.com </li></ul><ul><li>[email_address] </li></ul>
Upcoming SlideShare
Loading in...5
×

Hadoop for Genomics__HadoopSummit2010

3,006

Published on

Hadoop Summit 2010 - Research Track
Hadoop for Genomics
Jeremy Bruestle, Spiral Genetics

Published in: Technology
1 Comment
8 Likes
Statistics
Notes
No Downloads
Views
Total Views
3,006
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
0
Comments
1
Likes
8
Embeds 0
No embeds

No notes for slide
  • This is the Title slide. Please use the name of the presentation that was used in the abstract submission.
  • This is the agenda slide. There is only one of these in the deck.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is the final slide; generally for questions at the end of the talk. Please post your contact information here.
  • Hadoop for Genomics__HadoopSummit2010

    1. 1. Hadoop for Genomics <ul><li>Jeremy Bruestle </li></ul>Spiral Genetics, Inc.
    2. 2. <ul><li>About Genomics </li></ul><ul><li>About Hadoop and Spiral </li></ul><ul><li>Specific technical detail </li></ul>Session Agenda
    3. 3. <ul><li>Study of genomes (total genetic material) of humans and other organisms </li></ul><ul><li>Requires reading an individual’s DNA sequence, fully or partially </li></ul><ul><li>We can compare genomes within species or across species lines </li></ul><ul><li>Many levels of analysis and uses </li></ul>Genomics Source: Nature Publishing Group
    4. 4. <ul><li>What is it? </li></ul><ul><ul><li>Reading the entire DNA sequence of individual humans </li></ul></ul><ul><ul><li>Analyzing the data </li></ul></ul><ul><ul><li>Interpreting the results for biological significance </li></ul></ul>Full Human Genome DNA Sequencing <ul><li>Why do it? </li></ul><ul><ul><li>Biomedical Research - Find disease-causing mutations for better screening and preventative care </li></ul></ul><ul><ul><li>Personalized Pharmacology - Inform drug development, determine if a particular drug is appropriate for an individual (Warfarin). </li></ul></ul><ul><ul><li>Cancer Treatment – Find differences between DNA of normal and cancer cells to predict prognosis and recommend best treatment. </li></ul></ul>Source: AAAS
    5. 5. <ul><li>All DNA is composed of 4 bases (A, C, T, and G) in long, linear sequences, with a complementary strand </li></ul><ul><li>Similar to a file (with 2 bits per base) </li></ul><ul><li>Either strand can be read (in different directions) </li></ul>DNA 101 Source: BBC Science & Technology
    6. 6. <ul><li>Human DNA is split up into 23 chromosomes </li></ul><ul><li>There are two copies of each </li></ul><ul><li>Chromosome pairs are mostly identical (exception: sex chromosomes) </li></ul>DNA 101, continued Computer Analogy: 46 files, chr1-a.dna, chr1-b.dna … chr22-a.dna, chr22-b.dna Diff of chr1-a.dna and chr1-b.dna shows only minor changes Source: NIH Genetics Home Reference
    7. 7. <ul><li>How BIG is human DNA? </li></ul><ul><ul><li>Total DNA - 6 billion bp (~1.5 GB) </li></ul></ul><ul><ul><li>Just one copy of each chromosome - 3 billion bp (~750 MB) </li></ul></ul><ul><ul><li>Largest chromosome - 247 million bp (~68 MB) </li></ul></ul><ul><ul><li>Smallest chromosome - 47 million bp (~12 MB) </li></ul></ul><ul><li>Your entire genome will fit on a flash drive! </li></ul>DNA 101, continued
    8. 8. <ul><li>How do we sequence a genome? </li></ul><ul><ul><li>Break DNA into many small fragments (~100 bases) </li></ul></ul><ul><ul><li>Read sequences (with errors!) </li></ul></ul><ul><ul><li>Oversample (by 40x - 100x times) to handle error </li></ul></ul><ul><ul><li>Assemble the original full sequence from the pieces </li></ul></ul>Whole Genome Sequencing <ul><li>Analogy: </li></ul><ul><ul><li>Make lots of somewhat inaccurate copies of the Complete Works of Shakespeare , shred them, and try to reassemble the entire text. </li></ul></ul>Source: Illumina, Inc.
    9. 9. Dropping costs - a problem! <ul><li>The physical costs of sequencing are dropping </li></ul><ul><ul><li>2007 - $1M </li></ul></ul><ul><ul><li>2009 - $100K </li></ul></ul><ul><ul><li>2010 - $5K - $10K </li></ul></ul><ul><ul><li>2012 - $1K (estimated) </li></ul></ul><ul><li>Lower costs drive an increase in sequencing activity and data produced </li></ul><ul><li>Computation for sequence assembly and analysis is becoming a bottleneck </li></ul>
    10. 10. <ul><li>Bioinformatics is a relatively new field </li></ul><ul><li>Big datasets are a new challenge - focus has not yet shifted to parallel methods </li></ul><ul><li>Most existing software is command line only, most users are biologists who find it difficult to use </li></ul><ul><li>There are usually separate programs for each analysis step - file formats are not standardized </li></ul>The problem, continued
    11. 11. <ul><li>The Spiral Platform </li></ul><ul><ul><li>A common, coherent, web based interface </li></ul></ul><ul><ul><li>Strong, extensible backend built on open source (Hadoop) </li></ul></ul><ul><ul><li>Shared data sets (Human reference sequence) </li></ul></ul><ul><ul><li>Commercial support available </li></ul></ul><ul><ul><li>Standard data formats and import/export capability with existing formats </li></ul></ul><ul><ul><li>Modular design,since new methods still being invented </li></ul></ul>The solution, a genomics platform
    12. 12. <ul><li>Support for parallelization </li></ul><ul><li>Support for large datasets </li></ul><ul><li>Good Composability </li></ul><ul><li>Existing developer community </li></ul><ul><li>Most genomics problems map well to map/reduce </li></ul>Why Hadoop
    13. 13. <ul><li>The Pipeline </li></ul><ul><ul><li>Assembly - Putting the parts together </li></ul></ul><ul><ul><ul><li>Alignment to reference </li></ul></ul></ul><ul><ul><ul><li>Denovo assembly </li></ul></ul></ul><ul><ul><li>Annotation </li></ul></ul><ul><ul><ul><li>Comparison to existing DNA (human reference) </li></ul></ul></ul><ul><ul><ul><li>Comparison to reference data </li></ul></ul></ul><ul><ul><ul><li>Looking for certain structure (protein coding sequences) </li></ul></ul></ul>Algorithms - Mapping to MapReduce
    14. 14. <ul><li>Primarily limited to alignment to reference </li></ul><ul><li>Crossbow </li></ul><ul><ul><li>Connects existing software via perl scripts and streaming </li></ul></ul><ul><li>Cloudburst </li></ul><ul><ul><li>True native Hadoop implementation </li></ul></ul><ul><ul><li>Performs basic assembly, additional features a TODO </li></ul></ul><ul><li>To reuse or not to reuse </li></ul><ul><ul><li>Always reuse where possible </li></ul></ul><ul><ul><li>Customer comes first </li></ul></ul>Existing use of Hadoop in Genomics
    15. 15. <ul><li>Uses an existing reference sequence </li></ul><ul><li>Performs a ‘near’ match lookup </li></ul><ul><ul><li>Changes in individual DNA / Read errors are usually small (1 bp at a time) </li></ul></ul><ul><li>Once aligned, use redundancy of reads to construct ‘consensus sequence’ </li></ul><ul><li>Two methods, seed based and Burrows–Wheeler transform </li></ul><ul><li>Complicated by the existence of two chromosomes / read direction, etc. </li></ul>Alignment to reference based assembly
    16. 16. <ul><li>Based on the assumption that every read exactly matches the reference for PART of the read (eg 15 bp) </li></ul><ul><li>Input = Unaligned reads + reference </li></ul><ul><li>Map </li></ul><ul><ul><li>Reference - output all 15 long subsequences, as well as surrounding data and location within reference </li></ul></ul><ul><ul><li>Reads - output 15 long subsequence and read data </li></ul></ul><ul><li>Reduce </li></ul><ul><ul><li>Find collisions, extend match and judge quality and output if good </li></ul></ul><ul><li>Output = Aligned reads </li></ul>Step 1, Alignment (seed based)
    17. 17. <ul><li>Input = Aligned read data </li></ul><ul><li>Map - Group aligned reads by location (reads overlapping two regions map to both) </li></ul><ul><li>Reduce - For a given location, determine most likely value based on multiple reads, taking into account accuracy of read and alignment </li></ul><ul><li>Output = Sequence data for the region </li></ul>Step 2, Consensus Calling
    18. 18. Questions? <ul><li>www.spiralgenetics.com </li></ul><ul><li>[email_address] </li></ul>

    ×