• Save
Hadoop for Genomics__HadoopSummit2010
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Hadoop for Genomics__HadoopSummit2010

on

  • 4,139 views

Hadoop Summit 2010 - Research Track

Hadoop Summit 2010 - Research Track
Hadoop for Genomics
Jeremy Bruestle, Spiral Genetics

Statistics

Views

Total Views
4,139
Views on SlideShare
4,139
Embed Views
0

Actions

Likes
6
Downloads
0
Comments
1

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
  • Nice
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • This is the Title slide. Please use the name of the presentation that was used in the abstract submission.
  • This is the agenda slide. There is only one of these in the deck.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is the final slide; generally for questions at the end of the talk. Please post your contact information here.

Hadoop for Genomics__HadoopSummit2010 Presentation Transcript

  • 1. Hadoop for Genomics
    • Jeremy Bruestle
    Spiral Genetics, Inc.
  • 2.
    • About Genomics
    • About Hadoop and Spiral
    • Specific technical detail
    Session Agenda
  • 3.
    • Study of genomes (total genetic material) of humans and other organisms
    • Requires reading an individual’s DNA sequence, fully or partially
    • We can compare genomes within species or across species lines
    • Many levels of analysis and uses
    Genomics Source: Nature Publishing Group
  • 4.
    • What is it?
      • Reading the entire DNA sequence of individual humans
      • Analyzing the data
      • Interpreting the results for biological significance
    Full Human Genome DNA Sequencing
    • Why do it?
      • Biomedical Research - Find disease-causing mutations for better screening and preventative care
      • Personalized Pharmacology - Inform drug development, determine if a particular drug is appropriate for an individual (Warfarin).
      • Cancer Treatment – Find differences between DNA of normal and cancer cells to predict prognosis and recommend best treatment.
    Source: AAAS
  • 5.
    • All DNA is composed of 4 bases (A, C, T, and G) in long, linear sequences, with a complementary strand
    • Similar to a file (with 2 bits per base)
    • Either strand can be read (in different directions)
    DNA 101 Source: BBC Science & Technology
  • 6.
    • Human DNA is split up into 23 chromosomes
    • There are two copies of each
    • Chromosome pairs are mostly identical (exception: sex chromosomes)
    DNA 101, continued Computer Analogy: 46 files, chr1-a.dna, chr1-b.dna … chr22-a.dna, chr22-b.dna Diff of chr1-a.dna and chr1-b.dna shows only minor changes Source: NIH Genetics Home Reference
  • 7.
    • How BIG is human DNA?
      • Total DNA - 6 billion bp (~1.5 GB)
      • Just one copy of each chromosome - 3 billion bp (~750 MB)
      • Largest chromosome - 247 million bp (~68 MB)
      • Smallest chromosome - 47 million bp (~12 MB)
    • Your entire genome will fit on a flash drive!
    DNA 101, continued
  • 8.
    • How do we sequence a genome?
      • Break DNA into many small fragments (~100 bases)
      • Read sequences (with errors!)
      • Oversample (by 40x - 100x times) to handle error
      • Assemble the original full sequence from the pieces
    Whole Genome Sequencing
    • Analogy:
      • Make lots of somewhat inaccurate copies of the Complete Works of Shakespeare , shred them, and try to reassemble the entire text.
    Source: Illumina, Inc.
  • 9. Dropping costs - a problem!
    • The physical costs of sequencing are dropping
      • 2007 - $1M
      • 2009 - $100K
      • 2010 - $5K - $10K
      • 2012 - $1K (estimated)
    • Lower costs drive an increase in sequencing activity and data produced
    • Computation for sequence assembly and analysis is becoming a bottleneck
  • 10.
    • Bioinformatics is a relatively new field
    • Big datasets are a new challenge - focus has not yet shifted to parallel methods
    • Most existing software is command line only, most users are biologists who find it difficult to use
    • There are usually separate programs for each analysis step - file formats are not standardized
    The problem, continued
  • 11.
    • The Spiral Platform
      • A common, coherent, web based interface
      • Strong, extensible backend built on open source (Hadoop)
      • Shared data sets (Human reference sequence)
      • Commercial support available
      • Standard data formats and import/export capability with existing formats
      • Modular design,since new methods still being invented
    The solution, a genomics platform
  • 12.
    • Support for parallelization
    • Support for large datasets
    • Good Composability
    • Existing developer community
    • Most genomics problems map well to map/reduce
    Why Hadoop
  • 13.
    • The Pipeline
      • Assembly - Putting the parts together
        • Alignment to reference
        • Denovo assembly
      • Annotation
        • Comparison to existing DNA (human reference)
        • Comparison to reference data
        • Looking for certain structure (protein coding sequences)
    Algorithms - Mapping to MapReduce
  • 14.
    • Primarily limited to alignment to reference
    • Crossbow
      • Connects existing software via perl scripts and streaming
    • Cloudburst
      • True native Hadoop implementation
      • Performs basic assembly, additional features a TODO
    • To reuse or not to reuse
      • Always reuse where possible
      • Customer comes first
    Existing use of Hadoop in Genomics
  • 15.
    • Uses an existing reference sequence
    • Performs a ‘near’ match lookup
      • Changes in individual DNA / Read errors are usually small (1 bp at a time)
    • Once aligned, use redundancy of reads to construct ‘consensus sequence’
    • Two methods, seed based and Burrows–Wheeler transform
    • Complicated by the existence of two chromosomes / read direction, etc.
    Alignment to reference based assembly
  • 16.
    • Based on the assumption that every read exactly matches the reference for PART of the read (eg 15 bp)
    • Input = Unaligned reads + reference
    • Map
      • Reference - output all 15 long subsequences, as well as surrounding data and location within reference
      • Reads - output 15 long subsequence and read data
    • Reduce
      • Find collisions, extend match and judge quality and output if good
    • Output = Aligned reads
    Step 1, Alignment (seed based)
  • 17.
    • Input = Aligned read data
    • Map - Group aligned reads by location (reads overlapping two regions map to both)
    • Reduce - For a given location, determine most likely value based on multiple reads, taking into account accuracy of read and alignment
    • Output = Sequence data for the region
    Step 2, Consensus Calling
  • 18. Questions?
    • www.spiralgenetics.com
    • [email_address]