Your SlideShare is downloading. ×
0
Hadoop for Genomics__HadoopSummit2010
Hadoop for Genomics__HadoopSummit2010
Hadoop for Genomics__HadoopSummit2010
Hadoop for Genomics__HadoopSummit2010
Hadoop for Genomics__HadoopSummit2010
Hadoop for Genomics__HadoopSummit2010
Hadoop for Genomics__HadoopSummit2010
Hadoop for Genomics__HadoopSummit2010
Hadoop for Genomics__HadoopSummit2010
Hadoop for Genomics__HadoopSummit2010
Hadoop for Genomics__HadoopSummit2010
Hadoop for Genomics__HadoopSummit2010
Hadoop for Genomics__HadoopSummit2010
Hadoop for Genomics__HadoopSummit2010
Hadoop for Genomics__HadoopSummit2010
Hadoop for Genomics__HadoopSummit2010
Hadoop for Genomics__HadoopSummit2010
Hadoop for Genomics__HadoopSummit2010
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Hadoop for Genomics__HadoopSummit2010

2,953

Published on

Hadoop Summit 2010 - Research Track …

Hadoop Summit 2010 - Research Track
Hadoop for Genomics
Jeremy Bruestle, Spiral Genetics

Published in: Technology
1 Comment
7 Likes
Statistics
Notes
No Downloads
Views
Total Views
2,953
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
0
Comments
1
Likes
7
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • This is the Title slide. Please use the name of the presentation that was used in the abstract submission.
  • This is the agenda slide. There is only one of these in the deck.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is the final slide; generally for questions at the end of the talk. Please post your contact information here.
  • Transcript

    • 1. Hadoop for Genomics
      • Jeremy Bruestle
      Spiral Genetics, Inc.
    • 2.
      • About Genomics
      • About Hadoop and Spiral
      • Specific technical detail
      Session Agenda
    • 3.
      • Study of genomes (total genetic material) of humans and other organisms
      • Requires reading an individual’s DNA sequence, fully or partially
      • We can compare genomes within species or across species lines
      • Many levels of analysis and uses
      Genomics Source: Nature Publishing Group
    • 4.
      • What is it?
        • Reading the entire DNA sequence of individual humans
        • Analyzing the data
        • Interpreting the results for biological significance
      Full Human Genome DNA Sequencing
      • Why do it?
        • Biomedical Research - Find disease-causing mutations for better screening and preventative care
        • Personalized Pharmacology - Inform drug development, determine if a particular drug is appropriate for an individual (Warfarin).
        • Cancer Treatment – Find differences between DNA of normal and cancer cells to predict prognosis and recommend best treatment.
      Source: AAAS
    • 5.
      • All DNA is composed of 4 bases (A, C, T, and G) in long, linear sequences, with a complementary strand
      • Similar to a file (with 2 bits per base)
      • Either strand can be read (in different directions)
      DNA 101 Source: BBC Science & Technology
    • 6.
      • Human DNA is split up into 23 chromosomes
      • There are two copies of each
      • Chromosome pairs are mostly identical (exception: sex chromosomes)
      DNA 101, continued Computer Analogy: 46 files, chr1-a.dna, chr1-b.dna … chr22-a.dna, chr22-b.dna Diff of chr1-a.dna and chr1-b.dna shows only minor changes Source: NIH Genetics Home Reference
    • 7.
      • How BIG is human DNA?
        • Total DNA - 6 billion bp (~1.5 GB)
        • Just one copy of each chromosome - 3 billion bp (~750 MB)
        • Largest chromosome - 247 million bp (~68 MB)
        • Smallest chromosome - 47 million bp (~12 MB)
      • Your entire genome will fit on a flash drive!
      DNA 101, continued
    • 8.
      • How do we sequence a genome?
        • Break DNA into many small fragments (~100 bases)
        • Read sequences (with errors!)
        • Oversample (by 40x - 100x times) to handle error
        • Assemble the original full sequence from the pieces
      Whole Genome Sequencing
      • Analogy:
        • Make lots of somewhat inaccurate copies of the Complete Works of Shakespeare , shred them, and try to reassemble the entire text.
      Source: Illumina, Inc.
    • 9. Dropping costs - a problem!
      • The physical costs of sequencing are dropping
        • 2007 - $1M
        • 2009 - $100K
        • 2010 - $5K - $10K
        • 2012 - $1K (estimated)
      • Lower costs drive an increase in sequencing activity and data produced
      • Computation for sequence assembly and analysis is becoming a bottleneck
    • 10.
      • Bioinformatics is a relatively new field
      • Big datasets are a new challenge - focus has not yet shifted to parallel methods
      • Most existing software is command line only, most users are biologists who find it difficult to use
      • There are usually separate programs for each analysis step - file formats are not standardized
      The problem, continued
    • 11.
      • The Spiral Platform
        • A common, coherent, web based interface
        • Strong, extensible backend built on open source (Hadoop)
        • Shared data sets (Human reference sequence)
        • Commercial support available
        • Standard data formats and import/export capability with existing formats
        • Modular design,since new methods still being invented
      The solution, a genomics platform
    • 12.
      • Support for parallelization
      • Support for large datasets
      • Good Composability
      • Existing developer community
      • Most genomics problems map well to map/reduce
      Why Hadoop
    • 13.
      • The Pipeline
        • Assembly - Putting the parts together
          • Alignment to reference
          • Denovo assembly
        • Annotation
          • Comparison to existing DNA (human reference)
          • Comparison to reference data
          • Looking for certain structure (protein coding sequences)
      Algorithms - Mapping to MapReduce
    • 14.
      • Primarily limited to alignment to reference
      • Crossbow
        • Connects existing software via perl scripts and streaming
      • Cloudburst
        • True native Hadoop implementation
        • Performs basic assembly, additional features a TODO
      • To reuse or not to reuse
        • Always reuse where possible
        • Customer comes first
      Existing use of Hadoop in Genomics
    • 15.
      • Uses an existing reference sequence
      • Performs a ‘near’ match lookup
        • Changes in individual DNA / Read errors are usually small (1 bp at a time)
      • Once aligned, use redundancy of reads to construct ‘consensus sequence’
      • Two methods, seed based and Burrows–Wheeler transform
      • Complicated by the existence of two chromosomes / read direction, etc.
      Alignment to reference based assembly
    • 16.
      • Based on the assumption that every read exactly matches the reference for PART of the read (eg 15 bp)
      • Input = Unaligned reads + reference
      • Map
        • Reference - output all 15 long subsequences, as well as surrounding data and location within reference
        • Reads - output 15 long subsequence and read data
      • Reduce
        • Find collisions, extend match and judge quality and output if good
      • Output = Aligned reads
      Step 1, Alignment (seed based)
    • 17.
      • Input = Aligned read data
      • Map - Group aligned reads by location (reads overlapping two regions map to both)
      • Reduce - For a given location, determine most likely value based on multiple reads, taking into account accuracy of read and alignment
      • Output = Sequence data for the region
      Step 2, Consensus Calling
    • 18. Questions?
      • www.spiralgenetics.com
      • [email_address]

    ×