• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
AWS Customer Presentation- University of Maryland

AWS Customer Presentation- University of Maryland



Michael Schatz, Researcher, University of Maryland talks about using AWS for dna research

Michael Schatz, Researcher, University of Maryland talks about using AWS for dna research



Total Views
Views on SlideShare
Embed Views



1 Embed 1

http://localhost 1



Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

AWS Customer Presentation- University of Maryland AWS Customer Presentation- University of Maryland Presentation Transcript

  • CloudBurst: Highly Sensitive Read Mapping with MapReduce Michael Schatz May 27, 2009 AWS Start-Up Event
  • Searching Wikipedia
    • How do you find all pages with your name in the Wikipedia
      • 4M pages x 250 words / page = 1B words to search
    • Sequentially searching every word is too slow, we need an index
      • Is the query Q present, and if so, where?
      • Are there any partial or approximate occurrences of Q?
    Michael Schatz Michel Schatz Michal Schatz … Michael Shatz Michael Schats Michael Schantz
  • Indexing with MapReduce
    • Inverted Index
      • Record source of every word in the corpus
      • Index size may be significantly larger than original original text
    • Construction Algorithm
      • Map function emits (word, position)
      • Different pages can be indexed in parallel on different machines
      • Similar to word counting example
    • Searching
      • Online search queries answered efficiently
        • Index stored on multiple disks
      • Bulk queries answered in reducer
        • Already cataloged words shared by Apple and Zebra
    Apple Zebra Apples are the fruit of the apple tree with black seeds. Zebras are equids with black and white stripes. … Apples 1 Apple 7 And 6 Are 2 , 2 Black 10 ,5 Equids 3 Fruit 4 Of 5 Seeds 11 Stripes 8 The 3 , 6 Tree 8 White 7 With 9 ,4 Zebras 1
  • Indexing DNA with MapReduce
    • Genome of an organism encodes the genetic information in long sequence of 4 DNA nucleotides: ACGT
      • Bacteria: ~5 million bp
      • Humans: ~3 billion bp
    • Current DNA sequencing machines generate 1-2 Gbp of sequence per day
      • Millions of short reads (25-300bp)
    • Recent studies of individual human genomes analyzed 3.3 (Wang, et al., 2008) & 4.0 (Bentley, et al., 2008) billion 36bp reads
      • Mapped reads to reference human genome to discover variations between people
      • Many more studies underway
    Chr 1 Chr X CATGCTGCGAATA TATGAATTCC … AAT 10 ,5 ATA 11 ATG 2 , 2 ATT 6 CAT 1 CGA 8 CTG 5 GAA 9 ,4 GCG 7 GCT 4 TAT 1 TCC 8 TGA 3 TGC 3 , 6 TTC 7
  • Personal Genomics
    • How does your genome compare to Craig’s?
    Heart Disease Cancer Loves Portuguese Water Dogs
  • CloudBurst Architecture
      • Map: Catalog K-mers
        • Emit every k-mer in the genome and non-overlapping k-mers in the reads
        • Simultaneously index the genome and join with the reads
      • 2. Shuffle: Coalesce Seeds
        • Hadoop internal shuffle groups together k-mers shared by the reads and the reference
        • Conceptually build a hash table of k-mers and their occurrences
      • 3. Reduce: End-to-end alignment
        • Locally extend alignment beyond seeds by counting mismatches, or with Landau-Vishkin k-difference algorithm to allow for indels.
        • If read aligns end-to-end, record the alignment
    Human chromosome 1 Read 1 Read 2 map shuffle … … reduce Read 1, Chromosome 1, 12345-12365 Read 2, Chromosome 1, 12350-12370
    • CloudBurst running times for mapping 7M reads to human chromosome 22 with at most 4 mismatches on the local and EC 2 clusters.
      • Local cluster shows near linear speedup over optimized C program
    • The 24-core Amazon High-CPU Medium Instance EC2 cluster is faster than the 24-core Small Instance EC2 cluster, and the 24-core local dedicated cluster.
      • As the number of cores increase, the running time decreases with near linear speedup.
      • The 96-core cluster is 3.5x faster than the 24-core, 100x faster than original
    Amazon EC2 Evaluation
  • Grand Challenge of Biology
    • “ NextGen sequencing has completely outrun the ability of good bioinformatics people to keep up with the data and use it well… We need a MASSIVE effort in the development of tools for “normal” biologists to make better use of massive sequence databases.”
      • Jonathan Eisen – JGI Users Meeting – 3/28/09
    • Moving Forward
      • More sophisticated genome indexing & matching with Hadoop Streaming
      • Large scale DNA network analysis with Hadoop
    • More Information
      • http://cloudburst-bio.sourceforge.net
  • Acknowledgements Mihai Pop Ben Langmead Jimmy Lin Steven Salzberg
  • Thank You!