AWS Customer Presentation- University of Maryland


Published on

Michael Schatz, Researcher, University of Maryland talks about using AWS for dna research

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • AWS Customer Presentation- University of Maryland

    1. 1. CloudBurst: Highly Sensitive Read Mapping with MapReduce Michael Schatz May 27, 2009 AWS Start-Up Event
    2. 2. Searching Wikipedia <ul><li>How do you find all pages with your name in the Wikipedia </li></ul><ul><ul><li>4M pages x 250 words / page = 1B words to search </li></ul></ul><ul><li>Sequentially searching every word is too slow, we need an index </li></ul><ul><ul><li>Is the query Q present, and if so, where? </li></ul></ul><ul><ul><li>Are there any partial or approximate occurrences of Q? </li></ul></ul>Michael Schatz Michel Schatz Michal Schatz … Michael Shatz Michael Schats Michael Schantz
    3. 3. Indexing with MapReduce <ul><li>Inverted Index </li></ul><ul><ul><li>Record source of every word in the corpus </li></ul></ul><ul><ul><li>Index size may be significantly larger than original original text </li></ul></ul><ul><li>Construction Algorithm </li></ul><ul><ul><li>Map function emits (word, position) </li></ul></ul><ul><ul><li>Different pages can be indexed in parallel on different machines </li></ul></ul><ul><ul><li>Similar to word counting example </li></ul></ul><ul><li>Searching </li></ul><ul><ul><li>Online search queries answered efficiently </li></ul></ul><ul><ul><ul><li>Index stored on multiple disks </li></ul></ul></ul><ul><ul><li>Bulk queries answered in reducer </li></ul></ul><ul><ul><ul><li>Already cataloged words shared by Apple and Zebra </li></ul></ul></ul>Apple Zebra Apples are the fruit of the apple tree with black seeds. Zebras are equids with black and white stripes. … Apples 1 Apple 7 And 6 Are 2 , 2 Black 10 ,5 Equids 3 Fruit 4 Of 5 Seeds 11 Stripes 8 The 3 , 6 Tree 8 White 7 With 9 ,4 Zebras 1
    4. 4. Indexing DNA with MapReduce <ul><li>Genome of an organism encodes the genetic information in long sequence of 4 DNA nucleotides: ACGT </li></ul><ul><ul><li>Bacteria: ~5 million bp </li></ul></ul><ul><ul><li>Humans: ~3 billion bp </li></ul></ul><ul><li>Current DNA sequencing machines generate 1-2 Gbp of sequence per day </li></ul><ul><ul><li>Millions of short reads (25-300bp) </li></ul></ul><ul><li>Recent studies of individual human genomes analyzed 3.3 (Wang, et al., 2008) & 4.0 (Bentley, et al., 2008) billion 36bp reads </li></ul><ul><ul><li>Mapped reads to reference human genome to discover variations between people </li></ul></ul><ul><ul><li>Many more studies underway </li></ul></ul>Chr 1 Chr X CATGCTGCGAATA TATGAATTCC … AAT 10 ,5 ATA 11 ATG 2 , 2 ATT 6 CAT 1 CGA 8 CTG 5 GAA 9 ,4 GCG 7 GCT 4 TAT 1 TCC 8 TGA 3 TGC 3 , 6 TTC 7
    5. 5. Personal Genomics <ul><li>How does your genome compare to Craig’s? </li></ul>Heart Disease Cancer Loves Portuguese Water Dogs
    6. 6. CloudBurst Architecture <ul><ul><li>Map: Catalog K-mers </li></ul></ul><ul><ul><ul><li>Emit every k-mer in the genome and non-overlapping k-mers in the reads </li></ul></ul></ul><ul><ul><ul><li>Simultaneously index the genome and join with the reads </li></ul></ul></ul><ul><ul><li>2. Shuffle: Coalesce Seeds </li></ul></ul><ul><ul><ul><li>Hadoop internal shuffle groups together k-mers shared by the reads and the reference </li></ul></ul></ul><ul><ul><ul><li>Conceptually build a hash table of k-mers and their occurrences </li></ul></ul></ul><ul><ul><li>3. Reduce: End-to-end alignment </li></ul></ul><ul><ul><ul><li>Locally extend alignment beyond seeds by counting mismatches, or with Landau-Vishkin k-difference algorithm to allow for indels. </li></ul></ul></ul><ul><ul><ul><li>If read aligns end-to-end, record the alignment </li></ul></ul></ul>Human chromosome 1 Read 1 Read 2 map shuffle … … reduce Read 1, Chromosome 1, 12345-12365 Read 2, Chromosome 1, 12350-12370
    7. 7. <ul><li>CloudBurst running times for mapping 7M reads to human chromosome 22 with at most 4 mismatches on the local and EC 2 clusters. </li></ul><ul><ul><li>Local cluster shows near linear speedup over optimized C program </li></ul></ul><ul><li>The 24-core Amazon High-CPU Medium Instance EC2 cluster is faster than the 24-core Small Instance EC2 cluster, and the 24-core local dedicated cluster. </li></ul><ul><ul><li>As the number of cores increase, the running time decreases with near linear speedup. </li></ul></ul><ul><ul><li>The 96-core cluster is 3.5x faster than the 24-core, 100x faster than original </li></ul></ul>Amazon EC2 Evaluation
    8. 8. Grand Challenge of Biology <ul><li>“ NextGen sequencing has completely outrun the ability of good bioinformatics people to keep up with the data and use it well… We need a MASSIVE effort in the development of tools for “normal” biologists to make better use of massive sequence databases.” </li></ul><ul><ul><li>Jonathan Eisen – JGI Users Meeting – 3/28/09 </li></ul></ul><ul><li>Moving Forward </li></ul><ul><ul><li>More sophisticated genome indexing & matching with Hadoop Streaming </li></ul></ul><ul><ul><li>Large scale DNA network analysis with Hadoop </li></ul></ul><ul><li>More Information </li></ul><ul><ul><li> </li></ul></ul>
    9. 9. Acknowledgements Mihai Pop Ben Langmead Jimmy Lin Steven Salzberg
    10. 10. Thank You!