Hw09 Protein Alignment


Published on

Published in: Technology, Education
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Hw09 Protein Alignment

  1. 1. Hadoop World 2009 New York Oct 2, 2009 Sequence Alignment and Hadoop . Booz Allen Hamilton Inc. 134 National Business Parkway Annapolis Junction, MD 20701 Tel (301) 543-4665 [email_address] Paul Brown Associate
  2. 2. The Impact of Hadoop <ul><li>Booz Allen Hamilton, a leading strategy and technology consulting firm, works with clients to deliver results that endure. Every day, government agencies, corporations, institutions, and not-for-profit organizations rely on Booz Allen’s expertise and objectivity, and on the combined capabilities and dedication of our exceptional people to find solutions and seize opportunities. </li></ul><ul><li>This dramatically lowers the cost of entry to distributed computing, and opens up a wide range of computational applications that were previously out of reach to those who need them. </li></ul><ul><li>Frequently these applications are faster, cheaper and more accurate than their predecessors. This has the result of completely changing the playing field. </li></ul><ul><li>The intention of this work, and this talk, is to explore how technology like Hadoop can have a game changing impact on Bio-Informatics </li></ul>
  3. 3. Biological Information Paul Brown 9/21/09 Need to verify all these.
  4. 4. Biology + Computer Science = Bioinformatics http://bioinformatics.ubc.ca/about/what_is_bioinformatics A Y N A R N A N R N Y A Y N N R N A A N R N
  5. 5. Bioinformatics: The Pain <ul><li>We are obtaining biological data at a steadily-increasing rate – an exponential curve steeply tilting to vertical </li></ul><ul><li>Converting that data to usable information is a process that is proceeding, albeit not completely keeping up with its acquisition </li></ul><ul><li>Leveraging all that information to create knowledge is an open challenge, lagging far behind our rate of data collection </li></ul><ul><li>Unique opportunities abound in creating an environment that enables true understanding of this rich sea of data: Hadoop promises to be such an environment </li></ul>Why Hadoop: <ul><li>Hadoop is a scalable data storage and processing file system with an easily accessible analytic framework . </li></ul><ul><li>Used “on-demand” with a cloud provider maximizes resources. </li></ul>
  6. 6. So What? Querying a database of sequences for similar sequences <ul><li>“One to Many” comparisons </li></ul><ul><ul><li>~58,000 protiens in the PDB. </li></ul></ul><ul><ul><li>Protein alignment frequently used in the development of medicines. </li></ul></ul><ul><ul><li>Looking for a certain sequence across species, helps indicate function. </li></ul></ul><ul><li>Implementation in Hadoop: </li></ul><ul><ul><li>Distribute database sequences across each node </li></ul></ul><ul><ul><li>Send query sequence inside MapReduce(or distributed cache) </li></ul></ul><ul><ul><li>Scales well with number of nodes </li></ul></ul><ul><ul><li>Existing algorithms port easily </li></ul></ul><ul><ul><li>Individual sequences can’t be too long </li></ul></ul>
  7. 7. So What? Comparing sequences in bulk <ul><li>“Many to many”: </li></ul><ul><ul><li>DNA Hybridization(reconstruction) </li></ul></ul><ul><li>Hadoop Implementation </li></ul><ul><ul><li>If the whole dataset fits on one computer: </li></ul></ul><ul><ul><li>Use distributed cache, assign each node a piece of the list. </li></ul></ul><ul><ul><li>But if the whole dataset does not fit on one computer…. </li></ul></ul>Reconstructed Sequence
  8. 8. What if the dataset of sequences doesn't fit on one machine? <ul><ul><li>“pre join” all possible pairs with one MapReduce </li></ul></ul><ul><ul><li>Once pairs are pre-computed, alignment algorithms can be applied relatively easily </li></ul></ul>Sequence: A B C D Pair: AB AC AD BC BD CD Input data Pre Joined Data MapReduce MapReduce Pre Join Data Alignment Results Alignment Algorithm 1 Alignment Algorithm 2 Alignment Algorithm N
  9. 9. So What? Analyzing really big sequences <ul><li>One Big Sequence to many small sequences : </li></ul><ul><ul><li>Scanning DNA for structures that may or may not indicate function </li></ul></ul><ul><ul><li>Population Genetics </li></ul></ul><ul><li>Hadoop Implementation: </li></ul><ul><ul><li>Sequences can be billions of characters in length </li></ul></ul><ul><ul><li>Distribute Pieces of larger sequence across machines </li></ul></ul><ul><ul><li>Similar to the “many to many” solution, just need to keep track of the original sequence from which it came. </li></ul></ul><ul><ul><li>Perhaps need a method for reconstructing local scores into an overall score. </li></ul></ul>
  10. 10. Demonstration Implementation: Smith-Waterman Alignment <ul><li>One of the more computationally intense matching and alignment techniques </li></ul><ul><li>Provides both a match score and a alignment </li></ul><ul><li>Optimizations exist which were not implemented </li></ul><ul><li>Did both the “one to many” alignment and “many to many” alignment </li></ul>
  11. 11. Smith-Waterman Algorithm - A Y N A N A N A - 0 0 0 0 0 0 0 0 0 A 0 2 1 0 2 1 2 1 2 N 0 1 1 3 2 4 3 4 3 A 0 2 1 2 5 4 6 5 6 N 0 1 1 3 4 7 6 8 7 A 0 2 1 2 5 6 9 8 10 N 0 1 1 3 4 7 8 11 10 R 0 0 0 2 3 6 7 10 10 A 0 2 1 1 4 5 8 9 12
  12. 12. Hadoop and EC2 Implementation <ul><li>Amazon EC2 </li></ul><ul><li>250 Machines </li></ul><ul><li>Runs in 10 minutes for a single sequence. Runs in 24hrs for a NxN comparison </li></ul><ul><li>Cost ~$40/hr </li></ul>
  13. 13. Ready for What’s Next…. <ul><li>Hadoop provides an accessible “big data” infrastructure </li></ul><ul><ul><li>Easy to administer </li></ul></ul><ul><ul><li>Flexible hardware demands </li></ul></ul><ul><ul><li>Easy to use </li></ul></ul><ul><li>Reduces design and implementation impediments by orders of magnitude. </li></ul><ul><li>With Next Generation sequencing systems, each with custom applications, the amount of sequence data is growing at an exponential rate. </li></ul><ul><li>Wide array of applications, bioinformatics just one. Results in a strong and fast moving technical community and talent pool. </li></ul>