Your SlideShare is downloading. ×
0
Hw09   Protein Alignment
Hw09   Protein Alignment
Hw09   Protein Alignment
Hw09   Protein Alignment
Hw09   Protein Alignment
Hw09   Protein Alignment
Hw09   Protein Alignment
Hw09   Protein Alignment
Hw09   Protein Alignment
Hw09   Protein Alignment
Hw09   Protein Alignment
Hw09   Protein Alignment
Hw09   Protein Alignment
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Hw09 Protein Alignment

943

Published on

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
943
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
66
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Hadoop World 2009 New York Oct 2, 2009 Sequence Alignment and Hadoop . Booz Allen Hamilton Inc. 134 National Business Parkway Annapolis Junction, MD 20701 Tel (301) 543-4665 [email_address] Paul Brown Associate
  • 2. The Impact of Hadoop <ul><li>Booz Allen Hamilton, a leading strategy and technology consulting firm, works with clients to deliver results that endure. Every day, government agencies, corporations, institutions, and not-for-profit organizations rely on Booz Allen’s expertise and objectivity, and on the combined capabilities and dedication of our exceptional people to find solutions and seize opportunities. </li></ul><ul><li>This dramatically lowers the cost of entry to distributed computing, and opens up a wide range of computational applications that were previously out of reach to those who need them. </li></ul><ul><li>Frequently these applications are faster, cheaper and more accurate than their predecessors. This has the result of completely changing the playing field. </li></ul><ul><li>The intention of this work, and this talk, is to explore how technology like Hadoop can have a game changing impact on Bio-Informatics </li></ul>
  • 3. Biological Information Paul Brown 9/21/09 Need to verify all these.
  • 4. Biology + Computer Science = Bioinformatics http://bioinformatics.ubc.ca/about/what_is_bioinformatics A Y N A R N A N R N Y A Y N N R N A A N R N
  • 5. Bioinformatics: The Pain <ul><li>We are obtaining biological data at a steadily-increasing rate – an exponential curve steeply tilting to vertical </li></ul><ul><li>Converting that data to usable information is a process that is proceeding, albeit not completely keeping up with its acquisition </li></ul><ul><li>Leveraging all that information to create knowledge is an open challenge, lagging far behind our rate of data collection </li></ul><ul><li>Unique opportunities abound in creating an environment that enables true understanding of this rich sea of data: Hadoop promises to be such an environment </li></ul>Why Hadoop: <ul><li>Hadoop is a scalable data storage and processing file system with an easily accessible analytic framework . </li></ul><ul><li>Used “on-demand” with a cloud provider maximizes resources. </li></ul>
  • 6. So What? Querying a database of sequences for similar sequences <ul><li>“One to Many” comparisons </li></ul><ul><ul><li>~58,000 protiens in the PDB. </li></ul></ul><ul><ul><li>Protein alignment frequently used in the development of medicines. </li></ul></ul><ul><ul><li>Looking for a certain sequence across species, helps indicate function. </li></ul></ul><ul><li>Implementation in Hadoop: </li></ul><ul><ul><li>Distribute database sequences across each node </li></ul></ul><ul><ul><li>Send query sequence inside MapReduce(or distributed cache) </li></ul></ul><ul><ul><li>Scales well with number of nodes </li></ul></ul><ul><ul><li>Existing algorithms port easily </li></ul></ul><ul><ul><li>Individual sequences can’t be too long </li></ul></ul>
  • 7. So What? Comparing sequences in bulk <ul><li>“Many to many”: </li></ul><ul><ul><li>DNA Hybridization(reconstruction) </li></ul></ul><ul><li>Hadoop Implementation </li></ul><ul><ul><li>If the whole dataset fits on one computer: </li></ul></ul><ul><ul><li>Use distributed cache, assign each node a piece of the list. </li></ul></ul><ul><ul><li>But if the whole dataset does not fit on one computer…. </li></ul></ul>Reconstructed Sequence
  • 8. What if the dataset of sequences doesn't fit on one machine? <ul><ul><li>“pre join” all possible pairs with one MapReduce </li></ul></ul><ul><ul><li>Once pairs are pre-computed, alignment algorithms can be applied relatively easily </li></ul></ul>Sequence: A B C D Pair: AB AC AD BC BD CD Input data Pre Joined Data MapReduce MapReduce Pre Join Data Alignment Results Alignment Algorithm 1 Alignment Algorithm 2 Alignment Algorithm N
  • 9. So What? Analyzing really big sequences <ul><li>One Big Sequence to many small sequences : </li></ul><ul><ul><li>Scanning DNA for structures that may or may not indicate function </li></ul></ul><ul><ul><li>Population Genetics </li></ul></ul><ul><li>Hadoop Implementation: </li></ul><ul><ul><li>Sequences can be billions of characters in length </li></ul></ul><ul><ul><li>Distribute Pieces of larger sequence across machines </li></ul></ul><ul><ul><li>Similar to the “many to many” solution, just need to keep track of the original sequence from which it came. </li></ul></ul><ul><ul><li>Perhaps need a method for reconstructing local scores into an overall score. </li></ul></ul>
  • 10. Demonstration Implementation: Smith-Waterman Alignment <ul><li>One of the more computationally intense matching and alignment techniques </li></ul><ul><li>Provides both a match score and a alignment </li></ul><ul><li>Optimizations exist which were not implemented </li></ul><ul><li>Did both the “one to many” alignment and “many to many” alignment </li></ul>
  • 11. Smith-Waterman Algorithm - A Y N A N A N A - 0 0 0 0 0 0 0 0 0 A 0 2 1 0 2 1 2 1 2 N 0 1 1 3 2 4 3 4 3 A 0 2 1 2 5 4 6 5 6 N 0 1 1 3 4 7 6 8 7 A 0 2 1 2 5 6 9 8 10 N 0 1 1 3 4 7 8 11 10 R 0 0 0 2 3 6 7 10 10 A 0 2 1 1 4 5 8 9 12
  • 12. Hadoop and EC2 Implementation <ul><li>Amazon EC2 </li></ul><ul><li>250 Machines </li></ul><ul><li>Runs in 10 minutes for a single sequence. Runs in 24hrs for a NxN comparison </li></ul><ul><li>Cost ~$40/hr </li></ul>
  • 13. Ready for What’s Next…. <ul><li>Hadoop provides an accessible “big data” infrastructure </li></ul><ul><ul><li>Easy to administer </li></ul></ul><ul><ul><li>Flexible hardware demands </li></ul></ul><ul><ul><li>Easy to use </li></ul></ul><ul><li>Reduces design and implementation impediments by orders of magnitude. </li></ul><ul><li>With Next Generation sequencing systems, each with custom applications, the amount of sequence data is growing at an exponential rate. </li></ul><ul><li>Wide array of applications, bioinformatics just one. Results in a strong and fast moving technical community and talent pool. </li></ul>

×