Hadoop World 2009 New York Oct 2, 2009 Sequence Alignment and Hadoop . Booz Allen Hamilton Inc. 134 National Business Parkway Annapolis Junction, MD 20701 Tel (301) 543-4665 [email_address] Paul Brown Associate
The Impact of Hadoop Booz Allen Hamilton, a leading strategy and technology consulting firm, works with clients to deliver results that endure. Every day, government agencies, corporations, institutions, and not-for-profit organizations rely on Booz Allen’s expertise and objectivity, and on the combined capabilities and dedication of our exceptional people to find solutions and seize opportunities. This dramatically lowers the cost of entry to distributed computing, and opens up a wide range of computational applications that were previously out of reach to those who need them. Frequently these applications are faster, cheaper and more accurate than their predecessors.  This has the result of completely changing the playing field.  The intention of this work, and this talk, is to explore how technology like Hadoop can have a game changing impact on Bio-Informatics
Biological Information Paul Brown 9/21/09 Need to verify all these.
Biology + Computer Science = Bioinformatics http://bioinformatics.ubc.ca/about/what_is_bioinformatics A Y N A R N A N R N Y A Y N N R N A A N R N
Bioinformatics: The Pain We are obtaining biological data at a steadily-increasing rate – an exponential curve steeply tilting to vertical Converting that data to usable information is a process that is proceeding, albeit not completely keeping up with its acquisition Leveraging all that information to create knowledge is an open challenge, lagging far behind our rate of data collection Unique opportunities abound in creating an environment that enables true understanding of this rich sea of data: Hadoop promises to be such an environment Why Hadoop: Hadoop is a scalable data storage and processing file system with an easily accessible analytic framework . Used “on-demand” with a cloud provider maximizes resources.
So What? Querying a database of sequences for similar sequences “One to Many” comparisons ~58,000 protiens in the PDB. Protein alignment  frequently used in the development of medicines. Looking for a certain sequence across species, helps indicate function. Implementation in Hadoop: Distribute database sequences across each node Send query sequence inside MapReduce(or distributed cache) Scales well with number of nodes Existing algorithms port easily Individual sequences can’t be too long
So What? Comparing sequences in bulk “Many to many”: DNA Hybridization(reconstruction) Hadoop Implementation If the whole dataset fits on one computer: Use distributed cache, assign each node a piece of the list. But if the whole dataset does not fit on one computer…. Reconstructed Sequence
What if the dataset of sequences doesn't fit on one machine? “pre join” all possible pairs with one MapReduce Once pairs are pre-computed, alignment algorithms can be applied relatively easily Sequence: A B C D Pair: AB AC AD BC BD CD Input data Pre Joined Data MapReduce MapReduce Pre Join Data Alignment  Results Alignment Algorithm 1 Alignment Algorithm 2 Alignment Algorithm N
So What? Analyzing really big sequences One Big Sequence to many small sequences : Scanning DNA for structures that may or may not indicate function Population Genetics Hadoop Implementation: Sequences can be billions of characters in length Distribute Pieces of larger sequence across machines Similar to the “many to many” solution, just need to keep track of the original sequence from which it came. Perhaps need a method for reconstructing local scores into an overall score.
Demonstration Implementation: Smith-Waterman Alignment One of the more computationally intense matching and alignment techniques Provides both a match score and a alignment Optimizations exist which were not implemented Did both the “one to many” alignment and “many to many” alignment
Smith-Waterman Algorithm   -  A  Y  N  A  N  A  N  A   -  0  0  0  0  0  0  0  0  0   A  0  2  1  0  2  1  2  1  2   N  0  1  1  3  2  4  3  4  3   A  0  2  1  2  5  4  6  5  6   N  0  1  1  3  4  7  6  8  7   A  0  2  1  2  5  6  9  8  10   N  0  1  1  3  4  7  8  11  10   R  0  0  0  2  3  6  7  10  10   A  0  2  1  1  4  5  8  9  12
Hadoop and EC2 Implementation Amazon EC2 250  Machines Runs in 10 minutes for a single sequence.  Runs in 24hrs for a NxN comparison Cost ~$40/hr
Ready for What’s Next…. Hadoop provides an accessible “big data” infrastructure Easy to administer Flexible hardware demands Easy to use Reduces design and implementation impediments by orders of magnitude. With Next Generation sequencing systems, each with custom applications, the amount of sequence data is growing at an exponential rate. Wide array of applications, bioinformatics just one.  Results in a strong and fast moving technical community and talent pool.

Hw09 Protein Alignment

  • 1.
    Hadoop World 2009New York Oct 2, 2009 Sequence Alignment and Hadoop . Booz Allen Hamilton Inc. 134 National Business Parkway Annapolis Junction, MD 20701 Tel (301) 543-4665 [email_address] Paul Brown Associate
  • 2.
    The Impact ofHadoop Booz Allen Hamilton, a leading strategy and technology consulting firm, works with clients to deliver results that endure. Every day, government agencies, corporations, institutions, and not-for-profit organizations rely on Booz Allen’s expertise and objectivity, and on the combined capabilities and dedication of our exceptional people to find solutions and seize opportunities. This dramatically lowers the cost of entry to distributed computing, and opens up a wide range of computational applications that were previously out of reach to those who need them. Frequently these applications are faster, cheaper and more accurate than their predecessors. This has the result of completely changing the playing field. The intention of this work, and this talk, is to explore how technology like Hadoop can have a game changing impact on Bio-Informatics
  • 3.
    Biological Information PaulBrown 9/21/09 Need to verify all these.
  • 4.
    Biology + ComputerScience = Bioinformatics http://bioinformatics.ubc.ca/about/what_is_bioinformatics A Y N A R N A N R N Y A Y N N R N A A N R N
  • 5.
    Bioinformatics: The PainWe are obtaining biological data at a steadily-increasing rate – an exponential curve steeply tilting to vertical Converting that data to usable information is a process that is proceeding, albeit not completely keeping up with its acquisition Leveraging all that information to create knowledge is an open challenge, lagging far behind our rate of data collection Unique opportunities abound in creating an environment that enables true understanding of this rich sea of data: Hadoop promises to be such an environment Why Hadoop: Hadoop is a scalable data storage and processing file system with an easily accessible analytic framework . Used “on-demand” with a cloud provider maximizes resources.
  • 6.
    So What? Queryinga database of sequences for similar sequences “One to Many” comparisons ~58,000 protiens in the PDB. Protein alignment frequently used in the development of medicines. Looking for a certain sequence across species, helps indicate function. Implementation in Hadoop: Distribute database sequences across each node Send query sequence inside MapReduce(or distributed cache) Scales well with number of nodes Existing algorithms port easily Individual sequences can’t be too long
  • 7.
    So What? Comparingsequences in bulk “Many to many”: DNA Hybridization(reconstruction) Hadoop Implementation If the whole dataset fits on one computer: Use distributed cache, assign each node a piece of the list. But if the whole dataset does not fit on one computer…. Reconstructed Sequence
  • 8.
    What if thedataset of sequences doesn't fit on one machine? “pre join” all possible pairs with one MapReduce Once pairs are pre-computed, alignment algorithms can be applied relatively easily Sequence: A B C D Pair: AB AC AD BC BD CD Input data Pre Joined Data MapReduce MapReduce Pre Join Data Alignment Results Alignment Algorithm 1 Alignment Algorithm 2 Alignment Algorithm N
  • 9.
    So What? Analyzingreally big sequences One Big Sequence to many small sequences : Scanning DNA for structures that may or may not indicate function Population Genetics Hadoop Implementation: Sequences can be billions of characters in length Distribute Pieces of larger sequence across machines Similar to the “many to many” solution, just need to keep track of the original sequence from which it came. Perhaps need a method for reconstructing local scores into an overall score.
  • 10.
    Demonstration Implementation: Smith-WatermanAlignment One of the more computationally intense matching and alignment techniques Provides both a match score and a alignment Optimizations exist which were not implemented Did both the “one to many” alignment and “many to many” alignment
  • 11.
    Smith-Waterman Algorithm - A Y N A N A N A - 0 0 0 0 0 0 0 0 0 A 0 2 1 0 2 1 2 1 2 N 0 1 1 3 2 4 3 4 3 A 0 2 1 2 5 4 6 5 6 N 0 1 1 3 4 7 6 8 7 A 0 2 1 2 5 6 9 8 10 N 0 1 1 3 4 7 8 11 10 R 0 0 0 2 3 6 7 10 10 A 0 2 1 1 4 5 8 9 12
  • 12.
    Hadoop and EC2Implementation Amazon EC2 250 Machines Runs in 10 minutes for a single sequence. Runs in 24hrs for a NxN comparison Cost ~$40/hr
  • 13.
    Ready for What’sNext…. Hadoop provides an accessible “big data” infrastructure Easy to administer Flexible hardware demands Easy to use Reduces design and implementation impediments by orders of magnitude. With Next Generation sequencing systems, each with custom applications, the amount of sequence data is growing at an exponential rate. Wide array of applications, bioinformatics just one. Results in a strong and fast moving technical community and talent pool.