• Share
  • Email
  • Embed
  • Like
  • Private Content
Hw09   Protein Alignment

Hw09 Protein Alignment






Total Views
Views on SlideShare
Embed Views



1 Embed 4

http://www.slideshare.net 4



Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

    Hw09   Protein Alignment Hw09 Protein Alignment Presentation Transcript

    • Hadoop World 2009 New York Oct 2, 2009 Sequence Alignment and Hadoop . Booz Allen Hamilton Inc. 134 National Business Parkway Annapolis Junction, MD 20701 Tel (301) 543-4665 [email_address] Paul Brown Associate
    • The Impact of Hadoop
      • Booz Allen Hamilton, a leading strategy and technology consulting firm, works with clients to deliver results that endure. Every day, government agencies, corporations, institutions, and not-for-profit organizations rely on Booz Allen’s expertise and objectivity, and on the combined capabilities and dedication of our exceptional people to find solutions and seize opportunities.
      • This dramatically lowers the cost of entry to distributed computing, and opens up a wide range of computational applications that were previously out of reach to those who need them.
      • Frequently these applications are faster, cheaper and more accurate than their predecessors. This has the result of completely changing the playing field.
      • The intention of this work, and this talk, is to explore how technology like Hadoop can have a game changing impact on Bio-Informatics
    • Biological Information Paul Brown 9/21/09 Need to verify all these.
    • Biology + Computer Science = Bioinformatics http://bioinformatics.ubc.ca/about/what_is_bioinformatics A Y N A R N A N R N Y A Y N N R N A A N R N
    • Bioinformatics: The Pain
      • We are obtaining biological data at a steadily-increasing rate – an exponential curve steeply tilting to vertical
      • Converting that data to usable information is a process that is proceeding, albeit not completely keeping up with its acquisition
      • Leveraging all that information to create knowledge is an open challenge, lagging far behind our rate of data collection
      • Unique opportunities abound in creating an environment that enables true understanding of this rich sea of data: Hadoop promises to be such an environment
      Why Hadoop:
      • Hadoop is a scalable data storage and processing file system with an easily accessible analytic framework .
      • Used “on-demand” with a cloud provider maximizes resources.
    • So What? Querying a database of sequences for similar sequences
      • “One to Many” comparisons
        • ~58,000 protiens in the PDB.
        • Protein alignment frequently used in the development of medicines.
        • Looking for a certain sequence across species, helps indicate function.
      • Implementation in Hadoop:
        • Distribute database sequences across each node
        • Send query sequence inside MapReduce(or distributed cache)
        • Scales well with number of nodes
        • Existing algorithms port easily
        • Individual sequences can’t be too long
    • So What? Comparing sequences in bulk
      • “Many to many”:
        • DNA Hybridization(reconstruction)
      • Hadoop Implementation
        • If the whole dataset fits on one computer:
        • Use distributed cache, assign each node a piece of the list.
        • But if the whole dataset does not fit on one computer….
      Reconstructed Sequence
    • What if the dataset of sequences doesn't fit on one machine?
        • “pre join” all possible pairs with one MapReduce
        • Once pairs are pre-computed, alignment algorithms can be applied relatively easily
      Sequence: A B C D Pair: AB AC AD BC BD CD Input data Pre Joined Data MapReduce MapReduce Pre Join Data Alignment Results Alignment Algorithm 1 Alignment Algorithm 2 Alignment Algorithm N
    • So What? Analyzing really big sequences
      • One Big Sequence to many small sequences :
        • Scanning DNA for structures that may or may not indicate function
        • Population Genetics
      • Hadoop Implementation:
        • Sequences can be billions of characters in length
        • Distribute Pieces of larger sequence across machines
        • Similar to the “many to many” solution, just need to keep track of the original sequence from which it came.
        • Perhaps need a method for reconstructing local scores into an overall score.
    • Demonstration Implementation: Smith-Waterman Alignment
      • One of the more computationally intense matching and alignment techniques
      • Provides both a match score and a alignment
      • Optimizations exist which were not implemented
      • Did both the “one to many” alignment and “many to many” alignment
    • Smith-Waterman Algorithm - A Y N A N A N A - 0 0 0 0 0 0 0 0 0 A 0 2 1 0 2 1 2 1 2 N 0 1 1 3 2 4 3 4 3 A 0 2 1 2 5 4 6 5 6 N 0 1 1 3 4 7 6 8 7 A 0 2 1 2 5 6 9 8 10 N 0 1 1 3 4 7 8 11 10 R 0 0 0 2 3 6 7 10 10 A 0 2 1 1 4 5 8 9 12
    • Hadoop and EC2 Implementation
      • Amazon EC2
      • 250 Machines
      • Runs in 10 minutes for a single sequence. Runs in 24hrs for a NxN comparison
      • Cost ~$40/hr
    • Ready for What’s Next….
      • Hadoop provides an accessible “big data” infrastructure
        • Easy to administer
        • Flexible hardware demands
        • Easy to use
      • Reduces design and implementation impediments by orders of magnitude.
      • With Next Generation sequencing systems, each with custom applications, the amount of sequence data is growing at an exponential rate.
      • Wide array of applications, bioinformatics just one. Results in a strong and fast moving technical community and talent pool.