Everything comes in 3's


Published on

A talk given at BioIT World conference 2010 Cloud Computing Workshop

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • REFERENCE Semantic-based Distributed I/O with the ParaMEDICFramework
P. Balaji, W. Feng, H. Lin
ACM/IEEE International Symposium on High-Performance Distributed Computing,
April 2008.http://www.mpiblast.org/About/Publications
  • Everything comes in 3's

    1. 1. Everything Comes in 3’s<br />Angel Pizarro<br />Director, ITMAT Bioinformatics Facility<br />University of Pennsylvania School of Medicine<br />
    2. 2. Outline<br />This talk looks at the practical aspects of Cloud Computing<br />We will be diving into specific examples<br />3pillars of systems design<br />3storage implementations<br />3 areas of bioinformatics <br />And how they are affected by clouds<br />3interesting internal projects<br />There are 2 hard problems in computer science: caching, naming, and off-by-1 errors<br />
    3. 3. Pillars of Systems Design<br />Provisioning<br />API access (AWS, Microsoft, RackSpace, GoGrid, etc.)<br />Not discussing further, since this is the WHOLE POINT of cloud computing.<br />Configuration<br />How to get a system up to the point you can do something with it<br />Command and Control<br />How to tell the system what to do<br />
    4. 4. System Configuration with Chef<br />Automatic installation of packages, service configuration and initialization<br />Specifications use a real programming language with known behavior<br />Bring the system to an idempotent state<br />http://opscode.com/chef/<br />http://hotclub.files.wordpress.com/2009/09/swedish_chef_bork-sleeper-cell.jpg<br />
    5. 5. Chef Recipes & Cookbooks<br />The specification for installing and configuring a system component<br />Able to support more than one platform<br />Has access to system-wide information<br />hostname, IP addr, RAM, # processors, etc.<br />Contain templates, documentation, static files & assets<br />Can define dependencies on other recipes<br />Executed in order, execution stops at first failure<br />
    6. 6. Simple Recipe : Rsync<br />Install rsync to the system<br />Meta data file states what platforms are supported<br />Note that Chef is a Linux centric system<br />BUT, the WikiWiki is MessyMessy<br />Look at Chef Solo & Resources<br />
    7. 7. More Complex Recipe: Heartbeat<br />Installs heartbeat package<br />Registers the service and specifies that is can be restarted and provides a status message<br />Finally it starts the service<br />
    8. 8. Command and Control<br />Traditional grid computing<br />QSUB – SGE, PBS, Torque<br />Usually requires tightly coupled and static systems<br />Shared file systems, firewalls, user accounts, shared exe & lib locations<br />Best for capability processes (e.g. MPI) <br />Map-Reduce is the new hotness<br />Best for data-parallel processes<br />Assumes loosely coupled non-static components<br />Job staging is a critical component<br />
    9. 9. Map Reduce in a Nutshell<br />Algorithm pioneered by Google for distributed data analysis<br />Data-parallel analysis fit well into this model<br />Split data, work on each part in parallel, then merge results<br />Hadoop, Disco, CloudCrowd, …<br />
    10. 10. Serial Execution of Proteomics Search<br />
    11. 11. Parallel Proteomics Search<br />
    12. 12. Roll-Your-Own MR on AWS<br />Define small scripts to<br />Split a FASTA file<br />Run a BLAT search<br />The first script make defines the inputs of the second<br />Submit the input FASTA to S3<br />Start a master node as the central communication hub<br />Start slave nodes, configured to ask for work from master and save results back to S3<br />Press “Play”<br />
    13. 13. Workflow of Distributed BLAT<br />Boot master & slaves<br />PC<br />Master<br />Submit the BLAT job<br />S3<br />Slave<br />Initial process splits fasta file. Subsequent jobs BLAT smaller files and save each result as it goes<br />Upload inputs<br />Download results<br />Slave<br />Slave<br />Slave<br />
    14. 14. Master Node => Resque<br />Github developed background job processing framework<br />Jobs attached to a class from your application, stored as JSON<br />Uses REDIS key-value store<br />Simple front end for viewing job queue status, failed job<br />http://github.com/defunkt/resque<br />Resque can invoke any class that has a class method “perform()”<br />
    15. 15. The scripts<br />
    16. 16. Storage in the Cloud : S3<br />Permanent storage for your data<br />Pay as you go for transmission and holding<br />Eliminates backups<br />Pretty good CDN<br />Able to hook into better CDN SLA via CloudFront<br />Can be slow at times<br />Reports of 10 second delay, but average is 300ms response<br />Your Data<br />S3<br />
    17. 17. S3 Costs<br />
    18. 18. Storage 2: Distributed FS on EC2<br />Hadoop HDFS, Gigaspaces, etc.<br />Network latency may be an issue for traditional DFSs<br />Gluster, GPFS, etc.<br />Tighter integration with execution framework, better performance?<br />Your Data<br />EC2 Node<br />EC2 Node<br />EC2 Node<br />EC2 Node<br />EC2 Node Disk<br />
    19. 19. DFS on EC2 m1.xlarge Costs<br />* Does not take into account transmission fees, or data redundancy. Final costs is probably >= S3<br />
    20. 20. Storage 3: Memory Grids<br />“RAM is the new Disk”<br />Application level RAM clustering<br />Terracotta, Gemstone Gemfire, Oracle, Cisco, Gigaspaces<br />Performance for capability jobs?<br />Your Data<br />EC2 RAM<br />EC2 RAM<br />EC2 RAM<br />EC2 RAM<br />EC2 RAM<br />EC2 RAM<br />* There is also the “Disk is the new RAM” groups, where redundant disk is used to mitigate seek times on subsequent reads<br />
    21. 21. Memory Grid Cost<br />Take home message: Unless your needs are small, you may be better off procuring bare-metal resources<br />
    22. 22. Cloud Influence on Bioinformatics<br />Computational Biology<br />Algorithms will need to account for large I/O latency<br />Statistical tests will need to account for incomplete information, or incremental results<br />Software Engineering<br />Built for the cloud algorithms are popping up<br />CloudBurst is a feature example in AWS EMR!<br />Application to Life Sciences<br />Deploy ready-made images for use<br />Cycle Computing, ViPDAC, others soon to follow<br />
    23. 23. Algorithms need to be I/O centric<br />Incur a slightly higher computational burden to reduce I/O across non-optimal networks<br />P. Balaji, W. Feng, H. Lin 2008<br />
    24. 24. Some Internal Projects<br />Resource Manager<br />Service for on-demand provisioning and release of EC2 nodes<br />Utilizes Chef to define and apply roles (compute node, DB server, etc)<br />Terminates idle compute nodes at 52 minutes<br />Workflow Manager<br />Defines and executes data analysis workflows<br />Relies on RM to provision nodes<br />Once appropriate worker nodes are available, acts as the central work queue<br />RUM<br />RNA-SeqUltimate Mapper<br />Map Reduce RNA-Seq analysis pipeline<br />Combines Bowtie + BLAT and feeds results into a decision tree for more accurate mapping of sequence reads<br />
    25. 25. Bowtie Alone<br />
    26. 26. RUM (Bowtie + BLAT + processing)<br />Significantly increases the confidence of your data<br />
    27. 27. RUM Costs<br />Computational cost ~$100 - $200<br />6-8 hours per lane on m2.4xlarge ($2.40 / hour)<br />Cost of reagents ~= $10,000<br />1% of total <br />
    28. 28. Acknowledgements<br />Garret FitzGerald<br />Ian Blair<br />John Hogenesch<br />Greg Grant<br />Tilo Grosser<br />NIH & UPENN for support <br />My Team<br />David Austin<br />Andrew Brader<br />Weichen Wu<br />Rate me! http://speakerrate.com/talks/3041-everything-comes-in-3-s<br />