Your SlideShare is downloading. ×
Everything comes in 3's
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Everything comes in 3's


Published on

A talk given at BioIT World conference 2010 Cloud Computing Workshop

A talk given at BioIT World conference 2010 Cloud Computing Workshop

Published in: Technology

  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide
  • REFERENCE Semantic-based Distributed I/O with the ParaMEDICFramework
P. Balaji, W. Feng, H. Lin
ACM/IEEE International Symposium on High-Performance Distributed Computing,
April 2008.
  • Transcript

    • 1. Everything Comes in 3’s
      Angel Pizarro
      Director, ITMAT Bioinformatics Facility
      University of Pennsylvania School of Medicine
    • 2. Outline
      This talk looks at the practical aspects of Cloud Computing
      We will be diving into specific examples
      3pillars of systems design
      3storage implementations
      3 areas of bioinformatics
      And how they are affected by clouds
      3interesting internal projects
      There are 2 hard problems in computer science: caching, naming, and off-by-1 errors
    • 3. Pillars of Systems Design
      API access (AWS, Microsoft, RackSpace, GoGrid, etc.)
      Not discussing further, since this is the WHOLE POINT of cloud computing.
      How to get a system up to the point you can do something with it
      Command and Control
      How to tell the system what to do
    • 4. System Configuration with Chef
      Automatic installation of packages, service configuration and initialization
      Specifications use a real programming language with known behavior
      Bring the system to an idempotent state
    • 5. Chef Recipes & Cookbooks
      The specification for installing and configuring a system component
      Able to support more than one platform
      Has access to system-wide information
      hostname, IP addr, RAM, # processors, etc.
      Contain templates, documentation, static files & assets
      Can define dependencies on other recipes
      Executed in order, execution stops at first failure
    • 6. Simple Recipe : Rsync
      Install rsync to the system
      Meta data file states what platforms are supported
      Note that Chef is a Linux centric system
      BUT, the WikiWiki is MessyMessy
      Look at Chef Solo & Resources
    • 7. More Complex Recipe: Heartbeat
      Installs heartbeat package
      Registers the service and specifies that is can be restarted and provides a status message
      Finally it starts the service
    • 8. Command and Control
      Traditional grid computing
      QSUB – SGE, PBS, Torque
      Usually requires tightly coupled and static systems
      Shared file systems, firewalls, user accounts, shared exe & lib locations
      Best for capability processes (e.g. MPI)
      Map-Reduce is the new hotness
      Best for data-parallel processes
      Assumes loosely coupled non-static components
      Job staging is a critical component
    • 9. Map Reduce in a Nutshell
      Algorithm pioneered by Google for distributed data analysis
      Data-parallel analysis fit well into this model
      Split data, work on each part in parallel, then merge results
      Hadoop, Disco, CloudCrowd, …
    • 10. Serial Execution of Proteomics Search
    • 11. Parallel Proteomics Search
    • 12. Roll-Your-Own MR on AWS
      Define small scripts to
      Split a FASTA file
      Run a BLAT search
      The first script make defines the inputs of the second
      Submit the input FASTA to S3
      Start a master node as the central communication hub
      Start slave nodes, configured to ask for work from master and save results back to S3
      Press “Play”
    • 13. Workflow of Distributed BLAT
      Boot master & slaves
      Submit the BLAT job
      Initial process splits fasta file. Subsequent jobs BLAT smaller files and save each result as it goes
      Upload inputs
      Download results
    • 14. Master Node => Resque
      Github developed background job processing framework
      Jobs attached to a class from your application, stored as JSON
      Uses REDIS key-value store
      Simple front end for viewing job queue status, failed job
      Resque can invoke any class that has a class method “perform()”
    • 15. The scripts
    • 16. Storage in the Cloud : S3
      Permanent storage for your data
      Pay as you go for transmission and holding
      Eliminates backups
      Pretty good CDN
      Able to hook into better CDN SLA via CloudFront
      Can be slow at times
      Reports of 10 second delay, but average is 300ms response
      Your Data
    • 17. S3 Costs
    • 18. Storage 2: Distributed FS on EC2
      Hadoop HDFS, Gigaspaces, etc.
      Network latency may be an issue for traditional DFSs
      Gluster, GPFS, etc.
      Tighter integration with execution framework, better performance?
      Your Data
      EC2 Node
      EC2 Node
      EC2 Node
      EC2 Node
      EC2 Node Disk
    • 19. DFS on EC2 m1.xlarge Costs
      * Does not take into account transmission fees, or data redundancy. Final costs is probably >= S3
    • 20. Storage 3: Memory Grids
      “RAM is the new Disk”
      Application level RAM clustering
      Terracotta, Gemstone Gemfire, Oracle, Cisco, Gigaspaces
      Performance for capability jobs?
      Your Data
      EC2 RAM
      EC2 RAM
      EC2 RAM
      EC2 RAM
      EC2 RAM
      EC2 RAM
      * There is also the “Disk is the new RAM” groups, where redundant disk is used to mitigate seek times on subsequent reads
    • 21. Memory Grid Cost
      Take home message: Unless your needs are small, you may be better off procuring bare-metal resources
    • 22. Cloud Influence on Bioinformatics
      Computational Biology
      Algorithms will need to account for large I/O latency
      Statistical tests will need to account for incomplete information, or incremental results
      Software Engineering
      Built for the cloud algorithms are popping up
      CloudBurst is a feature example in AWS EMR!
      Application to Life Sciences
      Deploy ready-made images for use
      Cycle Computing, ViPDAC, others soon to follow
    • 23. Algorithms need to be I/O centric
      Incur a slightly higher computational burden to reduce I/O across non-optimal networks
      P. Balaji, W. Feng, H. Lin 2008
    • 24. Some Internal Projects
      Resource Manager
      Service for on-demand provisioning and release of EC2 nodes
      Utilizes Chef to define and apply roles (compute node, DB server, etc)
      Terminates idle compute nodes at 52 minutes
      Workflow Manager
      Defines and executes data analysis workflows
      Relies on RM to provision nodes
      Once appropriate worker nodes are available, acts as the central work queue
      RNA-SeqUltimate Mapper
      Map Reduce RNA-Seq analysis pipeline
      Combines Bowtie + BLAT and feeds results into a decision tree for more accurate mapping of sequence reads
    • 25. Bowtie Alone
    • 26. RUM (Bowtie + BLAT + processing)
      Significantly increases the confidence of your data
    • 27. RUM Costs
      Computational cost ~$100 - $200
      6-8 hours per lane on m2.4xlarge ($2.40 / hour)
      Cost of reagents ~= $10,000
      1% of total
    • 28. Acknowledgements
      Garret FitzGerald
      Ian Blair
      John Hogenesch
      Greg Grant
      Tilo Grosser
      NIH & UPENN for support
      My Team
      David Austin
      Andrew Brader
      Weichen Wu
      Rate me!