Everything comes in 3's

Uploaded on

A talk given at BioIT World conference 2010 Cloud Computing Workshop

A talk given at BioIT World conference 2010 Cloud Computing Workshop

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads


Total Views
On Slideshare
From Embeds
Number of Embeds



Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

    No notes for slide
  • REFERENCE Semantic-based Distributed I/O with the ParaMEDICFramework
P. Balaji, W. Feng, H. Lin
ACM/IEEE International Symposium on High-Performance Distributed Computing,
April 2008.http://www.mpiblast.org/About/Publications


  • 1. Everything Comes in 3’s
    Angel Pizarro
    Director, ITMAT Bioinformatics Facility
    University of Pennsylvania School of Medicine
  • 2. Outline
    This talk looks at the practical aspects of Cloud Computing
    We will be diving into specific examples
    3pillars of systems design
    3storage implementations
    3 areas of bioinformatics
    And how they are affected by clouds
    3interesting internal projects
    There are 2 hard problems in computer science: caching, naming, and off-by-1 errors
  • 3. Pillars of Systems Design
    API access (AWS, Microsoft, RackSpace, GoGrid, etc.)
    Not discussing further, since this is the WHOLE POINT of cloud computing.
    How to get a system up to the point you can do something with it
    Command and Control
    How to tell the system what to do
  • 4. System Configuration with Chef
    Automatic installation of packages, service configuration and initialization
    Specifications use a real programming language with known behavior
    Bring the system to an idempotent state
  • 5. Chef Recipes & Cookbooks
    The specification for installing and configuring a system component
    Able to support more than one platform
    Has access to system-wide information
    hostname, IP addr, RAM, # processors, etc.
    Contain templates, documentation, static files & assets
    Can define dependencies on other recipes
    Executed in order, execution stops at first failure
  • 6. Simple Recipe : Rsync
    Install rsync to the system
    Meta data file states what platforms are supported
    Note that Chef is a Linux centric system
    BUT, the WikiWiki is MessyMessy
    Look at Chef Solo & Resources
  • 7. More Complex Recipe: Heartbeat
    Installs heartbeat package
    Registers the service and specifies that is can be restarted and provides a status message
    Finally it starts the service
  • 8. Command and Control
    Traditional grid computing
    QSUB – SGE, PBS, Torque
    Usually requires tightly coupled and static systems
    Shared file systems, firewalls, user accounts, shared exe & lib locations
    Best for capability processes (e.g. MPI)
    Map-Reduce is the new hotness
    Best for data-parallel processes
    Assumes loosely coupled non-static components
    Job staging is a critical component
  • 9. Map Reduce in a Nutshell
    Algorithm pioneered by Google for distributed data analysis
    Data-parallel analysis fit well into this model
    Split data, work on each part in parallel, then merge results
    Hadoop, Disco, CloudCrowd, …
  • 10. Serial Execution of Proteomics Search
  • 11. Parallel Proteomics Search
  • 12. Roll-Your-Own MR on AWS
    Define small scripts to
    Split a FASTA file
    Run a BLAT search
    The first script make defines the inputs of the second
    Submit the input FASTA to S3
    Start a master node as the central communication hub
    Start slave nodes, configured to ask for work from master and save results back to S3
    Press “Play”
  • 13. Workflow of Distributed BLAT
    Boot master & slaves
    Submit the BLAT job
    Initial process splits fasta file. Subsequent jobs BLAT smaller files and save each result as it goes
    Upload inputs
    Download results
  • 14. Master Node => Resque
    Github developed background job processing framework
    Jobs attached to a class from your application, stored as JSON
    Uses REDIS key-value store
    Simple front end for viewing job queue status, failed job
    Resque can invoke any class that has a class method “perform()”
  • 15. The scripts
  • 16. Storage in the Cloud : S3
    Permanent storage for your data
    Pay as you go for transmission and holding
    Eliminates backups
    Pretty good CDN
    Able to hook into better CDN SLA via CloudFront
    Can be slow at times
    Reports of 10 second delay, but average is 300ms response
    Your Data
  • 17. S3 Costs
  • 18. Storage 2: Distributed FS on EC2
    Hadoop HDFS, Gigaspaces, etc.
    Network latency may be an issue for traditional DFSs
    Gluster, GPFS, etc.
    Tighter integration with execution framework, better performance?
    Your Data
    EC2 Node
    EC2 Node
    EC2 Node
    EC2 Node
    EC2 Node Disk
  • 19. DFS on EC2 m1.xlarge Costs
    * Does not take into account transmission fees, or data redundancy. Final costs is probably >= S3
  • 20. Storage 3: Memory Grids
    “RAM is the new Disk”
    Application level RAM clustering
    Terracotta, Gemstone Gemfire, Oracle, Cisco, Gigaspaces
    Performance for capability jobs?
    Your Data
    EC2 RAM
    EC2 RAM
    EC2 RAM
    EC2 RAM
    EC2 RAM
    EC2 RAM
    * There is also the “Disk is the new RAM” groups, where redundant disk is used to mitigate seek times on subsequent reads
  • 21. Memory Grid Cost
    Take home message: Unless your needs are small, you may be better off procuring bare-metal resources
  • 22. Cloud Influence on Bioinformatics
    Computational Biology
    Algorithms will need to account for large I/O latency
    Statistical tests will need to account for incomplete information, or incremental results
    Software Engineering
    Built for the cloud algorithms are popping up
    CloudBurst is a feature example in AWS EMR!
    Application to Life Sciences
    Deploy ready-made images for use
    Cycle Computing, ViPDAC, others soon to follow
  • 23. Algorithms need to be I/O centric
    Incur a slightly higher computational burden to reduce I/O across non-optimal networks
    P. Balaji, W. Feng, H. Lin 2008
  • 24. Some Internal Projects
    Resource Manager
    Service for on-demand provisioning and release of EC2 nodes
    Utilizes Chef to define and apply roles (compute node, DB server, etc)
    Terminates idle compute nodes at 52 minutes
    Workflow Manager
    Defines and executes data analysis workflows
    Relies on RM to provision nodes
    Once appropriate worker nodes are available, acts as the central work queue
    RNA-SeqUltimate Mapper
    Map Reduce RNA-Seq analysis pipeline
    Combines Bowtie + BLAT and feeds results into a decision tree for more accurate mapping of sequence reads
  • 25. Bowtie Alone
  • 26. RUM (Bowtie + BLAT + processing)
    Significantly increases the confidence of your data
  • 27. RUM Costs
    Computational cost ~$100 - $200
    6-8 hours per lane on m2.4xlarge ($2.40 / hour)
    Cost of reagents ~= $10,000
    1% of total
  • 28. Acknowledgements
    Garret FitzGerald
    Ian Blair
    John Hogenesch
    Greg Grant
    Tilo Grosser
    NIH & UPENN for support
    My Team
    David Austin
    Andrew Brader
    Weichen Wu
    Rate me! http://speakerrate.com/talks/3041-everything-comes-in-3-s