Virtual Proteomics Analysis Cluster in the Cloud


Published on

This talk was presented at Super Computing '09 in the Cloud Computing for Systems and Computational Biology workshop. It describes the proteomics analysis package we built using Amazon's cloud computing architecture. More information in our paper in J. Proteome Research

Published in: Health & Medicine, Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Internally we now utilize a hybrid solution – Sequest and mascot running on local clusters, X!Tandem and OMSSA are run on AWS. Raw data can be sent to any and all of these algorithms through an integrated workflow system
  • Virtual Proteomics Analysis Cluster in the Cloud

    1. 1. Its always sunny on top of the Cloud! An intro to Amazon Web Services <ul><li>Simon Twigger, Ph.D. </li></ul><ul><li>Medical College of Wisconsin, Milwaukee </li></ul>ViPDAC, a stand-alone Proteomics Analysis Suite in the Cloud
    2. 2. ‘ How the humble pipette tip helped us rethink our computing strategy...’
    3. 3. Meet Joe ‘the’ Researcher...
    4. 4. Proteomics - Finding and identifying proteins DB Rat/Tissue Sample LC MS/MS Peptide Identification Results & Analysis
    5. 5. Current architecture Windows (head node, preprocessing, storage) Raw File .dtas Protein IDs IBM Blade Cluster (Sequest)
    6. 6. Finite Resource, wait your turn 1 MCW Cluster
    7. 7. Here’s the lab’s pipette tip, Let me have it when you’re done...
    8. 8. What would you do if there was only one tip? <ul><li>Wait in line to use it </li></ul><ul><li>Run fewer experiments ( due to waiting in line ) </li></ul><ul><li>Do small scale things ( Its a small tip, pipetting 5l takes all week! ) </li></ul><ul><li>Try fewer things ( its a real pain to keep washing it up ) </li></ul><ul><li>Not try anything weird ( What happens if it gets permanently clogged!? ) </li></ul>
    9. 9. OK, more computers might be better... but... we dont have the money! we dont have an IT guy/gal we dont have a sysadmin we dont know how to install a cluster we wont use it all the time
    10. 10. Virtual Proteomics Analysis Cluster (ViPDAC) + +
    11. 11. Current architecture with Sequest Raw File .dtas Protein IDs IBM Blade Cluster (Sequest) Windows (head node, preprocessing, storage)
    12. 12. ViPDAC & Amazon Components S3 (Data Store) Raw File .dtas Protein IDs EC2 (OMSSA, !XTandem)
    13. 13. ViPDAC & Amazon Components S3 (Data Store) Raw File .dtas Protein IDs 2x 3x 20x
    14. 14. ViPDAC: Create a new analysis job
    15. 15. Job in progress
    16. 16. Wait in line vs On Demand vs 1 MCW Cluster Molly’s ViPDAC Shama’s ViPDAC Brian’s ViPDAC Bassam’s ViPDAC
    17. 17. Equal-opportunity computing - Clusters for All vs 1 PC 1 ViPDAC or n ViPDACs
    18. 18. Observations Sign up & Start up is hard for biologists. / /
    19. 19. Now what? <ul><li>No need to Wait in line to use it </li></ul><ul><li>No need to Run fewer analyses </li></ul><ul><li>No need to Do small scale things </li></ul><ul><li>No need to Try fewer things </li></ul><ul><li>No need to Not try anything weird </li></ul>Molly’s ViPDAC Shama’s ViPDAC Brian’s ViPDAC Bassam’s ViPDAC
    20. 20. Internal Hybrid Solution – Local and Cloud Scale up/down/off
    21. 21. Clouds & Bioinformatics: Our observations so far <ul><li>Use it as a software delivery method </li></ul><ul><li>Use it to provide computing to virtually anyone </li></ul><ul><li>Get fast access to large data files (Ensembl, Genbank, etc) </li></ul><ul><li>Use it to COMPLEMENT existing clusters/grids </li></ul><ul><li>AMIs/Apps not easy for non-informatics folks to get going </li></ul><ul><li>‘ Cloud-friendly’ licensing structures for commercial software? </li></ul><ul><li>‘ Grant-friendly’ billing options </li></ul><ul><li>Data transfer for large datasets (NextGen sequencing?) </li></ul>
    22. 22. Acknowledgements <ul><li>Joey Geiger, Brian Halligan and Andrew Vallejos </li></ul><ul><li>Molly Pellitteri-Hahn, Shama Mirsa </li></ul><ul><li>Mike Olivier, Andy Greene </li></ul><ul><li>NHLBI National Proteomics Center </li></ul>Low Cost, Scalable Proteomics Data Analysis Using Amazon’s Cloud Computing Services and Open Source Search Algorithms. J. Proteome Res., 2009, 8 (6), pp 3148–3153