Your SlideShare is downloading. ×
0
Fast Exploration of the QSARModel Space with e-ScienceCentral and Windows AzureSimon WoodmanJacek CalaHugo HidenPaul Watson
VENUS-C  Developing technology to ease scientific      adoption of Cloud Computing• EU Funded Project  – 11 Partners  – 5 ...
Architrave                                                                                           AEGEAN               ...
The Problem           What are the properties of this molecule?Toxicity                                               Biol...
The alternative to Experiments                Predict likely properties based on similar moleculesCHEMBL Database:       d...
QSAR                              QSARQuantitative Structure Activity Relationship                 Activity ≈             ...
Method
Branching WorkflowsPartition training & test   Random split           data             80:20 split Calculate descriptors  ...
e-Science Central             Platform for cloud based data analysisAzure                                             Java...
Architecture                                                                             <<web role>>                     ...
Workflow Architecture                       Worker Role• Single Message            Install JRE  Queue                   In...
Results• 250k models             • QSAR Explorer  –   Linear Regression     – Browse  –   PLS                   – Search  ...
Scalability: Large Scale QSAR                                                                                             ...
Cloud Applicability• Bursty  – ChEMBLdb updates (delta 10%)  – New Modelling Methods (???)• Performance depends on how cha...
Performance is great but …Drug Development requires us to capture       the data and the process
Provenance/Audit            Requirements• How was a model generated?  – What algorithm?  – What descriptors• Are these res...
Storing Provenance• Neo4j  – Open Source Graph Database  – Nodes/Relationships + properties  – Querying/traversing• Access...
Provenance Model• Based on OPM  – Processes, Artifacts, Age    nts• Directed Graph• Multiple views of  provenance  – Depen...
Adding new model builders1. Add new block                                                Enumerate2. Mine the provenance  ...
Future Work• Scalability and reliability  – SQL Azure  – Application server replication• Provenance visualization• Meta-QS...
MOVEeCloud Project•   Investigating the links between    physical activity and common    diseases – type 2    diabetes, ca...
MOVEeCloud Process                                    Analysis and                                    Classification      ...
Data Sizes100 samples / second                       100 rows3600 seconds / hour                        360,000 rows24 hou...
Working with larger data sets• As we add more workflow engines server  load increases  – One server can cope 200 engines i...
HDFS• Implemented prior to Native HDFS on Azure• Easy to integrate with e-sc   – Java system just requires libraries inclu...
Drawbacks to HDFS• Needs a NAMENODE to co-ordinate everything   – Single point of failure (in current HDFS)• Metadata stor...
e-SC WorkflowsChunking WorkflowChunk Processing Workflow
System Setup• One machine for the e-sc server     • 4 CPUs, 7GB RAM, 1TB Local storage• One machine for Namenode     • 4 C...
Initial Results  For a single data set processing went from 60 to 16 minutes            using 4 workflow engines running H...
Questions?• Thank you to our generous funders  – Microsoft Cloud 4 Science  – EU FP7 - VENUS-C (RI-261565)  – RCUK – SiDE ...
Microsoft HPC User Group
Upcoming SlideShare
Loading in...5
×

Microsoft HPC User Group

158

Published on

Presentation given at the Microsoft HPC User Group, Salt Lake City, 2012

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
158
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
1
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • [JC] Shouldn’t the title be “e-SC in Azure”?
  • Work Stealing vs Work scheduling
  • Important for drug discovery due to traceability requirements
  • Natural fit for storing provenance as it’s a graph to start with
  • C4S – Mutation Detection, NGS, e-SC available on Azure, data sets for bioinformatics
  • Large datasets need smart moving of data around – HDFS? otherwise hit the limit on storage account b/w
  • Single POFNot great for small files or 1M+ filesInstances can terminate resulting in loss of data
  • Transcript of "Microsoft HPC User Group "

    1. 1. Fast Exploration of the QSARModel Space with e-ScienceCentral and Windows AzureSimon WoodmanJacek CalaHugo HidenPaul Watson
    2. 2. VENUS-C Developing technology to ease scientific adoption of Cloud Computing• EU Funded Project – 11 Partners – 5 Technology/Infrastructure – 6 Scenario Partners – Open Call – 20 Pilot Studies – May 2010 to May 2012
    3. 3. Architrave AEGEAN UPV.Bio UNEW COLB CoSBI CNR Scenario / Algorithm Users Programming .NE .NE C++ C++ Java C++ Java Language T T Type of Parameter Map / Batch HTC Workflow Data flow CEP workload sweep Reduce ExecutionVENUS- Environments EMIC Generic Worker BSC COMPS C Operating Windows Windows Linux BSC Super Computer System Windows (not in the cloud) Azure EMOTIV Infra- (not in the cloud) Cloud OpenNebula … On Premises Technology Estructure Cloud Paradigm PaaS IaaS Cloud Custome Provider MSFT ENG KTH BSC r
    4. 4. The Problem What are the properties of this molecule?Toxicity Biological ActivitySolubility Perform experiments Time consuming Expensive Ethical constraints
    5. 5. The alternative to Experiments Predict likely properties based on similar moleculesCHEMBL Database: data on 622,824 compounds, collected from 33,956 publicationsWOMBAT Database: data on 251,560 structures, for over 1,966 targetsWOMBAT-PK Database: data on 1230 compounds, for over 13,000 clinical measurements All these databases contain structure information and numerical activity data
    6. 6. QSAR QSARQuantitative Structure Activity Relationship Activity ≈ f( )More accurately, Activity related to a quantifiable structural attributeActivity ≈ f( logP, number of atoms, shape....) Currently > 3,000 recognised attributes http://www.qsarworld.com/
    7. 7. Method
    8. 8. Branching WorkflowsPartition training & test Random split data 80:20 split Calculate descriptors Java CDK descriptors C++ CDL descriptors Correlation analysis Select descriptors Genetic algorithms Random selection Build model Linear regression Neural Network Partial Least Squares Classification Trees Add to database
    9. 9. e-Science Central Platform for cloud based data analysisAzure JavaEC2 ROn Premise Octave Javascript
    10. 10. Architecture <<web role>> Generic Worker e-SC control data Workflow engine web web browser browser rich client app <<web role>> <<web role>> Generic Worker QSAR workflow data <<Azure VM>> Workflow Explorer engine Web UI REST API e-Sciencee-SC blob Central main server JMS queue <<web role>> Azure Blob store Generic Worker store Workflow engine workflow invocations e-SC db backend <<Azure VM>>
    11. 11. Workflow Architecture Worker Role• Single Message Install JRE Queue Install wf engine – Worker Failure Execute the engine Semantics Get Job from Queue – Elasticity Deploy Runtime?• Runtime Get Data Environments Execute Job –R – Octave Put Data – Java Put Next Jobs on Queue• Deployed only once
    12. 12. Results• 250k models • QSAR Explorer – Linear Regression – Browse – PLS – Search – RPartitioning – Get Predictions – Neural Net• 460K workflow executions• 4.4M service calls
    13. 13. Scalability: Large Scale QSAR 16:48480 datasets sequential time: 11 days GW 100 Nodes 200 Nodes 14:24 Azur e Response Time 3hr 19mins 1hr 50mins Speedup 94x 156x 12:00 Execution time [hh:mm] Efficiency 94% 78% 09:36 Cost $55.68 $51.84 250.0 07:12 200.0Relative processing speed-up 04:48 150.0 100.0 02:24 50.0 Azure ideal 00:00 GW 0 50 100 150 200 250 0.0 Number of processors 0 50 100 150 200 250 Number of processors
    14. 14. Cloud Applicability• Bursty – ChEMBLdb updates (delta 10%) – New Modelling Methods (???)• Performance depends on how chatty the problem is – Deploy (incl download) dependencies once – Avoid storage bottlenecks
    15. 15. Performance is great but …Drug Development requires us to capture the data and the process
    16. 16. Provenance/Audit Requirements• How was a model generated? – What algorithm? – What descriptors• Are these results reproducible?• How have bugs manifested? – Which models affected – How do we regenerate affected models?• Performance Characteristics• How do we deal with new data?
    17. 17. Storing Provenance• Neo4j – Open Source Graph Database – Nodes/Relationships + properties – Querying/traversing• Access – Java lib for OPM – e-SC library built on top of OPM lib – REST interface• Options for HA and Sharding for performance
    18. 18. Provenance Model• Based on OPM – Processes, Artifacts, Age nts• Directed Graph• Multiple views of provenance – Dependent on security privileges
    19. 19. Adding new model builders1. Add new block Enumerate2. Mine the provenance descriptors 1 n n3. Dynamically create Build and cross-validate Build and cross- validate a new RPart-m kind of model virtual workflows 1 14. One invocation per cross-validation data set 1 1 Test RPart-m Test ?-m• Work in progress…
    20. 20. Future Work• Scalability and reliability – SQL Azure – Application server replication• Provenance visualization• Meta-QSAR – Provenance Mining• Cloud4Science – Applying lessons learned to new scenarios
    21. 21. MOVEeCloud Project• Investigating the links between physical activity and common diseases – type 2 diabetes, cardiovascular disease,…• Wrist accelerometers worn over 1 week period• Measures movement at 100Hz in three axes• Processing ideal for Azure – Bursty data processing as new data gathered – Embarrassingly parallel – Large datasets
    22. 22. MOVEeCloud Process Analysis and Classification R, Java, Octave Walkin Sedentar Sleep Sedentary Activity g y Methodology Clinician’s Patient Section for Report Interventions Papers
    23. 23. Data Sizes100 samples / second 100 rows3600 seconds / hour 360,000 rows24 hours / day 8,640,000 rows7 days / study 60,480,000 rows / patient / visit Cohort size of 800 patients and multiple visits
    24. 24. Working with larger data sets• As we add more workflow engines server load increases – One server can cope 200 engines if files are small• This is not the case with movement data – Only support 4 engines• Increase the bandwidth to the engines – Clustering appserver /database?
    25. 25. HDFS• Implemented prior to Native HDFS on Azure• Easy to integrate with e-sc – Java system just requires libraries included in e-sc• Distributed store where bandwidth increases with number of machines – Bits of data spread around lots of machines• Concept of data location – Potential to route workflows to execute as close as possible to storage• Other applications also also built on top of HDFS – Open TSDB to store timeseries for movement data
    26. 26. Drawbacks to HDFS• Needs a NAMENODE to co-ordinate everything – Single point of failure (in current HDFS)• Metadata stored in RAM – doesn’t scale beyond a million or so files – Bad for drug discovery)• Not particularly efficient for small files – There is an overhead to connecting to filesystem• If instances terminate can lose data if not backed up – Redundancy helps – Backing in Cloud and stage to HDFS for experiments – Might use it is a cache of “Hot” files and use Blobstore/S3 to back it all up
    27. 27. e-SC WorkflowsChunking WorkflowChunk Processing Workflow
    28. 28. System Setup• One machine for the e-sc server • 4 CPUs, 7GB RAM, 1TB Local storage• One machine for Namenode • 4 CPUs, 7GB RAM, 1TB local storage• Four workflow engines • 2 CPUs, 3.5 GB RAM, 500GB local storage• 2TB HDFS storage using workflow engines • Mounts up quickly• Increase priority of HDFS on engines • Competing for resources with workflow
    29. 29. Initial Results For a single data set processing went from 60 to 16 minutes using 4 workflow engines running HDFS• 4 engines the limit for one e-sc server • Main server hit 100% CPU delivering data • No further improvements with more engines• Using HDFS CPU was consistently below 5% • More like our earlier scalability results• Once data had been chunked processing was the same for each chunk• The improvement lay entirely in staging and uploading results
    30. 30. Questions?• Thank you to our generous funders – Microsoft Cloud 4 Science – EU FP7 - VENUS-C (RI-261565) – RCUK – SiDE (EP/G066019/1)• The Team – Jacek Cala – Paul Watson – Hugo Hiden
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×