Federated HPC Clouds applied to
Radiation Therapy
A. Gómez, L.M. Carril, R. Valin,
J.C. Mouriño, C. Cotelo
ISC Cloud‘13, Heidelberg (Germany)
Sep. 23-24th, 2013
Overview
Context.
Virtual Cluster Architecture.
Experiments on BonFIRE.
Conclusions.
The research leading to these results has received funding from the European Commision's
Seventh Framework Programme (FP7/2007-2013) under grant agreement number 257386
Context: eIMRT service
CTs Treatment
Results
Results
TPS
 Second calculation
 Personalized: One patient, one treatment
eIMRT architecture
IaaSSaaS
Workflow based
on Monte Carlo
simulations
eIMRT Workflow
eIMRT code: Prepares inputs
for BEAMnrc MC. Seconds in
master computer
BEAMnrc MC simulations.
Independent jobs on CEs.
eIMRT code: collects outputs and
prepares inputs for DOSXYZnrc
Seconds in master computer
eIMRT code: collects outputs
and generates final output..
Seconds in master computer
DOSXYZnrc MC simulations.
Independent jobs on CEs.
SaaS issues
Local cluster:
– Could not be enough with many clients.
– Interferences between customer’s requests.
– Shared resources: Time-to-solution not guaranteed.
Grid:
– Interferences between clients.
– Shared resources: Time-to-solution not guaranteed.
Cloud:
– One treatment, one virtual cluster.
– No interferences between treatments, customers.
– But, How to guarantee the time-to-solution in a multi-
tenant out-of-control infrastructure?
IaaS issues for HPC/HTC SaaS
Failures of sites. Needs Fault-tolerant
design.
Application Performance Variability
between deployments. Needs elasticity.
– Different IaaS back-end servers.
– Multi-tenancy. Sharing resources among IaaS
customers.
– Different Cloud providers.
– Evolution of IaaS infrastructure.
J. Schad, et al, Runtime Measurements in the Cloud:
Observing,Analyzing, and Reducing Variance., Proceedings of
the VLDB Endowment, Vol. 3, No. 1, 2010
Proposal: Autonomous Virtual
Cluster Architecture
Virtual Cluster Architecture
Virtual Cluster single site
NFS
Cluster
management:
OGS + custom
scripts
Virtual Cluster-two sites
Fault-tolerant VC two sites
Elasticity Engine
Controls number of CEs based on Key
Application Performance measurements.
Enlarges the cluster to keep performance
and fulfill deadlines.
Decreases size if App. Performance is
higher than needed, to decrease costs.
Proof-of-Concept Experiments
BonFIRE Infrastructure
Vendor Freq.
(GHz)
Cores RAM
(GB)
Intel 2.33 2*2 4
AMD 1,7 2*12 48
Intel 2,5 2*4 32
Intel 2.93 2*4 24
INRIA: Vendor Freq. (GHz) Cores RAM
(GB)
Intel 3.2 2*2 2
Intel 2.66 2*2 8
AMD 2.6 4*12 196
AMD 2 2 4
Intel I7 2.53 2 4
Intel I7 2.1 4 8
Intel Atom 1 2
AMD
T56N
1.65 2 2
HLRS:
Cloud Manager:
OpenNebula 3.0
DISTRIBUTED VIRTUAL CLUSTER
EXPERIMENT
VCOC, FIRE Engineering Workshop, Ghent, Nov. 6th – 7th 2012
Application execution. One vs Two sites
 VC Conf.: Distributed VC (_dist)
 BonFIRE sites:
– INRIA: Master + CEs
– HLRS: CEs
 Deployment time decreases.
 App:Two sites faster than one site.
 But because second site has better
CPUs.
 Impact of deployment ~ 10% total
time.
SPECIFIC DEADLINE OBJECTIVE
EXPERIMENT
VCOC, FIRE Engineering Workshop, Ghent, Nov. 6th – 7th 2012
Horizontal elasticity
 Monitoring application
performance works.
 We have modified software
to produce information more
frequently.
 Execution with deadline.
 Elasticity works.
FAULT TOLERANCE EXPERIMENT
WITH ELASTICITY
VCOC, FIRE Engineering Workshop, Ghent, Nov. 6th – 7th 2012
Virtual Cluster
SYNC
Fault-tolerance
 BonFIRE sites:
– HLRS (Master + 4 CEs)
– INRIA (Shadow + 4 CEs)
 Demanded performance
(500H/s)
 Fault simulated putting HLRS
VMs in CANCEL.
 INRIA Shadow took control of
cluster.
 Elasticity worked, demanding
more CEs to INRIA.
CONCLUSIONS
VCOC, FIRE Engineering Workshop, Ghent, Nov. 6th – 7th 2012
Conclusions
 Distributed VC can be used to speed up HTC
applications.
 Elasticity Engine based on Key Application
Performance indicator for HTC works.
 High QoS can be provided in VC using
distributed VC + elasticity.
 BonFIRE provides infrastructure for experiments
about new concepts and services on Cloud.
THANKS
Questions?
agomez@cesga.es

Federated HPC Clouds applied to Radiation Therapy

  • 1.
    Federated HPC Cloudsapplied to Radiation Therapy A. Gómez, L.M. Carril, R. Valin, J.C. Mouriño, C. Cotelo ISC Cloud‘13, Heidelberg (Germany) Sep. 23-24th, 2013
  • 2.
    Overview Context. Virtual Cluster Architecture. Experimentson BonFIRE. Conclusions. The research leading to these results has received funding from the European Commision's Seventh Framework Programme (FP7/2007-2013) under grant agreement number 257386
  • 3.
    Context: eIMRT service CTsTreatment Results Results TPS  Second calculation  Personalized: One patient, one treatment
  • 4.
  • 5.
    eIMRT Workflow eIMRT code:Prepares inputs for BEAMnrc MC. Seconds in master computer BEAMnrc MC simulations. Independent jobs on CEs. eIMRT code: collects outputs and prepares inputs for DOSXYZnrc Seconds in master computer eIMRT code: collects outputs and generates final output.. Seconds in master computer DOSXYZnrc MC simulations. Independent jobs on CEs.
  • 6.
    SaaS issues Local cluster: –Could not be enough with many clients. – Interferences between customer’s requests. – Shared resources: Time-to-solution not guaranteed. Grid: – Interferences between clients. – Shared resources: Time-to-solution not guaranteed. Cloud: – One treatment, one virtual cluster. – No interferences between treatments, customers. – But, How to guarantee the time-to-solution in a multi- tenant out-of-control infrastructure?
  • 7.
    IaaS issues forHPC/HTC SaaS Failures of sites. Needs Fault-tolerant design. Application Performance Variability between deployments. Needs elasticity. – Different IaaS back-end servers. – Multi-tenancy. Sharing resources among IaaS customers. – Different Cloud providers. – Evolution of IaaS infrastructure. J. Schad, et al, Runtime Measurements in the Cloud: Observing,Analyzing, and Reducing Variance., Proceedings of the VLDB Endowment, Vol. 3, No. 1, 2010
  • 8.
  • 9.
  • 10.
    Virtual Cluster singlesite NFS Cluster management: OGS + custom scripts
  • 11.
  • 12.
  • 13.
    Elasticity Engine Controls numberof CEs based on Key Application Performance measurements. Enlarges the cluster to keep performance and fulfill deadlines. Decreases size if App. Performance is higher than needed, to decrease costs.
  • 14.
  • 15.
    BonFIRE Infrastructure Vendor Freq. (GHz) CoresRAM (GB) Intel 2.33 2*2 4 AMD 1,7 2*12 48 Intel 2,5 2*4 32 Intel 2.93 2*4 24 INRIA: Vendor Freq. (GHz) Cores RAM (GB) Intel 3.2 2*2 2 Intel 2.66 2*2 8 AMD 2.6 4*12 196 AMD 2 2 4 Intel I7 2.53 2 4 Intel I7 2.1 4 8 Intel Atom 1 2 AMD T56N 1.65 2 2 HLRS: Cloud Manager: OpenNebula 3.0
  • 16.
    DISTRIBUTED VIRTUAL CLUSTER EXPERIMENT VCOC,FIRE Engineering Workshop, Ghent, Nov. 6th – 7th 2012
  • 17.
    Application execution. Onevs Two sites  VC Conf.: Distributed VC (_dist)  BonFIRE sites: – INRIA: Master + CEs – HLRS: CEs  Deployment time decreases.  App:Two sites faster than one site.  But because second site has better CPUs.  Impact of deployment ~ 10% total time.
  • 18.
    SPECIFIC DEADLINE OBJECTIVE EXPERIMENT VCOC,FIRE Engineering Workshop, Ghent, Nov. 6th – 7th 2012
  • 19.
    Horizontal elasticity  Monitoringapplication performance works.  We have modified software to produce information more frequently.  Execution with deadline.  Elasticity works.
  • 20.
    FAULT TOLERANCE EXPERIMENT WITHELASTICITY VCOC, FIRE Engineering Workshop, Ghent, Nov. 6th – 7th 2012
  • 21.
  • 22.
    Fault-tolerance  BonFIRE sites: –HLRS (Master + 4 CEs) – INRIA (Shadow + 4 CEs)  Demanded performance (500H/s)  Fault simulated putting HLRS VMs in CANCEL.  INRIA Shadow took control of cluster.  Elasticity worked, demanding more CEs to INRIA.
  • 23.
    CONCLUSIONS VCOC, FIRE EngineeringWorkshop, Ghent, Nov. 6th – 7th 2012
  • 24.
    Conclusions  Distributed VCcan be used to speed up HTC applications.  Elasticity Engine based on Key Application Performance indicator for HTC works.  High QoS can be provided in VC using distributed VC + elasticity.  BonFIRE provides infrastructure for experiments about new concepts and services on Cloud.
  • 25.