Magellan Experiences with OpenStack




Narayan Desai
desai@mcs.anl.gov
Argonne National Lab
The Challenge of High Performance Computing


   Scientific progress is predicated on the use of computational models, simulation,
    or large scale data analysis
     – Conceptually similar to (or enabling of) traditional experiments
   Progress is also limited by the computational capacities usable by applications
   Applications often use large quantities of resources
     –   100s to 100000s of processors in concert
     –   High bandwidth network links
     –   Low latency communication between processors
     –   Massive data sets
   Largest problems often ride the ragged edge of available resources
     – Inefficiency reduces scope and efficacy of computational approaches to tackle particular
       large scale problems
   Historically driven by applications, not services
The Technical Computing Bottleneck
DOE Magellan Project (2009-2011)


   Joint project between Argonne and Berkeley Labs
   ARRA Funded
   Goal: To assess “cloud” approaches for mid-range technical computing
     –   Comparison of private/public clouds to HPC systems
     –   Evaluation of Hadoop for scientific computing
     –   Application performance comparison
     –   User productivity assessment
   Approach: Build a system with an HPC configuration, but operate as a private
    cloud
     –   504 IBM Idataplex nodes
     –   200 IBM 3650 Storage nodes (8 Disk, 4 ssd)
     –   12 HP 1TB Memory nodes
     –   133 Nvidia Fermi nodes
     –   QDR Infiniband
     –   Connected to the ESNet Research Network
Initial Approach


   Setup Magellan as a testbed
     – Several hardware types, many software configurations
   Chose Eucalyptus 1.6 as the cloud software stack
     – Mindshare leader in 2009
     – Had previous deployment experience
     – Supported widest range of EC2 Apis at the time
   Planned to deploy 500 nodes into the private cloud portion of the system
     – Bare metal provisioning for the rest, due to lack of virtualization support for GPUs, etc
Initial Results
Detailed Initial Experiences (2009-2010)


   Had serious stability and scalability problems once we hit 84 nodes
   Eucalyptus showed its research project heritage
     – Implemented in multiple languages
     – Questionable architecture decisions
   Managed to get system into usable state, but barely
   Began evaluating potential replacements (11/2010)
     – Eucalyptus 2.0
     – Nimbus
     – Openstack (Bexar+)
Evaluation Results


   Eucalyptus 2.0 better, but more of the same
   Openstack fared much better
     –   Poor documentation
     –   Solid architecture
     –   Good scalability
     –   High quality code
          • Good enough to function as documentation surrogate in many cases
     – Amazing community
          • (Thanks Vish!)
   Decided to deploy Openstack Nova in 1/2011
     –   Started with Cactus beta codebase and tracked changes through release
     –   By February, we had deployed 168 nodes and began moving users over
     –   Turned off old system by 3/2011
     –   Scaled to 336 than 420 nodes over the following few months
Early Openstack Compute Operational Experiences

   Cactus
     – Our configuration was unusual, due to scale
           • Multiple network servers
           • Splitting services out to individual service nodes
     – Once things were setup, the system mainly ran
     – Little administrative intervention required to keep the system running
   User productivity
     –   Most scientific users aren’t used to managing systems
     –   Typical usage model is application, not service centric
     –   Private cloud model has a higher barrier to entry
     –   Model also enabled aggressive disintermediation, which users liked
     –   It also turned out there was a substantial unmet demand for services in scientific
         computing
   Due to the user productivity benefits, we decided to transition the system to
    production at the end of the testbed project, in support of the DOE Systems
    Biology Knowledgebase project
Enable DOE Mission Science
                    Communities

                                  Plants




  Microbes
Transitioning into Production (11/2011)
   Production meant new priorities
     – Stability
     – Serviceability
     – Performance
   And a new operation team
   Initial build based on Diablo
     –   Nova
     –   Glance
     –   Keystone*
     –   Horizon*
   Started to develop operational processes
     – Maintenance
     – Troubleshooting
     – Appropriate monitoring
   Performed a full software stack shakedown
     – Scaled rack by rack up to 504 compute nodes
   Vanilla system ready by late 12/2011
Building Towards HPC Efficiency


   HPC platforms target peak performance
     – Virtualization is not a natural choice
   How close can we get to HPC performance while maintaining cloud feature
    benefits?
   Several major areas of concern
     –   Storage I/O
     –   Network Bandwidth
     –   Network latency
     –   Driver support for accelerators/GPUs
   Goal is to build multi-tenant, on demand high performance computational
    infrastructure
     – Support wide area data movement
     – Large scale computations
     – Scalable services hosting bio-informatics data integrations
Network Performance Expedition


   Goal: To determine the limits of Openstack infrastructure for wide area network
    transfers
     – Want small numbers of large flows as opposed to large numbers of slow flows
   Built a new Essex test deployment
     –   15 compute nodes, with 1x10GE link each
     –   Had 15 more in reserve
     –   Expected to need 20 nodes
     –   KVM hypervisor
   Used FlatManager network setup
     – Multi-host configuration
     – Each hypervisor ran ethernet bridging and ip firewalling for its guest(s)
   Nodes connected to the DOE ESNet Advanced Networking Initiative
ESNet Advanced Networking Infrastructure
Setup and Tuning


   Standard instance type
     – 8 vcpus
     – 4 vnics bridged to the same 10GE ethernet
     – virtio
   Standard tuning for wide area high bandwidth transfers
     –   Jumbo frames (9K MTU)
     –   Increased TX queue length on the hypervisor
     –   Buffer sizes on the guest
     –   32-64 MB window size on the guest
     –   Fasterdata.es.net rocks!
   Remote data sinks
     – 3 nodes with 4x10GE
     – No virtualization
   Settled on 10 VMs for testing
     – 4 TCP flows each (ANL -> LBL)
     – Memory to memory
Network Performance Results
Results and comments


   95 gigabit consistently
     – 98 peak!
     – ~12 GB/s across 50 ms latency!
   Single node performance was way higher than we expected
     – CPU utilization even suggests we could handle more bandwidth (5-10 more?)
     – Might be able to improve more with EoIB or SR-IOV
   Single stream performance was worse than native
     – Topped out at 3.5-4 gigabits
   Exotic tuning wasn’t really required
   Openstack performed beautifully
     – Was able to cleanly configure this networking setup
     – All of the APIs are usable in their intended ways
     – No duct tape involved!
Conclusions


   Openstack has been a key enabler of on demand computing for us
     – Even in technical computing, where these techniques are less common
   Openstack is definitely ready for prime time
     – Even supports crazy experimentation
   Experimental results shows that on demand high bandwidth data transfers are
    feasible
     – Our next step is to build openstack storage that can source/sink that data rate
   Eventually, multi-tenancy data transfer infrastructure will be possible
   This is just one example of the potential of mixed cloud/HPC systems
Acknowledgements


   Argonne Team         Original Magellan Team
     – Jason Hedden      • Susan Coghlan
     – Linda Winkler     • Adam Scovel
   ESNet                • Piotr Zbiegel
                         • Rick Bradshaw
     –   Jon Dugan
                         • Anping Liu
     –   Brian Tierney
                         • Ed Holohan
     –   Patrick Dorn
     –   Chris Tracy

DOE Magellan OpenStack user story

  • 1.
    Magellan Experiences withOpenStack Narayan Desai desai@mcs.anl.gov Argonne National Lab
  • 2.
    The Challenge ofHigh Performance Computing  Scientific progress is predicated on the use of computational models, simulation, or large scale data analysis – Conceptually similar to (or enabling of) traditional experiments  Progress is also limited by the computational capacities usable by applications  Applications often use large quantities of resources – 100s to 100000s of processors in concert – High bandwidth network links – Low latency communication between processors – Massive data sets  Largest problems often ride the ragged edge of available resources – Inefficiency reduces scope and efficacy of computational approaches to tackle particular large scale problems  Historically driven by applications, not services
  • 3.
  • 4.
    DOE Magellan Project(2009-2011)  Joint project between Argonne and Berkeley Labs  ARRA Funded  Goal: To assess “cloud” approaches for mid-range technical computing – Comparison of private/public clouds to HPC systems – Evaluation of Hadoop for scientific computing – Application performance comparison – User productivity assessment  Approach: Build a system with an HPC configuration, but operate as a private cloud – 504 IBM Idataplex nodes – 200 IBM 3650 Storage nodes (8 Disk, 4 ssd) – 12 HP 1TB Memory nodes – 133 Nvidia Fermi nodes – QDR Infiniband – Connected to the ESNet Research Network
  • 5.
    Initial Approach  Setup Magellan as a testbed – Several hardware types, many software configurations  Chose Eucalyptus 1.6 as the cloud software stack – Mindshare leader in 2009 – Had previous deployment experience – Supported widest range of EC2 Apis at the time  Planned to deploy 500 nodes into the private cloud portion of the system – Bare metal provisioning for the rest, due to lack of virtualization support for GPUs, etc
  • 6.
  • 7.
    Detailed Initial Experiences(2009-2010)  Had serious stability and scalability problems once we hit 84 nodes  Eucalyptus showed its research project heritage – Implemented in multiple languages – Questionable architecture decisions  Managed to get system into usable state, but barely  Began evaluating potential replacements (11/2010) – Eucalyptus 2.0 – Nimbus – Openstack (Bexar+)
  • 8.
    Evaluation Results  Eucalyptus 2.0 better, but more of the same  Openstack fared much better – Poor documentation – Solid architecture – Good scalability – High quality code • Good enough to function as documentation surrogate in many cases – Amazing community • (Thanks Vish!)  Decided to deploy Openstack Nova in 1/2011 – Started with Cactus beta codebase and tracked changes through release – By February, we had deployed 168 nodes and began moving users over – Turned off old system by 3/2011 – Scaled to 336 than 420 nodes over the following few months
  • 9.
    Early Openstack ComputeOperational Experiences  Cactus – Our configuration was unusual, due to scale • Multiple network servers • Splitting services out to individual service nodes – Once things were setup, the system mainly ran – Little administrative intervention required to keep the system running  User productivity – Most scientific users aren’t used to managing systems – Typical usage model is application, not service centric – Private cloud model has a higher barrier to entry – Model also enabled aggressive disintermediation, which users liked – It also turned out there was a substantial unmet demand for services in scientific computing  Due to the user productivity benefits, we decided to transition the system to production at the end of the testbed project, in support of the DOE Systems Biology Knowledgebase project
  • 10.
    Enable DOE MissionScience Communities Plants Microbes
  • 12.
    Transitioning into Production(11/2011)  Production meant new priorities – Stability – Serviceability – Performance  And a new operation team  Initial build based on Diablo – Nova – Glance – Keystone* – Horizon*  Started to develop operational processes – Maintenance – Troubleshooting – Appropriate monitoring  Performed a full software stack shakedown – Scaled rack by rack up to 504 compute nodes  Vanilla system ready by late 12/2011
  • 13.
    Building Towards HPCEfficiency  HPC platforms target peak performance – Virtualization is not a natural choice  How close can we get to HPC performance while maintaining cloud feature benefits?  Several major areas of concern – Storage I/O – Network Bandwidth – Network latency – Driver support for accelerators/GPUs  Goal is to build multi-tenant, on demand high performance computational infrastructure – Support wide area data movement – Large scale computations – Scalable services hosting bio-informatics data integrations
  • 14.
    Network Performance Expedition  Goal: To determine the limits of Openstack infrastructure for wide area network transfers – Want small numbers of large flows as opposed to large numbers of slow flows  Built a new Essex test deployment – 15 compute nodes, with 1x10GE link each – Had 15 more in reserve – Expected to need 20 nodes – KVM hypervisor  Used FlatManager network setup – Multi-host configuration – Each hypervisor ran ethernet bridging and ip firewalling for its guest(s)  Nodes connected to the DOE ESNet Advanced Networking Initiative
  • 15.
  • 16.
    Setup and Tuning  Standard instance type – 8 vcpus – 4 vnics bridged to the same 10GE ethernet – virtio  Standard tuning for wide area high bandwidth transfers – Jumbo frames (9K MTU) – Increased TX queue length on the hypervisor – Buffer sizes on the guest – 32-64 MB window size on the guest – Fasterdata.es.net rocks!  Remote data sinks – 3 nodes with 4x10GE – No virtualization  Settled on 10 VMs for testing – 4 TCP flows each (ANL -> LBL) – Memory to memory
  • 17.
  • 18.
    Results and comments  95 gigabit consistently – 98 peak! – ~12 GB/s across 50 ms latency!  Single node performance was way higher than we expected – CPU utilization even suggests we could handle more bandwidth (5-10 more?) – Might be able to improve more with EoIB or SR-IOV  Single stream performance was worse than native – Topped out at 3.5-4 gigabits  Exotic tuning wasn’t really required  Openstack performed beautifully – Was able to cleanly configure this networking setup – All of the APIs are usable in their intended ways – No duct tape involved!
  • 19.
    Conclusions  Openstack has been a key enabler of on demand computing for us – Even in technical computing, where these techniques are less common  Openstack is definitely ready for prime time – Even supports crazy experimentation  Experimental results shows that on demand high bandwidth data transfers are feasible – Our next step is to build openstack storage that can source/sink that data rate  Eventually, multi-tenancy data transfer infrastructure will be possible  This is just one example of the potential of mixed cloud/HPC systems
  • 20.
    Acknowledgements  Argonne Team Original Magellan Team – Jason Hedden • Susan Coghlan – Linda Winkler • Adam Scovel  ESNet • Piotr Zbiegel • Rick Bradshaw – Jon Dugan • Anping Liu – Brian Tierney • Ed Holohan – Patrick Dorn – Chris Tracy