Lxcloud

1,654 views

Published on

Published in: Technology, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,654
On SlideShare
0
From Embeds
0
Number of Embeds
125
Actions
Shares
0
Downloads
40
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Lxcloud

  1. 1. PES CERN's Cloud Computing infrastructure CERN's Cloud Computing Infrastructure Tony Cass, Sebastien Goasguen, Belmiro Moreira, Ewan Roche, Ulrich Schwickerath, Romain Wartel Cloudview conference, Porto, 2010 See also related presentations: HEPIX spring and autumn meeting 2009, 2010 Virtualization vision, Grid Deployment Board (GDB) 9/9/2009 Batch virtualization at CERN, EGEE09 conference, Barcelona CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it
  2. 2. PES Outline and disclaimer An introduction to CERN Why virtualization and cloud computing ? Virtualization of batch resources at CERN Building blocks and current status Image management systems: ISF and ONE Status of the project and first numbers Disclaimer: We are still in the testing and evaluation phase. No final decision has been taken yet on what we are going to use in the future. All given numbers and figures are preliminary CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it CERNs Cloud computing infrastructure - a status report - 2
  3. 3. PES Introduction to CERN European Organization for Nuclear Research The world’s largest particle physics laboratory Located on Swiss/French border Funded/staffed by 20 member states in 1954 With many contributors in the USA Birth place of World Wide Web Made popular by the movie “Angels and Demons” Flag ship accelerator LHC http://www.cern.ch CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it CERNs Cloud computing infrastructure - a status report - 3
  4. 4. PES Introduction to CERN: LHC and the experiments TOTEM LHCb LHCf Alice Circumference of LHC: 26 659 m Magnets : 9300 Temperature: -271.3°C (1.9 K) Cooling: ~60T liquid He Max. Beam energy: 7TeV Current beam energy: 3.5TeV ATLAS CMS CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it CERNs Cloud computing infrastructure - a status report - 4
  5. 5. PES Introduction to CERN Data Signal/Noise ratio 10-9 Data volume: High rate * large number of channels * 4 experiments  15 PetaBytes of new data each year Compute power Event complexity * Nb. events * thousands users  100 k CPUs (cores) Worldwide analysis & funding Computing funding locally in major regions & countries Efficient analysis everywhere  GRID technology CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it CERNs Cloud computing infrastructure - a status report - 5
  6. 6. PES LCG computing GRID Requried computing capacity: ~100 000 processors Number of sites: T0: 1 (CERN), 20% T1: 11 round the world T2: ~160 http://lcg.web.cern.ch/lcg/public/ CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it CERNs Cloud computing infrastructure - a status report - 6
  7. 7. PES The CERN Computer Center Disk and tape: 1500 disk servers 5PB disk space 16PB tape storage Computing facilities: >20.000 CPU cores (batch only) Up to ~10000 concurrent jobs Job throughput ~200 000/day CERN IT Department http://it-dep.web.cern.ch/it-dep/communications/it_facts__figures.htm CH-1211 Genève 23 Switzerland www.cern.ch/it CERNs Cloud computing infrastructure - a status report - 7
  8. 8. PES Why virtualization and cloud computing ? Service consolidation: Improve resource usage by squeezing mostly unused machines onto single big hypervisors Decouple hardware life cycle from applications running on the box Ease management by supporting life migration Virtualization of batch resources: Decouple jobs and physical resources Ease management of the batch farm resources Enable the computer center for new computing models This presentation is about virtualization of batch resources only CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it CERNs Cloud computing infrastructure - a status report - 8
  9. 9. PES Batch virtualization CERN batch farm lxbatch: ~3000 physical hosts ~20000 CPU cores >70 queues Type 1: Run my jobs in your VM Type 2: Run my jobs in my VM CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it CERNs Cloud computing infrastructure - a status report - 9
  10. 10. PES Towards cloud computing Type 3: Give me my infrastructure i.e a VM or a batch of VMs CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it CERNs Cloud computing infrastructure - a status report - 10
  11. 11. PES Philosophy (SLC = Scientific Linux CERN) I ( Near future: Batch SLC4 WN SLC5 WN Physical Physical SLC4 WN SLC5 WN hypervisor cluster (far) future ? Batch T0 development other/cloud applications Internal cloud CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it CERNs Cloud computing infrastructure - a status report - 11
  12. 12. PES Visions beyond the current plan Reusing/sharing images between different sites (phase 2) HEPIX working group founded in autumn 2009 to define rules and boundary conditions (https://www.hepix.org/) Experiment specific images (phase 3) Use of images which are customized for specific experiments Use of resources in a cloud like operation mode (phase 4) Images directly join experiment controlled scheduling systems (phase 5) Controlled by experiment Spread across sites CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it CERNs Cloud computing infrastructure - a status report - 12
  13. 13. PES Virtual batch: basic ideas Virtual batch worker nodes: Clones of real worker nodes, same setup Platform Infrastructure Sharing Facility Mix with physical resources (ISF) Dynamically join the batch farm as normal worker nodes For high level VM management Limited lifetime: stop accepting jobs after 24h Destroy when empty Only one user job per VM at a time Note: The limited lifetime allows for a fully automated system which dynamically adapts to the current needs, and automatically deploys intrusive updates. CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it CERNs Cloud computing infrastructure - a status report - 13
  14. 14. PES Virtual batch: basic ideas, technical Images: staged on hypervisors Infrastructure Sharing Facility Platform (ISF) master images, instances use LVM snapshots For high level VM management Start with few different flavors only Image creation: Derived from a centrally managed “golden node” Regularly updated and distributed to get updates “in” CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it CERNs Cloud computing infrastructure - a status report - 14
  15. 15. PES Virtual batch: basic ideas Images distribution: Platform Infrastructure Sharing Facility (ISF) Only shared file system available is AFS For high level VM management Prefer peer to peer methods (more on that later) SCP wave Rtorrent CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it CERNs Cloud computing infrastructure - a status report - 15
  16. 16. PES Virtual batch: basic ideas VM placement and Platform Infrastructure Sharing Facility management system (ISF) Use existing solutions For high level VM management Testing both a free and a commercial solution OpenNebula (ONE) Platform's Infrastructure Sharing Facility (ISF) CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it CERNs Cloud computing infrastructure - a status report - 16
  17. 17. PES Batch virtualization: architecture Centrally managed Grey: physical resource Golden nodes Job submission Colored: different VMs CE/interactive VM kiosk Batch system management Hypervisors / HW resources VM worker nodes With limited lifetime VM management system CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it CERNs Cloud computing infrastructure - a status report - 17
  18. 18. PES Status of building blocks (test system) Submission Hypervisor VM kiosk VM and batch cluster and image management managemen distribution system Initial deployment OK OK OK OK Central ISF OK, management OK OK Mostly ONE implemented missing Monitoring and alarming OK Switched off missing missing for tests CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it CERNs Cloud computing infrastructure - a status report - 18
  19. 19. PES Image distribution: SCP wave versus rtorrent Slow nodes, (under investigation) Preliminary ! (BT = bit torrent) CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it CERNs Cloud computing infrastructure - a status report - 19
  20. 20. PES VM placement and management OpenNebula (ONE): Basic model: Single ONE master Communication with hypervisors via ssh only (Currently) no special tools on the hypervisors Some scalability issues at the beginning (50 VM at the beginning) Addressing issues as they turn up Close collaboration with developers, ideas for improvements Managed to start more than 7,500 VMs CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it CERNs Cloud computing infrastructure - a status report - 20
  21. 21. PES Scalability tests: some first numbers “One shot” test with OpenNebula: Inject virtual machine requests And let them die Record the number of alive machines seen by LSF every 30s Units: 1h5min CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it CERNs Cloud computing infrastructure - a status report - 21
  22. 22. PES VM placement and management Platforms Infrastructure Sharing Facility (ISF) One active ISF master node, plus fail-over candidates Hypervisors run an agent which talks to XEN Needed to be packaged for CERN Resource management layer similar to LSF Scalability expected to be good but needs verification Tested with 2 racks (96 machines) so far, ramping up Filled with ~2.000 VMs so far (which is the maximum) See: http://www.platform.com CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it CERNs Cloud computing infrastructure - a status report - 22
  23. 23. PES Screen shots: ISF VM status: Status: 1 of 2 racks available CERN IT Department Note: one out of 2 racks enabled for demonstration purpose CH-1211 Genève 23 Switzerland www.cern.ch/it CERNs Cloud computing infrastructure - a status report - 23
  24. 24. PES Screen shots: ISF CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it CERNs Cloud computing infrastructure - a status report - 24
  25. 25. PES Summary Virtualization efforts at CERN are proceeding. Still some work to be done. Main challenges Scalability considerations provisioning systems of the batch system No decision on provisioning system to be used yet Reliability and speed of image distribution General readiness for production (hardening) Seamless integration into the existing infrastructure CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it CERNs Cloud computing infrastructure - a status report - 25
  26. 26. PES Outlook What's next ? Continue testing of ONE and ISF in parallel Solve remaining (known) issues Release first VMs for testing by our users soon CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it CERNs Cloud computing infrastructure - a status report - 26
  27. 27. PES Questions ? CERN IT Department ? CH-1211 Genève 23 Switzerland www.cern.ch/it CERNs Cloud computing infrastructure - a status report - 27
  28. 28. PES Philosophy Platform Infrastructure Sharing Facility VM provisioning system(s) (ISF) For high level VM management OpenNebula pVMO Hypervisor cluster (physical resources) CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it CERNs Cloud computing infrastructure - a status report - 28
  29. 29. PES Details: Some additional explanations ... “Golden node”: A centrally managed (i.e. Quattor controlled) standard worker node which Is a virtual machine Does not accept jobs Receives regular updates Purpose: creation of VM images “Virtual machine worker node”: A virtual machine derived from a golden node Not updated during their life time Dynamically adds itself to the batch farm Accepts jobs for only 24h Runs only one user job at a time Destroys itself when empty CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it CERNs Cloud computing infrastructure - a status report - 29
  30. 30. PES VM kiosk and Image distribution Boundary conditions at CERN Only available shared file system is AFS Network infrastructure with a single 1GE connection No dedicated fast network for transfers that could be used (eg 10GE, IB or similar) Tested options: Scp wave: Developed at Clemson university Based on simple scp rtorrent: Infrastructure developed at CERN Each node starts serving blocks it already hosts CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it CERNs Cloud computing infrastructure - a status report - 30
  31. 31. PES Details: VM kiosk and Image distribution Local image distribution model at CERN Virtual machines are instantiated as LVM snapshots of a base image.This is process is very fast Replacing a production image: Approved images are moved to a central image repository (the “kiosk”) Hypervisors check regularly for new images on the kiosk node The new image is transferred to a temporary area on the hypervisors When the transfer is finished, a new LV is created The new image is unpacked into the new LV The current production image is renamed (via lvrename) The new image is renamed to become the production image Note: may need a sophisticated locking strategy in this process CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it CERNs Cloud computing infrastructure - a status report - 31
  32. 32. PES Image distribution with torrrent: transfer speed 7GB file compressed 452 target nodes y in ar lim Pre 90% finished after 25min Slow nodes, under investigation CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it CERNs Cloud computing infrastructure - a status report - 32
  33. 33. PES Image distribution: total distribution speed → Unpacking still needs some tuning ! 7GB file compressed 452 target nodes y in ar All done after 1.5h lim Pre CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it CERNs Cloud computing infrastructure - a status report - 33

×