Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Accelerating Science with OpenStack.pptx


Published on


  • Be the first to comment

  • Be the first to like this

Accelerating Science with OpenStack.pptx

  1. 1. Accelerating Science with OpenStack Tim Bell @noggin143 OpenStack Summit San Diego 17th October 2012
  2. 2. What is CERN ?• Conseil Européen pour la Recherche Nucléaire – aka European Laboratory for Particle Physics• Between Geneva and the Jura mountains, straddling the Swiss-French border• Founded in 1954 with an international treaty• Our business is fundamental physics , what is the universe made of and how does it work OpenStack Summit October 2012 Tim Bell, CERN 2
  3. 3. Answering fundamental questions…• How to explain particles have mass? We have theories and accumulating experimental evidence.. Getting close…• What is 96% of the universe made of ? We can only see 4% of its estimated mass!• Why isn’t there anti-matter in the universe? Nature should be symmetric…• What was the state of matter just after the « Big Bang » ? Travelling back to the earliest instants of the universe would help…OpenStack Summit October 2012 Tim Bell, CERN 3
  4. 4. Community collaboration on an international scaleOpenStack Summit October 2012 Tim Bell, CERN 4
  5. 5. The Large Hadron ColliderOpenStack Summit October 2012 Tim Bell, CERN 5
  6. 6. The Large Hadron Collider (LHC) tunnelOpenStack Summit October 2012 Tim Bell, CERN 6
  7. 7. OpenStack Summit October 2012 Tim Bell, CERN 7
  8. 8. Accumulating events in 2009-2011OpenStack Summit October 2012 Tim Bell, CERN 8
  9. 9. OpenStack Summit October 2012 Tim Bell, CERN 9
  10. 10. Heavy Ion CollisionsOpenStack Summit October 2012 Tim Bell, CERN 10
  11. 11. OpenStack Summit October 2012 Tim Bell, CERN 11
  12. 12. Tier-0 (CERN): •Data recording •Initial data reconstruction •Data distribution Tier-1 (11 centres): •Permanent storage •Re-processing •Analysis Tier-2 (~200 centres): • Simulation • End-user analysis• Data is recorded at CERN and Tier-1s and analysed in the Worldwide LHC Computing Grid• In a normal day, the grid provides 100,000 CPU days executing over 2 million jobs OpenStack Summit October 2012 Tim Bell, CERN 12
  13. 13. • Data Centre by Numbers – Hardware installation & retirement • ~7,000 hardware movements/year; ~1,800 disk failures/year Racks 828 Disks 64,109 Tape Drives 160 Servers 11,728 Raw disk capacity (TiB) 63,289 Tape Cartridges 45,000 Processors 15,694 Memory modules 56,014 Tape slots 56,000 Cores 64,238 Memory capacity (TiB) 158 Tape Capacity (TiB) 73,000 HEPSpec06 482,507 RAID controllers 3,749 High Speed Routers 24 Xeon Xeon Xeon Other Fujitsu (640 Mbps → 2.4 Tbps) 3GHz 5150 5160 Xeon 0% 3% Xeon 4% 2% 10% E5335 Ethernet Switches 350 L5520 7% Xeon Hitachi 33% 23% 10 Gbps ports 2,000 E5345 14% HP Switching Capacity 4.8 Tbps Seagate 0% 15% 1 Gbps ports 16,939 Maxtor Western 0% 10 Gbps ports 558 Xeon Xeon Digital E5405 Xeon 59% L5420 6% IT Power Consumption 2,456 KW 8% E5410 16% Total Power Consumption 3,890 KW OpenStack Summit October 2012 Tim Bell, CERN 13
  14. 14. OpenStack Summit October 2012 Tim Bell, CERN 14
  15. 15. Our Challenges - Data storage • >20 years retention • 6GB/s average • 25GB/s peaks • 30PB/year to recordOpenStack Summit October 2012 Tim Bell, CERN 15
  16. 16. 45,000 tapes holding 73PB of physics dataOpenStack Summit October 2012 Tim Bell, CERN 16
  17. 17. New data centre to expand capacity • Data centre in Geneva at the limit of electrical capacity at 3.5MW • New centre chosen in Budapest, Hungary • Additional 2.7MW of usable power • Hands off facility • Deploying from 2013 with 200Gbit/sOpenStack Summit October 2012 Tim Bell, CERN network to CERN 17
  18. 18. Time to change strategy• Rationale – Need to manage twice the servers as today – No increase in staff numbers – Tools becoming increasingly brittle and will not scale as-is• Approach – CERN is no longer a special case for compute – Adopt an open source tool chain model – Our engineers rapidly iterate • Evaluate solutions in the problem domain • Identify functional gaps and challenge them • Select first choice but be prepared to change in future – Contribute new function back to the communityOpenStack Summit October 2012 Tim Bell, CERN 18
  19. 19. Building Blocks mcollective, yum Bamboo Puppet AIMS/PXE Foreman JIRA OpenStack Nova git Koji, Mock Yum repo Active Directory / Pulp LDAP Lemon / Hardware Hadoop database Puppet-DBOpenStack Summit October 2012 Tim Bell, CERN 19
  20. 20. Training and Support• Buy the book rather than guru mentoring• Follow the mailing lists to learn• Newcomers are rapidly productive (and often know more than us)• Community and Enterprise support means we’re not on our ownOpenStack Summit October 2012 Tim Bell, CERN 20
  21. 21. Staff Motivation• Skills valuable outside of CERN when an engineer’s contracts endOpenStack Summit October 2012 Tim Bell, CERN 21
  22. 22. Prepare the move to the clouds• Improve operational efficiency – Machine ordering, reception and testing – Hardware interventions with long running programs – Multiple operating system demand• Improve resource efficiency – Exploit idle resources, especially waiting for disk and tape I/O – Highly variable load such as interactive or build machines• Enable cloud architectures – Gradual migration to cloud interfaces and workflows• Improve responsiveness – Self-Service with coffee break response timeOpenStack Summit October 2012 Tim Bell, CERN 22
  23. 23. Public Procurement Purchase ModelStep Time (Days) Elapsed (Days)User expresses requirement 0Market Survey prepared 15 15Market Survey for possible vendors 30 45Specifications prepared 15 60Vendor responses 30 90Test systems evaluated 30 120Offers adjudicated 10 130Finance committee 30 160Hardware delivered 90 250Burn in and acceptance 30 days typical 280 380 worst caseTotal 280+ DaysOpenStack Summit October 2012 Tim Bell, CERN 23
  24. 24. Service Model • Pets are given names like • They are unique, lovingly hand raised and cared for • When they get ill, you nurse them back to health • Cattle are given numbers like • They are almost identical to other cattle • When they get ill, you get another one • Future application architectures should use Cattle but Pets with strong configuration management are viable and still neededOpenStack Summit October 2012 Tim Bell, CERN 24
  25. 25. Supporting the Pets with OpenStack• Network – Interfacing with legacy site DNS and IP management – Ensuring Kerberos identity before VM start• Puppet – Ease use of configuration management tools with our users – Exploit mcollective for orchestration/delegation• External Block Storage – Currently using nova-volume with Gluster backing store• Live migration to maximise availability – KVM live migration using Gluster – KVM and Hyper-V block migrationOpenStack Summit October 2012 Tim Bell, CERN 25
  26. 26. Current Status of OpenStack at CERN• Working on an Essex code base from the EPEL repository – Excellent experience with the Fedora cloud-sig team – Cloud-init for contextualisation, oz for images with RHEL/Fedora• Components – Current focus is on Nova with KVM and Hyper-V – Tests with Swift are ongoing but require significant experiment code changes• Pre-production facility with around 150 Hypervisors, with 2000 VMs integrated with CERN infrastructure, Puppet deployed and used for simulation of magnet placement using LHC@Home and batchOpenStack Summit October 2012 Tim Bell, CERN 26
  27. 27. OpenStack Summit October 2012 Tim Bell, CERN 27
  28. 28. When communities combine…• OpenStack’s many components and options make configuration complex out of the box• Puppet forge module from PuppetLabs does our configuration• The Foreman adds OpenStack provisioning for user kiosk to a configured machine in 15 minutesOpenStack Summit October 2012 Tim Bell, CERN 28
  29. 29. Foreman to manage Puppetized VMOpenStack Summit October 2012 Tim Bell, CERN 29
  30. 30. Active Directory Integration• CERN’s Active Directory – Unified identity management across the site – 44,000 users – 29,000 groups – 200 arrivals/departures per month• Full integration with Active Directory via LDAP – Uses the OpenLDAP backend with some particular configuration settings – Aim for minimal changes to Active Directory – 7 patches submitted around hard coded values and additional filtering• Now in use in our pre-production instance – Map project roles (admins, members) to groups – Documentation in the OpenStack wikiOpenStack Summit October 2012 Tim Bell, CERN 30
  31. 31. Welcome Back Hyper-V!• We currently use Hyper-V/System Centre for our server consolidation activities – But need to scale to 100x current installation size• Choice of hypervisors should be tactical – Performance – Compatibility/Support with integration components – Image migration from legacy environments• CERN is working closely with the Hyper-V OpenStack team – Puppet to configure hypervisors on Windows – Most functions work well but further work on Console, Ceilometer, …OpenStack Summit October 2012 Tim Bell, CERN 31
  32. 32. Opportunistic Clouds in online experiment farms• The CERN experiments have farms of 1000s of Linux servers close to the detectors to filter the 1PByte/s down to 6GByte/s to be recorded to tape• When the accelerator is not running, these machines are currently idle – Accelerator has regular maintenance slots of several days – Long Shutdown due from March 2013-November 2014• One of the experiments are deploying OpenStack on their farm – Simulation (low I/O, high CPU) – Analysis (high I/O, high CPU, high network)OpenStack Summit October 2012 Tim Bell, CERN 32
  33. 33. Federated European Clouds• Two significant European projects around Federated Clouds – European Grid Initiative Federated Cloud as a federation of grid sites providing IaaS – HELiX Nebula European Union funded project to create a scientific cloud based on commercial providers EGI Federated Cloud Sites CESGA CESNET INFN SARA Cyfronet FZ Jülich SZTAKI IPHC GRIF GRNET KTH Oxford GWDG IGI TCD IN2P3 STFCOpenStack Summit October 2012 Tim Bell, CERN 33
  34. 34. Federated Cloud Commonalities• Basic building blocks – Each site gives an IaaS endpoint with an API and common security policy • OCCI? CDMI ? Libcloud ? Jclouds ? – Image stores available across the sites – Federated identity management based on X.509 certificates – Consolidation of accounting information to validate pledges and usage• Multiple cloud technologies – OpenStack – OpenNebula – ProprietaryOpenStack Summit October 2012 Tim Bell, CERN 34
  35. 35. Next Steps• Deploy into production at the start of 2013 with Folsom running the Grid software on top of OpenStack IaaS• Support multi-site operations with 2nd data centre in Hungary• Exploit new functionality – Ceilometer for metering – Bare metal for non-virtualised use cases such as high I/O servers – X.509 user certificate authentication – Load balancing as a serviceRamping to 15,000 hypervisors with100,000 to 300,000 VMs by 2015OpenStack Summit October 2012 Tim Bell, CERN 35
  36. 36. What are we missing (or haven’t found yet) ?• Best practice for – Monitoring and KPIs as part of core functionality – Guest disaster recovery – Migration between versions of OpenStack• Roles within multi-user projects – VM owner allowed to manage their own resources (start/stop/delete) – Project admins allowed to manage all resources – Other members should not have high rights over other members VMs• Global quota management for non-elastic private cloud – Manage resource prioritisation and allocation centrally – Capacity management / utilisation for planningOpenStack Summit October 2012 Tim Bell, CERN 36
  37. 37. Conclusions• Production at CERN in next few months on Folsom – Our emphasis will shift to focus on stability – Integrate CERN legacy integrations via formal user exits – Work together with others on scaling improvements• Community is key to shared success – Our problems are often resolved before we raise them – Packaging teams are producing reliable builds promptly• CERN contributes and benefits – Thanks to everyone for their efforts and enthusiasm – Not just code but documentation, tests, blogs, …OpenStack Summit October 2012 Tim Bell, CERN 37
  38. 38. ReferencesCERN Linux LHC Computing Grid Report on Agile Infrastructure Nebula Cloud Taskforce OpenStack Summit October 2012 Tim Bell, CERN 39
  39. 39. Backup SlidesOpenStack Summit October 2012 Tim Bell, CERN 40
  40. 40. OpenStack Summit October 2012 Tim Bell, CERN 41
  41. 41. CERN’s tools• The world’s most powerful accelerator: LHC – A 27 km long tunnel filled with high-tech instruments – Equipped with thousands of superconducting magnets – Accelerates particles to energies never before obtained – Produces particle collisions creating microscopic “big bangs”• Very large sophisticated detectors – Four experiments each the size of a cathedral – Hundred million measurement channels each – Data acquisition systems treating Petabytes per second• Top level computing to distribute and analyse the data – A Computing Grid linking ~200 computer centres around the globe – Sufficient computing power and storage to handle 25 Petabytes per year, making them available to thousands of physicists for analysisOpenStack Summit October 2012 Tim Bell, CERN 42
  42. 42. Our Infrastructure• Hardware is generally based on commodity, white-box servers – Open tendering process based on SpecInt/CHF, CHF/Watt and GB/CHF – Compute nodes typically dual processor, 2GB per core – Bulk storage on 24x2TB disk storage-in-a-box with a RAID card• Vast majority of servers run Scientific Linux, developed by Fermilab and CERN, based on Redhat Enterprise – Focus is on stability in view of the number of centres on the WLCGOpenStack Summit October 2012 Tim Bell, CERN 43
  43. 43. New architecture data flowsOpenStack Summit October 2012 Tim Bell, CERN 44
  44. 44. 500 1500 2000 2500 3000 3500 1000 0 Mar-10 Apr-10May-10 Jun-10 Jul-10 Aug-10 Sep-10 Oct-10OpenStack Summit October 2012 Nov-10 Dec-10 Jan-11 Feb-11 Mar-11 Apr-11May-11 Jun-11 Jul-11Tim Bell, CERN Aug-11 Sep-11 Oct-11 Nov-11 Dec-11 Jan-12 Feb-12 Mar-12 Apr-12May-12 Virtualisation on SCVMM/Hyper-V Jun-12 Jul-12 Aug-12 Sep-1245 Linux Oct-12 Windows
  45. 45. Scaling up with Puppet and OpenStack• Use LHC@Home based on BOINC for simulating magnetics guiding particles around the LHC• Naturally, there is a puppet module puppet-boinc• 1000 VMs spun up to stress test the hypervisors with Puppet, Foreman and OpenStackOpenStack Summit October 2012 Tim Bell, CERN 46