The CERN computing infrastructure has a number of immediate problems to be solved. Specifically, how do we continue providing the capacity for the LHC experiments. The CERN computer centre was built in the 1970s for a Cray and a Mainframe and has been gradually adapted to meet the needs of modern commodity computing. However, we have reached the limits and can no longer adapt due to limits on cooling and electricity. At the same time, the tool set that we wrote during the past 10 years to manage the 8,000 servers is getting old. There were no alternatives when we started and so a hand-crafted set of tools such as Quattor, Sindes, SMS and Lemon have been developed. These are now requiring more work to maintain than resources are available. Equally, we know that we are not getting the most out of our installed capacity but identifying the root causes behind inefficiencies is difficult with the vast amounts of data to work through and correlate.
Deployment is staged, starting small in 2013 and growing through 2017 in gradual steps. This allows us to debug processes and facilities early in 2013.
CERN’s computing capacity is no longer at a leading edge. Companies like Rackspace have over 80,000 servers and Google is rumoured to have 500,000 worldwide. Upcoming evolution like the new datacentre, IPv6, scalability pressures, even new operating system versions require substantial effort to support on our own. We need to share the effort with others and benefit from their work. We started a project to make the CERN tools more agile… with the basic principles of We are not unique, there are solutions out there If we find our requirements are not met, question the requirement (and question it again) If we need to develop something, find the nearest tool and submit a patch rather than write from scratch
Followed a Google toolchain approach of small tools glued together… aim is that we try quickly and change if we find limitations. Tool selection is based on maintenance, community support, packaging and documentation quality.
We arrange bulk deliveries of 1000s of servers a year and these currently can take weeks to install and get into production. The large web companies do this automatically using network boot and automatic registering. We will do the same so that the only necessary steps for installation are to boot the machines on a switch which is declared as being an installation target. The machines then register themselves into the network and inventory tools. This saves manual effort on entry of MAC addresses. After this, the burn in tests are started to stress the machines to the full. This finds both near failing parts along with major issues such as firmware problems. The end results are automatically analysed and failures reported to the vendor. All serial numbers and firmware versions are tracked so that failure analysis is easier and statistics on exchanges can be maintained over the life of machines.
When it comes to configuring the boxes after they have been tested, we have a large variety of configurations. There are 100s of different configurations in the computer centre with only a few such as the batch farm being large. Many of the smaller ones have eless than 10 boxes where the effort is high to configure them for newly trained service managers. Over 700 boxes in the computer centre are not under configuration management control too….
At the same time, the tools we use are not commonly used elsewhere. New arrivals do not learn them at university and require individual training (as no formal training courses are available). By adopting a tool commonly used elsewhere, our staff can learn from books and courses and if they leave after a few years at CERN, they have directly re-usable skills for their CVs.
So, we chose a set of software and integrated it with some standard CERN services which are stable and scalable. Puppet is key for us as there are many off the shelf components, from NTP through to web services, which can be re-used. We chose Git although the default source code version control system at CERN is svn. This is because the toolset we are investigating come with out of the box support for git. Thus, we should do what the others do…
Currently, managing 160 boxes on the pilot platform.
Managing 15,000 servers is not simple… there are pets which need looking after and cattle which are disponsable. The aim is to create the infrastructure to support both pets and cattle but manage the hardware like cattle so that hardware management staff can work without needing to keep on scheduling interventions.
Our objectives for the coming years are having 90% of the hardware resources on the DCs virtualized. In this scenario w e will have around 300,000 Virtual Machines. CERN IT department currently has two small projects covering this (LxCloud and CVI). The main objective is building a common solution combining these two projects in terms of scalability and efficiency. We need to support multiple centres, scalability to 300K VMs and assist in reducing the running effort via live migration and platform as a service.
So after some research in IaaS solutions, we found this solution (Openstack). It provides a set of tools that let us build our private cloud. This solution is multi-site, scalable and uses standard cloud interfaces. The clients could build their applications using these cloud interfaces. This will position our cloud solution as another cloud provider.
Openstack has different components like Nova, Glance, Keystone, Horizon and Swift. This is the architecture of the different services and how the interact with each other. There are also the internal components of Glance and Nova. If you have a look on Nova, they are two key components that are the Queue and the Database. Nova is built on a shared-nothing, messaging-based architecture. All the major nova components can be run on multiple servers. This means that most component to component communication must go via message queue. Our implementation at CERN uses these main components.
So, final tool summary …
Early adopters such as dev boxes, non-quattorised boxes, lxplus
20120524 cern data centre evolution v2
CERN Infrastructure Evolution Tim Bell CHEPCERN IT Department 24th May 2012 CH-1211 Genève 23 Switzerlandwww.cern.ch/it
Agenda• Problems to address• Ongoing projects – Data centre expansion – Configuration management – Infrastructure as a Service – Monitoring• Timelines• Summary CERN Infrastructure Evolution 2
Problems to address• CERN data centre is reaching its limits• IT staff numbers remain fixed but more computing capacity is needed• Tools are high maintenance and becoming increasingly brittle• Inefficiencies exist but root cause cannot be easily identified CERN Infrastructure Evolution 3
Data centre extension history• Studies in 2008 into building a new computer centre on the CERN Prevessin site – Too costly• In 2011, tender run across CERN member states for remote hosting – 16 bids• In March 2012, Wigner Institute in Budapest, Hungary selected CERN Infrastructure Evolution 4
Data Centre Selection• Wigner Institute in Budapest, Hungary CERN Infrastructure Evolution 5
Data Centre Layout & Ramp-up CERN Infrastructure Evolution 6
Data Centre Plans• Timescales – Prototyping in 2012 – Testing in 2013 – Production in 2014• This will be a “hands-off” facility for CERN – They manage the infrastructure and hands-on work – We do everything else remotely• Need to be reasonably dynamic in our usage – Will pay power bill• Various possible models of usage – sure to evolve – The less specific the hardware installed there, the easier to change function – Disaster Recovery for key services in Meyrin data centre becomes a realistic scenario CERN Infrastructure Evolution 7
Infrastructure Tools Evolution• We had to develop our own toolset in 2002• Nowadays, – CERN compute capacity is no longer leading edge – Many options available for open source fabric management – We need to scale to meet the upcoming capacity increase• If there is a requirement which is not available through an open source tool, we should question the need – If we are the first to need it, contribute it back to the open source tool CERN Infrastructure Evolution 8
Where to Look?• Large community out there taking the “tool chain” approach whose scaling needs match ours: O(100k) servers and many applications• Become standard and join this community
Bulk Hardware Deployment• Adapt processes for minimal physical intervention• New machines register themselves automatically into network databases and inventory• Burn-in test uses standard Scientific Linux and production monitoring tools• Detailed hardware inventory with serial number tracking to support accurate CERN Infrastructureof failures analysis Evolution 10
Current Configurations• Many, diverse applications (“clusters”) managed by different teams• ..and 700+ other “unmanaged” Linux nodes in VMs that could benefit from a simple configuration system CERN Infrastructure Evolution 11
Job Adverts from Indeed.com Index of millions Puppet of worldwide job posts across thousands of job sites These are the sort of posts our Quattor departing staff will be applying for. Agile Infrastructure - Configuration and CERN Infrastructure Evolution 12 Operation Tools
Configuration Management• Using off the shelf components – Puppet – configuration definition – Foreman – GUI and Data store – Git – version control – Mcollective – remote execution• Integrated with – CERN Single Sign On – CERN Certificate Authority – Installation Server CERN Infrastructure Evolution 13
Different Service Models • Pets are given names like lsfmaster.cern.ch • They are unique, lovingly hand raised and cared for • When they get ill, you nurse them back to health • Cattle are given numbers like vm0042.cern.ch • They are almost identical to other cattle • When they get ill, you get another one• Future application architectures tend towards Cattle• .. But users now need support for both modes of working CERN Infrastructure Evolution 15
Infrastructure as a Service• Goals – Improve repair processes with virtualisation – More efficient use of our hardware – Better tracking of usage – Enable remote management for new data centre – Support potential new use cases (PaaS, Cloud) – Sustainable support model• At scale for 2015 – 15,000 servers – 90% of hardware virtualized – 300,000 VMs needed CERN Agile Infrastructure 16
OpenStack• Open source cloud software• Supported by 173 companies including IBM, RedHat, Rackspace, HP, Cisco, AT&T, …• Vibrant development community and ecosystem• Infrastructure as a Service to our scale• Started in 2010 but maturing rapidly CERN Agile Infrastructure 17
Scheduling and Accounting• Multiple uses of IaaS – Server consolidation – Classic batch (single or multi-core) – Cloud VMs such as CERNVM• Scheduling options – Availability zones for disaster recovery – Quality of service options to improve efficiency such as build machines, public login services – Batch system scalability is likely to be an issue• Accounting – Use underlying services of IaaS and Hypervisors for reporting and quotas CERN Infrastructure Evolution 19
Monitoring in IT• >30 monitoring applications – Number of producers: ~40k – Input data volume: ~280 GB per day• Covering a wide range of different resources – Hardware, OS, applications, files, jobs, etc.• Application-specific monitoring solutions – Using different technologies (including commercial tools) – Sharing similar needs: aggregate metrics, get alarms, etc• Limited sharing of monitoring data – Hard to implement complex monitoring queries 20
Current Tool Snapshot (Liable to Change) Puppet Foreman mcollective, yum Jenkins AIMS/PXE Foreman JIRAOpenstack Nova git, SVN Koji, Mock Yum repo Pulp Hardware Lemon database Puppet stored config DB
TimelinesYear What Actions2012 Prepare formal project plan Establish IaaS in CERN Data Centre Monitoring Implementation as per WG Migrate lxcloud users Early adopters to use new tools2013 LS 1 Extend IaaS to remote Data Centre New Data Centre Business Continuity Migrate CVI users General migration to new tools with SLC6 and Windows 82014 LS 1 (to Phase out legacy tools such as Quattor November) CERN Infrastructure Evolution 23
Conclusions• New data centre provides opportunity to expand the Tier-0 capacity• New tools are required to manage these systems and improve processes given fixed staff levels• Integrated monitoring is key to identify inefficiencies, bottlenecks and usage• Moving to an open source tool chain has allowed a rapid proof of concept and will be a more sustainable solution CERN Infrastructure Evolution 24
CERN Data Centre NumbersSystems 7,899 Hard disks 62,023Processors 14,972 Raw disk capacity (TiB) 62,660Cores 64,623 Tape capacity (PiB) 47Memory (TiB) 165 Ethernet 1Gb ports 16,773Racks 1,070 Ethernet 10Gb ports 622Power Consumption (KiW) 2,345From http://sls.cern.ch/sls/service.php?id=CCBYNUM CERN Infrastructure Evolution 27
Computer Centre Capacity CERN Infrastructure Evolution 28
Capacity to end of LEP CERN Infrastructure Evolution 29
Foreman DashboardAgile Infrastructure -Configuration and OperationTools
Industry Context - Cloud• Cloud computing models are now standardising – Facilities as a Service – such as Equinix, Safehost – Infrastructure as a Service - Amazon EC2, CVI or lxcloud – Platform as a Service - Microsoft Azure or CERN Web Services – Software as a Service – Salesforce, Google Mail, Service-Now, Indico• Different customers want access to different layers – Both our users and the IT Service Managers Applications Platform Infrastructure Facilities CERN Infrastructure Evolution 31