CERN Infrastructure Evolution




                                 Tim Bell
                                   CHEP
CERN IT Department
                              24th May 2012
 CH-1211 Genève 23
        Switzerland
www.cern.ch/it
Agenda

• Problems to address
• Ongoing projects
  – Data centre expansion
  – Configuration management
  – Infrastructure as a Service
  – Monitoring
• Timelines
• Summary

          CERN Infrastructure Evolution   2
Problems to address

• CERN data centre is reaching its
  limits
• IT staff numbers remain fixed but
  more computing capacity is needed
• Tools are high maintenance and
  becoming increasingly brittle
• Inefficiencies exist but root cause
  cannot be easily identified

          CERN Infrastructure Evolution   3
Data centre extension history

• Studies in 2008 into building a new
  computer centre on the CERN
  Prevessin site
  – Too costly
• In 2011, tender run across CERN
  member states for remote hosting
  – 16 bids
• In March 2012, Wigner Institute in
  Budapest, Hungary selected
          CERN Infrastructure Evolution   4
Data Centre Selection




• Wigner Institute in Budapest, Hungary

           CERN Infrastructure Evolution   5
Data Centre Layout & Ramp-up




          CERN Infrastructure Evolution   6
Data Centre Plans
•   Timescales
     – Prototyping in 2012
     – Testing in 2013
     – Production in 2014
•   This will be a “hands-off” facility for CERN
     – They manage the infrastructure and hands-on work
     – We do everything else remotely
•   Need to be reasonably dynamic in our usage
     – Will pay power bill
•   Various possible models of usage – sure to evolve
     – The less specific the hardware installed there, the
       easier to change function
     – Disaster Recovery for key services in Meyrin data
       centre becomes a realistic scenario

                 CERN Infrastructure Evolution               7
Infrastructure Tools Evolution

• We had to develop our own toolset in 2002
• Nowadays,
   – CERN compute capacity is no longer leading edge
   – Many options available for open source fabric
     management
   – We need to scale to meet the upcoming capacity
     increase
• If there is a requirement which is not available
  through an open source tool, we should question
  the need
   – If we are the first to need it, contribute it back to the
     open source tool


                CERN Infrastructure Evolution                    8
Where to Look?
• Large community out there taking the “tool chain”
  approach whose scaling needs match ours:
  O(100k) servers and many applications
• Become standard and join this community
Bulk Hardware Deployment
• Adapt processes for minimal physical
  intervention
• New machines register themselves
  automatically into network databases
  and inventory
• Burn-in test uses standard Scientific
  Linux and production monitoring
  tools
• Detailed hardware inventory with
  serial number tracking to support
  accurate CERN Infrastructureof failures
            analysis Evolution            10
Current Configurations




• Many, diverse applications (“clusters”) managed by different teams
• ..and 700+ other “unmanaged” Linux nodes in VMs that could benefit
  from a simple configuration system


                  CERN Infrastructure Evolution                 11
Job Adverts from Indeed.com

                                                    Index of millions
    Puppet                                          of worldwide job
                                                    posts across
                                                    thousands of
                                                    job sites




                                                    These are the
                                                    sort of posts our
     Quattor                                        departing staff
                                                    will be applying
                                                    for.


         Agile Infrastructure - Configuration and
          CERN Infrastructure Evolution                         12
                     Operation Tools
Configuration Management

• Using off the shelf components
  – Puppet – configuration definition
  – Foreman – GUI and Data store
  – Git – version control
  – Mcollective – remote execution
• Integrated with
  – CERN Single Sign On
  – CERN Certificate Authority
  – Installation Server
           CERN Infrastructure Evolution   13
Status




         CERN Infrastructure Evolution   14
Different Service Models
                      • Pets are given names like
                        lsfmaster.cern.ch
                      • They are unique, lovingly hand
                        raised and cared for
                      • When they get ill, you nurse them
                        back to health

                      • Cattle are given numbers like
                        vm0042.cern.ch
                      • They are almost identical to other
                        cattle
                      • When they get ill, you get another
                        one


• Future application architectures tend towards Cattle
• .. But users now need support for both modes of working

                 CERN Infrastructure Evolution               15
Infrastructure as a Service
• Goals
   –   Improve repair processes with virtualisation
   –   More efficient use of our hardware
   –   Better tracking of usage
   –   Enable remote management for new data centre
   –   Support potential new use cases (PaaS, Cloud)
   –   Sustainable support model
• At scale for 2015
   – 15,000 servers
   – 90% of hardware virtualized
   – 300,000 VMs needed

               CERN Agile Infrastructure
                                              16
OpenStack
• Open source cloud software
• Supported by 173 companies including
  IBM, RedHat, Rackspace, HP, Cisco,
  AT&T, …
• Vibrant development community and
  ecosystem
• Infrastructure as a Service to our scale
• Started in 2010 but maturing rapidly




             CERN Agile Infrastructure
                                             17
Openstack @ CERN
                                                         HORIZON
              KEYSTONE




     GLANCE                                       NOVA




                                                                   Scheduler
                                     Compute
Registry      Image




                                        Volume
                                                                    Network



                      CERN Agile Infrastructure
                                                                   18
Scheduling and Accounting
• Multiple uses of IaaS
  – Server consolidation
  – Classic batch (single or multi-core)
  – Cloud VMs such as CERNVM
• Scheduling options
  – Availability zones for disaster recovery
  – Quality of service options to improve efficiency
    such as build machines, public login services
  – Batch system scalability is likely to be an issue
• Accounting
  – Use underlying services of IaaS and
    Hypervisors for reporting and quotas
             CERN Infrastructure Evolution          19
Monitoring in IT

• >30 monitoring applications
   – Number of producers: ~40k
   – Input data volume: ~280 GB per day
• Covering a wide range of different resources
   – Hardware, OS, applications, files, jobs, etc.
• Application-specific monitoring solutions
   – Using different technologies (including commercial
     tools)
   – Sharing similar needs: aggregate metrics, get alarms,
     etc
• Limited sharing of monitoring data
   – Hard to implement complex monitoring queries



                         20
Monitoring Architecture

              Operations                       Dashboards
                Tools                                              Splunk
                                                and APIs



                                               Storage and         Hadoop
                                              Analysis Engine



              Operations                         Storage
              Consumers                         Consumer




                           Messaging Broker                        Apollo




   Producer                   Producer                  Producer
                                                                   Lemon
     Sensor                    Sensor                    Sensor


                   CERN Infrastructure Evolution                     21
Current Tool Snapshot (Liable to Change)

                                                     Puppet
                                                    Foreman
                 mcollective, yum                                   Jenkins

                                                    AIMS/PXE
                                                     Foreman                  JIRA

Openstack Nova



                                                                                 git, SVN



                                                                              Koji, Mock
                                                     Yum repo
                                                       Pulp




     Hardware                                                   Lemon
     database
                                    Puppet stored
                                      config DB
Timelines
Year   What                Actions
2012                       Prepare formal project plan
                           Establish IaaS in CERN Data Centre
                           Monitoring Implementation as per WG
                           Migrate lxcloud users
                           Early adopters to use new tools


2013   LS 1                Extend IaaS to remote Data Centre
       New Data Centre     Business Continuity
                           Migrate CVI users
                           General migration to new tools with SLC6
                           and Windows 8


2014   LS 1 (to            Phase out legacy tools such as Quattor
       November)




            CERN Infrastructure Evolution                           23
Conclusions
• New data centre provides opportunity to
  expand the Tier-0 capacity
• New tools are required to manage these
  systems and improve processes given
  fixed staff levels
• Integrated monitoring is key to identify
  inefficiencies, bottlenecks and usage
• Moving to an open source tool chain has
  allowed a rapid proof of concept and will
  be a more sustainable solution


            CERN Infrastructure Evolution     24
Background Information

• HEPiX Agile Talks
  – http://cern.ch/go/99Ck
• Tier-0 Upgrade
  – http://cern.ch/go/NN98

Other information, get in touch
         Tim.Bell@cern.ch
         CERN Infrastructure Evolution   25
Backup Slides




CERN Infrastructure Evolution   26
CERN Data Centre Numbers

Systems                      7,899 Hard disks               62,023
Processors                 14,972 Raw disk capacity (TiB)   62,660
Cores                      64,623 Tape capacity (PiB)             47
Memory (TiB)                   165 Ethernet 1Gb ports       16,773
Racks                        1,070 Ethernet 10Gb ports        622
Power Consumption (KiW)      2,345




From http://sls.cern.ch/sls/service.php?id=CCBYNUM

               CERN Infrastructure Evolution                 27
Computer Centre Capacity




         CERN Infrastructure Evolution   28
Capacity to end of LEP




          CERN Infrastructure Evolution   29
Foreman Dashboard




Agile Infrastructure -
Configuration and Operation
Tools
Industry Context - Cloud

•   Cloud computing models are now standardising
     –   Facilities as a Service – such as Equinix, Safehost
     –   Infrastructure as a Service - Amazon EC2, CVI or lxcloud
     –   Platform as a Service - Microsoft Azure or CERN Web Services
     –   Software as a Service – Salesforce, Google Mail, Service-Now, Indico
•   Different customers want access to different layers
     – Both our users and the IT Service Managers




                                          Applications
                                     Platform
                              Infrastructure
                             Facilities


                     CERN Infrastructure Evolution                         31
SLC 4 to SLC 5 migration




          CERN Infrastructure Evolution   32

20120524 cern data centre evolution v2

  • 1.
    CERN Infrastructure Evolution Tim Bell CHEP CERN IT Department 24th May 2012 CH-1211 Genève 23 Switzerland www.cern.ch/it
  • 2.
    Agenda • Problems toaddress • Ongoing projects – Data centre expansion – Configuration management – Infrastructure as a Service – Monitoring • Timelines • Summary CERN Infrastructure Evolution 2
  • 3.
    Problems to address •CERN data centre is reaching its limits • IT staff numbers remain fixed but more computing capacity is needed • Tools are high maintenance and becoming increasingly brittle • Inefficiencies exist but root cause cannot be easily identified CERN Infrastructure Evolution 3
  • 4.
    Data centre extensionhistory • Studies in 2008 into building a new computer centre on the CERN Prevessin site – Too costly • In 2011, tender run across CERN member states for remote hosting – 16 bids • In March 2012, Wigner Institute in Budapest, Hungary selected CERN Infrastructure Evolution 4
  • 5.
    Data Centre Selection •Wigner Institute in Budapest, Hungary CERN Infrastructure Evolution 5
  • 6.
    Data Centre Layout& Ramp-up CERN Infrastructure Evolution 6
  • 7.
    Data Centre Plans • Timescales – Prototyping in 2012 – Testing in 2013 – Production in 2014 • This will be a “hands-off” facility for CERN – They manage the infrastructure and hands-on work – We do everything else remotely • Need to be reasonably dynamic in our usage – Will pay power bill • Various possible models of usage – sure to evolve – The less specific the hardware installed there, the easier to change function – Disaster Recovery for key services in Meyrin data centre becomes a realistic scenario CERN Infrastructure Evolution 7
  • 8.
    Infrastructure Tools Evolution •We had to develop our own toolset in 2002 • Nowadays, – CERN compute capacity is no longer leading edge – Many options available for open source fabric management – We need to scale to meet the upcoming capacity increase • If there is a requirement which is not available through an open source tool, we should question the need – If we are the first to need it, contribute it back to the open source tool CERN Infrastructure Evolution 8
  • 9.
    Where to Look? •Large community out there taking the “tool chain” approach whose scaling needs match ours: O(100k) servers and many applications • Become standard and join this community
  • 10.
    Bulk Hardware Deployment •Adapt processes for minimal physical intervention • New machines register themselves automatically into network databases and inventory • Burn-in test uses standard Scientific Linux and production monitoring tools • Detailed hardware inventory with serial number tracking to support accurate CERN Infrastructureof failures analysis Evolution 10
  • 11.
    Current Configurations • Many,diverse applications (“clusters”) managed by different teams • ..and 700+ other “unmanaged” Linux nodes in VMs that could benefit from a simple configuration system CERN Infrastructure Evolution 11
  • 12.
    Job Adverts fromIndeed.com Index of millions Puppet of worldwide job posts across thousands of job sites These are the sort of posts our Quattor departing staff will be applying for. Agile Infrastructure - Configuration and CERN Infrastructure Evolution 12 Operation Tools
  • 13.
    Configuration Management • Usingoff the shelf components – Puppet – configuration definition – Foreman – GUI and Data store – Git – version control – Mcollective – remote execution • Integrated with – CERN Single Sign On – CERN Certificate Authority – Installation Server CERN Infrastructure Evolution 13
  • 14.
    Status CERN Infrastructure Evolution 14
  • 15.
    Different Service Models • Pets are given names like lsfmaster.cern.ch • They are unique, lovingly hand raised and cared for • When they get ill, you nurse them back to health • Cattle are given numbers like vm0042.cern.ch • They are almost identical to other cattle • When they get ill, you get another one • Future application architectures tend towards Cattle • .. But users now need support for both modes of working CERN Infrastructure Evolution 15
  • 16.
    Infrastructure as aService • Goals – Improve repair processes with virtualisation – More efficient use of our hardware – Better tracking of usage – Enable remote management for new data centre – Support potential new use cases (PaaS, Cloud) – Sustainable support model • At scale for 2015 – 15,000 servers – 90% of hardware virtualized – 300,000 VMs needed CERN Agile Infrastructure 16
  • 17.
    OpenStack • Open sourcecloud software • Supported by 173 companies including IBM, RedHat, Rackspace, HP, Cisco, AT&T, … • Vibrant development community and ecosystem • Infrastructure as a Service to our scale • Started in 2010 but maturing rapidly CERN Agile Infrastructure 17
  • 18.
    Openstack @ CERN HORIZON KEYSTONE GLANCE NOVA Scheduler Compute Registry Image Volume Network CERN Agile Infrastructure 18
  • 19.
    Scheduling and Accounting •Multiple uses of IaaS – Server consolidation – Classic batch (single or multi-core) – Cloud VMs such as CERNVM • Scheduling options – Availability zones for disaster recovery – Quality of service options to improve efficiency such as build machines, public login services – Batch system scalability is likely to be an issue • Accounting – Use underlying services of IaaS and Hypervisors for reporting and quotas CERN Infrastructure Evolution 19
  • 20.
    Monitoring in IT •>30 monitoring applications – Number of producers: ~40k – Input data volume: ~280 GB per day • Covering a wide range of different resources – Hardware, OS, applications, files, jobs, etc. • Application-specific monitoring solutions – Using different technologies (including commercial tools) – Sharing similar needs: aggregate metrics, get alarms, etc • Limited sharing of monitoring data – Hard to implement complex monitoring queries 20
  • 21.
    Monitoring Architecture Operations Dashboards Tools Splunk and APIs Storage and Hadoop Analysis Engine Operations Storage Consumers Consumer Messaging Broker Apollo Producer Producer Producer Lemon Sensor Sensor Sensor CERN Infrastructure Evolution 21
  • 22.
    Current Tool Snapshot(Liable to Change) Puppet Foreman mcollective, yum Jenkins AIMS/PXE Foreman JIRA Openstack Nova git, SVN Koji, Mock Yum repo Pulp Hardware Lemon database Puppet stored config DB
  • 23.
    Timelines Year What Actions 2012 Prepare formal project plan Establish IaaS in CERN Data Centre Monitoring Implementation as per WG Migrate lxcloud users Early adopters to use new tools 2013 LS 1 Extend IaaS to remote Data Centre New Data Centre Business Continuity Migrate CVI users General migration to new tools with SLC6 and Windows 8 2014 LS 1 (to Phase out legacy tools such as Quattor November) CERN Infrastructure Evolution 23
  • 24.
    Conclusions • New datacentre provides opportunity to expand the Tier-0 capacity • New tools are required to manage these systems and improve processes given fixed staff levels • Integrated monitoring is key to identify inefficiencies, bottlenecks and usage • Moving to an open source tool chain has allowed a rapid proof of concept and will be a more sustainable solution CERN Infrastructure Evolution 24
  • 25.
    Background Information • HEPiXAgile Talks – http://cern.ch/go/99Ck • Tier-0 Upgrade – http://cern.ch/go/NN98 Other information, get in touch Tim.Bell@cern.ch CERN Infrastructure Evolution 25
  • 26.
  • 27.
    CERN Data CentreNumbers Systems 7,899 Hard disks 62,023 Processors 14,972 Raw disk capacity (TiB) 62,660 Cores 64,623 Tape capacity (PiB) 47 Memory (TiB) 165 Ethernet 1Gb ports 16,773 Racks 1,070 Ethernet 10Gb ports 622 Power Consumption (KiW) 2,345 From http://sls.cern.ch/sls/service.php?id=CCBYNUM CERN Infrastructure Evolution 27
  • 28.
    Computer Centre Capacity CERN Infrastructure Evolution 28
  • 29.
    Capacity to endof LEP CERN Infrastructure Evolution 29
  • 30.
    Foreman Dashboard Agile Infrastructure- Configuration and Operation Tools
  • 31.
    Industry Context -Cloud • Cloud computing models are now standardising – Facilities as a Service – such as Equinix, Safehost – Infrastructure as a Service - Amazon EC2, CVI or lxcloud – Platform as a Service - Microsoft Azure or CERN Web Services – Software as a Service – Salesforce, Google Mail, Service-Now, Indico • Different customers want access to different layers – Both our users and the IT Service Managers Applications Platform Infrastructure Facilities CERN Infrastructure Evolution 31
  • 32.
    SLC 4 toSLC 5 migration CERN Infrastructure Evolution 32

Editor's Notes

  • #4 The CERN computing infrastructure has a number of immediate problems to be solved. Specifically, how do we continue providing the capacity for the LHC experiments. The CERN computer centre was built in the 1970s for a Cray and a Mainframe and has been gradually adapted to meet the needs of modern commodity computing. However, we have reached the limits and can no longer adapt due to limits on cooling and electricity. At the same time, the tool set that we wrote during the past 10 years to manage the 8,000 servers is getting old. There were no alternatives when we started and so a hand-crafted set of tools such as Quattor, Sindes, SMS and Lemon have been developed. These are now requiring more work to maintain than resources are available. Equally, we know that we are not getting the most out of our installed capacity but identifying the root causes behind inefficiencies is difficult with the vast amounts of data to work through and correlate.
  • #7 Deployment is staged, starting small in 2013 and growing through 2017 in gradual steps. This allows us to debug processes and facilities early in 2013.
  • #9 CERN’s computing capacity is no longer at a leading edge. Companies like Rackspace have over 80,000 servers and Google is rumoured to have 500,000 worldwide. Upcoming evolution like the new datacentre, IPv6, scalability pressures, even new operating system versions require substantial effort to support on our own. We need to share the effort with others and benefit from their work. We started a project to make the CERN tools more agile… with the basic principles of We are not unique, there are solutions out there If we find our requirements are not met, question the requirement (and question it again) If we need to develop something, find the nearest tool and submit a patch rather than write from scratch
  • #10 Followed a Google toolchain approach of small tools glued together… aim is that we try quickly and change if we find limitations. Tool selection is based on maintenance, community support, packaging and documentation quality.
  • #11 We arrange bulk deliveries of 1000s of servers a year and these currently can take weeks to install and get into production. The large web companies do this automatically using network boot and automatic registering. We will do the same so that the only necessary steps for installation are to boot the machines on a switch which is declared as being an installation target. The machines then register themselves into the network and inventory tools. This saves manual effort on entry of MAC addresses. After this, the burn in tests are started to stress the machines to the full. This finds both near failing parts along with major issues such as firmware problems. The end results are automatically analysed and failures reported to the vendor. All serial numbers and firmware versions are tracked so that failure analysis is easier and statistics on exchanges can be maintained over the life of machines.
  • #12 When it comes to configuring the boxes after they have been tested, we have a large variety of configurations. There are 100s of different configurations in the computer centre with only a few such as the batch farm being large. Many of the smaller ones have eless than 10 boxes where the effort is high to configure them for newly trained service managers. Over 700 boxes in the computer centre are not under configuration management control too….
  • #13 At the same time, the tools we use are not commonly used elsewhere. New arrivals do not learn them at university and require individual training (as no formal training courses are available). By adopting a tool commonly used elsewhere, our staff can learn from books and courses and if they leave after a few years at CERN, they have directly re-usable skills for their CVs.
  • #14 So, we chose a set of software and integrated it with some standard CERN services which are stable and scalable. Puppet is key for us as there are many off the shelf components, from NTP through to web services, which can be re-used. We chose Git although the default source code version control system at CERN is svn. This is because the toolset we are investigating come with out of the box support for git. Thus, we should do what the others do…
  • #15 Currently, managing 160 boxes on the pilot platform.
  • #16 Managing 15,000 servers is not simple… there are pets which need looking after and cattle which are disponsable. The aim is to create the infrastructure to support both pets and cattle but manage the hardware like cattle so that hardware management staff can work without needing to keep on scheduling interventions.
  • #17 Our objectives for the coming years are having 90% of the hardware resources on the DCs virtualized. In this scenario w e will have around 300,000 Virtual Machines. CERN IT department currently has two small projects covering this (LxCloud and CVI). The main objective is building a common solution combining these two projects in terms of scalability and efficiency. We need to support multiple centres, scalability to 300K VMs and assist in reducing the running effort via live migration and platform as a service.
  • #18 So after some research in IaaS solutions, we found this solution (Openstack). It provides a set of tools that let us build our private cloud. This solution is multi-site, scalable and uses standard cloud interfaces. The clients could build their applications using these cloud interfaces. This will position our cloud solution as another cloud provider.
  • #19 Openstack has different components like Nova, Glance, Keystone, Horizon and Swift. This is the architecture of the different services and how the interact with each other. There are also the internal components of Glance and Nova. If you have a look on Nova, they are two key components that are the Queue and the Database. Nova is built on a shared-nothing, messaging-based architecture. All the major nova components can be run on multiple servers. This means that most component to component communication must go via message queue. Our implementation at CERN uses these main components.
  • #23 So, final tool summary …
  • #24 Early adopters such as dev boxes, non-quattorised boxes, lxplus