SlideShare a Scribd company logo
1 of 31
Configuration Management
Evolution at CERN
Gavin McCance
gavin.mccance@cern.ch
@gmccance
Agile Infrastructure
•
•
•
•
•

Why we changed the stack
Current status
Technology challenges
People challenges
Puppe
t
Community

T Agile
he
Infras ture
truc
Making ITope
rationsbe r s e201
tte inc
3

Ope tac
ns k

Fore an
m

m c
colle tive

Koji

git

14/10/2013

J nkins
e

Ac MQ
tive

CHEP 2013

3
Why?
•

Homebrew stack of tools
•
•
•

•

“We’re not special”
•

•

Twice the number machines, no new staff
New remote data-centre
Adopting more dynamic Cloud model

Existence of open source tool chain: OpenStack,
Puppet, Foreman, Kibana

Staff turnover
•

Use standard tools – we can hire for it and people
can be hired for it when they leave
14/10/2013

CHEP 2013

4
Agile Infrastructure “stack”
•

Our current stack has been stable for one year now
•

•

Virtual server provisioning
–

•

Cloud “operating system”: OpenStack -> (Belmiro, next)

Configuration management
–
–

•

See plenary talk at last CHEP (Tim Bell et al)

Puppet + ecosystem as configuration management system
Foreman as machine inventory tool and dashboard

Monitoring improvements
•

Flume + Elasticsearch + Kibana -> (Pedro, next++)

14/10/2013

CHEP 2013

5
Puppet
•
•

Puppet manages nodes’ configuration via
“manifests” written in Puppet DSL
All nodes check in frequently (~1-2 hours)
and ask for configuration
•

•

Configuration applied frequently to
minimise drift

Using the central puppet master model
•
•

..rather than masterless model
No shipping of code, central caching and ACLs

14/10/2013

CHEP 2013

6
Separation of data and code
•

Puppet “Hiera” splits configuration “data”
from “code”
•
•

Treat Puppet manifests really as code
More reusable manifests
• Heira is quite new: old manifests are catching up

•

Hiera can use multiple sources for lookup
•
•

Currently we store the data in git
Investigating DB for “canned” operations
14/10/2013

CHEP 2013

7
Modules and Git
•

Manifests (code) and hiera (data) are version
controlled

•

Puppet can use git’s easy branching to support
parallel environments
•

Later…
14/10/2013

CHEP 2013

8
Foreman
•

Lifecycle management tool for VMs and
physical servers

•

External Node Classifier – tells the puppet
master what a node should look like
Receives reports from Puppet runs and
provides dashboard

•

14/10/2013

CHEP 2013

9
14/10/2013

CHEP 2013

10
14/10/2013

CHEP 2013

11
Deployment at CERN
•
•
•
•

Puppet 3.2
Foreman 1.2
Been in real production for 6 months
Over 4000 hosts currently managed by Puppet
•
•
•
•
•
•

SLC5, SLC6, Windows
~100 distinct hostgroups in CERN IT + Expts
New EMI Grid service instances puppetised
Batch/Lxplus service moving as fast as we can drain it
Data services migrating with new capacity
AI services (Openstack, Puppet, etc) 2013
CHEP
12
14/10/2013
Key technical challenges
•

Service stability and scaling

•

Service monitoring

•

Foreman improvements

•

Site integration

14/10/2013

CHEP 2013

13
Scalability experiences
•
•

Most stability issues we had were down to scaling
issues
Puppet masters are easy to load-balance
•
•

•

•

We use standard apache mod_proxy_balancer
We currently have 16 masters
Fairly high IO and CPU requirements

Split up services
•

Puppet – critical vs. non critical

12 backend nodes
“Bulk”
14/10/2013

4 backend nodes
“Interactive”
CHEP 2013

14
Scalability guidelines

14/10/2013

CHEP 2013

15
Scalability experiences
•
•

Foreman is easy to load-balance
Also split into different services
•

That way Puppet and Foreman UI don’t get
affected by e.g. massive installation bursts
Load balancer

ENC

UI/API

14/10/2013

Reports
processing

CHEP 2013

16
PuppetDB
•

All puppet data sent to PuppetDB
•

Querying at compile time for Puppet manifests
• e.g. configure load-balancer for all workers

•

Scaling is still a challenge
•
•

Single instance – manual failover for now
Postgres scaling
• Heavily IO bound (we moved to SSDs)
• Get the book

14/10/2013

CHEP 2013

17
Monitor response times
•

Monitor response, errors
and identify bottlenecks

•

Currently using Splunk – will likely migrate to
Elasticsearch and Kibana
14/10/2013

CHEP 2013

18
Upstream improvements
•

CERN strategy is to run the main-line
upstream code
•
•

Any developments we do gets pushed upstream
e.g Foreman power operations, CVE reported

IPMI

Physical
box

IPMI

Physical
box

IPMI

Physical
box

Foreman
Proxy

Openstack
Nova API
VM
14/10/2013

VM

VM
CHEP 2013

19
Site integration
•

Using Opensource doesn’t get completely get you
away from coding your own stuff

•

We’ve found every time Puppet touches our
existing site infrastructure a new “service” or
“plugin” is born
•
•

Implementing our CA audit policy
Integrating with our existing PXE setup and burnin/hardware allocation process - possible convergence on
tools in the future – Razor?

•

Implementing Lemon monitoring “masking” use-cases –
nothing upstream, yet..

14/10/2013

CHEP 2013

20
People challenges
•

Debugging tools and docu needed!
•

•

Can we have X’, Y’ and Z’ ?
•
•
•

•

PuppetDB helpful here

Just because the old thing did it like that, doesn’t mean it
was the only way to do it
Real requirements are interesting to others too
Re-understanding requirements and documentation and
training

Great tools – how do 150 people use them without
stepping on each other?
•

Workflow and process
14/10/2013

CHEP 2013

21
Your test box

Your special
feature
My special feature

QA
“QA”
machines

Production

Most machines

Use git branches to define isolated puppet environments
14/10/2013

CHEP 2013

22
Easy git cherry pick
14/10/2013

CHEP 2013

23
Git workflow
Git model and flexible environments
•

For simplicity we made it more complex
•

Each Puppet module / hostgroup now has its
own git repo (~200 in all)
• Simple git-merge process within module
• Delegated ACLs to enhance security

•

Standard “QA” and “production” branches
that machines can subscribe to
•

Flexible tool (Jens, to be open-sourced by
CERN) for defining “feature” developments
• Everything from “production” except for the change

I’m testing on my module
14/10/2013

CHEP 2013

25
Strong QA process
•

Mandatory QA process for “shared” modules
•
•
•

•

Recommended for non-shared modules
Everyone is expected to have some nodes from their
service in the QA environment
Normally changes are QA’d for at least 1 week. Hit the
button if it breaks your box!

Still iterating on the process
•
•

Not bound by technology
Is one week enough? Can people “freeze”?
14/10/2013

CHEP 2013

26
Community collaboration
•

Traditionally one of HEPs strong points

•

There’s a large existing Puppet community with a
good model - we can join it and open-source our
modules

•

New HEPiX working group being formed now
•
•
•
•
•

Engage with existing Puppet community
Advice on best practices
Common modules for HEP/Grid-specific software
https://twiki.cern.ch/twiki/bin/view/HEPIX/ConfigManagem
ent
https://lists.desy.de/sympa/info/hepix-config-wg
14/10/2013

CHEP 2013

27
http://github.com/cernops
for the modules we share
Pull requests welcome!
14/10/2013

CHEP 2013

28
Summary
•

The Puppet / Foreman / Git / Openstack model is
working well for us
•

•

•

Key technical challenges are scaling and integration
which are under control
Main challenge now is people and process
•

•

4000 hosts in production, migration ongoing

How to maximise the utility of the tools

The HEP and Puppet communities are both strong and
we can benefit if we join them together
https://twiki.cern.ch/twiki/bin/view/HEPIX/ConfigManagement
http://github.com/cernops
14/10/2013

CHEP 2013

29
Backup slides

14/10/2013

CHEP 2013

30
mcollective, yum

Bamboo

Puppet
AIMS/PXE
Foreman

JIRA

OpenStack
Nova

Koji, Mock
Yum repo
Pulp

Active Directory /
LDAP

git

Lemon /
Hadoop

Hardware
database
Puppet-DB

14/10/2013

CHEP 2013

31

More Related Content

What's hot

OpenStack Deployments with Chef
OpenStack Deployments with ChefOpenStack Deployments with Chef
OpenStack Deployments with ChefMatt Ray
 
OpenStack Deployment in the Enterprise
OpenStack Deployment in the Enterprise OpenStack Deployment in the Enterprise
OpenStack Deployment in the Enterprise Cisco Canada
 
Orchestration Tool Roundup - Arthur Berezin & Trammell Scruggs
Orchestration Tool Roundup - Arthur Berezin & Trammell ScruggsOrchestration Tool Roundup - Arthur Berezin & Trammell Scruggs
Orchestration Tool Roundup - Arthur Berezin & Trammell ScruggsCloud Native Day Tel Aviv
 
Tips Tricks and Tactics with Cells and Scaling OpenStack - May, 2015
Tips Tricks and Tactics with Cells and Scaling OpenStack - May, 2015Tips Tricks and Tactics with Cells and Scaling OpenStack - May, 2015
Tips Tricks and Tactics with Cells and Scaling OpenStack - May, 2015Belmiro Moreira
 
Hostvn ceph in production v1.1 dungtq
Hostvn   ceph in production v1.1 dungtqHostvn   ceph in production v1.1 dungtq
Hostvn ceph in production v1.1 dungtqViet Stack
 
What's New in Grizzly & Deploying OpenStack with Puppet
What's New in Grizzly & Deploying OpenStack with PuppetWhat's New in Grizzly & Deploying OpenStack with Puppet
What's New in Grizzly & Deploying OpenStack with PuppetMark Voelker
 
Open stack in action enovance-quantum in action
Open stack in action enovance-quantum in actionOpen stack in action enovance-quantum in action
Open stack in action enovance-quantum in actioneNovance
 
Cloud Architect Alliance #15: Openstack
Cloud Architect Alliance #15: OpenstackCloud Architect Alliance #15: Openstack
Cloud Architect Alliance #15: OpenstackMicrosoft
 
Openstack In Real Life
Openstack In Real LifeOpenstack In Real Life
Openstack In Real LifePaul Guth
 
Chef for OpenStack: Grizzly Roadmap
Chef for OpenStack: Grizzly RoadmapChef for OpenStack: Grizzly Roadmap
Chef for OpenStack: Grizzly RoadmapMatt Ray
 
Managing Complexity at Velocity
Managing Complexity at VelocityManaging Complexity at Velocity
Managing Complexity at VelocityMatt Ray
 
Openstack architecture for the enterprise (Openstack Ireland Meet-up)
Openstack architecture for the enterprise (Openstack Ireland Meet-up)Openstack architecture for the enterprise (Openstack Ireland Meet-up)
Openstack architecture for the enterprise (Openstack Ireland Meet-up)Keith Tobin
 
OpenStack High Availability
OpenStack High AvailabilityOpenStack High Availability
OpenStack High AvailabilityJakub Pavlik
 
OpenStack in action 4! Alessandro Pilotti - OpenStack, Hyper-V and Windows
OpenStack in action 4! Alessandro Pilotti - OpenStack, Hyper-V and WindowsOpenStack in action 4! Alessandro Pilotti - OpenStack, Hyper-V and Windows
OpenStack in action 4! Alessandro Pilotti - OpenStack, Hyper-V and WindowseNovance
 
RedHat OpenStack Platform Overview
RedHat OpenStack Platform OverviewRedHat OpenStack Platform Overview
RedHat OpenStack Platform Overviewindevlab
 
OpenStack Best Practices and Considerations - terasky tech day
OpenStack Best Practices and Considerations  - terasky tech dayOpenStack Best Practices and Considerations  - terasky tech day
OpenStack Best Practices and Considerations - terasky tech dayArthur Berezin
 
Red Hat Enteprise Linux Open Stack Platfrom Director
Red Hat Enteprise Linux Open Stack Platfrom DirectorRed Hat Enteprise Linux Open Stack Platfrom Director
Red Hat Enteprise Linux Open Stack Platfrom DirectorOrgad Kimchi
 
Kolla talk at OpenStack Summit 2017 in Sydney
Kolla talk at OpenStack Summit 2017 in SydneyKolla talk at OpenStack Summit 2017 in Sydney
Kolla talk at OpenStack Summit 2017 in SydneyVikram G Hosakote
 

What's hot (20)

OpenStack Deployments with Chef
OpenStack Deployments with ChefOpenStack Deployments with Chef
OpenStack Deployments with Chef
 
OpenStack Deployment in the Enterprise
OpenStack Deployment in the Enterprise OpenStack Deployment in the Enterprise
OpenStack Deployment in the Enterprise
 
Orchestration Tool Roundup - Arthur Berezin & Trammell Scruggs
Orchestration Tool Roundup - Arthur Berezin & Trammell ScruggsOrchestration Tool Roundup - Arthur Berezin & Trammell Scruggs
Orchestration Tool Roundup - Arthur Berezin & Trammell Scruggs
 
Tips Tricks and Tactics with Cells and Scaling OpenStack - May, 2015
Tips Tricks and Tactics with Cells and Scaling OpenStack - May, 2015Tips Tricks and Tactics with Cells and Scaling OpenStack - May, 2015
Tips Tricks and Tactics with Cells and Scaling OpenStack - May, 2015
 
Hostvn ceph in production v1.1 dungtq
Hostvn   ceph in production v1.1 dungtqHostvn   ceph in production v1.1 dungtq
Hostvn ceph in production v1.1 dungtq
 
TripleO
 TripleO TripleO
TripleO
 
What's New in Grizzly & Deploying OpenStack with Puppet
What's New in Grizzly & Deploying OpenStack with PuppetWhat's New in Grizzly & Deploying OpenStack with Puppet
What's New in Grizzly & Deploying OpenStack with Puppet
 
Open stack in action enovance-quantum in action
Open stack in action enovance-quantum in actionOpen stack in action enovance-quantum in action
Open stack in action enovance-quantum in action
 
Cloud Architect Alliance #15: Openstack
Cloud Architect Alliance #15: OpenstackCloud Architect Alliance #15: Openstack
Cloud Architect Alliance #15: Openstack
 
Openstack In Real Life
Openstack In Real LifeOpenstack In Real Life
Openstack In Real Life
 
OpenStack and Windows
OpenStack and WindowsOpenStack and Windows
OpenStack and Windows
 
Chef for OpenStack: Grizzly Roadmap
Chef for OpenStack: Grizzly RoadmapChef for OpenStack: Grizzly Roadmap
Chef for OpenStack: Grizzly Roadmap
 
Managing Complexity at Velocity
Managing Complexity at VelocityManaging Complexity at Velocity
Managing Complexity at Velocity
 
Openstack architecture for the enterprise (Openstack Ireland Meet-up)
Openstack architecture for the enterprise (Openstack Ireland Meet-up)Openstack architecture for the enterprise (Openstack Ireland Meet-up)
Openstack architecture for the enterprise (Openstack Ireland Meet-up)
 
OpenStack High Availability
OpenStack High AvailabilityOpenStack High Availability
OpenStack High Availability
 
OpenStack in action 4! Alessandro Pilotti - OpenStack, Hyper-V and Windows
OpenStack in action 4! Alessandro Pilotti - OpenStack, Hyper-V and WindowsOpenStack in action 4! Alessandro Pilotti - OpenStack, Hyper-V and Windows
OpenStack in action 4! Alessandro Pilotti - OpenStack, Hyper-V and Windows
 
RedHat OpenStack Platform Overview
RedHat OpenStack Platform OverviewRedHat OpenStack Platform Overview
RedHat OpenStack Platform Overview
 
OpenStack Best Practices and Considerations - terasky tech day
OpenStack Best Practices and Considerations  - terasky tech dayOpenStack Best Practices and Considerations  - terasky tech day
OpenStack Best Practices and Considerations - terasky tech day
 
Red Hat Enteprise Linux Open Stack Platfrom Director
Red Hat Enteprise Linux Open Stack Platfrom DirectorRed Hat Enteprise Linux Open Stack Platfrom Director
Red Hat Enteprise Linux Open Stack Platfrom Director
 
Kolla talk at OpenStack Summit 2017 in Sydney
Kolla talk at OpenStack Summit 2017 in SydneyKolla talk at OpenStack Summit 2017 in Sydney
Kolla talk at OpenStack Summit 2017 in Sydney
 

Similar to Configuration Management Evolution at CERN

OpenNebulaConf2015 1.07 Cloud for Scientific Computing @ STFC - Alexander Dibbo
OpenNebulaConf2015 1.07 Cloud for Scientific Computing @ STFC - Alexander DibboOpenNebulaConf2015 1.07 Cloud for Scientific Computing @ STFC - Alexander Dibbo
OpenNebulaConf2015 1.07 Cloud for Scientific Computing @ STFC - Alexander DibboOpenNebula Project
 
Puppet Keynote by Ralph Luchs
Puppet Keynote by Ralph LuchsPuppet Keynote by Ralph Luchs
Puppet Keynote by Ralph LuchsNETWAYS
 
Sanger, upcoming Openstack for Bio-informaticians
Sanger, upcoming Openstack for Bio-informaticiansSanger, upcoming Openstack for Bio-informaticians
Sanger, upcoming Openstack for Bio-informaticiansPeter Clapham
 
State of Puppet 2013 - Puppet Camp DC
State of Puppet 2013 - Puppet Camp DCState of Puppet 2013 - Puppet Camp DC
State of Puppet 2013 - Puppet Camp DCPuppet
 
OpenStack Enabling DevOps
OpenStack Enabling DevOpsOpenStack Enabling DevOps
OpenStack Enabling DevOpsCisco DevNet
 
OpenStack at EBSCO
OpenStack at EBSCOOpenStack at EBSCO
OpenStack at EBSCOTesora
 
Puppet overview
Puppet overviewPuppet overview
Puppet overviewjoshbeard
 
Swimming upstream: OPNFV Doctor project case study
Swimming upstream: OPNFV Doctor project case studySwimming upstream: OPNFV Doctor project case study
Swimming upstream: OPNFV Doctor project case studyOPNFV
 
TechWiseTV Workshop: Open NX-OS and Devops with Puppet Labs
TechWiseTV Workshop: Open NX-OS and Devops with Puppet LabsTechWiseTV Workshop: Open NX-OS and Devops with Puppet Labs
TechWiseTV Workshop: Open NX-OS and Devops with Puppet LabsRobb Boyd
 
Considerations for Operating an OpenStack Cloud
Considerations for Operating an OpenStack CloudConsiderations for Operating an OpenStack Cloud
Considerations for Operating an OpenStack CloudAll Things Open
 
Itsummit2015 blizzard
Itsummit2015 blizzardItsummit2015 blizzard
Itsummit2015 blizzardkevin_donovan
 
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & AlluxioAlluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & AlluxioAlluxio, Inc.
 
Devoxx PL 2018 - Microservices in action at the Dutch National Police
Devoxx PL 2018 - Microservices in action at the Dutch National PoliceDevoxx PL 2018 - Microservices in action at the Dutch National Police
Devoxx PL 2018 - Microservices in action at the Dutch National PoliceBert Jan Schrijver
 
Automate Hadoop Cluster Deployment in a Banking Ecosystem
Automate Hadoop Cluster Deployment in a Banking EcosystemAutomate Hadoop Cluster Deployment in a Banking Ecosystem
Automate Hadoop Cluster Deployment in a Banking EcosystemHellmar Becker
 
iSense Java Summit 2017 - Microservices in action at the Dutch National Police
iSense Java Summit 2017 - Microservices in action at the Dutch National PoliceiSense Java Summit 2017 - Microservices in action at the Dutch National Police
iSense Java Summit 2017 - Microservices in action at the Dutch National PoliceBert Jan Schrijver
 
InteropWG Intro & Vertical Programs (May. 2017)
InteropWG Intro & Vertical Programs (May. 2017)InteropWG Intro & Vertical Programs (May. 2017)
InteropWG Intro & Vertical Programs (May. 2017)Mark Voelker
 
HiPEAC 2019 Tutorial - Maestro RTOS
HiPEAC 2019 Tutorial - Maestro RTOSHiPEAC 2019 Tutorial - Maestro RTOS
HiPEAC 2019 Tutorial - Maestro RTOSTulipp. Eu
 
Get There meetup March 2018 - Microservices in action at the Dutch National P...
Get There meetup March 2018 - Microservices in action at the Dutch National P...Get There meetup March 2018 - Microservices in action at the Dutch National P...
Get There meetup March 2018 - Microservices in action at the Dutch National P...Bert Jan Schrijver
 

Similar to Configuration Management Evolution at CERN (20)

OpenNebulaConf2015 1.07 Cloud for Scientific Computing @ STFC - Alexander Dibbo
OpenNebulaConf2015 1.07 Cloud for Scientific Computing @ STFC - Alexander DibboOpenNebulaConf2015 1.07 Cloud for Scientific Computing @ STFC - Alexander Dibbo
OpenNebulaConf2015 1.07 Cloud for Scientific Computing @ STFC - Alexander Dibbo
 
Puppet Keynote by Ralph Luchs
Puppet Keynote by Ralph LuchsPuppet Keynote by Ralph Luchs
Puppet Keynote by Ralph Luchs
 
Sanger, upcoming Openstack for Bio-informaticians
Sanger, upcoming Openstack for Bio-informaticiansSanger, upcoming Openstack for Bio-informaticians
Sanger, upcoming Openstack for Bio-informaticians
 
Flexible compute
Flexible computeFlexible compute
Flexible compute
 
State of Puppet 2013 - Puppet Camp DC
State of Puppet 2013 - Puppet Camp DCState of Puppet 2013 - Puppet Camp DC
State of Puppet 2013 - Puppet Camp DC
 
OpenStack Enabling DevOps
OpenStack Enabling DevOpsOpenStack Enabling DevOps
OpenStack Enabling DevOps
 
OpenStack at EBSCO
OpenStack at EBSCOOpenStack at EBSCO
OpenStack at EBSCO
 
Puppet overview
Puppet overviewPuppet overview
Puppet overview
 
Swimming upstream: OPNFV Doctor project case study
Swimming upstream: OPNFV Doctor project case studySwimming upstream: OPNFV Doctor project case study
Swimming upstream: OPNFV Doctor project case study
 
TechWiseTV Workshop: Open NX-OS and Devops with Puppet Labs
TechWiseTV Workshop: Open NX-OS and Devops with Puppet LabsTechWiseTV Workshop: Open NX-OS and Devops with Puppet Labs
TechWiseTV Workshop: Open NX-OS and Devops with Puppet Labs
 
Considerations for Operating an OpenStack Cloud
Considerations for Operating an OpenStack CloudConsiderations for Operating an OpenStack Cloud
Considerations for Operating an OpenStack Cloud
 
Itsummit2015 blizzard
Itsummit2015 blizzardItsummit2015 blizzard
Itsummit2015 blizzard
 
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & AlluxioAlluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
 
Devoxx PL 2018 - Microservices in action at the Dutch National Police
Devoxx PL 2018 - Microservices in action at the Dutch National PoliceDevoxx PL 2018 - Microservices in action at the Dutch National Police
Devoxx PL 2018 - Microservices in action at the Dutch National Police
 
Automate Hadoop Cluster Deployment in a Banking Ecosystem
Automate Hadoop Cluster Deployment in a Banking EcosystemAutomate Hadoop Cluster Deployment in a Banking Ecosystem
Automate Hadoop Cluster Deployment in a Banking Ecosystem
 
iSense Java Summit 2017 - Microservices in action at the Dutch National Police
iSense Java Summit 2017 - Microservices in action at the Dutch National PoliceiSense Java Summit 2017 - Microservices in action at the Dutch National Police
iSense Java Summit 2017 - Microservices in action at the Dutch National Police
 
InteropWG Intro & Vertical Programs (May. 2017)
InteropWG Intro & Vertical Programs (May. 2017)InteropWG Intro & Vertical Programs (May. 2017)
InteropWG Intro & Vertical Programs (May. 2017)
 
OpenVINO introduction
OpenVINO introductionOpenVINO introduction
OpenVINO introduction
 
HiPEAC 2019 Tutorial - Maestro RTOS
HiPEAC 2019 Tutorial - Maestro RTOSHiPEAC 2019 Tutorial - Maestro RTOS
HiPEAC 2019 Tutorial - Maestro RTOS
 
Get There meetup March 2018 - Microservices in action at the Dutch National P...
Get There meetup March 2018 - Microservices in action at the Dutch National P...Get There meetup March 2018 - Microservices in action at the Dutch National P...
Get There meetup March 2018 - Microservices in action at the Dutch National P...
 

Recently uploaded

Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 

Recently uploaded (20)

Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 

Configuration Management Evolution at CERN

  • 1.
  • 2. Configuration Management Evolution at CERN Gavin McCance gavin.mccance@cern.ch @gmccance
  • 3. Agile Infrastructure • • • • • Why we changed the stack Current status Technology challenges People challenges Puppe t Community T Agile he Infras ture truc Making ITope rationsbe r s e201 tte inc 3 Ope tac ns k Fore an m m c colle tive Koji git 14/10/2013 J nkins e Ac MQ tive CHEP 2013 3
  • 4. Why? • Homebrew stack of tools • • • • “We’re not special” • • Twice the number machines, no new staff New remote data-centre Adopting more dynamic Cloud model Existence of open source tool chain: OpenStack, Puppet, Foreman, Kibana Staff turnover • Use standard tools – we can hire for it and people can be hired for it when they leave 14/10/2013 CHEP 2013 4
  • 5. Agile Infrastructure “stack” • Our current stack has been stable for one year now • • Virtual server provisioning – • Cloud “operating system”: OpenStack -> (Belmiro, next) Configuration management – – • See plenary talk at last CHEP (Tim Bell et al) Puppet + ecosystem as configuration management system Foreman as machine inventory tool and dashboard Monitoring improvements • Flume + Elasticsearch + Kibana -> (Pedro, next++) 14/10/2013 CHEP 2013 5
  • 6. Puppet • • Puppet manages nodes’ configuration via “manifests” written in Puppet DSL All nodes check in frequently (~1-2 hours) and ask for configuration • • Configuration applied frequently to minimise drift Using the central puppet master model • • ..rather than masterless model No shipping of code, central caching and ACLs 14/10/2013 CHEP 2013 6
  • 7. Separation of data and code • Puppet “Hiera” splits configuration “data” from “code” • • Treat Puppet manifests really as code More reusable manifests • Heira is quite new: old manifests are catching up • Hiera can use multiple sources for lookup • • Currently we store the data in git Investigating DB for “canned” operations 14/10/2013 CHEP 2013 7
  • 8. Modules and Git • Manifests (code) and hiera (data) are version controlled • Puppet can use git’s easy branching to support parallel environments • Later… 14/10/2013 CHEP 2013 8
  • 9. Foreman • Lifecycle management tool for VMs and physical servers • External Node Classifier – tells the puppet master what a node should look like Receives reports from Puppet runs and provides dashboard • 14/10/2013 CHEP 2013 9
  • 12. Deployment at CERN • • • • Puppet 3.2 Foreman 1.2 Been in real production for 6 months Over 4000 hosts currently managed by Puppet • • • • • • SLC5, SLC6, Windows ~100 distinct hostgroups in CERN IT + Expts New EMI Grid service instances puppetised Batch/Lxplus service moving as fast as we can drain it Data services migrating with new capacity AI services (Openstack, Puppet, etc) 2013 CHEP 12 14/10/2013
  • 13. Key technical challenges • Service stability and scaling • Service monitoring • Foreman improvements • Site integration 14/10/2013 CHEP 2013 13
  • 14. Scalability experiences • • Most stability issues we had were down to scaling issues Puppet masters are easy to load-balance • • • • We use standard apache mod_proxy_balancer We currently have 16 masters Fairly high IO and CPU requirements Split up services • Puppet – critical vs. non critical 12 backend nodes “Bulk” 14/10/2013 4 backend nodes “Interactive” CHEP 2013 14
  • 16. Scalability experiences • • Foreman is easy to load-balance Also split into different services • That way Puppet and Foreman UI don’t get affected by e.g. massive installation bursts Load balancer ENC UI/API 14/10/2013 Reports processing CHEP 2013 16
  • 17. PuppetDB • All puppet data sent to PuppetDB • Querying at compile time for Puppet manifests • e.g. configure load-balancer for all workers • Scaling is still a challenge • • Single instance – manual failover for now Postgres scaling • Heavily IO bound (we moved to SSDs) • Get the book 14/10/2013 CHEP 2013 17
  • 18. Monitor response times • Monitor response, errors and identify bottlenecks • Currently using Splunk – will likely migrate to Elasticsearch and Kibana 14/10/2013 CHEP 2013 18
  • 19. Upstream improvements • CERN strategy is to run the main-line upstream code • • Any developments we do gets pushed upstream e.g Foreman power operations, CVE reported IPMI Physical box IPMI Physical box IPMI Physical box Foreman Proxy Openstack Nova API VM 14/10/2013 VM VM CHEP 2013 19
  • 20. Site integration • Using Opensource doesn’t get completely get you away from coding your own stuff • We’ve found every time Puppet touches our existing site infrastructure a new “service” or “plugin” is born • • Implementing our CA audit policy Integrating with our existing PXE setup and burnin/hardware allocation process - possible convergence on tools in the future – Razor? • Implementing Lemon monitoring “masking” use-cases – nothing upstream, yet.. 14/10/2013 CHEP 2013 20
  • 21. People challenges • Debugging tools and docu needed! • • Can we have X’, Y’ and Z’ ? • • • • PuppetDB helpful here Just because the old thing did it like that, doesn’t mean it was the only way to do it Real requirements are interesting to others too Re-understanding requirements and documentation and training Great tools – how do 150 people use them without stepping on each other? • Workflow and process 14/10/2013 CHEP 2013 21
  • 22. Your test box Your special feature My special feature QA “QA” machines Production Most machines Use git branches to define isolated puppet environments 14/10/2013 CHEP 2013 22
  • 23. Easy git cherry pick 14/10/2013 CHEP 2013 23
  • 25. Git model and flexible environments • For simplicity we made it more complex • Each Puppet module / hostgroup now has its own git repo (~200 in all) • Simple git-merge process within module • Delegated ACLs to enhance security • Standard “QA” and “production” branches that machines can subscribe to • Flexible tool (Jens, to be open-sourced by CERN) for defining “feature” developments • Everything from “production” except for the change I’m testing on my module 14/10/2013 CHEP 2013 25
  • 26. Strong QA process • Mandatory QA process for “shared” modules • • • • Recommended for non-shared modules Everyone is expected to have some nodes from their service in the QA environment Normally changes are QA’d for at least 1 week. Hit the button if it breaks your box! Still iterating on the process • • Not bound by technology Is one week enough? Can people “freeze”? 14/10/2013 CHEP 2013 26
  • 27. Community collaboration • Traditionally one of HEPs strong points • There’s a large existing Puppet community with a good model - we can join it and open-source our modules • New HEPiX working group being formed now • • • • • Engage with existing Puppet community Advice on best practices Common modules for HEP/Grid-specific software https://twiki.cern.ch/twiki/bin/view/HEPIX/ConfigManagem ent https://lists.desy.de/sympa/info/hepix-config-wg 14/10/2013 CHEP 2013 27
  • 28. http://github.com/cernops for the modules we share Pull requests welcome! 14/10/2013 CHEP 2013 28
  • 29. Summary • The Puppet / Foreman / Git / Openstack model is working well for us • • • Key technical challenges are scaling and integration which are under control Main challenge now is people and process • • 4000 hosts in production, migration ongoing How to maximise the utility of the tools The HEP and Puppet communities are both strong and we can benefit if we join them together https://twiki.cern.ch/twiki/bin/view/HEPIX/ConfigManagement http://github.com/cernops 14/10/2013 CHEP 2013 29
  • 31. mcollective, yum Bamboo Puppet AIMS/PXE Foreman JIRA OpenStack Nova Koji, Mock Yum repo Pulp Active Directory / LDAP git Lemon / Hadoop Hardware database Puppet-DB 14/10/2013 CHEP 2013 31