SlideShare a Scribd company logo
1 of 28
Download to read offline
From Hire to Retire!
Server Life-cycle Management with
Ironic at CERN
Arne Wiebalck & Surya Seetharaman
CERN Cloud Infrastructure Team
CERN and CERN IT in 1 Minute ...
3
➔ Understand the mysteries of the universe!
➢ Large Hadron Collider
➢ 100 m underground
➢ 27 km circumference
➢ 4 main detectors
➢ Cathedral-sized
➢ O(10) GB/s
➢ Initial reconstruction
➢ Permanent storage
➢ World-wide distribution
Arne Wiebalck & Surya Seetharaman
Server Life-cycle Management with Ironic at CERN
The 2½ CERN IT Data Centers
Meyrin (CH)
~12800 servers
Budapest (HU)
~2200 servers
LHCb Point-8 (FR)
~800 servers
4
Arne Wiebalck & Surya Seetharaman
Server Life-cycle Management with Ironic at CERN
OpenStack Deployment in CERN IT
In production since 2013!
● 8’500 hosts with ~300k cores
● ~35k instances in ~80 cells
● 3 main regions (+ test regions)
● Wide use case spectrum
● Control plane a use case as well
Ironic controllers are VMs on compute nodes which
are physical instances Nova created in Ironic ...
ESSEX FOLSOM GRIZZLY HAVANA ICEHOUSE JUNO KILO LIBERTY MITAKA NEWTON OCATA PIKE QUEENS ROCKY STEIN TRAIN
CERN Production Infrastructure
5
Arne Wiebalck & Surya Seetharaman
Server Life-cycle Management with Ironic at CERN
Ironic & CERN’s Ironic Deployment
api / httpd
conductor
inspector
➢ 3x Ironic controllers
➢ in a bare metal cell
(1 CN for 3’500 nodes!)
➢ currently on Stein++
6
Arne Wiebalck & Surya Seetharaman
Server Life-cycle Management with Ironic at CERN
CERN’s Ironic Deployment: Node Growth
➢ New deliveries
● Ironic-only
➢ Data center repatriation
● adoption imminent
➢ Scaling issues
● power sync
● resource tracker
● image conversion
Arne Wiebalck & Surya Seetharaman
Server Life-cycle Management with Ironic at CERN 7
Server Life-cycle Management with Ironic
registration
health
check
burn-in
benchmark
configure
provision
repair
adopt
retire
physical
installation
physical
removal
in production
work in progress
planned
preparation:
racks
power
network
8
Arne Wiebalck & Surya Seetharaman
Server Life-cycle Management with Ironic at CERN
The Early Life-cycle Steps ...
registration
health
check
benchmark
burn-in
in production
work in progress
planned
9
Arne Wiebalck & Surya Seetharaman
Server Life-cycle Management with Ironic at CERN
➔ Currently done by “manually” checking the inventory
➢ should move to Ironic’s introspection rules (S3)
➔ Currently done with a CERN-only auto-registration image
➢ could move to Ironic, unclear if we want to do this
➔ Will become a set of cleaning steps “burnin-
{cpu,disk,memory,network}”
➢ rely on standard tools like badblocks
➢ stops upon failure
➔ Will become a cleaning step “benchmark”
➢ launches a container which will know what to do
Configure: Clean-time Software RAID
10
Arne Wiebalck & Surya Seetharaman
Server Life-cycle Management with Ironic at CERN
➔ Vast majority of our 15’000 nodes rely on Software RAID
➢ Redundancy & space
➔ The lack of support in Ironic required re-installations
➢ Additional installation based on user-provided kickstart file
➢ Other deployments do have similar constructs for such setups
➔ With the upstream team, we added Software RAID support
➢ Available in Train
➢ *Initial* support
➢ In analogy to hardware RAID implemented as part of ‘manual’ cleaning
➢ In-band via the Ironic Python Agent
Set the
raid_interface
Set the
target_raid_config
Trigger manual
cleaning
11
Arne Wiebalck & Surya Seetharaman
Server Life-cycle Management with Ironic at CERN
Configure: Clean-time Software RAID
➔ How does Ironic configure the RAID?
Ironic
IPA
(1) triggers
manual cleaning
(3) boots
(5) configures
RAID
(4) passes clean steps
(& triggers bootloader
install during deploy)
(2) gets
target_raid_config
manageable
manageable
12
Arne Wiebalck & Surya Seetharaman
Server Life-cycle Management with Ironic at CERN
Configure: Clean-time Software RAID
/dev/sda1
/dev/sdb1
/dev/sda2
/dev/sdb2
/dev/md1/dev/md0
(RAID-1)
MBR
MBR
holder disk
/dev/sda
(RAID-N)
md
md
md
md
/dev/md0p{1,2} /dev/md1
config drivedeploy device “payload” device
holder disk
/dev/sdb
13
Arne Wiebalck & Surya Seetharaman
Server Life-cycle Management with Ironic at CERN
Configure: Clean-time Software RAID
➔ What about GPT/UEFI and disk selection?
➢ Initial implementation uses MBR, BIOS, and fixed partition for the root file system ...
➢ GPT works (needed mechanism to find root fs)
➢ UEFI will require additional work … ongoing!
➢ Disk selection not yet possible … ongoing!
➔ How to repair a broken RAID?
➢ “broken” == “broken beyond repair” (!= degraded)
➢ Do it the cloudy way: delete the instance!
➢ At CERN: {delete,create}_configuration steps part of our custom hardware manager
➢ What about ‘nova rebuild’?
Cleaning 400 nodes triggered the creation of Software RAIDs
14
Arne Wiebalck & Surya Seetharaman
Server Life-cycle Management with Ironic at CERN
Provision The Instance Life-cycle
Why Ironic?
➔ Single interface for virtual and physical resources
➔ Same request approval workflow and tools
➔ Satisfies requests where VMs are not apt
➔ Consolidates the accounting
available active
deploying
deletingcleaning
OpenStack
Hypervisors
Other
users
15
Arne Wiebalck & Surya Seetharaman
Server Life-cycle Management with Ironic at CERN
Region ‘Services’ Region ‘Batch’
‘bare metal’ cell
CN=’abc’
node=’abc’
physical instances
‘def’
resource
tracker
RP=’abc’
nova-
compute
Ironic driver
‘compute’ cell
CN=’xyz’
virtual instances
‘pqr’
resource
tracker
RP=’xyz’
nova-
compute
virt driver
inst=’pqr’inst=’def’
Provision Physical Instances as Hypervisors
16
Arne Wiebalck & Surya Seetharaman
Server Life-cycle Management with Ironic at CERN
Adopt “Take over the world!”
➔ Ironic provides “adoption” of nodes, but this does not include instance creation!
➔ Our procedure to adopt production nodes into Ironic:
➢ Enroll the node, including its resource_class (now in ‘manageable’)
➢ Set fake drivers for this node in Ironic
➢ Provide the node (now in ‘available’)
➢ Create the port in Ironic (usually created by inspection)
➢ Let Nova discover the node
➢ Add the node to the placement aggregate
➢ Wait for the resource tracker
➢ Create instance in Nova (with flavor matching the above resource_class)
➢ Set real drivers and interfaces in Ironic
17
Arne Wiebalck & Surya Seetharaman
Server Life-cycle Management with Ironic at CERN
DATA CENTER
Repair Even with Ironic nodes do break!
➔ The OpenStack team does not directly intervene on nodes
➢ Dedicated repair team (plus workflow framework based on Rundeck & Mistral)
➔ Incidents: scheduled vs. unplanned
➢ BMC firmware upgrade campaign vs. motherboard failure
➔ Introduction of Ironic required workflow adaptations, training, and ...
➢ New concepts like “physical instance”
➢ Reinstallations
➔ … upstream code changes in Ironic and Nova
➢ power synchronisation (the “root of all evil”)
18
Arne Wiebalck & Surya Seetharaman
Server Life-cycle Management with Ironic at CERN
physical
instance
DATA CENTER
power
outage
Ironic database
POWER_ONFF
Nova database
POWER_ONFF
DATA
CENTER
Repair The Nova / Ironic Power Sync
Problem Statement
19
Arne Wiebalck & Surya Seetharaman
Server Life-cycle Management with Ironic at CERN
physical
instance
Ironic database
POWER_OFFN
Nova database
POWER_OFF
DATA
CENTER
POWER_ON != POWER_OFF
????
Repair The Nova / Ironic Power Sync
Problem Statement
20
Arne Wiebalck & Surya Seetharaman
Server Life-cycle Management with Ironic at CERN
physical
instance
Ironic database
POWER_ONFF
Nova database
POWER_OFF
DATA
CENTER
POWER_ON != POWER_OFF
????Force power state
update to reflect what
nova feels is right
Repair The Nova / Ironic Power Sync
Problem Statement
21
Arne Wiebalck & Surya Seetharaman
Server Life-cycle Management with Ironic at CERN
Repair The Nova / Ironic Power Sync
Problem Statement
➔ Unforeseen events like power outage:
➢ Physical instance goes down
● Nova puts the instance into SHUTDOWN state
○ through the ``_sync_power_states`` periodic task
○ hypervisor regarded as the source of truth
➢ Physical instance comes back up without Nova knowing
● Nova again puts the instance back into SHUTDOWN state
○ through the ``_sync_power_states`` periodic task
○ database regarded as the source of truth
Nova should not force the instance to POWER_OFF when it comes back up
22
Arne Wiebalck & Surya Seetharaman
Server Life-cycle Management with Ironic at CERN
physical
instance
DATA CENTER
power
outage
POWER_ONFFDATA
CENTER
Repair The Nova / Ironic Power Sync
Implemented Solution
POWER_ONFF
target_power_state = POWER_OFF
Power update event
using
os-server-external-events
POWER_OFF == POWER_OFF
23
Arne Wiebalck & Surya Seetharaman
Server Life-cycle Management with Ironic at CERN
physical
instance
DATA CENTER
POWER_OFFNDATA
CENTER
Repair The Nova / Ironic Power Sync
Implemented Solution
POWER_OFFN
target_power_state = POWER_ON
POWER_ON == POWER_ON
There can be race conditions depending on the sequence of occurrence of events.
Power update event
using
os-server-external-events
24
Arne Wiebalck & Surya Seetharaman
Server Life-cycle Management with Ironic at CERN
Repair The Nova / Ironic Power Sync
Implemented Solution
Ironic sends power state change callbacks to Nova
➔ Operator has to set [nova].send_power_notifications config option to True
➔ JSON request body sent from Ironic
➔ Done via the os-server-external-events Nova api
➔ Read power_update spec and documentation for more details
{
"events": [
{
"name": "power-update",
"server_uuid": "3df201cf-2451-44f2-8d25-a4ca826fc1f3",
"tag": target_power_state
}
]
}
➔ JSON response body sent from Nova
➔ Nova updates its database to reflect the power state change
➢ Ironic regarded as source of truth
25
Arne Wiebalck & Surya Seetharaman
Server Life-cycle Management with Ironic at CERN
Repair The Nova / Ironic Power Sync
Implemented Solution
Ironic sends power state change callbacks to Nova
{
"events": [
{
"code": 200,
"name": "power-update",
"server_uuid": "3df201cf-2451-44f2-8d25-a4ca826fc1f3",
"status": "completed",
"tag": target_power_state
}
]
}
Available
from
26
Arne Wiebalck & Surya Seetharaman
Server Life-cycle Management with Ironic at CERN
Retire The End of the Cycle
➔ A simple procedure for now
➢ Deleting instances triggers cleaning
➢ Setting maintenance avoids instance creation
➢ The maintenance reason marks them for removal
➔ No explicit “node retirement” support in Ironic
➢ Time windows allow for re-use
➢ Explicit tagging would be helpful
➔ Proposals to introduce a retirement flag (or even state)
➢ Spec: “Add support for node retirement” https://review.opendev.org/#/c/656799/
A Pain Point: Resource Tracking
27
Arne Wiebalck & Surya Seetharaman
Server Life-cycle Management with Ironic at CERN
➔ The resource tracker loops sequentially over compute nodes
➢ Holds a semaphore and hence blocks instance creation
➢ OK for virtual machines, a scaling issue for bare metal deployments
➔ This creates a noticeable dead time for instance creation
➢ For 1 node with 3500 physical nodes the turn-around time is ~60 mins
➢ … this is already with upstream and local patches to reduce the overhead
➔ Stop-gap solution: adapt the resource tracker cycle
➢ Compromise between blockage and placement updates
➔ Potential solutions: n-compute sharding and per-instance locking
28
Arne Wiebalck & Surya Seetharaman
Server Life-cycle Management with Ironic at CERN
謝謝啦
Thank you!
Power consumption in container DC during allocation of a new delivery
(partial)
burn-in
nodes
available
physical
Instances
created
virtual
Instances
created
Summary
➔ At CERN, Ironic is used for the majority of a server’s life cycle steps
➔ We work on the remaining steps, more features and we plan to pass +10k servers!

More Related Content

What's hot

Managing an Akka Cluster on Kubernetes
Managing an Akka Cluster on KubernetesManaging an Akka Cluster on Kubernetes
Managing an Akka Cluster on KubernetesMarkus Jura
 
Mark Farnam : Minimizing the Concurrency Footprint of Transactions
Mark Farnam  : Minimizing the Concurrency Footprint of TransactionsMark Farnam  : Minimizing the Concurrency Footprint of Transactions
Mark Farnam : Minimizing the Concurrency Footprint of TransactionsKyle Hailey
 
To Build My Own Cloud with Blackjack…
To Build My Own Cloud with Blackjack…To Build My Own Cloud with Blackjack…
To Build My Own Cloud with Blackjack…Sergey Dzyuban
 
Etcd- Mission Critical Key-Value Store
Etcd- Mission Critical Key-Value StoreEtcd- Mission Critical Key-Value Store
Etcd- Mission Critical Key-Value StoreCoreOS
 
Fosdem 2014 - MySQL & Friends Devroom: 15 tips galera cluster
Fosdem 2014 - MySQL & Friends Devroom: 15 tips galera clusterFosdem 2014 - MySQL & Friends Devroom: 15 tips galera cluster
Fosdem 2014 - MySQL & Friends Devroom: 15 tips galera clusterFrederic Descamps
 
Plmce2k15 15 tips galera cluster
Plmce2k15   15 tips galera clusterPlmce2k15   15 tips galera cluster
Plmce2k15 15 tips galera clusterFrederic Descamps
 
tow nodes Oracle 12c RAC on virtualbox
tow nodes Oracle 12c RAC on virtualboxtow nodes Oracle 12c RAC on virtualbox
tow nodes Oracle 12c RAC on virtualboxjustinit
 
Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 1
Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 1Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 1
Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 1Tanel Poder
 
Ceph Day Shanghai - CeTune - Benchmarking and tuning your Ceph cluster
Ceph Day Shanghai - CeTune - Benchmarking and tuning your Ceph cluster Ceph Day Shanghai - CeTune - Benchmarking and tuning your Ceph cluster
Ceph Day Shanghai - CeTune - Benchmarking and tuning your Ceph cluster Ceph Community
 
Oracle12c data guard farsync and whats new
Oracle12c data guard farsync and whats newOracle12c data guard farsync and whats new
Oracle12c data guard farsync and whats newNassyam Basha
 
Upgrade 11.2.0.1 gi crs to 11.2.0.2 in linux
Upgrade 11.2.0.1 gi crs to 11.2.0.2 in linuxUpgrade 11.2.0.1 gi crs to 11.2.0.2 in linux
Upgrade 11.2.0.1 gi crs to 11.2.0.2 in linuxmaclean liu
 
Advanced Percona XtraDB Cluster in a nutshell... la suite
Advanced Percona XtraDB Cluster in a nutshell... la suiteAdvanced Percona XtraDB Cluster in a nutshell... la suite
Advanced Percona XtraDB Cluster in a nutshell... la suiteKenny Gryp
 
Why Your Apache Spark Job is Failing
Why Your Apache Spark Job is FailingWhy Your Apache Spark Job is Failing
Why Your Apache Spark Job is FailingCloudera, Inc.
 
Oracle 11g R2 RAC setup on rhel 5.0
Oracle 11g R2 RAC setup on rhel 5.0Oracle 11g R2 RAC setup on rhel 5.0
Oracle 11g R2 RAC setup on rhel 5.0Santosh Kangane
 
Oracle Open World 2014: Lies, Damned Lies, and I/O Statistics [ CON3671]
Oracle Open World 2014: Lies, Damned Lies, and I/O Statistics [ CON3671]Oracle Open World 2014: Lies, Damned Lies, and I/O Statistics [ CON3671]
Oracle Open World 2014: Lies, Damned Lies, and I/O Statistics [ CON3671]Kyle Hailey
 
Why is My Spark Job Failing? by Sandy Ryza of Cloudera
Why is My Spark Job Failing? by Sandy Ryza of ClouderaWhy is My Spark Job Failing? by Sandy Ryza of Cloudera
Why is My Spark Job Failing? by Sandy Ryza of ClouderaData Con LA
 
20 Real-World Use Cases to help pick a better MySQL Replication scheme (2012)
20 Real-World Use Cases to help pick a better MySQL Replication scheme (2012)20 Real-World Use Cases to help pick a better MySQL Replication scheme (2012)
20 Real-World Use Cases to help pick a better MySQL Replication scheme (2012)Darpan Dinker
 
MySQL Replication: Pros and Cons
MySQL Replication: Pros and ConsMySQL Replication: Pros and Cons
MySQL Replication: Pros and ConsRachel Li
 
Taming the Cloud Database with Apache jclouds
Taming the Cloud Database with Apache jcloudsTaming the Cloud Database with Apache jclouds
Taming the Cloud Database with Apache jcloudszshoylev
 

What's hot (20)

Managing an Akka Cluster on Kubernetes
Managing an Akka Cluster on KubernetesManaging an Akka Cluster on Kubernetes
Managing an Akka Cluster on Kubernetes
 
Mark Farnam : Minimizing the Concurrency Footprint of Transactions
Mark Farnam  : Minimizing the Concurrency Footprint of TransactionsMark Farnam  : Minimizing the Concurrency Footprint of Transactions
Mark Farnam : Minimizing the Concurrency Footprint of Transactions
 
To Build My Own Cloud with Blackjack…
To Build My Own Cloud with Blackjack…To Build My Own Cloud with Blackjack…
To Build My Own Cloud with Blackjack…
 
Etcd- Mission Critical Key-Value Store
Etcd- Mission Critical Key-Value StoreEtcd- Mission Critical Key-Value Store
Etcd- Mission Critical Key-Value Store
 
Fosdem 2014 - MySQL & Friends Devroom: 15 tips galera cluster
Fosdem 2014 - MySQL & Friends Devroom: 15 tips galera clusterFosdem 2014 - MySQL & Friends Devroom: 15 tips galera cluster
Fosdem 2014 - MySQL & Friends Devroom: 15 tips galera cluster
 
Plmce2k15 15 tips galera cluster
Plmce2k15   15 tips galera clusterPlmce2k15   15 tips galera cluster
Plmce2k15 15 tips galera cluster
 
tow nodes Oracle 12c RAC on virtualbox
tow nodes Oracle 12c RAC on virtualboxtow nodes Oracle 12c RAC on virtualbox
tow nodes Oracle 12c RAC on virtualbox
 
Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 1
Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 1Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 1
Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 1
 
Ceph Day Shanghai - CeTune - Benchmarking and tuning your Ceph cluster
Ceph Day Shanghai - CeTune - Benchmarking and tuning your Ceph cluster Ceph Day Shanghai - CeTune - Benchmarking and tuning your Ceph cluster
Ceph Day Shanghai - CeTune - Benchmarking and tuning your Ceph cluster
 
Oracle12c data guard farsync and whats new
Oracle12c data guard farsync and whats newOracle12c data guard farsync and whats new
Oracle12c data guard farsync and whats new
 
Upgrade 11.2.0.1 gi crs to 11.2.0.2 in linux
Upgrade 11.2.0.1 gi crs to 11.2.0.2 in linuxUpgrade 11.2.0.1 gi crs to 11.2.0.2 in linux
Upgrade 11.2.0.1 gi crs to 11.2.0.2 in linux
 
Advanced Percona XtraDB Cluster in a nutshell... la suite
Advanced Percona XtraDB Cluster in a nutshell... la suiteAdvanced Percona XtraDB Cluster in a nutshell... la suite
Advanced Percona XtraDB Cluster in a nutshell... la suite
 
Why Your Apache Spark Job is Failing
Why Your Apache Spark Job is FailingWhy Your Apache Spark Job is Failing
Why Your Apache Spark Job is Failing
 
Oracle 11g R2 RAC setup on rhel 5.0
Oracle 11g R2 RAC setup on rhel 5.0Oracle 11g R2 RAC setup on rhel 5.0
Oracle 11g R2 RAC setup on rhel 5.0
 
Oracle Open World 2014: Lies, Damned Lies, and I/O Statistics [ CON3671]
Oracle Open World 2014: Lies, Damned Lies, and I/O Statistics [ CON3671]Oracle Open World 2014: Lies, Damned Lies, and I/O Statistics [ CON3671]
Oracle Open World 2014: Lies, Damned Lies, and I/O Statistics [ CON3671]
 
Ansible Automation - Enterprise Use Cases | Juncheng Anthony Lin
Ansible Automation - Enterprise Use Cases | Juncheng Anthony LinAnsible Automation - Enterprise Use Cases | Juncheng Anthony Lin
Ansible Automation - Enterprise Use Cases | Juncheng Anthony Lin
 
Why is My Spark Job Failing? by Sandy Ryza of Cloudera
Why is My Spark Job Failing? by Sandy Ryza of ClouderaWhy is My Spark Job Failing? by Sandy Ryza of Cloudera
Why is My Spark Job Failing? by Sandy Ryza of Cloudera
 
20 Real-World Use Cases to help pick a better MySQL Replication scheme (2012)
20 Real-World Use Cases to help pick a better MySQL Replication scheme (2012)20 Real-World Use Cases to help pick a better MySQL Replication scheme (2012)
20 Real-World Use Cases to help pick a better MySQL Replication scheme (2012)
 
MySQL Replication: Pros and Cons
MySQL Replication: Pros and ConsMySQL Replication: Pros and Cons
MySQL Replication: Pros and Cons
 
Taming the Cloud Database with Apache jclouds
Taming the Cloud Database with Apache jcloudsTaming the Cloud Database with Apache jclouds
Taming the Cloud Database with Apache jclouds
 

Similar to From hire to retire! server life cycle management with ironic at cern - open infrasummit-shanghai_06nov2019

London Ceph Day: Ceph at CERN
London Ceph Day: Ceph at CERNLondon Ceph Day: Ceph at CERN
London Ceph Day: Ceph at CERNCeph Community
 
Integrating Bare-metal Provisioning into CERN's Private Cloud
Integrating Bare-metal Provisioning into CERN's Private CloudIntegrating Bare-metal Provisioning into CERN's Private Cloud
Integrating Bare-metal Provisioning into CERN's Private CloudArne Wiebalck
 
Ceph Deployment at Target: Customer Spotlight
Ceph Deployment at Target: Customer SpotlightCeph Deployment at Target: Customer Spotlight
Ceph Deployment at Target: Customer SpotlightRed_Hat_Storage
 
Ceph Deployment at Target: Customer Spotlight
Ceph Deployment at Target: Customer SpotlightCeph Deployment at Target: Customer Spotlight
Ceph Deployment at Target: Customer SpotlightColleen Corrice
 
Scaling Ceph at CERN - Ceph Day Frankfurt
Scaling Ceph at CERN - Ceph Day Frankfurt Scaling Ceph at CERN - Ceph Day Frankfurt
Scaling Ceph at CERN - Ceph Day Frankfurt Ceph Community
 
Operational War Stories from 5 Years of Running OpenStack in Production
Operational War Stories from 5 Years of Running OpenStack in ProductionOperational War Stories from 5 Years of Running OpenStack in Production
Operational War Stories from 5 Years of Running OpenStack in ProductionArne Wiebalck
 
Unveiling CERN Cloud Architecture - October, 2015
Unveiling CERN Cloud Architecture - October, 2015Unveiling CERN Cloud Architecture - October, 2015
Unveiling CERN Cloud Architecture - October, 2015Belmiro Moreira
 
Meetup 23 - 01 - The things I wish I would have known before doing OpenStack ...
Meetup 23 - 01 - The things I wish I would have known before doing OpenStack ...Meetup 23 - 01 - The things I wish I would have known before doing OpenStack ...
Meetup 23 - 01 - The things I wish I would have known before doing OpenStack ...Vietnam Open Infrastructure User Group
 
VMworld 2013: VMware Disaster Recovery Solution with Oracle Data Guard and Si...
VMworld 2013: VMware Disaster Recovery Solution with Oracle Data Guard and Si...VMworld 2013: VMware Disaster Recovery Solution with Oracle Data Guard and Si...
VMworld 2013: VMware Disaster Recovery Solution with Oracle Data Guard and Si...VMworld
 
Turning OpenStack Swift into a VM storage platform
Turning OpenStack Swift into a VM storage platformTurning OpenStack Swift into a VM storage platform
Turning OpenStack Swift into a VM storage platformwim_provoost
 
OpenStack Tutorial For Beginners | OpenStack Tutorial | OpenStack Training | ...
OpenStack Tutorial For Beginners | OpenStack Tutorial | OpenStack Training | ...OpenStack Tutorial For Beginners | OpenStack Tutorial | OpenStack Training | ...
OpenStack Tutorial For Beginners | OpenStack Tutorial | OpenStack Training | ...Edureka!
 
My Opera meets Varnish, Dec 2009
My Opera meets Varnish, Dec 2009My Opera meets Varnish, Dec 2009
My Opera meets Varnish, Dec 2009Cosimo Streppone
 
VMworld 2014: Virtualizing Databases
VMworld 2014: Virtualizing DatabasesVMworld 2014: Virtualizing Databases
VMworld 2014: Virtualizing DatabasesVMworld
 
Xen Virtualization 2008
Xen Virtualization 2008Xen Virtualization 2008
Xen Virtualization 2008mwlang88
 
Nevmug Lighthouse Automation7.1
Nevmug   Lighthouse   Automation7.1Nevmug   Lighthouse   Automation7.1
Nevmug Lighthouse Automation7.1csharney
 
How to build a winning solution for large scale VDI deployments
How to build a winning solution for large scale VDI deploymentsHow to build a winning solution for large scale VDI deployments
How to build a winning solution for large scale VDI deploymentsNetApp
 
Road show 2015 triangle meetup
Road show 2015 triangle meetupRoad show 2015 triangle meetup
Road show 2015 triangle meetupwim_provoost
 
Critical Attributes for a High-Performance, Low-Latency Database
Critical Attributes for a High-Performance, Low-Latency DatabaseCritical Attributes for a High-Performance, Low-Latency Database
Critical Attributes for a High-Performance, Low-Latency DatabaseScyllaDB
 
Ceph Day Chicago - Ceph Deployment at Target: Best Practices and Lessons Learned
Ceph Day Chicago - Ceph Deployment at Target: Best Practices and Lessons LearnedCeph Day Chicago - Ceph Deployment at Target: Best Practices and Lessons Learned
Ceph Day Chicago - Ceph Deployment at Target: Best Practices and Lessons LearnedCeph Community
 

Similar to From hire to retire! server life cycle management with ironic at cern - open infrasummit-shanghai_06nov2019 (20)

London Ceph Day: Ceph at CERN
London Ceph Day: Ceph at CERNLondon Ceph Day: Ceph at CERN
London Ceph Day: Ceph at CERN
 
Integrating Bare-metal Provisioning into CERN's Private Cloud
Integrating Bare-metal Provisioning into CERN's Private CloudIntegrating Bare-metal Provisioning into CERN's Private Cloud
Integrating Bare-metal Provisioning into CERN's Private Cloud
 
Ceph Deployment at Target: Customer Spotlight
Ceph Deployment at Target: Customer SpotlightCeph Deployment at Target: Customer Spotlight
Ceph Deployment at Target: Customer Spotlight
 
Ceph Deployment at Target: Customer Spotlight
Ceph Deployment at Target: Customer SpotlightCeph Deployment at Target: Customer Spotlight
Ceph Deployment at Target: Customer Spotlight
 
Scaling Ceph at CERN - Ceph Day Frankfurt
Scaling Ceph at CERN - Ceph Day Frankfurt Scaling Ceph at CERN - Ceph Day Frankfurt
Scaling Ceph at CERN - Ceph Day Frankfurt
 
Operational War Stories from 5 Years of Running OpenStack in Production
Operational War Stories from 5 Years of Running OpenStack in ProductionOperational War Stories from 5 Years of Running OpenStack in Production
Operational War Stories from 5 Years of Running OpenStack in Production
 
Unveiling CERN Cloud Architecture - October, 2015
Unveiling CERN Cloud Architecture - October, 2015Unveiling CERN Cloud Architecture - October, 2015
Unveiling CERN Cloud Architecture - October, 2015
 
Meetup 23 - 01 - The things I wish I would have known before doing OpenStack ...
Meetup 23 - 01 - The things I wish I would have known before doing OpenStack ...Meetup 23 - 01 - The things I wish I would have known before doing OpenStack ...
Meetup 23 - 01 - The things I wish I would have known before doing OpenStack ...
 
VMworld 2013: VMware Disaster Recovery Solution with Oracle Data Guard and Si...
VMworld 2013: VMware Disaster Recovery Solution with Oracle Data Guard and Si...VMworld 2013: VMware Disaster Recovery Solution with Oracle Data Guard and Si...
VMworld 2013: VMware Disaster Recovery Solution with Oracle Data Guard and Si...
 
Turning OpenStack Swift into a VM storage platform
Turning OpenStack Swift into a VM storage platformTurning OpenStack Swift into a VM storage platform
Turning OpenStack Swift into a VM storage platform
 
OpenStack Tutorial For Beginners | OpenStack Tutorial | OpenStack Training | ...
OpenStack Tutorial For Beginners | OpenStack Tutorial | OpenStack Training | ...OpenStack Tutorial For Beginners | OpenStack Tutorial | OpenStack Training | ...
OpenStack Tutorial For Beginners | OpenStack Tutorial | OpenStack Training | ...
 
My Opera meets Varnish, Dec 2009
My Opera meets Varnish, Dec 2009My Opera meets Varnish, Dec 2009
My Opera meets Varnish, Dec 2009
 
VMworld 2014: Virtualizing Databases
VMworld 2014: Virtualizing DatabasesVMworld 2014: Virtualizing Databases
VMworld 2014: Virtualizing Databases
 
Xen Virtualization 2008
Xen Virtualization 2008Xen Virtualization 2008
Xen Virtualization 2008
 
Nevmug Lighthouse Automation7.1
Nevmug   Lighthouse   Automation7.1Nevmug   Lighthouse   Automation7.1
Nevmug Lighthouse Automation7.1
 
How to build a winning solution for large scale VDI deployments
How to build a winning solution for large scale VDI deploymentsHow to build a winning solution for large scale VDI deployments
How to build a winning solution for large scale VDI deployments
 
Road show 2015 triangle meetup
Road show 2015 triangle meetupRoad show 2015 triangle meetup
Road show 2015 triangle meetup
 
Critical Attributes for a High-Performance, Low-Latency Database
Critical Attributes for a High-Performance, Low-Latency DatabaseCritical Attributes for a High-Performance, Low-Latency Database
Critical Attributes for a High-Performance, Low-Latency Database
 
Ceph Day Chicago - Ceph Deployment at Target: Best Practices and Lessons Learned
Ceph Day Chicago - Ceph Deployment at Target: Best Practices and Lessons LearnedCeph Day Chicago - Ceph Deployment at Target: Best Practices and Lessons Learned
Ceph Day Chicago - Ceph Deployment at Target: Best Practices and Lessons Learned
 
state-of-compute-vish-grizzly.ppt
state-of-compute-vish-grizzly.pptstate-of-compute-vish-grizzly.ppt
state-of-compute-vish-grizzly.ppt
 

Recently uploaded

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024The Digital Insurer
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 

Recently uploaded (20)

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 

From hire to retire! server life cycle management with ironic at cern - open infrasummit-shanghai_06nov2019

  • 1.
  • 2. From Hire to Retire! Server Life-cycle Management with Ironic at CERN Arne Wiebalck & Surya Seetharaman CERN Cloud Infrastructure Team
  • 3. CERN and CERN IT in 1 Minute ... 3 ➔ Understand the mysteries of the universe! ➢ Large Hadron Collider ➢ 100 m underground ➢ 27 km circumference ➢ 4 main detectors ➢ Cathedral-sized ➢ O(10) GB/s ➢ Initial reconstruction ➢ Permanent storage ➢ World-wide distribution Arne Wiebalck & Surya Seetharaman Server Life-cycle Management with Ironic at CERN
  • 4. The 2½ CERN IT Data Centers Meyrin (CH) ~12800 servers Budapest (HU) ~2200 servers LHCb Point-8 (FR) ~800 servers 4 Arne Wiebalck & Surya Seetharaman Server Life-cycle Management with Ironic at CERN
  • 5. OpenStack Deployment in CERN IT In production since 2013! ● 8’500 hosts with ~300k cores ● ~35k instances in ~80 cells ● 3 main regions (+ test regions) ● Wide use case spectrum ● Control plane a use case as well Ironic controllers are VMs on compute nodes which are physical instances Nova created in Ironic ... ESSEX FOLSOM GRIZZLY HAVANA ICEHOUSE JUNO KILO LIBERTY MITAKA NEWTON OCATA PIKE QUEENS ROCKY STEIN TRAIN CERN Production Infrastructure 5 Arne Wiebalck & Surya Seetharaman Server Life-cycle Management with Ironic at CERN
  • 6. Ironic & CERN’s Ironic Deployment api / httpd conductor inspector ➢ 3x Ironic controllers ➢ in a bare metal cell (1 CN for 3’500 nodes!) ➢ currently on Stein++ 6 Arne Wiebalck & Surya Seetharaman Server Life-cycle Management with Ironic at CERN
  • 7. CERN’s Ironic Deployment: Node Growth ➢ New deliveries ● Ironic-only ➢ Data center repatriation ● adoption imminent ➢ Scaling issues ● power sync ● resource tracker ● image conversion Arne Wiebalck & Surya Seetharaman Server Life-cycle Management with Ironic at CERN 7
  • 8. Server Life-cycle Management with Ironic registration health check burn-in benchmark configure provision repair adopt retire physical installation physical removal in production work in progress planned preparation: racks power network 8 Arne Wiebalck & Surya Seetharaman Server Life-cycle Management with Ironic at CERN
  • 9. The Early Life-cycle Steps ... registration health check benchmark burn-in in production work in progress planned 9 Arne Wiebalck & Surya Seetharaman Server Life-cycle Management with Ironic at CERN ➔ Currently done by “manually” checking the inventory ➢ should move to Ironic’s introspection rules (S3) ➔ Currently done with a CERN-only auto-registration image ➢ could move to Ironic, unclear if we want to do this ➔ Will become a set of cleaning steps “burnin- {cpu,disk,memory,network}” ➢ rely on standard tools like badblocks ➢ stops upon failure ➔ Will become a cleaning step “benchmark” ➢ launches a container which will know what to do
  • 10. Configure: Clean-time Software RAID 10 Arne Wiebalck & Surya Seetharaman Server Life-cycle Management with Ironic at CERN ➔ Vast majority of our 15’000 nodes rely on Software RAID ➢ Redundancy & space ➔ The lack of support in Ironic required re-installations ➢ Additional installation based on user-provided kickstart file ➢ Other deployments do have similar constructs for such setups ➔ With the upstream team, we added Software RAID support ➢ Available in Train ➢ *Initial* support ➢ In analogy to hardware RAID implemented as part of ‘manual’ cleaning ➢ In-band via the Ironic Python Agent Set the raid_interface Set the target_raid_config Trigger manual cleaning
  • 11. 11 Arne Wiebalck & Surya Seetharaman Server Life-cycle Management with Ironic at CERN Configure: Clean-time Software RAID ➔ How does Ironic configure the RAID? Ironic IPA (1) triggers manual cleaning (3) boots (5) configures RAID (4) passes clean steps (& triggers bootloader install during deploy) (2) gets target_raid_config manageable manageable
  • 12. 12 Arne Wiebalck & Surya Seetharaman Server Life-cycle Management with Ironic at CERN Configure: Clean-time Software RAID /dev/sda1 /dev/sdb1 /dev/sda2 /dev/sdb2 /dev/md1/dev/md0 (RAID-1) MBR MBR holder disk /dev/sda (RAID-N) md md md md /dev/md0p{1,2} /dev/md1 config drivedeploy device “payload” device holder disk /dev/sdb
  • 13. 13 Arne Wiebalck & Surya Seetharaman Server Life-cycle Management with Ironic at CERN Configure: Clean-time Software RAID ➔ What about GPT/UEFI and disk selection? ➢ Initial implementation uses MBR, BIOS, and fixed partition for the root file system ... ➢ GPT works (needed mechanism to find root fs) ➢ UEFI will require additional work … ongoing! ➢ Disk selection not yet possible … ongoing! ➔ How to repair a broken RAID? ➢ “broken” == “broken beyond repair” (!= degraded) ➢ Do it the cloudy way: delete the instance! ➢ At CERN: {delete,create}_configuration steps part of our custom hardware manager ➢ What about ‘nova rebuild’? Cleaning 400 nodes triggered the creation of Software RAIDs
  • 14. 14 Arne Wiebalck & Surya Seetharaman Server Life-cycle Management with Ironic at CERN Provision The Instance Life-cycle Why Ironic? ➔ Single interface for virtual and physical resources ➔ Same request approval workflow and tools ➔ Satisfies requests where VMs are not apt ➔ Consolidates the accounting available active deploying deletingcleaning OpenStack Hypervisors Other users
  • 15. 15 Arne Wiebalck & Surya Seetharaman Server Life-cycle Management with Ironic at CERN Region ‘Services’ Region ‘Batch’ ‘bare metal’ cell CN=’abc’ node=’abc’ physical instances ‘def’ resource tracker RP=’abc’ nova- compute Ironic driver ‘compute’ cell CN=’xyz’ virtual instances ‘pqr’ resource tracker RP=’xyz’ nova- compute virt driver inst=’pqr’inst=’def’ Provision Physical Instances as Hypervisors
  • 16. 16 Arne Wiebalck & Surya Seetharaman Server Life-cycle Management with Ironic at CERN Adopt “Take over the world!” ➔ Ironic provides “adoption” of nodes, but this does not include instance creation! ➔ Our procedure to adopt production nodes into Ironic: ➢ Enroll the node, including its resource_class (now in ‘manageable’) ➢ Set fake drivers for this node in Ironic ➢ Provide the node (now in ‘available’) ➢ Create the port in Ironic (usually created by inspection) ➢ Let Nova discover the node ➢ Add the node to the placement aggregate ➢ Wait for the resource tracker ➢ Create instance in Nova (with flavor matching the above resource_class) ➢ Set real drivers and interfaces in Ironic
  • 17. 17 Arne Wiebalck & Surya Seetharaman Server Life-cycle Management with Ironic at CERN DATA CENTER Repair Even with Ironic nodes do break! ➔ The OpenStack team does not directly intervene on nodes ➢ Dedicated repair team (plus workflow framework based on Rundeck & Mistral) ➔ Incidents: scheduled vs. unplanned ➢ BMC firmware upgrade campaign vs. motherboard failure ➔ Introduction of Ironic required workflow adaptations, training, and ... ➢ New concepts like “physical instance” ➢ Reinstallations ➔ … upstream code changes in Ironic and Nova ➢ power synchronisation (the “root of all evil”)
  • 18. 18 Arne Wiebalck & Surya Seetharaman Server Life-cycle Management with Ironic at CERN physical instance DATA CENTER power outage Ironic database POWER_ONFF Nova database POWER_ONFF DATA CENTER Repair The Nova / Ironic Power Sync Problem Statement
  • 19. 19 Arne Wiebalck & Surya Seetharaman Server Life-cycle Management with Ironic at CERN physical instance Ironic database POWER_OFFN Nova database POWER_OFF DATA CENTER POWER_ON != POWER_OFF ???? Repair The Nova / Ironic Power Sync Problem Statement
  • 20. 20 Arne Wiebalck & Surya Seetharaman Server Life-cycle Management with Ironic at CERN physical instance Ironic database POWER_ONFF Nova database POWER_OFF DATA CENTER POWER_ON != POWER_OFF ????Force power state update to reflect what nova feels is right Repair The Nova / Ironic Power Sync Problem Statement
  • 21. 21 Arne Wiebalck & Surya Seetharaman Server Life-cycle Management with Ironic at CERN Repair The Nova / Ironic Power Sync Problem Statement ➔ Unforeseen events like power outage: ➢ Physical instance goes down ● Nova puts the instance into SHUTDOWN state ○ through the ``_sync_power_states`` periodic task ○ hypervisor regarded as the source of truth ➢ Physical instance comes back up without Nova knowing ● Nova again puts the instance back into SHUTDOWN state ○ through the ``_sync_power_states`` periodic task ○ database regarded as the source of truth Nova should not force the instance to POWER_OFF when it comes back up
  • 22. 22 Arne Wiebalck & Surya Seetharaman Server Life-cycle Management with Ironic at CERN physical instance DATA CENTER power outage POWER_ONFFDATA CENTER Repair The Nova / Ironic Power Sync Implemented Solution POWER_ONFF target_power_state = POWER_OFF Power update event using os-server-external-events POWER_OFF == POWER_OFF
  • 23. 23 Arne Wiebalck & Surya Seetharaman Server Life-cycle Management with Ironic at CERN physical instance DATA CENTER POWER_OFFNDATA CENTER Repair The Nova / Ironic Power Sync Implemented Solution POWER_OFFN target_power_state = POWER_ON POWER_ON == POWER_ON There can be race conditions depending on the sequence of occurrence of events. Power update event using os-server-external-events
  • 24. 24 Arne Wiebalck & Surya Seetharaman Server Life-cycle Management with Ironic at CERN Repair The Nova / Ironic Power Sync Implemented Solution Ironic sends power state change callbacks to Nova ➔ Operator has to set [nova].send_power_notifications config option to True ➔ JSON request body sent from Ironic ➔ Done via the os-server-external-events Nova api ➔ Read power_update spec and documentation for more details { "events": [ { "name": "power-update", "server_uuid": "3df201cf-2451-44f2-8d25-a4ca826fc1f3", "tag": target_power_state } ] }
  • 25. ➔ JSON response body sent from Nova ➔ Nova updates its database to reflect the power state change ➢ Ironic regarded as source of truth 25 Arne Wiebalck & Surya Seetharaman Server Life-cycle Management with Ironic at CERN Repair The Nova / Ironic Power Sync Implemented Solution Ironic sends power state change callbacks to Nova { "events": [ { "code": 200, "name": "power-update", "server_uuid": "3df201cf-2451-44f2-8d25-a4ca826fc1f3", "status": "completed", "tag": target_power_state } ] } Available from
  • 26. 26 Arne Wiebalck & Surya Seetharaman Server Life-cycle Management with Ironic at CERN Retire The End of the Cycle ➔ A simple procedure for now ➢ Deleting instances triggers cleaning ➢ Setting maintenance avoids instance creation ➢ The maintenance reason marks them for removal ➔ No explicit “node retirement” support in Ironic ➢ Time windows allow for re-use ➢ Explicit tagging would be helpful ➔ Proposals to introduce a retirement flag (or even state) ➢ Spec: “Add support for node retirement” https://review.opendev.org/#/c/656799/
  • 27. A Pain Point: Resource Tracking 27 Arne Wiebalck & Surya Seetharaman Server Life-cycle Management with Ironic at CERN ➔ The resource tracker loops sequentially over compute nodes ➢ Holds a semaphore and hence blocks instance creation ➢ OK for virtual machines, a scaling issue for bare metal deployments ➔ This creates a noticeable dead time for instance creation ➢ For 1 node with 3500 physical nodes the turn-around time is ~60 mins ➢ … this is already with upstream and local patches to reduce the overhead ➔ Stop-gap solution: adapt the resource tracker cycle ➢ Compromise between blockage and placement updates ➔ Potential solutions: n-compute sharding and per-instance locking
  • 28. 28 Arne Wiebalck & Surya Seetharaman Server Life-cycle Management with Ironic at CERN 謝謝啦 Thank you! Power consumption in container DC during allocation of a new delivery (partial) burn-in nodes available physical Instances created virtual Instances created Summary ➔ At CERN, Ironic is used for the majority of a server’s life cycle steps ➔ We work on the remaining steps, more features and we plan to pass +10k servers!