SlideShare a Scribd company logo
1 of 60
Accelerating Containers at CERN
Tim Bell,
CERN IT
@noggin143
VHPC ‘19
20th June 2019
About Tim
• Responsible for
Compute and
Monitoring in CERN IT
department
• Previously worked at
IBM and Deutsche
Bank
VHPC '19 220/06/2019
3
CERN: founded in 1954: 12 European States
“Science for Peace”
Today: 23 Member States
Member States: Austria, Belgium, Bulgaria,
Czech Republic, Denmark, Finland, France,
Germany, Greece, Hungary, Israel, Italy,
Netherlands, Norway, Poland, Portugal,
Romania, Serbia, Slovak Republic, Spain,
Sweden, Switzerland and
United Kingdom
~ 2600 staff
~ 1800 other paid personnel
~ 13000 scientific users
Budget (2018) ~ 1150 MCHF
20/06/2019
Associate Members in the Pre-Stage to Membership: Cyprus,
Slovenia
Associate Member States: India, Lithuania, Pakistan, Turkey,
Ukraine
Applications for Membership or Associate Membership:
Brazil, Croatia, Estonia
Observers to Council: Japan, Russia, United States of America;VHPC '19 3
CERN
World’s largest
particle physics
laboratory
VHPC '194
Image credit: CERN
Protons in
LHC
20/06/2019 VHPC '19 5
VHPC '19 6
40 million
pictures
per second
1PB/s
Image credit: CERN
Our Approach: Tool Chain and DevOps
7
• CERN’s requirements are no longer special
• Small dedicated tools
allowed for rapid validation &
prototyping
• Adapted our processes,
policies and work flows
to the tools
• Join (and contribute to)
existing communities
The CERN Cloud Service
8
• Production since July 2013
- Several rolling upgrades since,
now on Rocky
- Many sub services deployed
• Spans two data centers
• Geneva
• Budapest
• Deployed using RDO + Puppet
- Mostly upstream, patched where needed
CPU Performance
9
• The benchmarks on full-node VMs was about 20% lower
than the one of the underlying host
- Smaller VMs much better
• Investigated various tuning options
- KSM*, EPT**, PAE, Pinning, … +hardware type dependencies
- Discrepancy down to ~10% between virtual and physical
• Comparison with Hyper-V: no general issue
- Loss w/o tuning ~3% (full-node), <1% for small VMs
- … NUMA-awareness!
*KSM on/off: beware of memory reclaim! **EPT on/off: beware of expensive page table walks!
NUMA
• NUMA-awareness identified as most
efficient setting
• “EPT-off” side-effect
• Small number of hosts, but very
visible there
• Use 2MB Huge Pages
• Keep the “EPT off” performance gain
with “EPT on”
20/06/2019 VHPC '19 10
VM Overhead
Before
Overhead
After
4x 8 8%
2x 16 16%
1x 24 20% 5%
1x 32 20% 3%
Bare Metal
20/06/2019 VHPC '19 11
• VMs not suitable for all of our use cases
- Storage and database nodes, HPC clusters, boot strapping,
critical network equipment or specialised network setups,
precise/repeatable benchmarking for s/w frameworks, …
• Complete our service offerings
- Physical nodes (in addition to VMs and containers)
- OpenStack UI as the single pane of glass
• Simplify hardware provisioning workflows
- For users: openstack server create/delete
- For procurement & h/w provisioning team: initial on-boarding, server re-assignments
• Consolidate accounting & bookkeeping
- Resource accounting input will come from less sources
- Machine re-assignments will be easier to track
CERN OpenStack Infrastructure
• Production since 2013
• Provides 90% of CERN IT Compute resources
• Institution as well as physics services
20/06/2019 VHPC '19 12
HTC vs HPC
• CERN is mostly a high throughput computing lab:
• File-based parallelism, massive batch system for data
processing
• But we have several HPC use-cases:
• Beam simulations, plasma physics, CFD, QCD, (ASICs)
• Need full POSIX consistency, fast parallel IO
13
20/06/2019 VHPC '19
HTC vs HPC
• CERN is mostly a high throughput computing lab:
• File-based parallelism, massive batch system for data
processing
• But we have several HPC use-cases:
• Beam simulations, plasma physics, CFD, QCD, (ASICs)
• Need full POSIX consistency, fast parallel IO
14
20/06/2019 VHPC '19
HTC vs HPC
• CERN is mostly a high throughput computing lab:
• File-based parallelism, massive batch system for data
processing
• But we have several HPC use-cases:
• Beam simulations, plasma physics, CFD, QCD, (ASICs)
• Need full POSIX consistency, fast parallel IO
15
20/06/2019 VHPC '19
CERN Container Use Cases
• Batch Processing
• End user analysis / Jupyter Notebooks
• Machine Learning / TensorFlow / Keras
• Infrastructure Management
• Data Movement, Web servers, PaaS …
• Continuous Integration / Deployment
• Run OpenStack :-)
• And many others
Credit: Ricardo Rocha, CERN Cloud
20/06/2019 VHPC '19 16
HTC - Containers all the way down
• HTCondor manages the HTC
throughput queues
• Singularity allows host OS to
decouple from that required by
experiment workload
• Not all hardware supports old OSes
now
• Multiscience labs may require
newer OS
20/06/2019 VHPC '19 17
Batch System
Pilot
job job
job job
Container
Batch on Storage Services - BEER
18
EOS
Condor
cgroups
Monitored by:
cadvisor/collectd
job job
job
job
container
cores
memory
Local disk
Cores reserved for EOS Cores integrated in Condor running
jobs at low priority, memory and
scratch space restricted by cgroups,
20/06/2019 VHPC '19
https://cds.cern.ch/record/2653012
Use Case: Spark on K8s
Credit: CERN data analytics working group
20/06/2019 VHPC '19 19
Reusable Analysis Platform
● Workflow Engine (Yadage)
● Each step a Kubernetes Job
● Integrated Monitoring & Logging
● Centralized Log Collection
● ”Rediscovering the Higgs” at Kubecon
Use case: REANA / RECAST
Credit: CERN Invenio User Group Workshop
20
Use case: Federated Kubernetes
Batch or other jobs on multiple
clusters
Segment the datacenter
Burst compute capacity
Same deployment in all clouds
Credit: Ricardo Rocha, CERN Cloud
kubefed join --host-cluster-context… --cluster-context … atlas-recast-y
openstack coe federation join cern-condor atlas-recast-x atlas-recast-y
• Many kubernetes clusters provisioned
with OpenStack/Magnum
• Manila shares backed by cephfs for
shared storage
• Central GitLab container registry
• Keystone Webhook for user AuthN
Use case: Weblogic on kubernetes
Credit: Antonio Nappi, CERN IT-DB
20/06/2019 VHPC '19 22
What is Magnum?
An OpenStack API service that allows
creation of container clusters.
● Use your keystone credentials
● You choose your cluster type
● Kubernetes
● Docker Swarm
● Mesos
● Single-tenant clusters
● Quickly create new clusters with advanced features
such as multi-master
23
CERN Magnum Deployment
• In production since 2016
• Running OpenStack Rocky release
• Working closely with upstream development
• Slightly patched to adapt to the CERN network
20/06/2019 VHPC '19 24
Magnum Cluster
A Magnum cluster is composed of:
● compute instances (virtual or physical)
● OpenStack Neutron networks
● security groups
● OpenStack Cinder for block volumes
● other resources (eg Load Balancer)
● OpenStack Heat to orchestrate the nodes
● Where your containers run
● Lifecycle operations
○ Scale up/down
○ Autoscale
○ Upgrade
○ Node heal/replace
● Self contained cluster with each own
monitoring, data store, additional
resources
Why use Magnum?
• Centrally managed self-service like GKE and AKS
• Provide clusters to users with one-click deployment (or one API
call)
• Users don’t need to be system administrators
• Accounting comes for free if you use quotas in your
projects
• Easy entrypoint to containers for new users
• Control your users’ deployments
• OS
• Monitoring
20/06/2019 VHPC '19 26
CERN Storage Integration
• CSI CephFS
• Provides an interface between a CSI-enabled Container Orchestrator and the Ceph cluster
• Provisions and mounts CephFS volumes
• Supports both the kernel CephFS client and the CephFS FUSE driver
• https://github.com/ceph/ceph-csi
• OpenStack Manila External Provisioner
• Provisions new Manila shares, fetches existing ones
• Maps them to Kubernetes PersistentVolume objects
• Currently supports CephFS shares only (both in-tree CephFS plugin and csi-cephfs)
• https://github.com/kubernetes/cloud-provider-openstack/tree/master/pkg/share/manila
Detailed results at https://techblog.web.cern.ch/techblog/post/container-storage-
cephfs-scale-part3/
20/06/2019 VHPC '19 27
Credit: Robert Vasek, CERN Cloud
io500 – entered for 2018
• Using CephFS on SSDs and Lazy IO, we
made it onto the io500 list at #21 -
https://www.vi4io.org/io500/start
20/06/2019 VHPC '19 28
Bigbang Scale Tests
• Bigbang scale tests mutually benefit
CERN & Ceph project
• Bigbang I: 30PB, 7200 OSDs, Ceph
hammer. Several osdmap limitations
• Bigbang II: Similar size, Ceph jewel.
Scalability limited by OSD/MON
messaging. Motivated ceph-mgr
• Bigbang III: 65PB, 10800 OSDs
29
https://ceph.com/community/new-luminous-scalability/
20/06/2019 VHPC '19
CERN Storage Integration
• CVMFS provides us with a massively scalable
read-only file system
• Static content like compiled applications and
conditions data
• Provides an interface between a CSI-enabled
Container Orchestrator and the CERN
application appliances
• https://github.com/cernops/cvmfs-csi/
20/06/2019 VHPC '19 30
Credit: Robert Vasek and Ricardo Rocha, CERN Cloud
First Attempt – 1M requests/Seq
• 200 Nodes
• Found multiple limits
• Heat Orchestration scaling
• Authentication caches
• Volume deletion
• Site services
VHPC '19 3120/06/2019
Second Attempt – 7M requests/Seq
• Fixes and scale to 1000 Nodes
VHPC '19 32
Cluster Size
(Nodes)
Concurrency Deployment
Time (min)
2 50 2.5
16 10 4
32 10 4
128 5 5.5
512 1 14
1000 1 23
20/06/2019
Node Groups
• Define subclusters
• Vary Flavors
• Small/Big VMs
• Bare Metal
• Vary Zones
• Improve redundancy
20/06/2019 VHPC '19 33
Physical workers
VMs with GPUs
Master
NG
m2.medium VMs m2.medium az-B
Auto Scaling
https://github.com/kubernetes/autoscaler
Optimize resource usage
Dynamically resize the cluster based on current number
of pods, and their required CPU / memory
New cloud provider for Magnum
Docs at autoscaler / cluster-autoscaler / cloudprovider / magnum
Merged PR: https://github.com/kubernetes/autoscaler/pull/1690
Pods are created
(manually or an automatic
response to higher load)
If any pods can not be
scheduled, the autoscaler
provisions new nodes
Pods are scheduled on the
new nodes and run until they
are deleted
The autoscaler removes any
unused nodes
Slide Credit: Thomas Hartland, CERN Cloud
34
More enhancements rolling out
• Authentication using OpenStack Keystone
• Native kubectl commands with cloud credentials
• Choice of Ingress controller
• Nginx or Traefik
• Integrated monitoring with Prometheus
• Rolling cluster upgrades for Kubernetes, operating
system and add-ons
• Integrated Node Problem Detector
20/06/2019 VHPC '19 35
LHC Schedule
14/03/2019 Tim Bell 36
Run
3
Alice, LHCb
upgrades
Run
4
ATLAS, CMS
upgrades
VHPC '19 37
First run LS1 Second run Third run LS3 HL-LHC Run4
…2009 2013 2014 2015 2016 2017 201820112010 2012 2019 2023 2024 2030?20212020 2022 …2025
LS2
 Significant part of cost comes
from global operations
 Even with technology increase of
~15%/year, we still have a big
gap if we keep trying to do things
with our current compute models
Raw data volume
increases significantly
for High Luminosity LHC
2026
20/06/2019
Google
searches
98 PB
LHC Science
data
~200 PB
SKA Phase 1 –
2023
~300 PB/year
science data
HL-LHC – 2026
~600 PB Raw data
HL-LHC – 2026
~1 EB Physics data
SKA Phase 2 – mid-2020’s
~1 EB science data
LHC – 2016
50 PB raw data
Facebook
uploads
180 PB
Google
Internet archive
~15 EB
Yearly data volumes
14/03/2019 Tim Bell 38
Annecy, 22/06/2019 G. Lamanna 39
Commercial Clouds
VHPC '19 4020/06/2019
High Luminosity LHC until 2035
• Ten times more collisions than
the original design
Studies in progress:
Compact Linear Collider (CLIC)
• Up to 50Km long
• Linear e+e- collider √s up to 3 TeV
Future Circular Collider (FCC)
• ~100 Km circumference
• New technology magnets 
100 TeV pp collisions in 100km ring
• e+e- collider (FCC-ee) as 1st step?
European Strategy for Particle Physics
• Preparing next update in 2020
Future of particle physics ?
m Bell
Conclusions
• Many teams with diverse workloads -> Many clusters
• Single resource pool with accounting and quota
• Shared effort with other scientific communities via
• Open source collaborations
• Community special interest groups e.g. Scientific SIG
• Common projects e.g. CERN/SKA collaborations
• Containers are becoming a key technology
• Rapid application deployment
• Isolate workloads from underlying infrastructure
• Preserve analysis for the future
20/06/2019 VHPC '19 42
Further Information
• CERN blogs
• https://techblog.web.cern.ch/techblog/
• Recent Talks at OpenStack summits
• https://www.openstack.org/videos/search?search=cern
• Kubecon 2018, 2019
• Source code
• https://github.com/cernops and
https://github.com/openstack
VHPC '19 4320/06/2019
Backup Slides
Early Prototypes
46
CERN Ceph Clusters Size Version
OpenStack Cinder/Glance Production 5.5PB jewel
Satellite data centre (1000km away) 0.4PB luminous
CephFS (HPC+Manila) Production 0.8PB luminous
Manila testing cluster 0.4PB luminous
Hyperconverged HPC 0.4PB luminous
CASTOR/XRootD Production 4.2PB luminous
CERN Tape Archive 0.8PB luminous
S3+SWIFT Production 0.9PB luminous
47
+5PB in the pipeline
20/06/2019 VHPC '19
What to consider when running a container service
● Design your network
○ By default, magnum creates a private network per cluster and assigns floating IPs
to nodes
○ LBaaS for multi-master clusters
● Run a container registry
○ DockerHub is usually up but latency will always get you
○ Rebuild or mirror the containers used by magnum
● Provide self-service clusters -> Provide software
○ Upgrade magnum regularly, update its configuration regularly
○ Plan which container and glance images are available to users
48
Cluster Resize
Motivation: Remove specific nodes from the cluster (replace update cmd)
Forward compatibility with old cluster
Already in the upstream stable branch
ETA for release, 12 of April (upstream)
api-reference:
http://git.openstack.org/cgit/openstack/magnum/tree/api-ref/source/clusters.inc#n268
Thanks to Feilong Wang
49
Cluster Resize
$ openstack coe cluster resize --nodegroup kube-worker kube 3
Request to resize cluster kube has been accepted.
$ openstack coe cluster list
+--------------------------------------+------+---------+------------+--------------+
| uuid | name | keypair | node_count | master_count |
+--------------------------------------+------+---------+------------+--------------+
| ed38e800-5884-4053-9b17-9f80995f1993 | kube | default | 3 | 1 |
+--------------------------------------+------+---------+------------+--------------+
…
--------------------+---------------+
status | health_status |
--------------------+---------------+
UPDATE_IN_PROGRESS | HEALTHY |
--------------------+---------------+
$ openstack coe cluster resize --nodegroup kube-worker 
--nodes-to-remove 05b7b307-18fd-459a-a13a-a1923c2c840d kube 1
Request to resize cluster kube has been accepted.
50
Node Groups
$ openstack coe nodegroup list kube
+--------------------------------------+-------------+-----------+------------+--------+
| uuid | name | flavor_id | node_count | role |
+--------------------------------------+-------------+-----------+------------+--------+
| 14ddaf00-9867-49ca-b10c-106c3656e4f1 | kube-master | m1.small | 1 | master |
| 8a18cc5c-040d-4e67-aa4d-9aaf38241119 | kube-worker | m1.small | 1 | worker |
+--------------------------------------+-------------+-----------+------------+--------+
$ openstack coe nodegroup show kube kube-master
+--------------------+--------------------------------------+
| Field | Value |
+--------------------+--------------------------------------+
| name | kube-master |
| cluster_id | ed38e800-5884-4053-9b17-9f80995f1993 |
| flavor_id | m1.small |
| node_addresses | [u'172.24.4.120'] |
| node_count | 1 |
| role | master |
| max_node_count | None |
| min_node_count | 1 |
| is_default | True |
+--------------------+--------------------------------------+
51
Authentication to OpenStack Keystone
Use OpenStack tokens directly in kubectl
Give kubectl access to users outside the cluster’s OpenStack project
A better (more secure) option than the current TLS certificates
$ openstack coe cluster create ... --labels keystone_auth_enabled=true
$ export OS_TOKEN=$(openstack token issue -c id -f value)
$ kubectl get pod
52
Cluster Metrics Monitoring (Prometheus)
Objectives
Provide an out-of-the-box solution for cluster, node and application metrics monitoring
Services Included
Metrics scraping and storage (Prometheus)
Data visualization (Grafana)
Alarms (Alertmanager)
Upstream Prometheus Operator Helm Chart
Slide Credit: Diogo Guerra, CERN Cloud
53
Cluster Upgrades
Upgrades of Kubernetes, Operating System, Add-ons
Rolling in-place upgrade
Rolling node-replacement
Batch size for rolling upgrade
https://storyboard.openstack.org/#!/story/2002210
54
More Add-ons
Ingress Controllers
Traefik v1.7.x
Can be used with Neutron-lbaas/Octavia or HostNetwork
Octavia 1.13.2-alpha or newer
Node Problem Detector
Customizable detectors for node health
Pod Security Policy
Two modes, privileged or restricted by default
55
Magnum Deployment
● Clusters are described by cluster templates
● Shared/public templates for most common setups, customizable by users
$ openstack coe cluster template list
+------+---------------------------+
| uuid | name |
+------+---------------------------+
| .... | swarm |
| .... | swarm-ha |
| .... | kubernetes |
| .... | kubernetes-ha |
| .... | mesos |
| .... | mesos-ha |
| .... | dcos |
+------+---------------------------+
56
Magnum Deployment
● Clusters are described by cluster templates
● Shared/public templates for most common setups, customizable by users
$ openstack coe cluster create --name my-k8s --cluster-template kubernetes --node-count 100
~ 5 mins later
$ openstack coe cluster list
+------+------+---------+------------+--------------+--------------------+---------------+
| uuid | name | keypair | node_count | master_count | status | health_status |
+------+------+---------+------------+--------------+--------------------+---------------+
| ... | kube | default | 3 | 1 | UPDATE_IN_PROGRESS | HEALTHY |
+------+------+---------+------------+--------------+--------------------+---------------+
$ $(openstack coe cluster config my-k8s --dir clusters/my-k8s --use-keystone)
$ OS_TOKEN=$(openstack token issue -c id -f value)
$ kubectl get ...
57
Resource Provisioning: IaaS
VHPC '19
58
• Based on OpenStack
- Collection of open source projects for cloud orchestration
- Started by NASA and Rackspace in 2010
- Grown into a global software community
20/06/2019
NUMA roll-out
• Rolled out on ~2’000 batch hypervisors (~6’000 VMs)
• HP allocation as boot parameter  reboot
• VM NUMA awareness as flavor metadata  delete/recreate
• Cell-by-cell (~200 hosts):
• Queue-reshuffle to minimize resource impact
• Draining & deletion of batch VMs
• Hypervisor reconfiguration (Puppet) & reboot
• Recreation of batch VMs
• Whole update took about 8 weeks
• Organized between batch and cloud teams
• No performance issue observed since
20/06/2019 VHPC '19 59
VM Before After
4x 8 8%
2x 16 16%
1x 24 20% 5%
1x 32 20% 3%
Container Orchestrators
60

More Related Content

What's hot

20161025 OpenStack at CERN Barcelona
20161025 OpenStack at CERN Barcelona20161025 OpenStack at CERN Barcelona
20161025 OpenStack at CERN BarcelonaTim Bell
 
Cern Cloud Architecture - February, 2016
Cern Cloud Architecture - February, 2016Cern Cloud Architecture - February, 2016
Cern Cloud Architecture - February, 2016Belmiro Moreira
 
Moving from CellsV1 to CellsV2 at CERN
Moving from CellsV1 to CellsV2 at CERNMoving from CellsV1 to CellsV2 at CERN
Moving from CellsV1 to CellsV2 at CERNBelmiro Moreira
 
20150924 rda federation_v1
20150924 rda federation_v120150924 rda federation_v1
20150924 rda federation_v1Tim Bell
 
Containers on Baremetal and Preemptible VMs at CERN and SKA
Containers on Baremetal and Preemptible VMs at CERN and SKAContainers on Baremetal and Preemptible VMs at CERN and SKA
Containers on Baremetal and Preemptible VMs at CERN and SKABelmiro Moreira
 
20190314 cern register v3
20190314 cern register v320190314 cern register v3
20190314 cern register v3Tim Bell
 
Multi-Cell OpenStack: How to Evolve Your Cloud to Scale - November, 2014
Multi-Cell OpenStack: How to Evolve Your Cloud to Scale - November, 2014Multi-Cell OpenStack: How to Evolve Your Cloud to Scale - November, 2014
Multi-Cell OpenStack: How to Evolve Your Cloud to Scale - November, 2014Belmiro Moreira
 
Learning to Scale OpenStack
Learning to Scale OpenStackLearning to Scale OpenStack
Learning to Scale OpenStackRainya Mosher
 
TOWARDS Hybrid OpenStack Clouds in the Real World
TOWARDS Hybrid OpenStack Clouds in the Real WorldTOWARDS Hybrid OpenStack Clouds in the Real World
TOWARDS Hybrid OpenStack Clouds in the Real WorldAndrew Hickey
 
The OpenStack Cloud at CERN
The OpenStack Cloud at CERNThe OpenStack Cloud at CERN
The OpenStack Cloud at CERNArne Wiebalck
 
Designing HPC & Deep Learning Middleware for Exascale Systems
Designing HPC & Deep Learning Middleware for Exascale SystemsDesigning HPC & Deep Learning Middleware for Exascale Systems
Designing HPC & Deep Learning Middleware for Exascale Systemsinside-BigData.com
 
Best Practices: Large Scale Multiphysics
Best Practices: Large Scale MultiphysicsBest Practices: Large Scale Multiphysics
Best Practices: Large Scale Multiphysicsinside-BigData.com
 
OpenStack Ousts vCenter for DevOps and Unites IT Silos at AVG Technologies
OpenStack Ousts vCenter for DevOps and Unites IT Silos at AVG Technologies OpenStack Ousts vCenter for DevOps and Unites IT Silos at AVG Technologies
OpenStack Ousts vCenter for DevOps and Unites IT Silos at AVG Technologies Jakub Pavlik
 
Real-Time Detection of Anomalies in the Database Infrastructure using Apache ...
Real-Time Detection of Anomalies in the Database Infrastructure using Apache ...Real-Time Detection of Anomalies in the Database Infrastructure using Apache ...
Real-Time Detection of Anomalies in the Database Infrastructure using Apache ...Spark Summit
 
The CMS openstack, opportunistic, overlay, online-cluster Cloud (CMSooooCloud)
The CMS openstack, opportunistic, overlay, online-cluster Cloud (CMSooooCloud)The CMS openstack, opportunistic, overlay, online-cluster Cloud (CMSooooCloud)
The CMS openstack, opportunistic, overlay, online-cluster Cloud (CMSooooCloud)Jose Antonio Coarasa Perez
 
Experiments with Complex Scientific Applications on Hybrid Cloud Infrastructures
Experiments with Complex Scientific Applications on Hybrid Cloud InfrastructuresExperiments with Complex Scientific Applications on Hybrid Cloud Infrastructures
Experiments with Complex Scientific Applications on Hybrid Cloud InfrastructuresRafael Ferreira da Silva
 
OpenStack Toronto Q3 MeetUp - September 28th 2017
OpenStack Toronto Q3 MeetUp - September 28th 2017OpenStack Toronto Q3 MeetUp - September 28th 2017
OpenStack Toronto Q3 MeetUp - September 28th 2017Stacy Véronneau
 
Integrating Bare-metal Provisioning into CERN's Private Cloud
Integrating Bare-metal Provisioning into CERN's Private CloudIntegrating Bare-metal Provisioning into CERN's Private Cloud
Integrating Bare-metal Provisioning into CERN's Private CloudArne Wiebalck
 
Operational War Stories from 5 Years of Running OpenStack in Production
Operational War Stories from 5 Years of Running OpenStack in ProductionOperational War Stories from 5 Years of Running OpenStack in Production
Operational War Stories from 5 Years of Running OpenStack in ProductionArne Wiebalck
 
Iris: Inter-cloud Resource Integration System for Elastic Cloud Data Center
Iris: Inter-cloud Resource Integration System for Elastic Cloud Data CenterIris: Inter-cloud Resource Integration System for Elastic Cloud Data Center
Iris: Inter-cloud Resource Integration System for Elastic Cloud Data CenterRyousei Takano
 

What's hot (20)

20161025 OpenStack at CERN Barcelona
20161025 OpenStack at CERN Barcelona20161025 OpenStack at CERN Barcelona
20161025 OpenStack at CERN Barcelona
 
Cern Cloud Architecture - February, 2016
Cern Cloud Architecture - February, 2016Cern Cloud Architecture - February, 2016
Cern Cloud Architecture - February, 2016
 
Moving from CellsV1 to CellsV2 at CERN
Moving from CellsV1 to CellsV2 at CERNMoving from CellsV1 to CellsV2 at CERN
Moving from CellsV1 to CellsV2 at CERN
 
20150924 rda federation_v1
20150924 rda federation_v120150924 rda federation_v1
20150924 rda federation_v1
 
Containers on Baremetal and Preemptible VMs at CERN and SKA
Containers on Baremetal and Preemptible VMs at CERN and SKAContainers on Baremetal and Preemptible VMs at CERN and SKA
Containers on Baremetal and Preemptible VMs at CERN and SKA
 
20190314 cern register v3
20190314 cern register v320190314 cern register v3
20190314 cern register v3
 
Multi-Cell OpenStack: How to Evolve Your Cloud to Scale - November, 2014
Multi-Cell OpenStack: How to Evolve Your Cloud to Scale - November, 2014Multi-Cell OpenStack: How to Evolve Your Cloud to Scale - November, 2014
Multi-Cell OpenStack: How to Evolve Your Cloud to Scale - November, 2014
 
Learning to Scale OpenStack
Learning to Scale OpenStackLearning to Scale OpenStack
Learning to Scale OpenStack
 
TOWARDS Hybrid OpenStack Clouds in the Real World
TOWARDS Hybrid OpenStack Clouds in the Real WorldTOWARDS Hybrid OpenStack Clouds in the Real World
TOWARDS Hybrid OpenStack Clouds in the Real World
 
The OpenStack Cloud at CERN
The OpenStack Cloud at CERNThe OpenStack Cloud at CERN
The OpenStack Cloud at CERN
 
Designing HPC & Deep Learning Middleware for Exascale Systems
Designing HPC & Deep Learning Middleware for Exascale SystemsDesigning HPC & Deep Learning Middleware for Exascale Systems
Designing HPC & Deep Learning Middleware for Exascale Systems
 
Best Practices: Large Scale Multiphysics
Best Practices: Large Scale MultiphysicsBest Practices: Large Scale Multiphysics
Best Practices: Large Scale Multiphysics
 
OpenStack Ousts vCenter for DevOps and Unites IT Silos at AVG Technologies
OpenStack Ousts vCenter for DevOps and Unites IT Silos at AVG Technologies OpenStack Ousts vCenter for DevOps and Unites IT Silos at AVG Technologies
OpenStack Ousts vCenter for DevOps and Unites IT Silos at AVG Technologies
 
Real-Time Detection of Anomalies in the Database Infrastructure using Apache ...
Real-Time Detection of Anomalies in the Database Infrastructure using Apache ...Real-Time Detection of Anomalies in the Database Infrastructure using Apache ...
Real-Time Detection of Anomalies in the Database Infrastructure using Apache ...
 
The CMS openstack, opportunistic, overlay, online-cluster Cloud (CMSooooCloud)
The CMS openstack, opportunistic, overlay, online-cluster Cloud (CMSooooCloud)The CMS openstack, opportunistic, overlay, online-cluster Cloud (CMSooooCloud)
The CMS openstack, opportunistic, overlay, online-cluster Cloud (CMSooooCloud)
 
Experiments with Complex Scientific Applications on Hybrid Cloud Infrastructures
Experiments with Complex Scientific Applications on Hybrid Cloud InfrastructuresExperiments with Complex Scientific Applications on Hybrid Cloud Infrastructures
Experiments with Complex Scientific Applications on Hybrid Cloud Infrastructures
 
OpenStack Toronto Q3 MeetUp - September 28th 2017
OpenStack Toronto Q3 MeetUp - September 28th 2017OpenStack Toronto Q3 MeetUp - September 28th 2017
OpenStack Toronto Q3 MeetUp - September 28th 2017
 
Integrating Bare-metal Provisioning into CERN's Private Cloud
Integrating Bare-metal Provisioning into CERN's Private CloudIntegrating Bare-metal Provisioning into CERN's Private Cloud
Integrating Bare-metal Provisioning into CERN's Private Cloud
 
Operational War Stories from 5 Years of Running OpenStack in Production
Operational War Stories from 5 Years of Running OpenStack in ProductionOperational War Stories from 5 Years of Running OpenStack in Production
Operational War Stories from 5 Years of Running OpenStack in Production
 
Iris: Inter-cloud Resource Integration System for Elastic Cloud Data Center
Iris: Inter-cloud Resource Integration System for Elastic Cloud Data CenterIris: Inter-cloud Resource Integration System for Elastic Cloud Data Center
Iris: Inter-cloud Resource Integration System for Elastic Cloud Data Center
 

Similar to 20190620 accelerating containers v3

Deep Dive Into the CERN Cloud Infrastructure - November, 2013
Deep Dive Into the CERN Cloud Infrastructure - November, 2013Deep Dive Into the CERN Cloud Infrastructure - November, 2013
Deep Dive Into the CERN Cloud Infrastructure - November, 2013Belmiro Moreira
 
Unveiling CERN Cloud Architecture - October, 2015
Unveiling CERN Cloud Architecture - October, 2015Unveiling CERN Cloud Architecture - October, 2015
Unveiling CERN Cloud Architecture - October, 2015Belmiro Moreira
 
Proto kubernetes onswitc_hengines_tue100418
Proto kubernetes onswitc_hengines_tue100418Proto kubernetes onswitc_hengines_tue100418
Proto kubernetes onswitc_hengines_tue100418inside-BigData.com
 
Ceph for Big Science - Dan van der Ster
Ceph for Big Science - Dan van der SterCeph for Big Science - Dan van der Ster
Ceph for Big Science - Dan van der SterCeph Community
 
Ceph on 64-bit ARM with X-Gene
Ceph on 64-bit ARM with X-GeneCeph on 64-bit ARM with X-Gene
Ceph on 64-bit ARM with X-GeneCeph Community
 
CEPH DAY BERLIN - CEPH ON THE BRAIN!
CEPH DAY BERLIN - CEPH ON THE BRAIN!CEPH DAY BERLIN - CEPH ON THE BRAIN!
CEPH DAY BERLIN - CEPH ON THE BRAIN!Ceph Community
 
Spark China Summit 2015 Guancheng Chen
Spark China Summit 2015 Guancheng ChenSpark China Summit 2015 Guancheng Chen
Spark China Summit 2015 Guancheng ChenGuancheng (G.C.) Chen
 
Sanger OpenStack presentation March 2017
Sanger OpenStack presentation March 2017Sanger OpenStack presentation March 2017
Sanger OpenStack presentation March 2017Dave Holland
 
SC23 : NCHC Hyper Kylin Cloud Platform
SC23 : NCHC Hyper Kylin Cloud PlatformSC23 : NCHC Hyper Kylin Cloud Platform
SC23 : NCHC Hyper Kylin Cloud PlatformChenkai Sun
 
Ceph Day SF 2015 - Deploying flash storage for Ceph without compromising perf...
Ceph Day SF 2015 - Deploying flash storage for Ceph without compromising perf...Ceph Day SF 2015 - Deploying flash storage for Ceph without compromising perf...
Ceph Day SF 2015 - Deploying flash storage for Ceph without compromising perf...Ceph Community
 
Boyan Krosnov - Building a software-defined cloud - our experience
Boyan Krosnov - Building a software-defined cloud - our experienceBoyan Krosnov - Building a software-defined cloud - our experience
Boyan Krosnov - Building a software-defined cloud - our experienceShapeBlue
 
Kubernetes meetup bangalore december 2017 - v02
Kubernetes meetup bangalore   december 2017 - v02Kubernetes meetup bangalore   december 2017 - v02
Kubernetes meetup bangalore december 2017 - v02Kumar Gaurav
 
GRP 19 - Nautilus, IceCube and LIGO
GRP 19 - Nautilus, IceCube and LIGOGRP 19 - Nautilus, IceCube and LIGO
GRP 19 - Nautilus, IceCube and LIGOIgor Sfiligoi
 
OSMC 2019 | Monitoring Alerts and Metrics on Large Power Systems Clusters by ...
OSMC 2019 | Monitoring Alerts and Metrics on Large Power Systems Clusters by ...OSMC 2019 | Monitoring Alerts and Metrics on Large Power Systems Clusters by ...
OSMC 2019 | Monitoring Alerts and Metrics on Large Power Systems Clusters by ...NETWAYS
 
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...Ganesan Narayanasamy
 
CloudLab Overview
CloudLab OverviewCloudLab Overview
CloudLab OverviewEd Dodds
 
Company Presentation - ClusterVision
Company Presentation - ClusterVisionCompany Presentation - ClusterVision
Company Presentation - ClusterVisionRemy L. Overkempe
 
OpenNebula and StorPool: Building Powerful Clouds
OpenNebula and StorPool: Building Powerful CloudsOpenNebula and StorPool: Building Powerful Clouds
OpenNebula and StorPool: Building Powerful CloudsOpenNebula Project
 
StorPool Presents at Cloud Field Day 9
StorPool Presents at Cloud Field Day 9StorPool Presents at Cloud Field Day 9
StorPool Presents at Cloud Field Day 9StorPool Storage
 

Similar to 20190620 accelerating containers v3 (20)

Deep Dive Into the CERN Cloud Infrastructure - November, 2013
Deep Dive Into the CERN Cloud Infrastructure - November, 2013Deep Dive Into the CERN Cloud Infrastructure - November, 2013
Deep Dive Into the CERN Cloud Infrastructure - November, 2013
 
Unveiling CERN Cloud Architecture - October, 2015
Unveiling CERN Cloud Architecture - October, 2015Unveiling CERN Cloud Architecture - October, 2015
Unveiling CERN Cloud Architecture - October, 2015
 
Proto kubernetes onswitc_hengines_tue100418
Proto kubernetes onswitc_hengines_tue100418Proto kubernetes onswitc_hengines_tue100418
Proto kubernetes onswitc_hengines_tue100418
 
Ceph for Big Science - Dan van der Ster
Ceph for Big Science - Dan van der SterCeph for Big Science - Dan van der Ster
Ceph for Big Science - Dan van der Ster
 
Ceph on 64-bit ARM with X-Gene
Ceph on 64-bit ARM with X-GeneCeph on 64-bit ARM with X-Gene
Ceph on 64-bit ARM with X-Gene
 
CEPH DAY BERLIN - CEPH ON THE BRAIN!
CEPH DAY BERLIN - CEPH ON THE BRAIN!CEPH DAY BERLIN - CEPH ON THE BRAIN!
CEPH DAY BERLIN - CEPH ON THE BRAIN!
 
Spark China Summit 2015 Guancheng Chen
Spark China Summit 2015 Guancheng ChenSpark China Summit 2015 Guancheng Chen
Spark China Summit 2015 Guancheng Chen
 
Sanger OpenStack presentation March 2017
Sanger OpenStack presentation March 2017Sanger OpenStack presentation March 2017
Sanger OpenStack presentation March 2017
 
SC23 : NCHC Hyper Kylin Cloud Platform
SC23 : NCHC Hyper Kylin Cloud PlatformSC23 : NCHC Hyper Kylin Cloud Platform
SC23 : NCHC Hyper Kylin Cloud Platform
 
Ceph Day SF 2015 - Deploying flash storage for Ceph without compromising perf...
Ceph Day SF 2015 - Deploying flash storage for Ceph without compromising perf...Ceph Day SF 2015 - Deploying flash storage for Ceph without compromising perf...
Ceph Day SF 2015 - Deploying flash storage for Ceph without compromising perf...
 
Boyan Krosnov - Building a software-defined cloud - our experience
Boyan Krosnov - Building a software-defined cloud - our experienceBoyan Krosnov - Building a software-defined cloud - our experience
Boyan Krosnov - Building a software-defined cloud - our experience
 
Kubernetes meetup bangalore december 2017 - v02
Kubernetes meetup bangalore   december 2017 - v02Kubernetes meetup bangalore   december 2017 - v02
Kubernetes meetup bangalore december 2017 - v02
 
GRP 19 - Nautilus, IceCube and LIGO
GRP 19 - Nautilus, IceCube and LIGOGRP 19 - Nautilus, IceCube and LIGO
GRP 19 - Nautilus, IceCube and LIGO
 
OSMC 2019 | Monitoring Alerts and Metrics on Large Power Systems Clusters by ...
OSMC 2019 | Monitoring Alerts and Metrics on Large Power Systems Clusters by ...OSMC 2019 | Monitoring Alerts and Metrics on Large Power Systems Clusters by ...
OSMC 2019 | Monitoring Alerts and Metrics on Large Power Systems Clusters by ...
 
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
 
CloudLab Overview
CloudLab OverviewCloudLab Overview
CloudLab Overview
 
Company Presentation - ClusterVision
Company Presentation - ClusterVisionCompany Presentation - ClusterVision
Company Presentation - ClusterVision
 
OpenNebula and StorPool: Building Powerful Clouds
OpenNebula and StorPool: Building Powerful CloudsOpenNebula and StorPool: Building Powerful Clouds
OpenNebula and StorPool: Building Powerful Clouds
 
StorPool Presents at Cloud Field Day 9
StorPool Presents at Cloud Field Day 9StorPool Presents at Cloud Field Day 9
StorPool Presents at Cloud Field Day 9
 
Available HPC Resources at CSUC
Available HPC Resources at CSUCAvailable HPC Resources at CSUC
Available HPC Resources at CSUC
 

More from Tim Bell

CERN Status at OpenStack Shanghai Summit November 2019
CERN Status at OpenStack Shanghai Summit November 2019CERN Status at OpenStack Shanghai Summit November 2019
CERN Status at OpenStack Shanghai Summit November 2019Tim Bell
 
20181219 ucc open stack 5 years v3
20181219 ucc open stack 5 years v320181219 ucc open stack 5 years v3
20181219 ucc open stack 5 years v3Tim Bell
 
OpenStack Paris 2014 - Federation, are we there yet ?
OpenStack Paris 2014 - Federation, are we there yet ?OpenStack Paris 2014 - Federation, are we there yet ?
OpenStack Paris 2014 - Federation, are we there yet ?Tim Bell
 
20141103 cern open_stack_paris_v3
20141103 cern open_stack_paris_v320141103 cern open_stack_paris_v3
20141103 cern open_stack_paris_v3Tim Bell
 
CERN Mass and Agility talk at OSCON 2014
CERN Mass and Agility talk at OSCON 2014CERN Mass and Agility talk at OSCON 2014
CERN Mass and Agility talk at OSCON 2014Tim Bell
 
20140509 cern open_stack_linuxtag_v3
20140509 cern open_stack_linuxtag_v320140509 cern open_stack_linuxtag_v3
20140509 cern open_stack_linuxtag_v3Tim Bell
 
Open stack operations feedback loop v1.4
Open stack operations feedback loop v1.4Open stack operations feedback loop v1.4
Open stack operations feedback loop v1.4Tim Bell
 
CERN clouds and culture at GigaOm London 2013
CERN clouds and culture at GigaOm London 2013CERN clouds and culture at GigaOm London 2013
CERN clouds and culture at GigaOm London 2013Tim Bell
 
20130529 openstack cee_day_v6
20130529 openstack cee_day_v620130529 openstack cee_day_v6
20130529 openstack cee_day_v6Tim Bell
 
Academic cloud experiences cern v4
Academic cloud experiences cern v4Academic cloud experiences cern v4
Academic cloud experiences cern v4Tim Bell
 
Ceilometer lsf-intergration-openstack-summit
Ceilometer lsf-intergration-openstack-summitCeilometer lsf-intergration-openstack-summit
Ceilometer lsf-intergration-openstack-summitTim Bell
 
Havana survey results-final-v2
Havana survey results-final-v2Havana survey results-final-v2
Havana survey results-final-v2Tim Bell
 
Havana survey results-final
Havana survey results-finalHavana survey results-final
Havana survey results-finalTim Bell
 
20121205 open stack_accelerating_science_v3
20121205 open stack_accelerating_science_v320121205 open stack_accelerating_science_v3
20121205 open stack_accelerating_science_v3Tim Bell
 
20121115 open stack_ch_user_group_v1.2
20121115 open stack_ch_user_group_v1.220121115 open stack_ch_user_group_v1.2
20121115 open stack_ch_user_group_v1.2Tim Bell
 
20121017 OpenStack Accelerating Science
20121017 OpenStack Accelerating Science20121017 OpenStack Accelerating Science
20121017 OpenStack Accelerating ScienceTim Bell
 
Accelerating science with Puppet
Accelerating science with PuppetAccelerating science with Puppet
Accelerating science with PuppetTim Bell
 
20120524 cern data centre evolution v2
20120524 cern data centre evolution v220120524 cern data centre evolution v2
20120524 cern data centre evolution v2Tim Bell
 
CERN User Story
CERN User StoryCERN User Story
CERN User StoryTim Bell
 

More from Tim Bell (19)

CERN Status at OpenStack Shanghai Summit November 2019
CERN Status at OpenStack Shanghai Summit November 2019CERN Status at OpenStack Shanghai Summit November 2019
CERN Status at OpenStack Shanghai Summit November 2019
 
20181219 ucc open stack 5 years v3
20181219 ucc open stack 5 years v320181219 ucc open stack 5 years v3
20181219 ucc open stack 5 years v3
 
OpenStack Paris 2014 - Federation, are we there yet ?
OpenStack Paris 2014 - Federation, are we there yet ?OpenStack Paris 2014 - Federation, are we there yet ?
OpenStack Paris 2014 - Federation, are we there yet ?
 
20141103 cern open_stack_paris_v3
20141103 cern open_stack_paris_v320141103 cern open_stack_paris_v3
20141103 cern open_stack_paris_v3
 
CERN Mass and Agility talk at OSCON 2014
CERN Mass and Agility talk at OSCON 2014CERN Mass and Agility talk at OSCON 2014
CERN Mass and Agility talk at OSCON 2014
 
20140509 cern open_stack_linuxtag_v3
20140509 cern open_stack_linuxtag_v320140509 cern open_stack_linuxtag_v3
20140509 cern open_stack_linuxtag_v3
 
Open stack operations feedback loop v1.4
Open stack operations feedback loop v1.4Open stack operations feedback loop v1.4
Open stack operations feedback loop v1.4
 
CERN clouds and culture at GigaOm London 2013
CERN clouds and culture at GigaOm London 2013CERN clouds and culture at GigaOm London 2013
CERN clouds and culture at GigaOm London 2013
 
20130529 openstack cee_day_v6
20130529 openstack cee_day_v620130529 openstack cee_day_v6
20130529 openstack cee_day_v6
 
Academic cloud experiences cern v4
Academic cloud experiences cern v4Academic cloud experiences cern v4
Academic cloud experiences cern v4
 
Ceilometer lsf-intergration-openstack-summit
Ceilometer lsf-intergration-openstack-summitCeilometer lsf-intergration-openstack-summit
Ceilometer lsf-intergration-openstack-summit
 
Havana survey results-final-v2
Havana survey results-final-v2Havana survey results-final-v2
Havana survey results-final-v2
 
Havana survey results-final
Havana survey results-finalHavana survey results-final
Havana survey results-final
 
20121205 open stack_accelerating_science_v3
20121205 open stack_accelerating_science_v320121205 open stack_accelerating_science_v3
20121205 open stack_accelerating_science_v3
 
20121115 open stack_ch_user_group_v1.2
20121115 open stack_ch_user_group_v1.220121115 open stack_ch_user_group_v1.2
20121115 open stack_ch_user_group_v1.2
 
20121017 OpenStack Accelerating Science
20121017 OpenStack Accelerating Science20121017 OpenStack Accelerating Science
20121017 OpenStack Accelerating Science
 
Accelerating science with Puppet
Accelerating science with PuppetAccelerating science with Puppet
Accelerating science with Puppet
 
20120524 cern data centre evolution v2
20120524 cern data centre evolution v220120524 cern data centre evolution v2
20120524 cern data centre evolution v2
 
CERN User Story
CERN User StoryCERN User Story
CERN User Story
 

Recently uploaded

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 

Recently uploaded (20)

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 

20190620 accelerating containers v3

  • 1. Accelerating Containers at CERN Tim Bell, CERN IT @noggin143 VHPC ‘19 20th June 2019
  • 2. About Tim • Responsible for Compute and Monitoring in CERN IT department • Previously worked at IBM and Deutsche Bank VHPC '19 220/06/2019
  • 3. 3 CERN: founded in 1954: 12 European States “Science for Peace” Today: 23 Member States Member States: Austria, Belgium, Bulgaria, Czech Republic, Denmark, Finland, France, Germany, Greece, Hungary, Israel, Italy, Netherlands, Norway, Poland, Portugal, Romania, Serbia, Slovak Republic, Spain, Sweden, Switzerland and United Kingdom ~ 2600 staff ~ 1800 other paid personnel ~ 13000 scientific users Budget (2018) ~ 1150 MCHF 20/06/2019 Associate Members in the Pre-Stage to Membership: Cyprus, Slovenia Associate Member States: India, Lithuania, Pakistan, Turkey, Ukraine Applications for Membership or Associate Membership: Brazil, Croatia, Estonia Observers to Council: Japan, Russia, United States of America;VHPC '19 3
  • 6. VHPC '19 6 40 million pictures per second 1PB/s Image credit: CERN
  • 7. Our Approach: Tool Chain and DevOps 7 • CERN’s requirements are no longer special • Small dedicated tools allowed for rapid validation & prototyping • Adapted our processes, policies and work flows to the tools • Join (and contribute to) existing communities
  • 8. The CERN Cloud Service 8 • Production since July 2013 - Several rolling upgrades since, now on Rocky - Many sub services deployed • Spans two data centers • Geneva • Budapest • Deployed using RDO + Puppet - Mostly upstream, patched where needed
  • 9. CPU Performance 9 • The benchmarks on full-node VMs was about 20% lower than the one of the underlying host - Smaller VMs much better • Investigated various tuning options - KSM*, EPT**, PAE, Pinning, … +hardware type dependencies - Discrepancy down to ~10% between virtual and physical • Comparison with Hyper-V: no general issue - Loss w/o tuning ~3% (full-node), <1% for small VMs - … NUMA-awareness! *KSM on/off: beware of memory reclaim! **EPT on/off: beware of expensive page table walks!
  • 10. NUMA • NUMA-awareness identified as most efficient setting • “EPT-off” side-effect • Small number of hosts, but very visible there • Use 2MB Huge Pages • Keep the “EPT off” performance gain with “EPT on” 20/06/2019 VHPC '19 10 VM Overhead Before Overhead After 4x 8 8% 2x 16 16% 1x 24 20% 5% 1x 32 20% 3%
  • 11. Bare Metal 20/06/2019 VHPC '19 11 • VMs not suitable for all of our use cases - Storage and database nodes, HPC clusters, boot strapping, critical network equipment or specialised network setups, precise/repeatable benchmarking for s/w frameworks, … • Complete our service offerings - Physical nodes (in addition to VMs and containers) - OpenStack UI as the single pane of glass • Simplify hardware provisioning workflows - For users: openstack server create/delete - For procurement & h/w provisioning team: initial on-boarding, server re-assignments • Consolidate accounting & bookkeeping - Resource accounting input will come from less sources - Machine re-assignments will be easier to track
  • 12. CERN OpenStack Infrastructure • Production since 2013 • Provides 90% of CERN IT Compute resources • Institution as well as physics services 20/06/2019 VHPC '19 12
  • 13. HTC vs HPC • CERN is mostly a high throughput computing lab: • File-based parallelism, massive batch system for data processing • But we have several HPC use-cases: • Beam simulations, plasma physics, CFD, QCD, (ASICs) • Need full POSIX consistency, fast parallel IO 13 20/06/2019 VHPC '19
  • 14. HTC vs HPC • CERN is mostly a high throughput computing lab: • File-based parallelism, massive batch system for data processing • But we have several HPC use-cases: • Beam simulations, plasma physics, CFD, QCD, (ASICs) • Need full POSIX consistency, fast parallel IO 14 20/06/2019 VHPC '19
  • 15. HTC vs HPC • CERN is mostly a high throughput computing lab: • File-based parallelism, massive batch system for data processing • But we have several HPC use-cases: • Beam simulations, plasma physics, CFD, QCD, (ASICs) • Need full POSIX consistency, fast parallel IO 15 20/06/2019 VHPC '19
  • 16. CERN Container Use Cases • Batch Processing • End user analysis / Jupyter Notebooks • Machine Learning / TensorFlow / Keras • Infrastructure Management • Data Movement, Web servers, PaaS … • Continuous Integration / Deployment • Run OpenStack :-) • And many others Credit: Ricardo Rocha, CERN Cloud 20/06/2019 VHPC '19 16
  • 17. HTC - Containers all the way down • HTCondor manages the HTC throughput queues • Singularity allows host OS to decouple from that required by experiment workload • Not all hardware supports old OSes now • Multiscience labs may require newer OS 20/06/2019 VHPC '19 17 Batch System Pilot job job job job Container
  • 18. Batch on Storage Services - BEER 18 EOS Condor cgroups Monitored by: cadvisor/collectd job job job job container cores memory Local disk Cores reserved for EOS Cores integrated in Condor running jobs at low priority, memory and scratch space restricted by cgroups, 20/06/2019 VHPC '19 https://cds.cern.ch/record/2653012
  • 19. Use Case: Spark on K8s Credit: CERN data analytics working group 20/06/2019 VHPC '19 19
  • 20. Reusable Analysis Platform ● Workflow Engine (Yadage) ● Each step a Kubernetes Job ● Integrated Monitoring & Logging ● Centralized Log Collection ● ”Rediscovering the Higgs” at Kubecon Use case: REANA / RECAST Credit: CERN Invenio User Group Workshop 20
  • 21. Use case: Federated Kubernetes Batch or other jobs on multiple clusters Segment the datacenter Burst compute capacity Same deployment in all clouds Credit: Ricardo Rocha, CERN Cloud kubefed join --host-cluster-context… --cluster-context … atlas-recast-y openstack coe federation join cern-condor atlas-recast-x atlas-recast-y
  • 22. • Many kubernetes clusters provisioned with OpenStack/Magnum • Manila shares backed by cephfs for shared storage • Central GitLab container registry • Keystone Webhook for user AuthN Use case: Weblogic on kubernetes Credit: Antonio Nappi, CERN IT-DB 20/06/2019 VHPC '19 22
  • 23. What is Magnum? An OpenStack API service that allows creation of container clusters. ● Use your keystone credentials ● You choose your cluster type ● Kubernetes ● Docker Swarm ● Mesos ● Single-tenant clusters ● Quickly create new clusters with advanced features such as multi-master 23
  • 24. CERN Magnum Deployment • In production since 2016 • Running OpenStack Rocky release • Working closely with upstream development • Slightly patched to adapt to the CERN network 20/06/2019 VHPC '19 24
  • 25. Magnum Cluster A Magnum cluster is composed of: ● compute instances (virtual or physical) ● OpenStack Neutron networks ● security groups ● OpenStack Cinder for block volumes ● other resources (eg Load Balancer) ● OpenStack Heat to orchestrate the nodes ● Where your containers run ● Lifecycle operations ○ Scale up/down ○ Autoscale ○ Upgrade ○ Node heal/replace ● Self contained cluster with each own monitoring, data store, additional resources
  • 26. Why use Magnum? • Centrally managed self-service like GKE and AKS • Provide clusters to users with one-click deployment (or one API call) • Users don’t need to be system administrators • Accounting comes for free if you use quotas in your projects • Easy entrypoint to containers for new users • Control your users’ deployments • OS • Monitoring 20/06/2019 VHPC '19 26
  • 27. CERN Storage Integration • CSI CephFS • Provides an interface between a CSI-enabled Container Orchestrator and the Ceph cluster • Provisions and mounts CephFS volumes • Supports both the kernel CephFS client and the CephFS FUSE driver • https://github.com/ceph/ceph-csi • OpenStack Manila External Provisioner • Provisions new Manila shares, fetches existing ones • Maps them to Kubernetes PersistentVolume objects • Currently supports CephFS shares only (both in-tree CephFS plugin and csi-cephfs) • https://github.com/kubernetes/cloud-provider-openstack/tree/master/pkg/share/manila Detailed results at https://techblog.web.cern.ch/techblog/post/container-storage- cephfs-scale-part3/ 20/06/2019 VHPC '19 27 Credit: Robert Vasek, CERN Cloud
  • 28. io500 – entered for 2018 • Using CephFS on SSDs and Lazy IO, we made it onto the io500 list at #21 - https://www.vi4io.org/io500/start 20/06/2019 VHPC '19 28
  • 29. Bigbang Scale Tests • Bigbang scale tests mutually benefit CERN & Ceph project • Bigbang I: 30PB, 7200 OSDs, Ceph hammer. Several osdmap limitations • Bigbang II: Similar size, Ceph jewel. Scalability limited by OSD/MON messaging. Motivated ceph-mgr • Bigbang III: 65PB, 10800 OSDs 29 https://ceph.com/community/new-luminous-scalability/ 20/06/2019 VHPC '19
  • 30. CERN Storage Integration • CVMFS provides us with a massively scalable read-only file system • Static content like compiled applications and conditions data • Provides an interface between a CSI-enabled Container Orchestrator and the CERN application appliances • https://github.com/cernops/cvmfs-csi/ 20/06/2019 VHPC '19 30 Credit: Robert Vasek and Ricardo Rocha, CERN Cloud
  • 31. First Attempt – 1M requests/Seq • 200 Nodes • Found multiple limits • Heat Orchestration scaling • Authentication caches • Volume deletion • Site services VHPC '19 3120/06/2019
  • 32. Second Attempt – 7M requests/Seq • Fixes and scale to 1000 Nodes VHPC '19 32 Cluster Size (Nodes) Concurrency Deployment Time (min) 2 50 2.5 16 10 4 32 10 4 128 5 5.5 512 1 14 1000 1 23 20/06/2019
  • 33. Node Groups • Define subclusters • Vary Flavors • Small/Big VMs • Bare Metal • Vary Zones • Improve redundancy 20/06/2019 VHPC '19 33 Physical workers VMs with GPUs Master NG m2.medium VMs m2.medium az-B
  • 34. Auto Scaling https://github.com/kubernetes/autoscaler Optimize resource usage Dynamically resize the cluster based on current number of pods, and their required CPU / memory New cloud provider for Magnum Docs at autoscaler / cluster-autoscaler / cloudprovider / magnum Merged PR: https://github.com/kubernetes/autoscaler/pull/1690 Pods are created (manually or an automatic response to higher load) If any pods can not be scheduled, the autoscaler provisions new nodes Pods are scheduled on the new nodes and run until they are deleted The autoscaler removes any unused nodes Slide Credit: Thomas Hartland, CERN Cloud 34
  • 35. More enhancements rolling out • Authentication using OpenStack Keystone • Native kubectl commands with cloud credentials • Choice of Ingress controller • Nginx or Traefik • Integrated monitoring with Prometheus • Rolling cluster upgrades for Kubernetes, operating system and add-ons • Integrated Node Problem Detector 20/06/2019 VHPC '19 35
  • 36. LHC Schedule 14/03/2019 Tim Bell 36 Run 3 Alice, LHCb upgrades Run 4 ATLAS, CMS upgrades
  • 37. VHPC '19 37 First run LS1 Second run Third run LS3 HL-LHC Run4 …2009 2013 2014 2015 2016 2017 201820112010 2012 2019 2023 2024 2030?20212020 2022 …2025 LS2  Significant part of cost comes from global operations  Even with technology increase of ~15%/year, we still have a big gap if we keep trying to do things with our current compute models Raw data volume increases significantly for High Luminosity LHC 2026 20/06/2019
  • 38. Google searches 98 PB LHC Science data ~200 PB SKA Phase 1 – 2023 ~300 PB/year science data HL-LHC – 2026 ~600 PB Raw data HL-LHC – 2026 ~1 EB Physics data SKA Phase 2 – mid-2020’s ~1 EB science data LHC – 2016 50 PB raw data Facebook uploads 180 PB Google Internet archive ~15 EB Yearly data volumes 14/03/2019 Tim Bell 38
  • 39. Annecy, 22/06/2019 G. Lamanna 39
  • 41. High Luminosity LHC until 2035 • Ten times more collisions than the original design Studies in progress: Compact Linear Collider (CLIC) • Up to 50Km long • Linear e+e- collider √s up to 3 TeV Future Circular Collider (FCC) • ~100 Km circumference • New technology magnets  100 TeV pp collisions in 100km ring • e+e- collider (FCC-ee) as 1st step? European Strategy for Particle Physics • Preparing next update in 2020 Future of particle physics ? m Bell
  • 42. Conclusions • Many teams with diverse workloads -> Many clusters • Single resource pool with accounting and quota • Shared effort with other scientific communities via • Open source collaborations • Community special interest groups e.g. Scientific SIG • Common projects e.g. CERN/SKA collaborations • Containers are becoming a key technology • Rapid application deployment • Isolate workloads from underlying infrastructure • Preserve analysis for the future 20/06/2019 VHPC '19 42
  • 43. Further Information • CERN blogs • https://techblog.web.cern.ch/techblog/ • Recent Talks at OpenStack summits • https://www.openstack.org/videos/search?search=cern • Kubecon 2018, 2019 • Source code • https://github.com/cernops and https://github.com/openstack VHPC '19 4320/06/2019
  • 44.
  • 47. CERN Ceph Clusters Size Version OpenStack Cinder/Glance Production 5.5PB jewel Satellite data centre (1000km away) 0.4PB luminous CephFS (HPC+Manila) Production 0.8PB luminous Manila testing cluster 0.4PB luminous Hyperconverged HPC 0.4PB luminous CASTOR/XRootD Production 4.2PB luminous CERN Tape Archive 0.8PB luminous S3+SWIFT Production 0.9PB luminous 47 +5PB in the pipeline 20/06/2019 VHPC '19
  • 48. What to consider when running a container service ● Design your network ○ By default, magnum creates a private network per cluster and assigns floating IPs to nodes ○ LBaaS for multi-master clusters ● Run a container registry ○ DockerHub is usually up but latency will always get you ○ Rebuild or mirror the containers used by magnum ● Provide self-service clusters -> Provide software ○ Upgrade magnum regularly, update its configuration regularly ○ Plan which container and glance images are available to users 48
  • 49. Cluster Resize Motivation: Remove specific nodes from the cluster (replace update cmd) Forward compatibility with old cluster Already in the upstream stable branch ETA for release, 12 of April (upstream) api-reference: http://git.openstack.org/cgit/openstack/magnum/tree/api-ref/source/clusters.inc#n268 Thanks to Feilong Wang 49
  • 50. Cluster Resize $ openstack coe cluster resize --nodegroup kube-worker kube 3 Request to resize cluster kube has been accepted. $ openstack coe cluster list +--------------------------------------+------+---------+------------+--------------+ | uuid | name | keypair | node_count | master_count | +--------------------------------------+------+---------+------------+--------------+ | ed38e800-5884-4053-9b17-9f80995f1993 | kube | default | 3 | 1 | +--------------------------------------+------+---------+------------+--------------+ … --------------------+---------------+ status | health_status | --------------------+---------------+ UPDATE_IN_PROGRESS | HEALTHY | --------------------+---------------+ $ openstack coe cluster resize --nodegroup kube-worker --nodes-to-remove 05b7b307-18fd-459a-a13a-a1923c2c840d kube 1 Request to resize cluster kube has been accepted. 50
  • 51. Node Groups $ openstack coe nodegroup list kube +--------------------------------------+-------------+-----------+------------+--------+ | uuid | name | flavor_id | node_count | role | +--------------------------------------+-------------+-----------+------------+--------+ | 14ddaf00-9867-49ca-b10c-106c3656e4f1 | kube-master | m1.small | 1 | master | | 8a18cc5c-040d-4e67-aa4d-9aaf38241119 | kube-worker | m1.small | 1 | worker | +--------------------------------------+-------------+-----------+------------+--------+ $ openstack coe nodegroup show kube kube-master +--------------------+--------------------------------------+ | Field | Value | +--------------------+--------------------------------------+ | name | kube-master | | cluster_id | ed38e800-5884-4053-9b17-9f80995f1993 | | flavor_id | m1.small | | node_addresses | [u'172.24.4.120'] | | node_count | 1 | | role | master | | max_node_count | None | | min_node_count | 1 | | is_default | True | +--------------------+--------------------------------------+ 51
  • 52. Authentication to OpenStack Keystone Use OpenStack tokens directly in kubectl Give kubectl access to users outside the cluster’s OpenStack project A better (more secure) option than the current TLS certificates $ openstack coe cluster create ... --labels keystone_auth_enabled=true $ export OS_TOKEN=$(openstack token issue -c id -f value) $ kubectl get pod 52
  • 53. Cluster Metrics Monitoring (Prometheus) Objectives Provide an out-of-the-box solution for cluster, node and application metrics monitoring Services Included Metrics scraping and storage (Prometheus) Data visualization (Grafana) Alarms (Alertmanager) Upstream Prometheus Operator Helm Chart Slide Credit: Diogo Guerra, CERN Cloud 53
  • 54. Cluster Upgrades Upgrades of Kubernetes, Operating System, Add-ons Rolling in-place upgrade Rolling node-replacement Batch size for rolling upgrade https://storyboard.openstack.org/#!/story/2002210 54
  • 55. More Add-ons Ingress Controllers Traefik v1.7.x Can be used with Neutron-lbaas/Octavia or HostNetwork Octavia 1.13.2-alpha or newer Node Problem Detector Customizable detectors for node health Pod Security Policy Two modes, privileged or restricted by default 55
  • 56. Magnum Deployment ● Clusters are described by cluster templates ● Shared/public templates for most common setups, customizable by users $ openstack coe cluster template list +------+---------------------------+ | uuid | name | +------+---------------------------+ | .... | swarm | | .... | swarm-ha | | .... | kubernetes | | .... | kubernetes-ha | | .... | mesos | | .... | mesos-ha | | .... | dcos | +------+---------------------------+ 56
  • 57. Magnum Deployment ● Clusters are described by cluster templates ● Shared/public templates for most common setups, customizable by users $ openstack coe cluster create --name my-k8s --cluster-template kubernetes --node-count 100 ~ 5 mins later $ openstack coe cluster list +------+------+---------+------------+--------------+--------------------+---------------+ | uuid | name | keypair | node_count | master_count | status | health_status | +------+------+---------+------------+--------------+--------------------+---------------+ | ... | kube | default | 3 | 1 | UPDATE_IN_PROGRESS | HEALTHY | +------+------+---------+------------+--------------+--------------------+---------------+ $ $(openstack coe cluster config my-k8s --dir clusters/my-k8s --use-keystone) $ OS_TOKEN=$(openstack token issue -c id -f value) $ kubectl get ... 57
  • 58. Resource Provisioning: IaaS VHPC '19 58 • Based on OpenStack - Collection of open source projects for cloud orchestration - Started by NASA and Rackspace in 2010 - Grown into a global software community 20/06/2019
  • 59. NUMA roll-out • Rolled out on ~2’000 batch hypervisors (~6’000 VMs) • HP allocation as boot parameter  reboot • VM NUMA awareness as flavor metadata  delete/recreate • Cell-by-cell (~200 hosts): • Queue-reshuffle to minimize resource impact • Draining & deletion of batch VMs • Hypervisor reconfiguration (Puppet) & reboot • Recreation of batch VMs • Whole update took about 8 weeks • Organized between batch and cloud teams • No performance issue observed since 20/06/2019 VHPC '19 59 VM Before After 4x 8 8% 2x 16 16% 1x 24 20% 5% 1x 32 20% 3%