20181219 ucc open stack 5 years v3

Clouds at CERN : A 5 year perspective
Utility and Cloud Computing Conference, December 19, 2018
Tim Bell
@noggin143UCC 2018 2

About Tim
• Responsible for Compute
and Monitoring in CERN
IT department
• Elected member of the
OpenStack Foundation
management board
• Member of the
OpenStack user
committee from 2013-
2015
UCC 2018 3

UCC 2018 4
CERNa
Worldwide
collaboration
CERN’s primary mission:
SCIENCE
Fundamental research on particle physics,
pushing the boundaries of knowledge and
technology

CERN
World’s largest
particle physics
laboratory
UCC 20185
Image credit: CERN

UCC 20186
The Large Hadron Collider: LHC
1232
dipole magnets
15 metres
35t EACH
27km
Image credit: CERN

Image credit: CERN
COLDER
TEMPERATURES
than outer space
( 120t He )
UCC 20187
LHC: World’s Largest Cryogenic System (1.9 K)

Vacuum?
• Yes
UCC 20188
LHC: Highest Vacuum
104 km
of PIPES
10-11bar (~ moon)
Image credit: CERN

Image credit: CERN
Image credit: CERN
UCC 20189
ATLAS, CMS, ALICE and LHCb
EIFFEL
TOWER
HEAVIER
than the
Image credit: CERN

UCC 2018 10
40 million
pictures
per second
1PB/s
Image credit: CERN

About the CERN IT Department
UCC 2018 11
Enable the laboratory to fulfill its mission
- Main data centre on Meyrin site
- Wigner data centre in Budapest (since 2013)
- Connected via three dedicated 100Gbs links
- Where possible, resources at both sites
(plus disaster recovery)
Drone footage of the CERN CC
About the CERN IT Department
UCC 2018
4
Enable the laboratory to fulfill its mission
- Main data centre on Meyrin site
- Wigner data centre in Budapest (since 2013)
- Connected via three dedicated 100Gbs links
- Where possible, resources at both sites
(plus disaster recovery)
Drone footage of the CERN CC
19/12/2018

Status: Service Level Overview
UCC 2018
12

Outline
UCC 2018
13
• Fabric Management before 2012
• The AI Project
• The three AI areas
- Configuration Management
- Monitoring
- Resource provisioning
• Review

CERN IT Tools up to 2011 (1)
UCC 2018
14
• Developed in series of EU funded projects
- 2001-2004: European DataGrid
- 2004-2010: EGEE
• Work package 4 – Fabric management:
“Deliver a computing fabric comprised of all the necessary tools to
manage a centre providing grid services on clusters of thousands of
nodes.”

CERN IT Tools up to 2011 (2)
UCC 2018
15
• The WP4 software was developed from scratch
- Scale and experience needed for LHC Computing was special
- Config’ mgmt, monitoring, secret store, service status, state mgmt, service databases, …
LEMON – LHC Era Monitoring
- client/server based monitoring
- local agent with sensors
- samples stored in a cache & sent to server
- UDP or TCP, w/ or w/o encryption
- support for remote entities
- system administration toolkit
- automated installation, configuration &
management of clusters
- clients interact with a configuration
database (CMDB) & and an installation
infrastructure (AII)
Around 8’000 servers managed!

2012: A Turning Point for CERN IT
UCC 2018
16
• EU projects finished in 2010: decreasing development and support
• LHC compute and data requirements increasing
- Moore’s law would help, but not enough
• Staff would not grow with managed resources
- Standardization & automation, current tools not apt
• Other deployments have surpassed the CERN one
- Mostly commercial companies like Google, Facebook, Rackspace, Amazon, Yahoo!, …
- We were no longer special! Can we profit?
0
20
40
60
80
100
120
140
160
Run 1 Run 2 Run 3 Run 4
GRID
ATLAS
CMS
LHCb
ALICE
we are
here
what we
can afford
LS1 (2013) ahead, next window for change would only open in 2019 …
2012

UCC 2018
17
How we began …
• Formed a small team of service managers from …
- Large services (e.g. batch, plus)
- Existing fabric services (e.g. monitoring)
- Existing virtualization service
• ... to define project goals
- What issues do we need to address?
- What forward looking features do we need?
http://iopscience.iop.org/article/10.1088/1742-6596/396/4/042002/pdf

Agile Infrastructure Project Goals
UCC 2018
18
New data centre support
- Overcome limits of CC in Meyrin
- Disaster recovery and business continuity
- ‘Smart hands’ approach
1

UCC 2018
19
Sustainable tool support
- Tools to be used at our scale need maintenance
- Tools with a limited community require more time for
newcomers to become productive and are less valuable
for the time after (transferable skills)
2

UCC 2018
20
Improve user response time
- Reduce the resource provisioning time span
(current virtualization service reached scaling limits)
- Self-service kiosk
3

UCC 2018
21
Enable cloud interfaces
- Experiments already started to use EC2
- Enable libraries such as Apache’s libcloud
4

UCC 2018
22
Precise monitoring and
accounting
- Enable timely monitoring for debugging
- Showback usage to the cloud users
- Consolidate accounting data for usage of CPU, network,
storage … across batch, physical nodes and grid
resources
5

UCC 2018
23
Improve resource
efficiency
- Adapt provisioned resources to services’ needs
- Streamline the provisioning workflows
(e.g. burn-in, repair or retirement)
6

Our Approach: Tool Chain and DevOps
UCC 2018
24
• CERN’s requirements are no longer special!
• A set of tools emerged when looking at other places
• Small dedicated tools
allowed for rapid validation &
prototyping
• Adapted our processes,
policies and work flows
to the tools!
• Join (and contribute to)
existing communities!

IT Policy Changes for Services
UCC 2018
25
• Services shall be virtual …
- Within reason
- Exceptions are costly!
• Puppet managed, and …
• … monitored!
- (Semi-)automatic with Puppet
Decrease provisioning time
Increase resource efficiency
Simplify infrastructure mgmt
Profit from others’ work
Speed up deployment
‘Automatic’ documentation
Centralized monitoring
Integrated alarm handling

UCC 2018
26
Tools + Policies:
Sounds simple!
From tools to services is complex!
- Integration w/ sec services?
- Incident handling?
- Request work flows?
- Change management?
- Accounting and charging?
- Life cycle management?
- … Image: Subbu Allamaraju

Public Procurement Timelines
UCC 2018 27

Resource Provisioning: IaaS
UCC 2018
28
• Based on OpenStack
- Collection of open source projects for cloud orchestration
- Started by NASA and Rackspace in 2010
- Grown into a global software community

The CERN Cloud Service
UCC 2018
30
• Production since July 2013
- Several rolling upgrades since,
now on Rocky
- Many sub services deployed
• Spans two data centers
- One region, one API entry point
• Deployed using RDO + Puppet
- Mostly upstream, patched where needed
• Many sub services run on VMs!
- Boot strapping

Agility in the Cloud
UCC 2018
32
• Use case spectrum
- Batch service (physics analysis)
- IT services (built on each other)
- Experiment services (build)
- Engineering (chip design)
- Infrastructure (hotel, bikes)
- Personal (development)
• Hardware spectrum
- Processor archs (features, NUMA, …)
- Core-to-RAM ratio (1:2, 1:3, 1:5, …)
- Core-to-disk ratio (2x or 4x SSDs)
- Disk layout (2, 3, 4, mixed)
- Network (1/10GbE, FC, domain)
- Location (DC, power)
- SLC6, CC7, RHEL, Windows
- …

What about our initial goals?
UCC 2018
33
• The remote DC is seamlessly
integrated
- No difference from provisioning PoV
- Easily accessible by users
- Local DC limits overcome (business continuity?)
• Sustainable tools
- Number of managed machines has multiplied
- Good collaboration with upstream communities
- Newcomers know tools, can use knowledge
afterwards
• Provisioning time span is ~minutes
- Was several months before
- Self-service kiosk with automated workflows
• Cloud interfaces
- Good OpenStack adoption, EC2 support
• Flexible monitoring infra
- Automatic in for simple cases
- Powerful tool set for more complex ones
- Accounting for local and grid resources
• Increased resource efficiency
- ‘Packing’ of services
- Overcommit
- Adapted to services’ needs
- Quick draining & back filling
So … 100% success?

Cloud Architecture Overview
UCC 2018
34
• Top and child cells for scaling
- API, DB, MQ, Compute nodes
- Remote DC is set of cells
• Nova HA only on top cell
- Simplicity vs impact
• Other projects global
- Load balanced controllers
- RabbitMQ clusters
• Three Ceph instances
- Volumes (Cinder), images (Glance), shares (Manila)

Tech. Challenge: Scaling
• OpenStack Cells provides composable units
• Cells V1 – Special custom developments
• Cells V2 – Now the standard deployment model
• Broadcast vs Targetted queries
• Handling down cells
• Quota
• Academic and scientific instances push the
limits
• Now many enterprise clouds above 1000
hypervisors
• CERN running 73 Cells in production
UCC 2018 36
https://www.openstack.org/analytics

Tech. Challenge: CPU Performance
UCC 2018
37
• The benchmarks on full-node VMs was about 20% lower
than the one of the underlying host
- Smaller VMs much better
• Investigated various tuning options
- KSM*, EPT**, PAE, Pinning, … +hardware type dependencies
- Discrepancy down to ~10% between virtual and physical
• Comparison with Hyper-V: no general issue
- Loss w/o tuning ~3% (full-node), <1% for small VMs
- … NUMA-awareness!
*KSM on/off: beware of memory reclaim! **EPT on/off: beware of expensive page table walks!

CPU Performance: NUMA
UCC 2018
38
• NUMA-awareness identified as most
efficient setting
• “EPT-off” side-effect
- Small number of hosts, but very
visible there
• Use 2MB Huge Pages
- Keep the “EPT off” performance gain
with “EPT on”

NUMA roll-out
UCC 2018
39
• Rolled out on ~2’000 batch hypervisors (~6’000 VMs)
- HP allocation as boot parameter  reboot
- VM NUMA awareness as flavor metadata  delete/recreate
• Cell-by-cell (~200 hosts):
- Queue-reshuffle to minimize resource impact
- Draining & deletion of batch VMs
- Hypervisor reconfiguration (Puppet) & reboot
- Recreation of batch VMs
• Whole update took about 8 weeks
- Organized between batch and cloud teams
- No performance issue observed since
VM Before After
4x 8 8%
2x 16 16%
1x 24 20% 5%
1x 32 20% 3%

Tech. Challenge: Under used resources
UCC 2018 40

VM Expiry
UCC 2018 41
• Each personal instance will have an expiration date
• Set shortly after creation and evaluated daily
• Configured to 180 days, renewable
• Reminder mails starting 30 days before expiration

Expiry results
UCC 2018 42
• Results exceeded
expectations
• Expired
• >1000 VMs
• >3000 cores

Tech. Challenge: Bare Metal
UCC 2018 43
• VMs not suitable for all of our use cases
- Storage and database nodes, HPC clusters, boot strapping,
critical network equipment or specialised network setups,
precise/repeatable benchmarking for s/w frameworks, …
• Complete our service offerings
- Physical nodes (in addition to VMs and containers)
- OpenStack UI as the single pane of glass
• Simplify hardware provisioning workflows
- For users: openstack server create/delete
- For procurement & h/w provisioning team: initial on-boarding, server re-assignments
• Consolidate accounting & bookkeeping
- Resource accounting input will come from less sources
- Machine re-assignments will be easier to track

Adapt the Burn In process
• “Burn-in” before acceptance
- Compliance with technical spec (e.g. performance)
- Find failed components (e.g. broken RAM)
- Find systematic errors (e.g. bad firmware)
- Provoke early failing due to stress
- Tests include
- CPU: burnK7, burnP6, burnMMX (cooling)
- RAM: memtest, Disk: badblocks
- Network: iperf(3) between pairs of nodes
- automatic node pairing
- Benchmarking: HEPSpec06 (& fio)
- derivative of SPEC06
- we buy total compute capacity (not newest processors)
UCC 2018 44

Exploiting cloud services for burn in
UCC 2018 45

Tech. Challenge: Containers
UCC 2018 46
An OpenStack API Service that allows creation of container
clusters
● Use your OpenStack credentials, quota and roles
● You choose your cluster type
● Multi-Tenancy
● Quickly create new clusters with advanced features
such as multi-master
● Integrated monitoring and CERN storage access
● Making it easy to do the right thing

Scale Testing using Rally
• An Openstack benchmark test tool
• Easily extended by plugin
• Test result in HTML reports
• Used by many projects
• Context: set up environment
• Scenario: run benchmark
• Recommended for a production
service
to verify that the service behaves as
expected at all time
UCC 2018 47
Kubernetes
Cluster
pods,
contai
ners
Rally
report

First Attempt – 1M requests/Seq
• 200 Nodes
• Found multiple limits
• Heat Orchestration scaling
• Authentication caches
• Volume deletion
• Site services
UCC 2018 48

Second Attempt – 7M requests/Seq
• Fixes and scale to 1000 Nodes
UCC 2018 49
Cluster Size
(Nodes)
Concurrency Deployment
Time (min)
2 50 2.5
16 10 4
32 10 4
128 5 5.5
512 1 14
1000 1 23

Tech. Challenge: Meltdown
UCC 2018 50
• In January 2018, a security vulnerability was
disclosed a new kernel everywhere
• Staged campaign
• 7 reboot days, 7 tidy up days
• By availability zone
• Benefits
• Automation now to reboot the cloud if needed -
33,000 VMs on 9,000 hypervisors
• Latest QEMU and RBD user code on all VMs
• Then L1TF came along
• And we had to do it all again......
06/06/2018

UCC 2018 51
First run LS1 Second run Third run LS3 HL-LHC Run4
…2009 2013 2014 2015 2016 2017 201820112010 2012 2019 2023 2024 2030?20212020 2022 …2025
LS2
 Significant part of cost comes
from global operations
 Even with technology increase of
~15%/year, we still have a big
gap if we keep trying to do things
with our current compute models
Raw data volume
increases significantly
for High Luminosity LHC
2026

Non-Technical Challenges (1)
UCC 2018
53
• Agile Infrastructure Paradigm Adoption
- ‘VMs are slower than physical machines.’
- ‘I need to keep control on the full stack.’
- ‘This would not have happened with physical machines.’
- ‘It’s the cloud, so it should be able to do X!’
- ‘Using a config’ management tool is too dangerous!’
- ‘They are my machines’

UCC 2018
54
• Agility can bring great benefits …
• … but mind (adapted) Hooke’s Law!
- Avoid irreversible deformations
• Ensure the tail is moving as well as
the head
- Application support
- Cultural changes
- Workflow adoption
- Open source community culture can help

• Contributor License Agreements
• Patches needed but merges/review time
• Regular staff changes limits Karma
• Need to be a polyglot
• Python, Ruby, Go, … and legacy Perl etc.
• Keep riding the release wave
• Avoid the end-of-life scenarios
UCC 2018 55

Ongoing Work Areas
• Spot Market / Pre-emptible instances
• Software Defined Networking
• Regions
• GPUs
• Containers on Bare Metal
• …
UCC 2018 56

Summary
UCC 2018
57
Positive results after 5 years into the project!
- LHC needs met without additional staff
- Tools and workflows widely adopted and accepted
- Many technical challenges were mastered and returned upstream
- Integration with open source communities successful
- Use of common tools increased CERN’s attraction of talents
Further enhancements in function & scale needed for HL-LHC

Further Information
• CERN information outside the auditorium
• Jobs at CERN – wide range of options
• http://jobs.cern
• CERN blogs
• http://openstack-in-production.blogspot.ch
• https://techblog.web.cern.ch/techblog/
• Recent Talks at OpenStack summits
• https://www.openstack.org/videos/search?search=cern
• Source code
• https://github.com/cernops and https://github.com/openstack
UCC 2018 58

Agile Infrastructure Core Areas
UCC 2018
61
• Resource provisioning (IaaS)
- Based on OpenStack
• Centralized Monitoring
- Based on Collectd (sensor) + ‘ELK’ stack
• Configuration Management
- Based on Puppet

Configuration Management
UCC 2018
62
• Client/server architecture
- ‘agents’ running on hosts plus horizontally scalable ‘masters’
• Desired state of hosts described in ‘manifests’
- Simple, declarative language
- ‘resource’ basic unit for system modeling, e.g. package or service
• ‘agent’ discovers system state using ‘facter’
- Sends current system state to masters
• Master compiles data and manifests into ‘catalog’
- Agent applies catalog on the host

Status: Config’ Management (1)
UCC 2018
63
(virtual and physical, private and public cloud)
(‘base’ is what every Puppet node gets)
(compilations are spread out)
(this number includes dev changes)
(number Puppet code committers)

UCC 2018
64

UCC 2018
65
• Changes to QA are
announced publicly
• QA duration: 1 week
• All Service Managers
can stop a change!

Monitoring: Scope
UCC 2018
66
Data Centre Monitoring
• Two DCs at CERN and Wigner
• Hardware, O/S, and services
• PDUs, temp sensors, …
• Metrics and logs
Experiment Dashboards
- WLCG Monitoring
- Sites availability, data transfers,
job information, reports
- Used by WLCG, experiments,
sites and users

UCC 2018
67
Status: (Unified) Monitoring (1)
• Offering: monitor, collect, aggregate, process,
visualize, alarm … for metrics and logs!
• ~400 (virtual) servers, 500GB/day, 1B docs/day
- Mon data management from CERN IT and WLCG
- Infrastructure and tools for CERN IT and WLCG
• Migrations ongoing (double maintenance)
- CERN IT: From Lemon sensor to collectd
- WLCG: From former infra, tools, and dashboards

Status: (Unified) Monitoring (2)
UCC 2018
68
Kafka cluster
(buffering) *
Processing
Data enrichment
Data aggregation
Batch Processing
Transport
FlumeKafkasink
Flumesinks
FTS
Data
Sources
Rucio
XRootD
Jobs
…
Lemon
syslog
app log
DB
HTTP
feed
AMQ
Flume
AMQ
Flume
DB
Flume
HTTP
Flume
Log
GW
Flume
Metric
GW
Logs
Lemon
metrics
HDFS
Elastic
Search
…
Storage &
Search
Others
(influxdb)
Data
Access
CLI, API
User
Views
User
Jobs
User
Data
Today: > 500 GB/day, 72h buffering

20181219 ucc open stack 5 years v3

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to 20181219 ucc open stack 5 years v3

Similar to 20181219 ucc open stack 5 years v3 (20)

More from Tim Bell

More from Tim Bell (20)

Recently uploaded

Recently uploaded (20)

20181219 ucc open stack 5 years v3

Editor's Notes