OpenStack at CERN : A 5 year perspective

Grappling with Massive
Data Sets
Gavin McCance, CERN IT
Digital Energy 2018
1 May 2018 | Aberdeen
06/06/2018 OpenStack at CERN 2
OpenStack at CERN : A 5 year perspective
Tim Bell
tim.bell@cern.ch
@noggin143
OpenStack Days Budapest 2018

About Me - @noggin143
• Responsible for
Compute and Monitoring
at CERN
• Elected member of the
OpenStack Foundation
board
• Member of the
OpenStack user
committee from 2013-
2015

OpenStack at CERN 4
CERNa
Worldwide
collaboration
CERN’s primary mission:
SCIENCE
Fundamental research on particle physics,
pushing the boundaries of knowledge and
technology
06/06/2018

CERN
World’s largest
particle physics
laboratory
OpenStack at CERN 5
Image credit: CERN
06/06/2018

Evolution of the Universe
Test the
Standard
Model?
What’s matter
made of?
What holds it
together?
Anti-matter?
(Gravity?)

OpenStack at CERN 7
The Large Hadron Collider: LHC
1232
dipole magnets
15 metres
35t EACH
27km
Image credit: CERN
06/06/2018

Image credit: CERN
COLDER
TEMPERATURES
than outer space
( 120t He )
OpenStack at CERN 8
LHC: World’s Largest Cryogenic System (1.9 K)
06/06/2018

Vacuum?
• Yes
OpenStack at CERN 9
LHC: Highest Vacuum
104 km
of PIPES
10-11bar (~ moon)
Image credit: CERN
06/06/2018

Image credit: CERN
Image credit: CERN
OpenStack at CERN 10
ATLAS, CMS, ALICE and LHCb
EIFFEL
TOWER
HEAVIER
than the
Image credit: CERN
06/06/2018

40 million
pictures
per second
1PB/s
Image credit: CERN

Data Flow to Storage and Processing
ALICE: 4GB/s
ATLAS: 1GB/s
CMS: 600MB/s
LHCB: 750MB/s
RUN 2CERN DC
06/06/2018

Image credit: CERN
CERN Data Centre: Primary Copy of LHC Data
Data Centre on Google Street View
90k disks
15k servers
> 200 PB
on TAPES
06/06/2018

About WLCG:
• A community of 10,000 physicists
• ~250,000 jobs running concurrently
• 600,000 processing cores
• 700 PB storage available worldwide
• 20-40 Gbit/s connect CERN to Tier1s
Tier-0 (CERN)
• Initial data reconstruction
• Data recording & archiving
• Data distribution to rest of world
Tier-1s (14 centres worldwide)
• Permanent storage
• Re-processing
• Monte Carlo Simulation
• End-user analysis
Tier-2s (>150 centres worldwide)
• Monte Carlo Simulation
• End-user analysis
WLCG: LHC Computing Grid
Image credit: CERN
170 sites
WORLDWIDE
> 10000
users

CERN in 2017
230 PB on tape
550 million files
2017
55 PB produced
TB

Cloud
CERN Data Centre: Private OpenStack Cloud
More Than
300 000
cores
More Than
500 000
physics jobs
per day
06/06/2018

Infrastructure in 2011
• Data centre managed by home grown toolset
• Initial development funded by EU projects
• Quattor, Lemon, …
• Development environment based on CVS
• 100K or so lines of Perl
• At the limit for power and cooling in Geneva
• No simple expansion options

Wigner Data Centre
Started project in 2011 with
inauguration in June 2013

Getting resources in 2011

OpenStack London July 2011

2011 - First OpenStack summit talk
https://www.slideshare.net/noggin143/cern-user-story

The Agile Infrastructure Project
2012, a turning point for CERN IT:
- LHC Computing and data requirements were
increasing … Moore’s law would help, but not enough
- EU funded projects for fabric management
toolset ended
- Staff fixed but must grow resources
- LS1 (2013) ahead, next window only in 2019!
- Other deployments have surpassed CERN‘s
Three core areas:
- Centralized Monitoring
- Config’ management
- IaaS based on OpenStack
“All servers shall be virtual!”

CERN Tool Chain

And block storage.. February 2013

Sharing with Central Europe – May 2013
https://www.slideshare.net/noggin143/20130529-openstack-ceedayv6

Production in Summer 2013

CERN Ceph Clusters Size Version
OpenStack Cinder/Glance Production 5.5PB jewel
Satellite data centre (1000km away) 0.4PB luminous
CephFS (HPC+Manila) Production 0.8PB luminous
Manila testing cluster 0.4PB luminous
Hyperconverged HPC 0.4PB luminous
CASTOR/XRootD Production 4.2PB luminous
CERN Tape Archive 0.8PB luminous
S3+SWIFT Production 0.9PB luminous
29
+5PB in the pipeline
06/06/2018 OpenStack at CERN

Bigbang Scale Tests
• Bigbang scale tests mutually benefit
CERN & Ceph project
• Bigbang I: 30PB, 7200 OSDs, Ceph
hammer. Several osdmap limitations
• Bigbang II: Similar size, Ceph jewel.
Scalability limited by OSD/MON
messaging. Motivated ceph-mgr
• Bigbang III: 65PB, 10800 OSDs
30
https://ceph.com/community/new-luminous-scalability/

OpenStack Magnum
An OpenStack API Service that allows creation of
container clusters
● Use your keystone credentials
● You choose your cluster type
● Multi-Tenancy
● Quickly create new clusters with advanced
features such as multi-master

OpenStack Magnum
$ openstack coe cluster create --cluster-template kubernetes --node-count 100 … mycluster
$ openstack cluster list
+------+----------------+------------+--------------+-----------------+
| uuid | name | node_count | master_count | status |
+------+----------------+------------+--------------+-----------------+
| .... | mycluster | 100 | 1 | CREATE_COMPLETE |
+------+----------------+------------+--------------+-----------------+
$ $(magnum cluster-config mycluster --dir mycluster)
$ kubectl get pod
$ openstack coe cluster update mycluster replace node_count=200
Single command cluster creation

33
Why Bare-Metal Provisioning?
• VMs not sensible/suitable for all of our use cases
- Storage and database nodes, HPC clusters, boot strapping,
critical network equipment or specialised network setups,
precise/repeatable benchmarking for s/w frameworks, …
• Complete our service offerings
- Physical nodes (in addition to VMs and containers)
- OpenStack UI as the single pane of glass
• Simplify hardware provisioning workflows
- For users: openstack server create/delete
- For procurement & h/w provisioning team: initial on-boarding, server re-assignments
• Consolidate accounting & bookkeeping
- Resource accounting input will come from less sources
- Machine re-assignments will be easier to track

Compute Intensive Workloads on VMs
• Up to 20% loss on very large VMs!
• “Tuning”: KSM*, EPT**, pinning, … 10%
• Compare with Hyper-V: no issue
• Numa-awares & node pinning ... <3%!
• Cross over : patches from Telecom
(*) Kernel Shared Memory
(**) Extended Page Tables
VM Before After
4x 8 7.8%
2x 16 16%
1x 24 20% 5%
1x 32 20% 3%

A new use case: Containers on Bare-Metal
• OpenStack managed containers and bare
metal so put them together
• General service offer: managed clusters
- Users get only K8s credentials
- Cloud team manages the cluster and the underlying infra
• Batch farm runs in VMs as well
- Evaluating federated kubernetes for hybrid cloud integration
- 7 clouds federated demonstrated at Kubecon
- OpenStack and non-OpenStack transparently managed
Integration: seamless!
(based on specific template)
Monitoring (metrics/logs)?
 Pod in the cluster
 Logs: fluentd + ES
 Metrics: cadvisor + influx

• h/w purchases: formal procedure compliant with public procurements
- Market survey identifies potential bidders
- Tender spec is sent to ask for offers
- Larger deliveries 1-2 times / year
• “Burn-in” before acceptance
- Compliance with technical spec (e.g. performance)
- Find failed components (e.g. broken RAM)
- Find systematic errors (e.g. bad firmware)
- Provoke early failing due to stress
Whole process can take weeks!
Hardware Burn-in in the CERN Data Centre (1)
“bathtub curve”

Hardware Burn-in in the CERN Data Centre (2)
• Initial checks: Serial Asset Tag and BIOS settings
- Purchase order ID and unique serial no. to be set in the BMC (node name!)
• “Burn-in” tests
- CPU: burnK7, burnP6, burnMMX (cooling)
- RAM: memtest, Disk: badblocks
- Network: iperf(3) between pairs of nodes
- automatic node pairing
- Benchmarking: HEPSpec06 (& fio)
- derivative of SPEC06
- we buy total compute capacity (not newest processors)
$ ipmitool fru print 0 | tail -2
Product Serial : 245410-1
Product Asset Tag : CD5792984
$ openstack baremetal node show CD5792984-245410-1
“Double peak” structure due
to slower hardware threads
OpenAccess paper

38
Re-allocationAllocation
Foreman
Recently added
Burn-in

39
Phase 1.
Nova Network
Linux Bridge
Phase 2.
Neutron
Linux Bridge
Phase 3.
SDN
Tungsten Fabric (testing)
Network Migration
New Region
coming in 2018
Already running
* But still used in 2018
*

Spectre / Meltdown
 In January, a security vulnerability was
disclosed a new kernel everywhere
 Campaign over two weeks from15th
January
 7 reboot days, 7 tidy up days
 By availability zone
 Benefits
 Automation now to reboot the cloud if
needed - 33,000 VMs on 9,000
hypervisors
 Latest QEMU and RBD user code on all
VMs
 Downside
 Discovered Kernel bug in XFS which may
mean we have to do it again soon

Community Experience
 Open source collaboration sets model for in-house
teams
 External recognition by the community is highly
rewarding for contributors
 Reviews and being reviewed is a constant
learning experience
 Productive for job market for staff
 Working groups, like the Scientific and Large
Deployment teams, discuss wide range of topics
 Effective knowledge transfer mechanisms
consistent with the CERN mission
 110 outreach talks since 2011
 Dojos at CERN bring good attendance
 Ceph, CentOS, Elastic, OpenStack CH, …

 Increased complexity due to much higher pile-up and
higher trigger rates will bring several challenges to
reconstruction algorithms
MS had to cope with monster pile-up
8b4e bunch structure à pile-up of ~ 60 events/x-ing
for ~ 20 events/x-ing)
CMS: event with 78 reconstructed vertices
CMS: event from 2017 with 78
reconstructed vertices
ATLAS: simulation for HL-LHC
with 200 vertices
HL-LHC: More collisions!

First run LS1 Second run Third run LS3 HL-LHC Run4
…2009 2013 2014 2015 2016 2017 201820112010 2012 2019 2023 2024 2030?20212020 2022 …2025
LS2
 Significant part of cost comes
from global operations
 Even with technology increase of
~15%/year, we still have a big
gap if we keep trying to do things
with our current compute models
Raw data volume
increases significantly
for High Luminosity LHC
2026

Commercial Clouds

Development areas going forward
 Spot Market
 Cells V2
 Neutron scaling
 Magnum rolling upgrades
 Block Storage Performance
 Federated Kubernetes
 Collaborations with Industry and SKA

Summary
 OpenStack has provided flexible infrastructure at
CERN since 2013
 The open infrastructure toolchain has been stable
at scale
 Clouds are part but not all of the solution
 Open source collaborations have been fruitful for
CERN, industry and the communities
 Further efforts will be needed to ensure that
physics is not limited by the computing resources
available

Thanks for all your help .. Some links
 CERN OpenStack blog at http://openstack-
in-production.blogspot.com
 Recent CERN OpenStack talks at
Vancouver summit at
https://www.openstack.org/videos/search?se
arch=cern
 CERN Tools at https://github.com/cernops

Backup Material

Hardware Evolution
 Looking at new hardware
platforms to reduce the
upcoming resource gap
 Explorations have been made
in low cost and low power
ARM processors
 Interesting R&Ds in high
performance hardware
 GPUs for deep learning
network training and fast
simulation
 FPGAs for neural network
inference and data
transformations
49
Significant
algorithm changes
needed to benefit
from potential

OpenStack at CERN : A 5 year perspective

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to OpenStack at CERN : A 5 year perspective

Similar to OpenStack at CERN : A 5 year perspective (20)

More from Tim Bell

More from Tim Bell (15)

Recently uploaded

Recently uploaded (20)

OpenStack at CERN : A 5 year perspective

Editor's Notes