Clouds at CERN
Tim Bell
tim.bell@cern.ch
Clouds at CERN
Tim Bell
tim.bell@cern.ch
Academic Cloud Experiences, 29th April 2013Academic Cloud Experiences, 29th April 2013
T. Bell 1
2
CERN was founded 1954: 12 European States
“Science for Peace”
Today: 20 Member States
Member States: Austria, Belgium, Bulgaria, the Czech Republic, Denmark,
Finland, France, Germany, Greece, Hungary, Italy, the Netherlands, Norway,
Poland, Portugal, Slovakia, Spain, Sweden, Switzerland and
the United Kingdom
Candidate for Accession: Romania
Associate Members in Pre-Stage to Membership: Israel, Serbia
Applicant States for Membership or Associate Membership:
Brazil, Cyprus (awaiting ratification), Pakistan, Russia, Slovenia, Turkey, Ukraine
Observers to Council: India, Japan, Russia, Turkey, United States of America;
European Commission and UNESCO
Member States: Austria, Belgium, Bulgaria, the Czech Republic, Denmark,
Finland, France, Germany, Greece, Hungary, Italy, the Netherlands, Norway,
Poland, Portugal, Slovakia, Spain, Sweden, Switzerland and
the United Kingdom
Candidate for Accession: Romania
Associate Members in Pre-Stage to Membership: Israel, Serbia
Applicant States for Membership or Associate Membership:
Brazil, Cyprus (awaiting ratification), Pakistan, Russia, Slovenia, Turkey, Ukraine
Observers to Council: India, Japan, Russia, Turkey, United States of America;
European Commission and UNESCO
~ 2300 staff
~ 1000 other paid personnel
> 11000 users
Budget (2013) ~1000 MCHF
~ 2300 staff
~ 1000 other paid personnel
> 11000 users
Budget (2013) ~1000 MCHF
T. Bell 2
T. Bell 3
Is the Higgs boson the source of mass of our
fundamental particles?
T. Bell 4
Why is the universe made
of matter
and not equal amounts of matter/antimatter?
T. Bell 5
Dark Matter and Dark Energy?
TT
We do not know the
composition of
95% of the universe
Temperature of the universe
WMAP satellite
T. Bell 6
Blue tubes contain the two beam pipes and magnets at 1.8
degrees Kelvin
T. Bell 7
ATLAS detector during construction in 2005
T. Bell 8
Number of candidates
(vertical axis)
Mass of the candidates
(horizontal axis)
We observe an excess
of candidates with a
mass of 125 proton-
masses
Search for Higgs decays to 4 “leptons” (electrons or muons)
Also observed in the CMS
experiment
T. Bell 9
July 4, 2012
The Worldwide LHC Computing Grid
Tier-1: permanent
storage, re-
processing,
analysis
Tier-1: permanent
storage, re-
processing,
analysis
Tier-0 (CERN): data
recording,
reconstruction and
distribution
Tier-0 (CERN): data
recording,
reconstruction and
distribution
Tier-2: Simulation,
end-user analysis
Tier-2: Simulation,
end-user analysis
> 2 million jobs/day> 2 million jobs/day
~250’000 cores~250’000 cores
173 PB of storage173 PB of storage
nearly 160 sites,
35 countries
nearly 160 sites,
35 countries
10 Gb links10 Gb links
Tier-1: permanent
storage, re-
processing,
analysis
Tier-0 (CERN): data
recording,
reconstruction and
distribution
Tier-2: Simulation,
end-user analysis
> 2 million jobs/day
~250’000 cores
173 PB of storage
nearly 160 sites,
35 countries
10 Gb links
WLCG:
An International collaboration to distribute and analyse LHC data
Integrates computer centres worldwide that provide computing and storage
resource into a single infrastructure accessible by all LHC physicists
WLCG:
An International collaboration to distribute and analyse LHC data
Integrates computer centres worldwide that provide computing and storage
resource into a single infrastructure accessible by all LHC physicistsT. Bell 10
IT Infrastructure Challenges
 Staff numbers fixed
 Materials budget decreasing
 Increasing users of CERN’s facilities
 Legacy tools are high maintenance and brittle
 Additional data centre in Budapest now online
doubling potential capacity and 200GBit/s
network
How do we scale from our current 11,000
servers within these constraints ?
T. Bell 11
Approach
 Remodel IT services on Cloud layered
models
 IaaS, PaaS, SaaS
 Move to commonly used open source tools
 Puppet,OpenStack,Foreman,Koji,Oz,Kibana, …
 Implement clouds at scale
 IT aims for 15,000 hypervisors with 150,000 VMs
by 2015
 Exploit ecosystem solutions such as LBaaS,
DBaaS, MQaaS rather than build our own
T. Bell 12
Clouds in High Energy Physics
T. Bell 13
Long-term preservation
of software and data of
HEP experiments
Utilize special
computing resources
attached to the
detectors
Simplify the management
of heterogeneous in-
house resources
Use commercial clouds
for exceptional
computing demands
Distributed cloud
computing using HEP
and non-HEP clouds
Service Models
T. Bell 14
 Pets are given names like
pussinboots.cern.ch
 They are unique, lovingly hand raised and
cared for
 When they get ill, you nurse them back to
health
 Cattle are given numbers like
vm0042.cern.ch
 They are almost identical to other cattle
 When they get ill, you get another one
Future application architectures tend towards Cattle but Pet support is needed for
some specific zones of the cloud
Refine Service Levels ?
T. Bell 15
 Hippos are cattle with bulk
storage. Useful where
Cassandra or MongoDB
ensures redundancy
 Canaries are cattle at high
risk to give early warning of
failures .. Deploy early, fail
fast and fix
Infrastructure Overview
T. Bell 16
Microsoft Active
Directory
CERN DB
on Demand
CERN Network
Database
Account mgmt.
system
Horizon
Keystone
Network
Compute
Glance
Scheduler
Cinder
Nova
CERN Block
Storage provider
Dashboard using Horizon
T. Bell 17
Timelines
 Deploy as stable release becomes available in
EPEL
 Keep up to date but not too close
 Benefit from continuous integration testing of
other companies
T. Bell 18
Grizzly
' 12 Jan
2013
Feb Apr May … Oct Dec ' 13
Today Havana
Oct, 2013
Havana Service
Nov/Dec, 2013
Apr 4, 2013
Grizzly Service
May, 2013
Ibex
Feb, 2013
Folsom
Sep 27, 2012
Status
 CERN IT OpenStack Cloud
 Running Folsom around 500 hypervisors on KVM
and Hyper-V
 High availability using load balancing
 75 users creating around 50 new VMs/day
 Experiment farms
 CMS currently running 1,300 hypervisors with
50,000 cores using Essex
 ATLAS starting to ramp up to a similar size
 Other HEP sites moving to private cloud
 Brookhaven, IN2P3, FutureGrid, NeCTAR, IHEP,
…
T. Bell 19
Next Steps (I)
 Move to Grizzly
 Target end May 2013
 Enable Kerberos and X.509 authentication
 Avoids users having to enter passwords
 Recycle existing hardware and scale using
cells
 Can recycle around 100 batch machines to
hypervisors/week
T. Bell 20
Cells
T. Bell 21
We’re not alone …
T. Bell 22
Already 6 sites running more than 10,000 hypervisors
according to the latest OpenStack user survey
Next Steps (II)
 Block Storage for Hippos and Pets
 Cinder with Ceph, NetApp or GlusterFS
 Heat for Orchestration and auto-scaling
 Load Balancing as a Service
 Bare-Metal to bring all servers under
OpenStack
 Move ceilometer into production
 Accounting by project
 Move to wall-clock, vCPU metering
T. Bell 23
Cost Model
 CERN computing is funded from CERN central
budgets, no billing but quotas
T. Bell 24
IT resource manager
Experiment resource managers
Project Management
Quota Management
 What to do when quota is exceeded ?
 No credit card
 If capacity is not used ?
 Spot market on low SLA conditions
 Fair share across the cloud ?
 Worked for supercomputers but heavy for clouds
at scale
 Bursting to public clouds an option ?
 IT provisioned or experiment decision
T. Bell 25
Cloud of clouds: the next big step
 What is required to get to a cloud of clouds ?
 Federated identity
 Image conversion and sharing
 API standardisation
 SLAs
 Security models
 Many initiatives investigating this at different
levels
 Public/Private bursting
 Private/Private sharing (as the grid)
 Homogeneous and Heterogeneous
 We will see intensive efforts in this area over
the coming year
T. Bell 26
Conclusions
 Clouds provide a framework for re-engineering how IT
is delivering responsive services to the physicists
 OpenStack and the ecosystem provide a suitable
solution with flexibility and opportunity to contribute as
well as benefit from work of others
 Migration via re-cycling bare-metal to hypervisors
provides a smooth transition
 Cloud of clouds has potential to replace grid
computing models in the future
T. Bell 27
Questions?Questions?
T. Bell 28
BACKUP SLIDES
Job Opportunities
T. Bell 30
Science is getting more and more global
CERN: x staff, x fellows
T. Bell 31

Academic cloud experiences cern v4

  • 1.
    Clouds at CERN TimBell tim.bell@cern.ch Clouds at CERN Tim Bell tim.bell@cern.ch Academic Cloud Experiences, 29th April 2013Academic Cloud Experiences, 29th April 2013 T. Bell 1
  • 2.
    2 CERN was founded1954: 12 European States “Science for Peace” Today: 20 Member States Member States: Austria, Belgium, Bulgaria, the Czech Republic, Denmark, Finland, France, Germany, Greece, Hungary, Italy, the Netherlands, Norway, Poland, Portugal, Slovakia, Spain, Sweden, Switzerland and the United Kingdom Candidate for Accession: Romania Associate Members in Pre-Stage to Membership: Israel, Serbia Applicant States for Membership or Associate Membership: Brazil, Cyprus (awaiting ratification), Pakistan, Russia, Slovenia, Turkey, Ukraine Observers to Council: India, Japan, Russia, Turkey, United States of America; European Commission and UNESCO Member States: Austria, Belgium, Bulgaria, the Czech Republic, Denmark, Finland, France, Germany, Greece, Hungary, Italy, the Netherlands, Norway, Poland, Portugal, Slovakia, Spain, Sweden, Switzerland and the United Kingdom Candidate for Accession: Romania Associate Members in Pre-Stage to Membership: Israel, Serbia Applicant States for Membership or Associate Membership: Brazil, Cyprus (awaiting ratification), Pakistan, Russia, Slovenia, Turkey, Ukraine Observers to Council: India, Japan, Russia, Turkey, United States of America; European Commission and UNESCO ~ 2300 staff ~ 1000 other paid personnel > 11000 users Budget (2013) ~1000 MCHF ~ 2300 staff ~ 1000 other paid personnel > 11000 users Budget (2013) ~1000 MCHF T. Bell 2
  • 3.
    T. Bell 3 Isthe Higgs boson the source of mass of our fundamental particles?
  • 4.
    T. Bell 4 Whyis the universe made of matter and not equal amounts of matter/antimatter?
  • 5.
    T. Bell 5 DarkMatter and Dark Energy? TT We do not know the composition of 95% of the universe Temperature of the universe WMAP satellite
  • 6.
    T. Bell 6 Bluetubes contain the two beam pipes and magnets at 1.8 degrees Kelvin
  • 7.
    T. Bell 7 ATLASdetector during construction in 2005
  • 8.
    T. Bell 8 Numberof candidates (vertical axis) Mass of the candidates (horizontal axis) We observe an excess of candidates with a mass of 125 proton- masses Search for Higgs decays to 4 “leptons” (electrons or muons) Also observed in the CMS experiment
  • 9.
  • 10.
    The Worldwide LHCComputing Grid Tier-1: permanent storage, re- processing, analysis Tier-1: permanent storage, re- processing, analysis Tier-0 (CERN): data recording, reconstruction and distribution Tier-0 (CERN): data recording, reconstruction and distribution Tier-2: Simulation, end-user analysis Tier-2: Simulation, end-user analysis > 2 million jobs/day> 2 million jobs/day ~250’000 cores~250’000 cores 173 PB of storage173 PB of storage nearly 160 sites, 35 countries nearly 160 sites, 35 countries 10 Gb links10 Gb links Tier-1: permanent storage, re- processing, analysis Tier-0 (CERN): data recording, reconstruction and distribution Tier-2: Simulation, end-user analysis > 2 million jobs/day ~250’000 cores 173 PB of storage nearly 160 sites, 35 countries 10 Gb links WLCG: An International collaboration to distribute and analyse LHC data Integrates computer centres worldwide that provide computing and storage resource into a single infrastructure accessible by all LHC physicists WLCG: An International collaboration to distribute and analyse LHC data Integrates computer centres worldwide that provide computing and storage resource into a single infrastructure accessible by all LHC physicistsT. Bell 10
  • 11.
    IT Infrastructure Challenges Staff numbers fixed  Materials budget decreasing  Increasing users of CERN’s facilities  Legacy tools are high maintenance and brittle  Additional data centre in Budapest now online doubling potential capacity and 200GBit/s network How do we scale from our current 11,000 servers within these constraints ? T. Bell 11
  • 12.
    Approach  Remodel ITservices on Cloud layered models  IaaS, PaaS, SaaS  Move to commonly used open source tools  Puppet,OpenStack,Foreman,Koji,Oz,Kibana, …  Implement clouds at scale  IT aims for 15,000 hypervisors with 150,000 VMs by 2015  Exploit ecosystem solutions such as LBaaS, DBaaS, MQaaS rather than build our own T. Bell 12
  • 13.
    Clouds in HighEnergy Physics T. Bell 13 Long-term preservation of software and data of HEP experiments Utilize special computing resources attached to the detectors Simplify the management of heterogeneous in- house resources Use commercial clouds for exceptional computing demands Distributed cloud computing using HEP and non-HEP clouds
  • 14.
    Service Models T. Bell14  Pets are given names like pussinboots.cern.ch  They are unique, lovingly hand raised and cared for  When they get ill, you nurse them back to health  Cattle are given numbers like vm0042.cern.ch  They are almost identical to other cattle  When they get ill, you get another one Future application architectures tend towards Cattle but Pet support is needed for some specific zones of the cloud
  • 15.
    Refine Service Levels? T. Bell 15  Hippos are cattle with bulk storage. Useful where Cassandra or MongoDB ensures redundancy  Canaries are cattle at high risk to give early warning of failures .. Deploy early, fail fast and fix
  • 16.
    Infrastructure Overview T. Bell16 Microsoft Active Directory CERN DB on Demand CERN Network Database Account mgmt. system Horizon Keystone Network Compute Glance Scheduler Cinder Nova CERN Block Storage provider
  • 17.
  • 18.
    Timelines  Deploy asstable release becomes available in EPEL  Keep up to date but not too close  Benefit from continuous integration testing of other companies T. Bell 18 Grizzly ' 12 Jan 2013 Feb Apr May … Oct Dec ' 13 Today Havana Oct, 2013 Havana Service Nov/Dec, 2013 Apr 4, 2013 Grizzly Service May, 2013 Ibex Feb, 2013 Folsom Sep 27, 2012
  • 19.
    Status  CERN ITOpenStack Cloud  Running Folsom around 500 hypervisors on KVM and Hyper-V  High availability using load balancing  75 users creating around 50 new VMs/day  Experiment farms  CMS currently running 1,300 hypervisors with 50,000 cores using Essex  ATLAS starting to ramp up to a similar size  Other HEP sites moving to private cloud  Brookhaven, IN2P3, FutureGrid, NeCTAR, IHEP, … T. Bell 19
  • 20.
    Next Steps (I) Move to Grizzly  Target end May 2013  Enable Kerberos and X.509 authentication  Avoids users having to enter passwords  Recycle existing hardware and scale using cells  Can recycle around 100 batch machines to hypervisors/week T. Bell 20
  • 21.
  • 22.
    We’re not alone… T. Bell 22 Already 6 sites running more than 10,000 hypervisors according to the latest OpenStack user survey
  • 23.
    Next Steps (II) Block Storage for Hippos and Pets  Cinder with Ceph, NetApp or GlusterFS  Heat for Orchestration and auto-scaling  Load Balancing as a Service  Bare-Metal to bring all servers under OpenStack  Move ceilometer into production  Accounting by project  Move to wall-clock, vCPU metering T. Bell 23
  • 24.
    Cost Model  CERNcomputing is funded from CERN central budgets, no billing but quotas T. Bell 24 IT resource manager Experiment resource managers Project Management
  • 25.
    Quota Management  Whatto do when quota is exceeded ?  No credit card  If capacity is not used ?  Spot market on low SLA conditions  Fair share across the cloud ?  Worked for supercomputers but heavy for clouds at scale  Bursting to public clouds an option ?  IT provisioned or experiment decision T. Bell 25
  • 26.
    Cloud of clouds:the next big step  What is required to get to a cloud of clouds ?  Federated identity  Image conversion and sharing  API standardisation  SLAs  Security models  Many initiatives investigating this at different levels  Public/Private bursting  Private/Private sharing (as the grid)  Homogeneous and Heterogeneous  We will see intensive efforts in this area over the coming year T. Bell 26
  • 27.
    Conclusions  Clouds providea framework for re-engineering how IT is delivering responsive services to the physicists  OpenStack and the ecosystem provide a suitable solution with flexibility and opportunity to contribute as well as benefit from work of others  Migration via re-cycling bare-metal to hypervisors provides a smooth transition  Cloud of clouds has potential to replace grid computing models in the future T. Bell 27
  • 28.
  • 29.
  • 30.
  • 31.
    Science is gettingmore and more global CERN: x staff, x fellows T. Bell 31