Successfully reported this slideshow.
Your SlideShare is downloading. ×

CERN Data Centre Evolution

Loading in …3

Check these out next

1 of 32 Ad

More Related Content

Slideshows for you (20)

Similar to CERN Data Centre Evolution (20)


Recently uploaded (20)

CERN Data Centre Evolution

  1. 1. CERN  Data  Centre  Evolution Gavin  McCance @gmccance SDCD12:  Supporting  Science  with  Cloud  Computing Bern 19th November  2012
  2. 2. What  is  CERN  ? Gavin  McCance,  CERN 2 • Conseil Européen pour  la   Recherche Nucléaire – aka   European  Laboratory  for   Particle  Physics • Between  Geneva  and  the   Jura  mountains,  straddling   the  Swiss-­‐French  border • Founded  in  1954  with  an   international  treaty • Our  business  is  fundamental   physics  ,  what  is  the   universe  made  of  and  how   does  it  work
  3. 3. Gavin  McCance,  CERN 3 Answering fundamental questions… • How  to  explain particles have  mass? We have  theories and  accumulating experimental evidence..  Getting close… • What is 96%  of  the  universe made  of  ? We can only see 4%  of  its estimated mass! • Why isn’t there anti-­‐matter in  the  universe? Nature  should be symmetric… • What was the  state  of  matter just after the  « Big Bang »  ? Travelling  back  to  the  earliest instants  of the  universe would help…
  4. 4. 4 The  Large  Hadron  Collider  (LHC)  tunnel Gavin  McCance,  CERN
  5. 5. Gavin  McCance,  CERN 5
  6. 6. Gavin  McCance,  CERN 6 • Data  Centre  by  Numbers – Hardware  installation  &  retirement • ~7,000  hardware  movements/year;  ~1,800  disk  failures/year Xeon   5150 2% Xeon   5160 10% Xeon   E5335 7% Xeon   E5345 14% Xeon   E5405 6% Xeon   E5410 16% Xeon   L5420 8% Xeon   L5520 33% Xeon   3GHz 4% Fujitsu 3% Hitachi 23% HP 0% Maxtor 0% Seagate 15% Western   Digital 59% Other 0% High  Speed  Routers (640 Mbps  →  2.4  Tbps) 24 Ethernet  Switches 350 10  Gbps  ports 2,000 Switching  Capacity 4.8 Tbps 1  Gbps  ports 16,939 10  Gbps  ports 558 Racks 828 Servers 11,728 Processors 15,694 Cores 64,238 HEPSpec06 482,507 Disks 64,109 Raw  disk  capacity  (TiB) 63,289 Memory  modules 56,014 Memory  capacity  (TiB) 158 RAID  controllers 3,749 Tape  Drives 160 Tape  Cartridges 45,000 Tape  slots 56,000 Tape Capacity  (TiB) 73,000 IT  Power  Consumption 2,456  KW Total Power  Consumption 3,890  KW
  7. 7. Current  infrastructure • Around  12k  servers – Dedicated  compute,  dedicated  disk  server,  dedicated  service  nodes – Majority  Scientific  Linux  (RHEL5/6  clone) – Mostly  running  on  real  hardware – Last  couple  of  years,  we’ve  consolidated  some  of  the  service  nodes   onto  Microsoft  HyperV – Various  other  virtualisation  projects  around • In  2002  we  developed  our  own  management  toolset – Quattor /  CDB  configuration  tool – Lemon  computer  monitoring – Open  source,  but  a  small  community Gavin  McCance,  CERN 7
  8. 8. • Many  diverse  applications  (”clusters”)   • Managed  by  different  teams  (CERN  IT  +  experiment  groups) Gavin  McCance,  CERN 8
  9. 9. New  data  centre  to  expand  capacity Gavin  McCance,  CERN 9 • Data  centre  in  Geneva   at  the  limit  of   electrical  capacity  at   3.5MW • New  centre  chosen  in   Budapest,  Hungary • Additional  2.7MW  of   usable  power • Hands  off  facility • Deploying  from  2013   with  200Gbit/s   network  to  CERN
  10. 10. Time  to  change  strategy • Rationale – Need  to  manage  twice  the  servers  as  today – No  increase  in  staff  numbers – Tools  becoming  increasingly  brittle  and  will  not  scale  as-­‐is • Approach – CERN  is  no  longer  a  special  case  for  compute – Adopt  an  open  source  tool  chain  model – Our  engineers  rapidly  iterate • Evaluate  solutions  in  the  problem  domain • Identify  functional  gaps  and  challenge  old  assumptions • Select  first  choice  but  be  prepared  to  change  in  future – Contribute  new  function  back  to  the  community Gavin  McCance,  CERN 10
  11. 11. Building  Blocks Gavin  McCance,  CERN 11 Bamboo   Koji,  Mock AIMS/PXE Foreman Yum  repo Pulp Puppet-­DB mcollective,  yum JIRA Lemon  / Hadoop git OpenStack   Nova Hardware   database Puppet Active  Directory  / LDAP
  12. 12. Choose  Puppet  for  Configuration • The  tool  space  has  exploded  in  last  few  years – In  configuration  management  and  operations • Puppet and  Chef are  the  clear  leaders  for  ‘core  tools’ • Many  large  enterprises  now  use  Puppet – Its  declarative  approach  fits  what  we’re  used  to  at  CERN – Large  installations:  friendly,  wide-­‐based  community – You  can  buy  books  on  it – You  can  employ  people  who  know  it  better  than  do Gavin  McCance,  CERN 12
  13. 13. Puppet  Experience • Excellent:  basic  puppet  is  easy  to  setup and  can  be  scaled-­‐up  well • Well  documented,  configuring  services  with  it  is  easy • Handle  our  cluster  diversity  and  dynamic  clouds  well • Lots  of  resource  (“modules”)  online,  though  of  varying  quality • Large,  responsive  community  to  help • Lots  of  nice  tooling  for  free – Configuration  version  control  and  branching:  integrates  well  with  git – Dashboard:  we  use  the  Foreman dashboard • We’re  moving  all  our  production  service  over  in  2013 Gavin  McCance,  CERN 13
  14. 14. Gavin  McCance,  CERN 14
  15. 15. Preparing  the  move  to  cloud • Improve  operational  efficiency  and  dynamicness – Dynamic  multiple  operating  system  demand – Dynamic  temporary  load  spikes  for  special  activities – Hardware  interventions  with  long  running  programs  (live  migration) • Improve  resource  efficiency – Exploit  idle  resources,  especially  waiting  for  disk  and  tape  I/O – Highly  variable  load  such  as  interactive  or  build  machines • Enable  cloud  architectures – Gradual  migration  from  traditional  batch  +  disk  to  cloud  interfaces  and   workflows • Improve  responsiveness – Self-­‐Service  with  coffee  break  response  time Gavin  McCance,  CERN 15
  16. 16. What  is  OpenStack  ? • OpenStack  is  a  cloud  operating  system  that  controls  large   pools  of  compute,  storage,  and  networking  resources   throughout  a  datacenter,  all  managed  through  a  dashboard   that  gives  administrators  control  while  empowering  their  users   to  provision  resources  through  a  web  interface Gavin  McCance,  CERN 16
  17. 17. Service  Model Gavin  McCance,  CERN 17 • Pets  are  given  names  like   • They  are  unique,  lovingly  hand  raised   and  cared  for • When  they  get  ill,  you  nurse  them  back   to  health • Cattle  are  given  numbers  like • They  are  almost  identical  to  other  cattle • When  they  get  ill,  you  get  another  one • Future  application  architectures  should  use  Cattle  but  Pets  with   strong  configuration  management  are  viable  and  still  needed Borrowed  from @randybias at  Cloudscaling­‐cloud-­‐ revolution-­‐cyber-­‐press-­‐forum-­‐philippines
  18. 18. Basic  Openstack Components Gavin  McCance,  CERN 18 Compute Scheduler NetworkVolume Registry Image KEYSTONE HORIZON NOVAGLANCE • Each  component  has  an  API  and  is  pluggable • Other  non-­‐core  projects  interact  with  these  components  
  19. 19. Supporting  the  Pets  with  OpenStack • Network – Interfacing  with  legacy  site  DNS  and  IP  management – Ensuring  Kerberos  identity  before  VM  start • Puppet – Ease  use  of  configuration  management  tools  with  our  users – Exploit  mcollective  for  orchestration/delegation • External  Block  Storage – Currently  using  nova-­‐volume  with  Gluster backing  store • Live  migration  to  maximise  availability – KVM  live  migration  using  Gluster – KVM  and  Hyper-­‐V  block  migration Gavin  McCance,  CERN 19
  20. 20. Current  Status  of  OpenStack  at  CERN • Working  on  an  Essex  code  base  from  the  EPEL  repository – Excellent  experience  with  the  Fedora  cloud-­‐sig  team – Cloud-­‐init for  contextualisation,  oz for  images  with  RHEL/Fedora • Components – Current  focus  is  on  Nova  with  KVM  and  Hyper-­‐V – Keystone  running  with  Active  Directory  and  Glance  for  Linux  and   Windows  images • Pre-­‐production  facility  with  around  200  Hypervisors,    with   2000  VMs  integrated  with  CERN  infrastructure – used  for  simulation  of  magnet  placement  using  LHC@Home and  batch   physics  programs Gavin  McCance,  CERN 20
  21. 21. Gavin  McCance,  CERN 21
  22. 22. Next  Steps • Deploy  into  production  at  the  start  of  2013  with  Folsom  running   production  services  and  compute  on  top  of  OpenStack  IaaS • Support  multi-­‐site  operations  with  2nd data  centre  in  Hungary • Exploit  new  functionality – Ceilometer  for  metering – Bare  metal  for  non-­‐virtualised  use  cases  such  as  high  I/O  servers – X.509  user  certificate  authentication – Load  balancing  as  a  service Ramping  to  15K  hypervisors  with  100K   VMs  by  2015   Gavin  McCance,  CERN 22
  23. 23. Conclusions • CERN  computer  centre  is  expanding • We’re  in  the  process  of  refurbishing  the  tools  we  use   to  manage  the  centre  based  on  Openstack for  IaaS and  Puppet for  configuration  management • Production  at  CERN  in  next  few  months  on  Folsom – Gradual  migration  of  all  our  services • Community  is  key  to  shared  success – CERN  contributes  and  benefits Gavin  McCance,  CERN 23
  24. 24. BACKUP  SLIDES Gavin  McCance,  CERN 24
  25. 25. Training  and  Support • Buy  the  book  rather  than  guru  mentoring • Follow  the  mailing  lists  to  learn • Newcomers  are  rapidly  productive  (and  often  know  more  than  us) • Community  and  Enterprise  support  means  we’re  not  on  our  own Gavin  McCance,  CERN 25
  26. 26. Staff  Motivation • Skills  valuable  outside  of  CERN  when  an  engineer’s  contracts   end Gavin  McCance,  CERN 26
  27. 27. When  communities  combine… • OpenStack’s  many  components  and  options  make   configuration  complex  out  of  the  box • Puppet  forge module  from  PuppetLabs  does  our  configuration • The  Foreman  adds  OpenStack  provisioning  for  user  kiosk  to  a   configured  machine  in  15  minutes Gavin  McCance,  CERN 27
  28. 28. Foreman  to  manage  Puppetized VM Gavin  McCance,  CERN 28
  29. 29. Active  Directory  Integration • CERN’s  Active  Directory – Unified  identity  management  across  the  site – 44,000  users – 29,000  groups – 200  arrivals/departures  per  month • Full  integration  with  Active  Directory  via  LDAP – Uses  the  OpenLDAP backend  with  some  particular  configuration   settings – Aim  for  minimal  changes  to  Active  Directory – 7  patches  submitted  around  hard  coded  values  and  additional  filtering • Now  in  use  in  our  pre-­‐production  instance – Map  project  roles  (admins,  members)  to  groups – Documentation  in  the  OpenStack  wiki Gavin  McCance,  CERN 29
  30. 30. What  are  we  missing  (or  haven’t  found  yet)  ? • Best  practice  for – Monitoring  and  KPIs  as  part  of  core  functionality – Guest  disaster  recovery – Migration  between  versions  of  OpenStack • Roles  within  multi-­‐user  projects – VM  owner  allowed  to  manage  their  own  resources  (start/stop/delete) – Project  admins  allowed  to  manage  all  resources – Other  members  should  not  have  high  rights  over  other  members  VMs • Global  quota  management  for  non-­‐elastic  private  cloud – Manage  resource  prioritisation  and  allocation  centrally – Capacity  management  /  utilisation  for  planning Gavin  McCance,  CERN 30
  31. 31. Opportunistic  Clouds  in  online  experiment  farms • The  CERN  experiments  have  farms  of  1000s  of  Linux  servers   close  to  the  detectors  to  filter  the  1PByte/s  down  to  6GByte/s   to  be  recorded  to  tape • When  the  accelerator  is  not  running,  these  machines  are   currently    idle – Accelerator  has  regular  maintenance  slots  of  several  days – Long  Shutdown  due  from  March  2013-­‐November  2014 • One  of  the  experiments  are  deploying  OpenStack  on  their  farm – Simulation  (low  I/O,  high  CPU) – Analysis  (high  I/O,  high  CPU,  high  network) Gavin  McCance,  CERN 31
  32. 32. New  architecture  data  flows Gavin  McCance,  CERN 32

Editor's Notes

  • Established by an international treaty at the end of 2nd world war as a place where scientists could work together for fundamental researchNuclear is part of the name but our world is particle physics
  • Our current understanding of the universe is incomplete. A theory, called the Standard Model, proposes particles and forces, many of which have been experimentally observed. However, there are open questions- Why do some particles have mass and others not ? The Higgs Boson is a theory but we need experimental evidence.Our theory of forces does not explain how Gravity worksCosmologists can only find 4% of the matter in the universe, we have lost the other 96%We should have 50% matter, 50% anti-matter… why is there an asymmetry (although it is a good thing that there is since the two anhialiate each other) ?When we go back through time 13 billion years towards the big bang, we move back through planets, stars, atoms, protons/electrons towards a soup like quark gluon plasma. What were the properties of this?
  • The ring consists of two beam pipes, with a vacuum pressure 10 times lower than on the moon which contain the beams of protons accelerated to just below the speed of light. These go round 11,000 times per second being bent by the superconducting magnets cooled to 2K by liquid helium (-450F), colder than outer space. The beams themselves have a total energy similar to a high speed train so care needs to be taken to make sure they turn the corners correctly and don’t bump into the walls of the pipe.
  • To improve the statistics, we send round beams of multiple bunches, as they cross there are multiple collisions as 100 billion protons per bunch pass through each otherSoftware close by the detector and later offline in the computer centre then has to examine the tracks to understand the particles involved
  • So, to the Tier-0 computer centre at CERN… we are unusual in that we are public with our environment as there is no competitive advantage for us. We have thousands of visitors a year coming for tours and education and the computer center is a popular visit.The data centre has around 2.9MW of usable power looking after 12,000 servers.. In comparison, the accelerator uses 120MW, like a small town.With 64,000 disks, we have around 1,800 failing each year… this is much higher than the manufacturers’ MTBFs which is consistent with results from Google.Servers are mainly Intel processors, some AMD with dual core Xeon being the most common configuration.
  • Asked member states for offers200Gbit/s links connecting the centresExpect to double computing capacity compared to today by 2015
  • Double the capacity, same manpowerNeed to rethink how to solve the problem… look at how others approach itWe had our own tools in 2002 and as they become more sophisticated, it was not possible to take advantage of other developments elsewhere without a major break.Doing this while doing their ‘day’ jobs so it re-enforces the approach of taking what we can from the community
  • Model based on Google Toolchain, Puppet is key for many operations. We’ve only had to write one new significant custom CERN software component which is in the certificate authority. Other parts such as Lemon for monitoring are from our previous implementation as we did not want to change all at once and they scale.
  • Standardise hardware … buy in bulk and pile it up then work out what to use it forMemory, motherboards, cables or disks interventionsUsers waiting for I/O means wasted cycles. Build machines at night unused during the day. Interactive machines mainly during the dayMove to cloud APIs … need to support them but also maintain our existing applicationsDetails later on reception and testing
  • Puppet applies well to the cattle model but we’re also using it to handle the pet cases that can’t yet move over due to software limitations. So, they get cloud provisioning but flexible configuration management.
  • Complex to configure… take advantage of the experience of others
  • We’ve been very pleased with our choices. Along with the obvious benefits of the functionality, there are soft benefits from the community model.
  • Many staff at CERN are short term contracts… good benefits for those staff to leave with skills in need.
  • Communities integrating … when a new option is being used at CERN in OpenStack, we contribute the changes back to the puppet forge such as certificate handling. Even looking at Hyper-V/Windows openstack configuration…