Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Tim Bell
@noggin143
tim.bell@cern.ch
23/07/2014 2OSCON - CERN Mass and Agility
About Tim
• Runs IT Infrastructure group at CERN
• Member of OpenStack management board
and user committee
• Previously wo...
23/07/2014 4
CERN was founded 1954: 12 European States
“Science for Peace”
Today: 21 Member States
Member States: Austria,...
What are the Origins of Mass ?
23/07/2014 5OSCON - CERN Mass and Agility
Matter/Anti Matter Symmetric?
23/07/2014 6OSCON - CERN Mass and Agility
Where is 95% of the Universe?
23/07/2014 7OSCON - CERN Mass and Agility
23/07/2014 8OSCON - CERN Mass and Agility
23/07/2014 9OSCON - CERN Mass and Agility
23/07/2014 10OSCON - CERN Mass and Agility
Collisions
23/07/2014 11OSCON - CERN Mass and Agility
A Big Data Challenge
23/07/2014 12
In 2014,
• ~ 100PB archive with additional 35PB/year
• ~ 11,000 servers
• ~ 75,000 disk...
LHC data growth
• Plan to record
400PB/year by
2023
• Compute needs
expected to be
around 50x current
levels if budget
ava...
23/07/2014 14
Tier-1 (11 centres):
•Permanent storage
•Re-processing
•Analysis
Tier-0 (CERN):
•Data recording
•Initial dat...
The CERN Meyrin Data Centre
23/07/2014 15OSCON - CERN Mass and Agility
New Data Centre in Budapest
23/07/2014 16OSCON - CERN Mass and Agility
Good News, Bad News
23/07/2014 OSCON - CERN Mass and Agility 17
• Additional data centre in Budapest now online
• Increasi...
Public Procurement Cycle
Step Time (Days) Elapsed (Days)
User expresses requirement 0
Market Survey prepared 15 15
Market ...
Approach
• There is no Moore’s Law for people
• Automation needs APIs, not documented procedures
• Focus on high people ef...
O’Reilly Consideration
23/07/2014 OSCON - CERN Mass and Agility 20
Indeed.Com Consideration
23/07/2014 OSCON - CERN Mass and Agility 21
23/07/2014
Bamboo
Koji, Mock
AIMS/PXE
Foreman
Yum repo
Pulp
Puppet-DB
mcollective, yum
JIRA
Lemon /
Hadoop /
LogStash /
Ki...
Puppet Configuration
23/07/2014 OSCON - CERN Mass and Agility 23
• Over 10,000 hosts in
Puppet
• 160 different hostgroups
...
Monitoring - Flume, Elastic
Search, Kibana
24
HDFS
Flume
gateway
elasticsearch Kibana
OpenStack infrastructure
23/07/2014 ...
23/07/2014 25
Microsoft Active
Directory
CERN DB
on Demand
CERN Network
Database
Account mgmt
system
Horizon
Keystone
Glan...
compute-nodescontrollers
compute-nodes
Scaling Architecture Overview
26
Child Cell
Geneva, Switzerland
Child Cell
Budapest...
Status
• Multi-data centre cloud in production since July
2013 (Geneva and Budapest) with nearly 1,000
users
• Currently r...
The Agile Experience
23/07/2014 OSCON - CERN Mass and Agility 28
Cultural Barriers
23/07/2014 OSCON - CERN Mass and Agility 29
Agility and Elasticity Limits
• Communities help to set good behaviour
• Internal demonstrations build momentum
• Finding ...
Next Steps: Scale with Physics
• Scaling to >100,000 cores by 2015
• Around 100 hypervisors per week with fixed staff
• De...
IN2P3
Lyon
Next Steps: Federated Clouds
Public Cloud such
as Rackspace
CERN Private
Cloud
70K cores
ATLAS Trigger
28K core...
Summary
• Open source tools have successfully replaced CERN’s
legacy fabric management system
• Scaling to 100,000s of cor...
Questions ?
23/07/2014 34
• Details at
http://openstack-in-
production.blogspot.fr
• Previous presentations at
http://info...
23/07/2014 35OSCON - CERN Mass and Agility
23/07/2014 36OSCON - CERN Mass and Agility
23/07/2014 37
http://www.eucalyptus.com/blog/2013/04/02/cy13-q1-community-analysis-%E2%80%94-openstack-vs-opennebula-vs-eu...
23/07/2014 38OSCON - CERN Mass and Agility
Monitoring - Kibana
3923/07/2014 OSCON - CERN Mass and Agility
Monitoring - Kibana
4023/07/2014 OSCON - CERN Mass and Agility
23/07/2014 41OSCON - CERN Mass and Agility
Architecture Components
42
rabbitmq
- Keystone
- Nova api
- Nova conductor
- Nova scheduler
- Nova network
- Nova cells
- ...
Upgrade Strategy
• Surely “OpenStack can’t be upgraded”
• Our Essex, Folsom and Grizzly clouds were ‘tear-down’
migrations...
Phased Migration
• Migrated by Component
• Choose an approach (online with load balancer, offline)
• Spin up ‘teststack’ i...
Upgrade Experience
• No significant outage of the cloud
• During upgrade window, creation not possible
• Small incidents (...
Duplication and Divergence
Service Silos Functional Layers
23/07/2014 OSCON - CERN Mass and Agility 46
Network
Hardware Fa...
Service Models
23/07/2014 47
• Pets are given names like pussinboots.cern.ch
• They are unique, lovingly hand raised and c...
23/07/2014 48OSCON - CERN Mass and Agility
CERN Mass and Agility talk at OSCON 2014
Upcoming SlideShare
Loading in …5
×

CERN Mass and Agility talk at OSCON 2014

2,134 views

Published on

CERN is the European Centre for Particle Physics based in Geneva. The home of the Large Hadron Collider and the birth place of the world wide web is expanding its computing resources with a second data centre to process over 35PB/year from one of the largest scientific experiments ever constructed.

Within the constraints of fixed budget and manpower, agile computing techniques and common open source tools are being adopted to support over 11,000 physicists in their search for how the universe works and what is it made of.

By challenging special requirements and understanding how other large computing infrastructures are built, we have deployed a 50,000 core cloud based infrastructure building on tools such as Puppet, OpenStack and Kibana.

In moving to a cloud model, this has also required close examination of the IT processes and culture. Finding the right approach between Enterprise and DevOps techniques has been one of the greatest challenges of this transformation.

This talk will cover the requirements, tools selected, results achieved so far and the outlook for the future.

Published in: Technology
  • Be the first to comment

CERN Mass and Agility talk at OSCON 2014

  1. 1. Tim Bell @noggin143 tim.bell@cern.ch 23/07/2014 2OSCON - CERN Mass and Agility
  2. 2. About Tim • Runs IT Infrastructure group at CERN • Member of OpenStack management board and user committee • Previously worked at • Deutsche Bank running European Private Banking Infrastructure • IBM as a consultant and kernel developer 23/07/2014 3OSCON - CERN Mass and Agility
  3. 3. 23/07/2014 4 CERN was founded 1954: 12 European States “Science for Peace” Today: 21 Member States Member States: Austria, Belgium, Bulgaria, the Czech Republic, Denmark, Finland, France, Germany, Greece, Hungary, Israel, Italy, the Netherlands, Norway, Poland, Portugal, Slovakia, Spain, Sweden, Switzerland and the United Kingdom Candidate for Accession: Romania Associate Members in Pre-Stage to Membership: Serbia Applicant States for Membership or Associate Membership: Brazil, Cyprus (awaiting ratification), Pakistan, Russia, Slovenia, Turkey, Ukraine Observers to Council: India, Japan, Russia, Turkey, United States of America; European Commission and UNESCO ~ 2,300 staff ~ 1,000 other paid personnel > 11,000 users Budget (2013) ~1,000 MCHF OSCON - CERN Mass and Agility
  4. 4. What are the Origins of Mass ? 23/07/2014 5OSCON - CERN Mass and Agility
  5. 5. Matter/Anti Matter Symmetric? 23/07/2014 6OSCON - CERN Mass and Agility
  6. 6. Where is 95% of the Universe? 23/07/2014 7OSCON - CERN Mass and Agility
  7. 7. 23/07/2014 8OSCON - CERN Mass and Agility
  8. 8. 23/07/2014 9OSCON - CERN Mass and Agility
  9. 9. 23/07/2014 10OSCON - CERN Mass and Agility
  10. 10. Collisions 23/07/2014 11OSCON - CERN Mass and Agility
  11. 11. A Big Data Challenge 23/07/2014 12 In 2014, • ~ 100PB archive with additional 35PB/year • ~ 11,000 servers • ~ 75,000 disk drives • ~ 45,000 tapes • Data should be kept for at least 20 years In 2015, we start the accelerator again • Upgrade to double the energy of the beams • Expect a significant increase in data rate OSCON - CERN Mass and Agility
  12. 12. LHC data growth • Plan to record 400PB/year by 2023 • Compute needs expected to be around 50x current levels if budget available 23/07/2014 OSCON - CERN Mass and Agility 13 0.0 50.0 100.0 150.0 200.0 250.0 300.0 350.0 400.0 450.0 Run 1 Run 2 Run 3 Run 4 CMS ATLAS ALICE LHCb 2010 2015 2018 2023 PB per year
  13. 13. 23/07/2014 14 Tier-1 (11 centres): •Permanent storage •Re-processing •Analysis Tier-0 (CERN): •Data recording •Initial data reconstruction •Data distribution Tier-2 (~200 centres): • Simulation • End-user analysis • Data is recorded at CERN and Tier-1s and analysed in the Worldwide LHC Computing Grid • In a normal day, the grid provides 100,000 CPU days executing over 2 million jobs OSCON - CERN Mass and Agility
  14. 14. The CERN Meyrin Data Centre 23/07/2014 15OSCON - CERN Mass and Agility
  15. 15. New Data Centre in Budapest 23/07/2014 16OSCON - CERN Mass and Agility
  16. 16. Good News, Bad News 23/07/2014 OSCON - CERN Mass and Agility 17 • Additional data centre in Budapest now online • Increasing use of facilities as data rates increase But… • Staff numbers are fixed, no more people • Materials budget decreasing, no more money • Legacy tools are high maintenance and brittle • User expectations are for fast self-service
  17. 17. Public Procurement Cycle Step Time (Days) Elapsed (Days) User expresses requirement 0 Market Survey prepared 15 15 Market Survey for possible vendors 30 45 Specifications prepared 15 60 Vendor responses 30 90 Test systems evaluated 30 120 Offers adjudicated 10 130 Finance committee 30 160 Hardware delivered 90 250 Burn in and acceptance 30 days typical with 380 worst case 280 Total 280+ Days 23/07/2014 OSCON - CERN Mass and Agility 18
  18. 18. Approach • There is no Moore’s Law for people • Automation needs APIs, not documented procedures • Focus on high people effort activities • Are those requirements really justified ? • Accumulating technical debt stifles agility • Find open source communities and contribute • Understand ethos and architecture • Stay mainstream 23/07/2014 OSCON - CERN Mass and Agility 19
  19. 19. O’Reilly Consideration 23/07/2014 OSCON - CERN Mass and Agility 20
  20. 20. Indeed.Com Consideration 23/07/2014 OSCON - CERN Mass and Agility 21
  21. 21. 23/07/2014 Bamboo Koji, Mock AIMS/PXE Foreman Yum repo Pulp Puppet-DB mcollective, yum JIRA Lemon / Hadoop / LogStash / Kibana git OpenStack Nova Hardware database Puppet Active Directory / LDAP 22OSCON - CERN Mass and Agility
  22. 22. Puppet Configuration 23/07/2014 OSCON - CERN Mass and Agility 23 • Over 10,000 hosts in Puppet • 160 different hostgroups • Tool chain using • PuppetDB • Foreman • Git • Scaling issues resolved with the communities
  23. 23. Monitoring - Flume, Elastic Search, Kibana 24 HDFS Flume gateway elasticsearch Kibana OpenStack infrastructure 23/07/2014 OSCON - CERN Mass and Agility
  24. 24. 23/07/2014 25 Microsoft Active Directory CERN DB on Demand CERN Network Database Account mgmt system Horizon Keystone Glance Network Compute Scheduler Cinder Nova Block Storage Ceph & NetApp CERN Accounting Ceilometer OSCON - CERN Mass and Agility
  25. 25. compute-nodescontrollers compute-nodes Scaling Architecture Overview 26 Child Cell Geneva, Switzerland Child Cell Budapest, Hungary Top Cell - controllers Geneva, Switzerland Load Balancer Geneva, Switzerland controllers 23/07/2014 OSCON - CERN Mass and Agility
  26. 26. Status • Multi-data centre cloud in production since July 2013 (Geneva and Budapest) with nearly 1,000 users • Currently running OpenStack Havana • KVM and Hyper-V deployed • All configured automatically with Puppet • ~70,000 cores on ~3,000 servers • 3PB Ceph pool available for volumes, images and other physics storage 23/07/2014 27OSCON - CERN Mass and Agility
  27. 27. The Agile Experience 23/07/2014 OSCON - CERN Mass and Agility 28
  28. 28. Cultural Barriers 23/07/2014 OSCON - CERN Mass and Agility 29
  29. 29. Agility and Elasticity Limits • Communities help to set good behaviour • Internal demonstrations build momentum • Finding the right speed is key • Keeping up with releases takes focus • Coping with legacy requires compromise • Travel budget needs significant increase! 23/07/2014 OSCON - CERN Mass and Agility 30
  30. 30. Next Steps: Scale with Physics • Scaling to >100,000 cores by 2015 • Around 100 hypervisors per week with fixed staff • Deploying and configuring latest releases • Need to stay close … but not too close • Legacy systems retirement • Server consolidation • Home grown configuration and monitoring • Analytics of processor, disk and network • Focus on efficiency 23/07/2014 31OSCON - CERN Mass and Agility
  31. 31. IN2P3 Lyon Next Steps: Federated Clouds Public Cloud such as Rackspace CERN Private Cloud 70K cores ATLAS Trigger 28K cores CMS Trigger 12K cores Brookhaven National Labs NecTAR Australia Many Others on Their Way 23/07/2014 OSCON - CERN Mass and Agility 32
  32. 32. Summary • Open source tools have successfully replaced CERN’s legacy fabric management system • Scaling to 100,000s of cores with OpenStack and Puppet is in sight • Cultural change to an Agile approach has required time and patience but is paying off Community collaboration needed to reach 400PB/year 23/07/2014 33OSCON - CERN Mass and Agility
  33. 33. Questions ? 23/07/2014 34 • Details at http://openstack-in- production.blogspot.fr • Previous presentations at http://information- technology.web.cern.ch/boo k/cern-private-cloud-user- guide/openstack-information • CERN code is at http://github.com/cernops OSCON - CERN Mass and Agility
  34. 34. 23/07/2014 35OSCON - CERN Mass and Agility
  35. 35. 23/07/2014 36OSCON - CERN Mass and Agility
  36. 36. 23/07/2014 37 http://www.eucalyptus.com/blog/2013/04/02/cy13-q1-community-analysis-%E2%80%94-openstack-vs-opennebula-vs-eucalyptus-vs- cloudstack OSCON - CERN Mass and Agility
  37. 37. 23/07/2014 38OSCON - CERN Mass and Agility
  38. 38. Monitoring - Kibana 3923/07/2014 OSCON - CERN Mass and Agility
  39. 39. Monitoring - Kibana 4023/07/2014 OSCON - CERN Mass and Agility
  40. 40. 23/07/2014 41OSCON - CERN Mass and Agility
  41. 41. Architecture Components 42 rabbitmq - Keystone - Nova api - Nova conductor - Nova scheduler - Nova network - Nova cells - Glance api - Ceilometer agent-central - Ceilometer collector Controller - Flume - Nova compute - Ceilometer agent-compute Compute node - Flume - HDFS - Elastic Search - Kibana - MySQL - MongoDB - Glance api - Glance registry - Keystone - Nova api - Nova consoleauth - Nova novncproxy - Nova cells - Horizon - Ceilometer api - Cinder api - Cinder volume - Cinder scheduler rabbitmq Controller Top Cell Children Cells - Stacktach - Ceph - Flume 23/07/2014 OSCON - CERN Mass and Agility
  42. 42. Upgrade Strategy • Surely “OpenStack can’t be upgraded” • Our Essex, Folsom and Grizzly clouds were ‘tear-down’ migrations • Puppet managed VMs are typical Cattle cases – re-create • User VMs snapshot, download image and upload to new instance • One month window to migrate • Users of production services expect more • Physicists accept not creating/changing VMs for a short period • Running VMs must not be affected 23/07/2014 43OSCON - CERN Mass and Agility
  43. 43. Phased Migration • Migrated by Component • Choose an approach (online with load balancer, offline) • Spin up ‘teststack’ instance with production software • Clone production databases to test environment • Run through upgrade process • Validate existing functions, Puppet configuration and monitoring • Order by complexity and need • Ceilometer, Glance, Keystone • Cinder, Client CLIs, Horizon • Nova 23/07/2014 44OSCON - CERN Mass and Agility
  44. 44. Upgrade Experience • No significant outage of the cloud • During upgrade window, creation not possible • Small incidents (see blog for details) • Puppet can be enthusiastic! - we told it to be  • Community response has been great • Bugs fixed and points are in Juno design summit • Rolling upgrades in Icehouse will make it easier 23/07/2014 45OSCON - CERN Mass and Agility
  45. 45. Duplication and Divergence Service Silos Functional Layers 23/07/2014 OSCON - CERN Mass and Agility 46 Network Hardware Facilities Storage Compute Windows Web Database Custom Network Hardware Facilities Infrastructure as a Service Platform as a Service Storage Compute Windows
  46. 46. Service Models 23/07/2014 47 • Pets are given names like pussinboots.cern.ch • They are unique, lovingly hand raised and cared for • When they get ill, you nurse them back to health • Cattle are given numbers like vm0042.cern.ch • They are almost identical to other cattle • When they get ill, you get another one OSCON - CERN Mass and Agility
  47. 47. 23/07/2014 48OSCON - CERN Mass and Agility

×