Tim Bell 
@noggin143 
tim.bell@cern.ch 
04/11/2014 Tim Bell - OpenStack Paris 2
Answering fundamental questions… 
How to explain particles have a mass? 
Brout-Englert-Higgs 
Boson 
04/11/2014 Tim Bell - OpenStack Paris 3
Answering Fundamental Questions… 
Where has all the anti-matter gone ? 
04/11/2014 Tim Bell - OpenStack Paris 4
Answering Fundamental Questions… 
What is the mass of the Universe made of? 
We can only 
see 5% of its 
estimated mass 
~25% Dark matter? 
~70% Dark energy? 
04/11/2014 Tim Bell - OpenStack Paris 5
Answering Fundamental Questions… 
Why is Gravity so weak ? 
Extra dimensions ? 
Gravitons ? 
04/11/2014 Tim Bell - OpenStack Paris 6
04/11/2014 Tim Bell - OpenStack Paris 7
04/11/2014 Tim Bell - OpenStack Paris 8
04/11/2014 Tim Bell - OpenStack Paris 9
Collisions 
04/11/2014 Tim Bell - OpenStack Paris 10
A Big Data Challenge 
In 2014, 
• ~ 100PB archive with additional 27PB/year 
• ~ 11,000 servers 
• ~ 75,000 disk drives 
• ~ 45,000 tapes 
• Data should be kept for at least 20 years 
In 2015, we start the accelerator again 
• Upgrade to double the energy of the beams 
• Expect a significant increase in data rate 
04/11/2014 Tim Bell - OpenStack Paris 
11
LHC data growth 
• Estimating 
400PB/year by 
2023 
• Compute needs 
expected to be 
around 50x current 
levels if budget 
available 
04/11/2014 Tim Bell - OpenStack Paris 12 
450.0	 
400.0	 
350.0	 
300.0	 
250.0	 
200.0	 
150.0	 
100.0	 
50.0	 
0.0	 
Run	1	 Run	2	 Run	3	 Run	4	 
CMS	 
ATLAS	 
ALICE	 
LHCb	 
2010 2015 2018 2023 
PB 
per 
year
The CERN Meyrin Data Centre 
http://goo.gl/maps/K5SoG 
04/11/2014 Tim Bell - OpenStack Paris 13
04/11/2014 Tim Bell - OpenStack Paris 14
04/11/2014 Tim Bell - OpenStack Paris 15
Good News, Bad News 
• Additional data centre in Budapest now online 
• Increasing use of facilities as data rates increase 
But… 
• Staff numbers are fixed, no more people 
• Materials budget decreasing, no more money 
• Legacy tools are high maintenance and brittle 
• User expectations are for fast self-service 
04/11/2014 Tim Bell - OpenStack Paris 16
We are not Special! 
• Challenge the must-have lists at project start 
• Are those requirements really justified ? 
• Accumulating technical debt stifles agility 
• There is no Moore’s Law for people 
• Automation needs APIs, not documented procedures 
• Find open source communities and contribute 
• Understand ethos and architecture 
• Stay mainstream, stay up to date 
04/11/2014 Tim Bell - OpenStack Paris 17
CERN Tool Chain 
04/11/2014 Tim Bell - OpenStack Paris 18
Status 
• Started project in 2011 with Cactus 
• In production since July 2013 with Grizzly 
• 2 upgrades without major incidents or VM downtime 
• 4 OpenStack Icehouse clouds at CERN 
• Largest is ~70,000 cores on ~3,000 servers 
• 3 other instances with 45,000 cores total 
• Expected to pass 150,000 cores in total by Q1 2015 
• All non-CERN specific code is upstream 
04/11/2014 Tim Bell - OpenStack Paris 19
Nova Cells Scaling Architecture 
compute-nodes 
controllers compute-nodes 
Child Cell 
Geneva, Switzerland 
Child Cell 
Budapest, Hungary 
20 
Top Cell - controllers 
Geneva, Switzerland 
Load Balancer 
Geneva, Switzerland 
controllers 
04/11/2014 Tim Bell - OpenStack Paris
Onwards the Federated Clouds 
IN2P3 
Lyon 
Public Cloud such 
as Rackspace 
CERN Private Cloud 
72K cores 
Many Others on 
Their Way 
NecTAR 
Australia 
ATLAS Trigger 
28K cores 
ALICE Trigger 
12K cores 
CMS Trigger 
12K cores 
Brookhaven 
National Labs 
04/11/2014 Tim Bell - OpenStack Paris 21
Hooke’s Law for Cultural Change 
• Under load, an 
organisation can 
extend proportional to 
external force 
• Too much stretching 
leads to permanent 
deformation 
04/11/2014 Tim Bell - OpenStack Paris 22
The Agile Experience 
04/11/2014 Tim Bell - OpenStack Paris 23
Cultural Barriers 
04/11/2014 Tim Bell - OpenStack Paris 24
Standing on the Shoulders of Giants 
04/11/2014 Tim Bell - OpenStack Paris 25
Thanks to all Community Members! 
• Details at 
http://openstack-in-production. 
blogspot.fr 
• CERN code is 
upstream or at 
http://github.com/cernops 
• CERN & Industry 
Collaboration at 
http://cern.ch/openlab 
04/11/2014 Tim Bell - OpenStack Paris 
26

20141103 cern open_stack_paris_v3

  • 2.
    Tim Bell @noggin143 tim.bell@cern.ch 04/11/2014 Tim Bell - OpenStack Paris 2
  • 3.
    Answering fundamental questions… How to explain particles have a mass? Brout-Englert-Higgs Boson 04/11/2014 Tim Bell - OpenStack Paris 3
  • 4.
    Answering Fundamental Questions… Where has all the anti-matter gone ? 04/11/2014 Tim Bell - OpenStack Paris 4
  • 5.
    Answering Fundamental Questions… What is the mass of the Universe made of? We can only see 5% of its estimated mass ~25% Dark matter? ~70% Dark energy? 04/11/2014 Tim Bell - OpenStack Paris 5
  • 6.
    Answering Fundamental Questions… Why is Gravity so weak ? Extra dimensions ? Gravitons ? 04/11/2014 Tim Bell - OpenStack Paris 6
  • 7.
    04/11/2014 Tim Bell- OpenStack Paris 7
  • 8.
    04/11/2014 Tim Bell- OpenStack Paris 8
  • 9.
    04/11/2014 Tim Bell- OpenStack Paris 9
  • 10.
    Collisions 04/11/2014 TimBell - OpenStack Paris 10
  • 11.
    A Big DataChallenge In 2014, • ~ 100PB archive with additional 27PB/year • ~ 11,000 servers • ~ 75,000 disk drives • ~ 45,000 tapes • Data should be kept for at least 20 years In 2015, we start the accelerator again • Upgrade to double the energy of the beams • Expect a significant increase in data rate 04/11/2014 Tim Bell - OpenStack Paris 11
  • 12.
    LHC data growth • Estimating 400PB/year by 2023 • Compute needs expected to be around 50x current levels if budget available 04/11/2014 Tim Bell - OpenStack Paris 12 450.0 400.0 350.0 300.0 250.0 200.0 150.0 100.0 50.0 0.0 Run 1 Run 2 Run 3 Run 4 CMS ATLAS ALICE LHCb 2010 2015 2018 2023 PB per year
  • 13.
    The CERN MeyrinData Centre http://goo.gl/maps/K5SoG 04/11/2014 Tim Bell - OpenStack Paris 13
  • 14.
    04/11/2014 Tim Bell- OpenStack Paris 14
  • 15.
    04/11/2014 Tim Bell- OpenStack Paris 15
  • 16.
    Good News, BadNews • Additional data centre in Budapest now online • Increasing use of facilities as data rates increase But… • Staff numbers are fixed, no more people • Materials budget decreasing, no more money • Legacy tools are high maintenance and brittle • User expectations are for fast self-service 04/11/2014 Tim Bell - OpenStack Paris 16
  • 17.
    We are notSpecial! • Challenge the must-have lists at project start • Are those requirements really justified ? • Accumulating technical debt stifles agility • There is no Moore’s Law for people • Automation needs APIs, not documented procedures • Find open source communities and contribute • Understand ethos and architecture • Stay mainstream, stay up to date 04/11/2014 Tim Bell - OpenStack Paris 17
  • 18.
    CERN Tool Chain 04/11/2014 Tim Bell - OpenStack Paris 18
  • 19.
    Status • Startedproject in 2011 with Cactus • In production since July 2013 with Grizzly • 2 upgrades without major incidents or VM downtime • 4 OpenStack Icehouse clouds at CERN • Largest is ~70,000 cores on ~3,000 servers • 3 other instances with 45,000 cores total • Expected to pass 150,000 cores in total by Q1 2015 • All non-CERN specific code is upstream 04/11/2014 Tim Bell - OpenStack Paris 19
  • 20.
    Nova Cells ScalingArchitecture compute-nodes controllers compute-nodes Child Cell Geneva, Switzerland Child Cell Budapest, Hungary 20 Top Cell - controllers Geneva, Switzerland Load Balancer Geneva, Switzerland controllers 04/11/2014 Tim Bell - OpenStack Paris
  • 21.
    Onwards the FederatedClouds IN2P3 Lyon Public Cloud such as Rackspace CERN Private Cloud 72K cores Many Others on Their Way NecTAR Australia ATLAS Trigger 28K cores ALICE Trigger 12K cores CMS Trigger 12K cores Brookhaven National Labs 04/11/2014 Tim Bell - OpenStack Paris 21
  • 22.
    Hooke’s Law forCultural Change • Under load, an organisation can extend proportional to external force • Too much stretching leads to permanent deformation 04/11/2014 Tim Bell - OpenStack Paris 22
  • 23.
    The Agile Experience 04/11/2014 Tim Bell - OpenStack Paris 23
  • 24.
    Cultural Barriers 04/11/2014Tim Bell - OpenStack Paris 24
  • 25.
    Standing on theShoulders of Giants 04/11/2014 Tim Bell - OpenStack Paris 25
  • 26.
    Thanks to allCommunity Members! • Details at http://openstack-in-production. blogspot.fr • CERN code is upstream or at http://github.com/cernops • CERN & Industry Collaboration at http://cern.ch/openlab 04/11/2014 Tim Bell - OpenStack Paris 26

Editor's Notes

  • #3 Bonjour a tous et a toutes… c’est formidable d’etre ici pour notre premier summit European. Thanks for the opportunity to share the experiences of OpenStack at CERN. I’ve been working with the OpenStack community since 2011, I think the Diablo summit in Boston was my first and been an elected member of the management board and user committee since then. It’s great to see the summit come to Europe. CERN is the Centre European de Recherche Nucleaire and the home of the Large Hadron Collider. We are funded, an annual budget of around 1M USD. by the tax payers of the European member states to provide facilities for 11,000 physicists worldwide. The goal is to understand what the universe is made of and how it works. My job is to give them the computing infrastructure they need to do this.
  • #4 A theory, from the 1960s, proposed by Professors Higgs and Englert suggested a boson, a force carrying particle, could be responsible for why we have mass. 50 years later, the LHC was able to show this existence experimentally in 2012 and the Nobel prize was awarded in 2013, the physics equivalent of landing a man on the moon. This discovery provided one of the pieces of the jigsaw called the Standard Model but we now need to determine the exact properties of the Higgs boson… it’s mass raises some interesting questions about the nature and stability of the universe.
  • #5 The big bang was a spontaneous explosion around 13.8 billion years ago which should have produced equal amounts of matter and anti-matter. The majority of the world around us is matter. At CERN, we do produce small amounts of anti-matter to study and also participate in other experiments such as AMS, a bus sized detector on the side of the space station.
  • #6 When we look out into the universe and see how it is expanding, and compare it to what we can see, stars and planets, we can see that around 95% of the universe is missing. Some theories suggest dark matter and dark energy which we cannot observe but are needed to find a match between theory and reality. How can we find something that we cannot observe ?
  • #7 The other forces in the standard model such as Electromagnetic, strong and weak forces, are much stronger than gravity. Why is it so weak and operates over such large distances ? One theory suggests that Gravitons could be the key, however, they are likely be created and then disappear into other dimensions rapidly… the evidence for them primarily being a loss of energy which must be conserved according to basic physics laws.
  • #8 To try to answer these questions, we have built one of the largest experiments on the planet… the large hadron collider, 100metres underground spanning the border between Switzerland and France. The protons cross the border 11,000 times a second at 3 metres/second below the speed of light which is faster than many international border controls…
  • #9 Over 1,600 magnets lowered down shafts and cooled to -271 C to become superconducting. Two beam pipes, vacuum 10 times less than the moon are being cooled at this moment ready to start the experiment again after 18 months of work to upgrade the accelerator.
  • #10 At 4 places around the ring, we collide the beams and observe the results using detectors which are like digital cameras. Except they weigh 7,000 tonnes, the same as the Eiffel tower, contain 3,000 KMs of cables and are the height of Notre Dame, housed in the largest man made caverns in the world. These cameras have 100 million pixels and take 40 million pictures a second.
  • #11 Colliding high energy beams allows us to probe the nature of matter. 600 million collisions/second produces 1 petabyte/s every second. This is filtered by massive computing farms of 1000s of servers to only 5-25 Gbytes/s. Highest data rates are when we collide lead ions, 200 protons and neutrons to create quark gluon plasma, 100,000 times hotter than the centre of the sun which was the material in the universe just after the big bang.
  • #12 These collisions produce data, lots of it. Over 100PB currently 45,000 tapes… data rates of up to 27 PB/year during the past years and expected to significantly increase in the next run in 2015. This currently uses 11,000 servers and 45,000 tapes… 75,000 disk drives means the hardware repair teams are kept busy The data must be kept at least 20 years so that’s heading towards exabytes, however, ….
  • #13 We’re just finishing Run 1 …. Run 2 starts in the spring 2015 and we’re planning the upgrades for Run 3 and 4 … estimates are 400 PB/year by 2023 and a corresponding need for 50 times the current computing capacity. Given limited budgets, we hope for continued improvements in processor, memory, disk and tape technologies so that physics analysis does not become limited by computing resources.
  • #14 Recording and analysing the data takes a lot of computing power. The CERN computer centre was built in the 1970s for mainframes and crays. Now running at 3.5MW of power, the power density is limited so we can’t fill the racks. CERN itself uses around 120MW of power, like a small town. Along with 80,000 visitors, the Google street view came through last year if you want to look in more detail.
  • #15 The CERN main entrance is in Switzerland but actually the data centre is in France, we walk across the road to the restaurant for lunch which is in Switzerland… So while much of the CERN OpenStack cloud is in France… the rest is further away….
  • #16 We could not ask for more power to the computer centre so we looked for capacity elsewhere. After a competive tender across CERN member states, we chose Budapest in Hungary and built up a 2.7MW facility there to cope with the LHC computing challenges for Run 2.
  • #17 With the new data centre in Budapest, we could now look at address the upcoming data increases but there were a number of constraints. In the current economic climate, CERN cannot be asking for additional staff to run the computer systems. At the same time, the budget for hardware is also under restrictions. The prices are coming down gradually so we can get more for the same but we need to find ways to maximise the efficicency of the hardware. Our tools for management were written in 2000s, consist of 100,000 of lines of perl over 10 years, often by students, and in need of maintenance. Changes such as IPv6 or new operating systems would require major effort just to keep up. Finally, the users are expected a more responsive central IT service… their expectations are set by the services they use at home, you don’t have to fill out a ticket to get a dropbox account so why should you need to at work ?
  • #18 We came up with a number of guiding principles… We took an approach that CERN was not special. Culturally, for a research organisation this is a big challenge. Many continue to feel that our requirements would be best met by starting again from scratch but with the modern requirements. In the past, we had extensive written procedures for sysadmins to execute with lots of small tools to run, These were error prone and often the guys did not read the latest ones before they performed the operation. We needed to find ways to scale the productivity the team to match the additional servers. One of the highest people cost items was the tooling. We had previously been constructing requirements lists, with detailed must-have needs for acceptance. Instead, we asked ourselves how come the other big centres could run using these open source tools yet we had special requirements. Often, the root cause was that we did not understand the best approach to use the tools rather than that we were special. The maintenance of our tools was high. The skills and experienced staff were taking up more and more of their time with the custom code so we took an approach of deploy rather than develop. This meant finding the open source tools that made sense for us, trying them out. Where we found something that was missing, we challenged it again and again. Finally, we would develop in collaboration with the community generalised solutikons for the problems that can eb maintained by the community afterwards. Long term forking is not sustainable.
  • #20 Cactus was early.. But we felt the project had long term potential 2 years of clouds, snapshotting and moving to new versions Grizzly was ready for our service levels – we can ensure VMs live All CERN patches are upstream unless there is a really CERN specific change Good for community Good for CERN as our other HEP clouds follow the same evolution We have users patching upstream to raise tickets :-)
  • #21 HA Proxy load balancers to ensure high availability Redundant controllers for compute nodes Cells used by the largest sites such as Rackspace and NeCTAR – more than 1000 hypervisors is the recommended configuration
  • #22 The trigger farms are those servers nearest the accelerator which are not needed while the accelerator is shut down till 2015 Public clouds are interesting for burst load (such as coming up to a conference) or when price drops such as spot market Private clouds allow universities and other research labs to collaborate in processing the LHC data
  • #23 Even in a research oriented environment like CERN, we’ve found tensions between the needs of different services. Hostnames, software version pinning, automatic updates at weekends, ... There are no easy answers especially in an organisation with a culture of academic freedom.
  • #24 So, we assembled a team made up of experienced service managers and new students. By freezing developments on legacy projects, we were able to make resources available but only as long as we could rapidly implement new functions. Many of the staff had to do their ‘day’ jobs as well as work on the new implementations. Several effects - Newcomers often had experience of the tools from university People learnt very rapidly by following mailing lists, going to conferences and interacting with the community. Contributions such as contributing to the governance, use cases and testing in addition to standard development contributions. Short term staff saw major improvements in their post-CERN job prospects as they left with very relevant skills
  • #25 The agile approach is a major cultural change which is an ongoing process. To illustrate this, there are some characteristics which I show extreme examples of to watch out from Tolkein…. Luckily, we never had characters like this at CERN: Don’t be hasty, let’s go slowly… transformations such as this cannot be done in a reasonable time by incremental change.. Running two parallel infrastructures was not compatible with staffing Move away from silos… top to bottom from application to hardware managed by a single team to a layered model with shared budget and resources. Quota is the new budget Knowledge management responsibilities change. The guru who wrote the tool and trains others on how to use it is replaced by the outside community in which people participate. Everything can appear to be research if you start with a blank piece of paper. The server or application manager of ‘precious’ applications that need special handling and care has to be understood… some cases are inevitable but many reflect non-technical aspects of the application or server management and may justify changes of process. The days of checking your server status lights are now past.
  • #26 Newton describes his role in science as being dwarfs standing on the shoulders of giants.. Each scientist takes the results of the previous generation and builds new discoveries and inventions which are then passed on. CERN’s experiments, thousands of collaborators working together, are building the understanding of the fundamental nature of the universe. The web started off with the simple line mode browser built by a student at CERN and has evolved into the world wide communication vehicle of today. Imagine a world without Mosaic and just the line mode browser, imagine a work with Internet Explorer 5 and no Chrome or Firefox…. Competition on strong core encourages innovation All of these are based on principles of meritocracy, transparent development processes, intellectual competition and contribution of 1000s of people towards a shared goal … Some activities are just too big for small teams to tackle …. sounds familiar ?