Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

20120524 cern data centre evolution v2

3,913 views

Published on

CHEP 2012 Presentation on CERN infrastructure evolution

Published in: Technology
  • Be the first to comment

  • Be the first to like this

20120524 cern data centre evolution v2

  1. 1. CERN Infrastructure Evolution Tim Bell CHEPCERN IT Department 24th May 2012 CH-1211 Genève 23 Switzerlandwww.cern.ch/it
  2. 2. Agenda• Problems to address• Ongoing projects – Data centre expansion – Configuration management – Infrastructure as a Service – Monitoring• Timelines• Summary CERN Infrastructure Evolution 2
  3. 3. Problems to address• CERN data centre is reaching its limits• IT staff numbers remain fixed but more computing capacity is needed• Tools are high maintenance and becoming increasingly brittle• Inefficiencies exist but root cause cannot be easily identified CERN Infrastructure Evolution 3
  4. 4. Data centre extension history• Studies in 2008 into building a new computer centre on the CERN Prevessin site – Too costly• In 2011, tender run across CERN member states for remote hosting – 16 bids• In March 2012, Wigner Institute in Budapest, Hungary selected CERN Infrastructure Evolution 4
  5. 5. Data Centre Selection• Wigner Institute in Budapest, Hungary CERN Infrastructure Evolution 5
  6. 6. Data Centre Layout & Ramp-up CERN Infrastructure Evolution 6
  7. 7. Data Centre Plans• Timescales – Prototyping in 2012 – Testing in 2013 – Production in 2014• This will be a “hands-off” facility for CERN – They manage the infrastructure and hands-on work – We do everything else remotely• Need to be reasonably dynamic in our usage – Will pay power bill• Various possible models of usage – sure to evolve – The less specific the hardware installed there, the easier to change function – Disaster Recovery for key services in Meyrin data centre becomes a realistic scenario CERN Infrastructure Evolution 7
  8. 8. Infrastructure Tools Evolution• We had to develop our own toolset in 2002• Nowadays, – CERN compute capacity is no longer leading edge – Many options available for open source fabric management – We need to scale to meet the upcoming capacity increase• If there is a requirement which is not available through an open source tool, we should question the need – If we are the first to need it, contribute it back to the open source tool CERN Infrastructure Evolution 8
  9. 9. Where to Look?• Large community out there taking the “tool chain” approach whose scaling needs match ours: O(100k) servers and many applications• Become standard and join this community
  10. 10. Bulk Hardware Deployment• Adapt processes for minimal physical intervention• New machines register themselves automatically into network databases and inventory• Burn-in test uses standard Scientific Linux and production monitoring tools• Detailed hardware inventory with serial number tracking to support accurate CERN Infrastructureof failures analysis Evolution 10
  11. 11. Current Configurations• Many, diverse applications (“clusters”) managed by different teams• ..and 700+ other “unmanaged” Linux nodes in VMs that could benefit from a simple configuration system CERN Infrastructure Evolution 11
  12. 12. Job Adverts from Indeed.com Index of millions Puppet of worldwide job posts across thousands of job sites These are the sort of posts our Quattor departing staff will be applying for. Agile Infrastructure - Configuration and CERN Infrastructure Evolution 12 Operation Tools
  13. 13. Configuration Management• Using off the shelf components – Puppet – configuration definition – Foreman – GUI and Data store – Git – version control – Mcollective – remote execution• Integrated with – CERN Single Sign On – CERN Certificate Authority – Installation Server CERN Infrastructure Evolution 13
  14. 14. Status CERN Infrastructure Evolution 14
  15. 15. Different Service Models • Pets are given names like lsfmaster.cern.ch • They are unique, lovingly hand raised and cared for • When they get ill, you nurse them back to health • Cattle are given numbers like vm0042.cern.ch • They are almost identical to other cattle • When they get ill, you get another one• Future application architectures tend towards Cattle• .. But users now need support for both modes of working CERN Infrastructure Evolution 15
  16. 16. Infrastructure as a Service• Goals – Improve repair processes with virtualisation – More efficient use of our hardware – Better tracking of usage – Enable remote management for new data centre – Support potential new use cases (PaaS, Cloud) – Sustainable support model• At scale for 2015 – 15,000 servers – 90% of hardware virtualized – 300,000 VMs needed CERN Agile Infrastructure 16
  17. 17. OpenStack• Open source cloud software• Supported by 173 companies including IBM, RedHat, Rackspace, HP, Cisco, AT&T, …• Vibrant development community and ecosystem• Infrastructure as a Service to our scale• Started in 2010 but maturing rapidly CERN Agile Infrastructure 17
  18. 18. Openstack @ CERN HORIZON KEYSTONE GLANCE NOVA Scheduler ComputeRegistry Image Volume Network CERN Agile Infrastructure 18
  19. 19. Scheduling and Accounting• Multiple uses of IaaS – Server consolidation – Classic batch (single or multi-core) – Cloud VMs such as CERNVM• Scheduling options – Availability zones for disaster recovery – Quality of service options to improve efficiency such as build machines, public login services – Batch system scalability is likely to be an issue• Accounting – Use underlying services of IaaS and Hypervisors for reporting and quotas CERN Infrastructure Evolution 19
  20. 20. Monitoring in IT• >30 monitoring applications – Number of producers: ~40k – Input data volume: ~280 GB per day• Covering a wide range of different resources – Hardware, OS, applications, files, jobs, etc.• Application-specific monitoring solutions – Using different technologies (including commercial tools) – Sharing similar needs: aggregate metrics, get alarms, etc• Limited sharing of monitoring data – Hard to implement complex monitoring queries 20
  21. 21. Monitoring Architecture Operations Dashboards Tools Splunk and APIs Storage and Hadoop Analysis Engine Operations Storage Consumers Consumer Messaging Broker Apollo Producer Producer Producer Lemon Sensor Sensor Sensor CERN Infrastructure Evolution 21
  22. 22. Current Tool Snapshot (Liable to Change) Puppet Foreman mcollective, yum Jenkins AIMS/PXE Foreman JIRAOpenstack Nova git, SVN Koji, Mock Yum repo Pulp Hardware Lemon database Puppet stored config DB
  23. 23. TimelinesYear What Actions2012 Prepare formal project plan Establish IaaS in CERN Data Centre Monitoring Implementation as per WG Migrate lxcloud users Early adopters to use new tools2013 LS 1 Extend IaaS to remote Data Centre New Data Centre Business Continuity Migrate CVI users General migration to new tools with SLC6 and Windows 82014 LS 1 (to Phase out legacy tools such as Quattor November) CERN Infrastructure Evolution 23
  24. 24. Conclusions• New data centre provides opportunity to expand the Tier-0 capacity• New tools are required to manage these systems and improve processes given fixed staff levels• Integrated monitoring is key to identify inefficiencies, bottlenecks and usage• Moving to an open source tool chain has allowed a rapid proof of concept and will be a more sustainable solution CERN Infrastructure Evolution 24
  25. 25. Background Information• HEPiX Agile Talks – http://cern.ch/go/99Ck• Tier-0 Upgrade – http://cern.ch/go/NN98Other information, get in touch Tim.Bell@cern.ch CERN Infrastructure Evolution 25
  26. 26. Backup SlidesCERN Infrastructure Evolution 26
  27. 27. CERN Data Centre NumbersSystems 7,899 Hard disks 62,023Processors 14,972 Raw disk capacity (TiB) 62,660Cores 64,623 Tape capacity (PiB) 47Memory (TiB) 165 Ethernet 1Gb ports 16,773Racks 1,070 Ethernet 10Gb ports 622Power Consumption (KiW) 2,345From http://sls.cern.ch/sls/service.php?id=CCBYNUM CERN Infrastructure Evolution 27
  28. 28. Computer Centre Capacity CERN Infrastructure Evolution 28
  29. 29. Capacity to end of LEP CERN Infrastructure Evolution 29
  30. 30. Foreman DashboardAgile Infrastructure -Configuration and OperationTools
  31. 31. Industry Context - Cloud• Cloud computing models are now standardising – Facilities as a Service – such as Equinix, Safehost – Infrastructure as a Service - Amazon EC2, CVI or lxcloud – Platform as a Service - Microsoft Azure or CERN Web Services – Software as a Service – Salesforce, Google Mail, Service-Now, Indico• Different customers want access to different layers – Both our users and the IT Service Managers Applications Platform Infrastructure Facilities CERN Infrastructure Evolution 31
  32. 32. SLC 4 to SLC 5 migration CERN Infrastructure Evolution 32

×