Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

20181219 ucc open stack 5 years v3

208 views

Published on

Presentation to Big Data Zurich Conference in December 2018

Published in: Technology
  • Be the first to comment

  • Be the first to like this

20181219 ucc open stack 5 years v3

  1. 1. Clouds at CERN : A 5 year perspective Utility and Cloud Computing Conference, December 19, 2018 Tim Bell @noggin143UCC 2018 2
  2. 2. About Tim • Responsible for Compute and Monitoring in CERN IT department • Elected member of the OpenStack Foundation management board • Member of the OpenStack user committee from 2013- 2015 UCC 2018 3
  3. 3. UCC 2018 4 CERNa Worldwide collaboration CERN’s primary mission: SCIENCE Fundamental research on particle physics, pushing the boundaries of knowledge and technology
  4. 4. CERN World’s largest particle physics laboratory UCC 20185 Image credit: CERN
  5. 5. UCC 20186 The Large Hadron Collider: LHC 1232 dipole magnets 15 metres 35t EACH 27km Image credit: CERN
  6. 6. Image credit: CERN COLDER TEMPERATURES than outer space ( 120t He ) UCC 20187 LHC: World’s Largest Cryogenic System (1.9 K)
  7. 7. Vacuum? • Yes UCC 20188 LHC: Highest Vacuum 104 km of PIPES 10-11bar (~ moon) Image credit: CERN
  8. 8. Image credit: CERN Image credit: CERN UCC 20189 ATLAS, CMS, ALICE and LHCb EIFFEL TOWER HEAVIER than the Image credit: CERN
  9. 9. UCC 2018 10 40 million pictures per second 1PB/s Image credit: CERN
  10. 10. About the CERN IT Department UCC 2018 11 Enable the laboratory to fulfill its mission - Main data centre on Meyrin site - Wigner data centre in Budapest (since 2013) - Connected via three dedicated 100Gbs links - Where possible, resources at both sites (plus disaster recovery) Drone footage of the CERN CC About the CERN IT Department UCC 2018 4 Enable the laboratory to fulfill its mission - Main data centre on Meyrin site - Wigner data centre in Budapest (since 2013) - Connected via three dedicated 100Gbs links - Where possible, resources at both sites (plus disaster recovery) Drone footage of the CERN CC 19/12/2018
  11. 11. Status: Service Level Overview UCC 2018 12
  12. 12. Outline UCC 2018 13 • Fabric Management before 2012 • The AI Project • The three AI areas - Configuration Management - Monitoring - Resource provisioning • Review
  13. 13. CERN IT Tools up to 2011 (1) UCC 2018 14 • Developed in series of EU funded projects - 2001-2004: European DataGrid - 2004-2010: EGEE • Work package 4 – Fabric management: “Deliver a computing fabric comprised of all the necessary tools to manage a centre providing grid services on clusters of thousands of nodes.”
  14. 14. CERN IT Tools up to 2011 (2) UCC 2018 15 • The WP4 software was developed from scratch - Scale and experience needed for LHC Computing was special - Config’ mgmt, monitoring, secret store, service status, state mgmt, service databases, … LEMON – LHC Era Monitoring - client/server based monitoring - local agent with sensors - samples stored in a cache & sent to server - UDP or TCP, w/ or w/o encryption - support for remote entities - system administration toolkit - automated installation, configuration & management of clusters - clients interact with a configuration database (CMDB) & and an installation infrastructure (AII) Around 8’000 servers managed!
  15. 15. 2012: A Turning Point for CERN IT UCC 2018 16 • EU projects finished in 2010: decreasing development and support • LHC compute and data requirements increasing - Moore’s law would help, but not enough • Staff would not grow with managed resources - Standardization & automation, current tools not apt • Other deployments have surpassed the CERN one - Mostly commercial companies like Google, Facebook, Rackspace, Amazon, Yahoo!, … - We were no longer special! Can we profit? 0 20 40 60 80 100 120 140 160 Run 1 Run 2 Run 3 Run 4 GRID ATLAS CMS LHCb ALICE we are here what we can afford LS1 (2013) ahead, next window for change would only open in 2019 … 2012
  16. 16. UCC 2018 17 How we began … • Formed a small team of service managers from … - Large services (e.g. batch, plus) - Existing fabric services (e.g. monitoring) - Existing virtualization service • ... to define project goals - What issues do we need to address? - What forward looking features do we need? http://iopscience.iop.org/article/10.1088/1742-6596/396/4/042002/pdf
  17. 17. Agile Infrastructure Project Goals UCC 2018 18 New data centre support - Overcome limits of CC in Meyrin - Disaster recovery and business continuity - ‘Smart hands’ approach 1
  18. 18. Agile Infrastructure Project Goals UCC 2018 19 Sustainable tool support - Tools to be used at our scale need maintenance - Tools with a limited community require more time for newcomers to become productive and are less valuable for the time after (transferable skills) 2
  19. 19. Agile Infrastructure Project Goals UCC 2018 20 Improve user response time - Reduce the resource provisioning time span (current virtualization service reached scaling limits) - Self-service kiosk 3
  20. 20. Agile Infrastructure Project Goals UCC 2018 21 Enable cloud interfaces - Experiments already started to use EC2 - Enable libraries such as Apache’s libcloud 4
  21. 21. Agile Infrastructure Project Goals UCC 2018 22 Precise monitoring and accounting - Enable timely monitoring for debugging - Showback usage to the cloud users - Consolidate accounting data for usage of CPU, network, storage … across batch, physical nodes and grid resources 5
  22. 22. Agile Infrastructure Project Goals UCC 2018 23 Improve resource efficiency - Adapt provisioned resources to services’ needs - Streamline the provisioning workflows (e.g. burn-in, repair or retirement) 6
  23. 23. Our Approach: Tool Chain and DevOps UCC 2018 24 • CERN’s requirements are no longer special! • A set of tools emerged when looking at other places • Small dedicated tools allowed for rapid validation & prototyping • Adapted our processes, policies and work flows to the tools! • Join (and contribute to) existing communities!
  24. 24. IT Policy Changes for Services UCC 2018 25 • Services shall be virtual … - Within reason - Exceptions are costly! • Puppet managed, and … • … monitored! - (Semi-)automatic with Puppet Decrease provisioning time Increase resource efficiency Simplify infrastructure mgmt Profit from others’ work Speed up deployment ‘Automatic’ documentation Centralized monitoring Integrated alarm handling
  25. 25. UCC 2018 26 Tools + Policies: Sounds simple! From tools to services is complex! - Integration w/ sec services? - Incident handling? - Request work flows? - Change management? - Accounting and charging? - Life cycle management? - … Image: Subbu Allamaraju
  26. 26. Public Procurement Timelines UCC 2018 27
  27. 27. Resource Provisioning: IaaS UCC 2018 28 • Based on OpenStack - Collection of open source projects for cloud orchestration - Started by NASA and Rackspace in 2010 - Grown into a global software community
  28. 28. Early Prototypes UCC 2018 29
  29. 29. The CERN Cloud Service UCC 2018 30 • Production since July 2013 - Several rolling upgrades since, now on Rocky - Many sub services deployed • Spans two data centers - One region, one API entry point • Deployed using RDO + Puppet - Mostly upstream, patched where needed • Many sub services run on VMs! - Boot strapping
  30. 30. UCC 2018 31
  31. 31. Agility in the Cloud UCC 2018 32 • Use case spectrum - Batch service (physics analysis) - IT services (built on each other) - Experiment services (build) - Engineering (chip design) - Infrastructure (hotel, bikes) - Personal (development) • Hardware spectrum - Processor archs (features, NUMA, …) - Core-to-RAM ratio (1:2, 1:3, 1:5, …) - Core-to-disk ratio (2x or 4x SSDs) - Disk layout (2, 3, 4, mixed) - Network (1/10GbE, FC, domain) - Location (DC, power) - SLC6, CC7, RHEL, Windows - …
  32. 32. What about our initial goals? UCC 2018 33 • The remote DC is seamlessly integrated - No difference from provisioning PoV - Easily accessible by users - Local DC limits overcome (business continuity?) • Sustainable tools - Number of managed machines has multiplied - Good collaboration with upstream communities - Newcomers know tools, can use knowledge afterwards • Provisioning time span is ~minutes - Was several months before - Self-service kiosk with automated workflows • Cloud interfaces - Good OpenStack adoption, EC2 support • Flexible monitoring infra - Automatic in for simple cases - Powerful tool set for more complex ones - Accounting for local and grid resources • Increased resource efficiency - ‘Packing’ of services - Overcommit - Adapted to services’ needs - Quick draining & back filling So … 100% success?
  33. 33. Cloud Architecture Overview UCC 2018 34 • Top and child cells for scaling - API, DB, MQ, Compute nodes - Remote DC is set of cells • Nova HA only on top cell - Simplicity vs impact • Other projects global - Load balanced controllers - RabbitMQ clusters • Three Ceph instances - Volumes (Cinder), images (Glance), shares (Manila)
  34. 34. UCC 2018 35 HL-LHC SKA
  35. 35. Tech. Challenge: Scaling • OpenStack Cells provides composable units • Cells V1 – Special custom developments • Cells V2 – Now the standard deployment model • Broadcast vs Targetted queries • Handling down cells • Quota • Academic and scientific instances push the limits • Now many enterprise clouds above 1000 hypervisors • CERN running 73 Cells in production UCC 2018 36 https://www.openstack.org/analytics
  36. 36. Tech. Challenge: CPU Performance UCC 2018 37 • The benchmarks on full-node VMs was about 20% lower than the one of the underlying host - Smaller VMs much better • Investigated various tuning options - KSM*, EPT**, PAE, Pinning, … +hardware type dependencies - Discrepancy down to ~10% between virtual and physical • Comparison with Hyper-V: no general issue - Loss w/o tuning ~3% (full-node), <1% for small VMs - … NUMA-awareness! *KSM on/off: beware of memory reclaim! **EPT on/off: beware of expensive page table walks!
  37. 37. CPU Performance: NUMA UCC 2018 38 • NUMA-awareness identified as most efficient setting • “EPT-off” side-effect - Small number of hosts, but very visible there • Use 2MB Huge Pages - Keep the “EPT off” performance gain with “EPT on”
  38. 38. NUMA roll-out UCC 2018 39 • Rolled out on ~2’000 batch hypervisors (~6’000 VMs) - HP allocation as boot parameter  reboot - VM NUMA awareness as flavor metadata  delete/recreate • Cell-by-cell (~200 hosts): - Queue-reshuffle to minimize resource impact - Draining & deletion of batch VMs - Hypervisor reconfiguration (Puppet) & reboot - Recreation of batch VMs • Whole update took about 8 weeks - Organized between batch and cloud teams - No performance issue observed since VM Before After 4x 8 8% 2x 16 16% 1x 24 20% 5% 1x 32 20% 3%
  39. 39. Tech. Challenge: Under used resources UCC 2018 40
  40. 40. VM Expiry UCC 2018 41 • Each personal instance will have an expiration date • Set shortly after creation and evaluated daily • Configured to 180 days, renewable • Reminder mails starting 30 days before expiration
  41. 41. Expiry results UCC 2018 42 • Results exceeded expectations • Expired • >1000 VMs • >3000 cores
  42. 42. Tech. Challenge: Bare Metal UCC 2018 43 • VMs not suitable for all of our use cases - Storage and database nodes, HPC clusters, boot strapping, critical network equipment or specialised network setups, precise/repeatable benchmarking for s/w frameworks, … • Complete our service offerings - Physical nodes (in addition to VMs and containers) - OpenStack UI as the single pane of glass • Simplify hardware provisioning workflows - For users: openstack server create/delete - For procurement & h/w provisioning team: initial on-boarding, server re-assignments • Consolidate accounting & bookkeeping - Resource accounting input will come from less sources - Machine re-assignments will be easier to track
  43. 43. Adapt the Burn In process • “Burn-in” before acceptance - Compliance with technical spec (e.g. performance) - Find failed components (e.g. broken RAM) - Find systematic errors (e.g. bad firmware) - Provoke early failing due to stress - Tests include - CPU: burnK7, burnP6, burnMMX (cooling) - RAM: memtest, Disk: badblocks - Network: iperf(3) between pairs of nodes - automatic node pairing - Benchmarking: HEPSpec06 (& fio) - derivative of SPEC06 - we buy total compute capacity (not newest processors) UCC 2018 44
  44. 44. Exploiting cloud services for burn in UCC 2018 45
  45. 45. Tech. Challenge: Containers UCC 2018 46 An OpenStack API Service that allows creation of container clusters ● Use your OpenStack credentials, quota and roles ● You choose your cluster type ● Multi-Tenancy ● Quickly create new clusters with advanced features such as multi-master ● Integrated monitoring and CERN storage access ● Making it easy to do the right thing
  46. 46. Scale Testing using Rally • An Openstack benchmark test tool • Easily extended by plugin • Test result in HTML reports • Used by many projects • Context: set up environment • Scenario: run benchmark • Recommended for a production service to verify that the service behaves as expected at all time UCC 2018 47 Kubernetes Cluster pods, contai ners Rally report
  47. 47. First Attempt – 1M requests/Seq • 200 Nodes • Found multiple limits • Heat Orchestration scaling • Authentication caches • Volume deletion • Site services UCC 2018 48
  48. 48. Second Attempt – 7M requests/Seq • Fixes and scale to 1000 Nodes UCC 2018 49 Cluster Size (Nodes) Concurrency Deployment Time (min) 2 50 2.5 16 10 4 32 10 4 128 5 5.5 512 1 14 1000 1 23
  49. 49. Tech. Challenge: Meltdown UCC 2018 50 • In January 2018, a security vulnerability was disclosed a new kernel everywhere • Staged campaign • 7 reboot days, 7 tidy up days • By availability zone • Benefits • Automation now to reboot the cloud if needed - 33,000 VMs on 9,000 hypervisors • Latest QEMU and RBD user code on all VMs • Then L1TF came along • And we had to do it all again...... 06/06/2018
  50. 50. UCC 2018 51 First run LS1 Second run Third run LS3 HL-LHC Run4 …2009 2013 2014 2015 2016 2017 201820112010 2012 2019 2023 2024 2030?20212020 2022 …2025 LS2  Significant part of cost comes from global operations  Even with technology increase of ~15%/year, we still have a big gap if we keep trying to do things with our current compute models Raw data volume increases significantly for High Luminosity LHC 2026
  51. 51. Commercial Clouds UCC 2018 52
  52. 52. Non-Technical Challenges (1) UCC 2018 53 • Agile Infrastructure Paradigm Adoption - ‘VMs are slower than physical machines.’ - ‘I need to keep control on the full stack.’ - ‘This would not have happened with physical machines.’ - ‘It’s the cloud, so it should be able to do X!’ - ‘Using a config’ management tool is too dangerous!’ - ‘They are my machines’
  53. 53. Non-Technical Challenges (2) UCC 2018 54 • Agility can bring great benefits … • … but mind (adapted) Hooke’s Law! - Avoid irreversible deformations • Ensure the tail is moving as well as the head - Application support - Cultural changes - Workflow adoption - Open source community culture can help
  54. 54. Non-Technical Challenges (3) • Contributor License Agreements • Patches needed but merges/review time • Regular staff changes limits Karma • Need to be a polyglot • Python, Ruby, Go, … and legacy Perl etc. • Keep riding the release wave • Avoid the end-of-life scenarios UCC 2018 55
  55. 55. Ongoing Work Areas • Spot Market / Pre-emptible instances • Software Defined Networking • Regions • GPUs • Containers on Bare Metal • … UCC 2018 56
  56. 56. Summary UCC 2018 57 Positive results after 5 years into the project! - LHC needs met without additional staff - Tools and workflows widely adopted and accepted - Many technical challenges were mastered and returned upstream - Integration with open source communities successful - Use of common tools increased CERN’s attraction of talents Further enhancements in function & scale needed for HL-LHC
  57. 57. Further Information • CERN information outside the auditorium • Jobs at CERN – wide range of options • http://jobs.cern • CERN blogs • http://openstack-in-production.blogspot.ch • https://techblog.web.cern.ch/techblog/ • Recent Talks at OpenStack summits • https://www.openstack.org/videos/search?search=cern • Source code • https://github.com/cernops and https://github.com/openstack UCC 2018 58
  58. 58. UCC 2018 59
  59. 59. Agile Infrastructure Core Areas UCC 2018 61 • Resource provisioning (IaaS) - Based on OpenStack • Centralized Monitoring - Based on Collectd (sensor) + ‘ELK’ stack • Configuration Management - Based on Puppet
  60. 60. Configuration Management UCC 2018 62 • Client/server architecture - ‘agents’ running on hosts plus horizontally scalable ‘masters’ • Desired state of hosts described in ‘manifests’ - Simple, declarative language - ‘resource’ basic unit for system modeling, e.g. package or service • ‘agent’ discovers system state using ‘facter’ - Sends current system state to masters • Master compiles data and manifests into ‘catalog’ - Agent applies catalog on the host
  61. 61. Status: Config’ Management (1) UCC 2018 63 (virtual and physical, private and public cloud) (‘base’ is what every Puppet node gets) (compilations are spread out) (this number includes dev changes) (number Puppet code committers)
  62. 62. Status: Config’ Management (2) UCC 2018 64
  63. 63. Status: Config’ Management (3) UCC 2018 65 • Changes to QA are announced publicly • QA duration: 1 week • All Service Managers can stop a change!
  64. 64. Monitoring: Scope UCC 2018 66 Data Centre Monitoring • Two DCs at CERN and Wigner • Hardware, O/S, and services • PDUs, temp sensors, … • Metrics and logs Experiment Dashboards - WLCG Monitoring - Sites availability, data transfers, job information, reports - Used by WLCG, experiments, sites and users
  65. 65. UCC 2018 67 Status: (Unified) Monitoring (1) • Offering: monitor, collect, aggregate, process, visualize, alarm … for metrics and logs! • ~400 (virtual) servers, 500GB/day, 1B docs/day - Mon data management from CERN IT and WLCG - Infrastructure and tools for CERN IT and WLCG • Migrations ongoing (double maintenance) - CERN IT: From Lemon sensor to collectd - WLCG: From former infra, tools, and dashboards
  66. 66. Status: (Unified) Monitoring (2) UCC 2018 68 Kafka cluster (buffering) * Processing Data enrichment Data aggregation Batch Processing Transport FlumeKafkasink Flumesinks FTS Data Sources Rucio XRootD Jobs … Lemon syslog app log DB HTTP feed AMQ Flume AMQ Flume DB Flume HTTP Flume Log GW Flume Metric GW Logs Lemon metrics HDFS Elastic Search … Storage & Search Others (influxdb) Data Access CLI, API User Views User Jobs User Data Today: > 500 GB/day, 72h buffering

×