Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

What HPC can learn from DevOps?

1,903 views

Published on

http://walidshaari.blogspot.com/2016/12/devops-and-traditional-hpc.html

Cloud, Web, Big Data operations and DevOps mindsets are changing the Internet, IT and Enterprise services and applications scene rapidly. What can HPC community learn from these technologies, processes, and culture? From the IT unicorns "Google, Facebook, Twitter, Linkedin, and Etsy" that are in the lead? What could be applied to tackle HPC operations challenges? The problem of efficiency, better use of resources? A use case of automation and version control system in HPC enterprise data centre, as well a proposal for utilising containers and new schedulers to drive better utilizations and diversify the data centre workloads, not just HPC but big data, interactive, batch, short and long-lived scientific jobs.

Published in: Technology
  • Be the first to comment

What HPC can learn from DevOps?

  1. 1. DevOps and HPC: Saudi Aramco HPC use case Walid A. Shaari 20th April 2016 Ahmed Bu-khamsin
  2. 2. 2 References in this document to any specific commercial products, process, or service by trade name, trademark, manufacturer, or otherwise, does not necessarily constitute or imply its endorsement, recommendation, or favoring by Saudi Aramco or Saudi Aramco HPC group. The ideas and findings of authors expressed in any slides or other material should not be construed as an official Saudi Aramco or HPC team position and shall not be used for advertising or product endorsement purposes. Information contained in this document is published in the interest of scientific and technical information exchange. DISCLAIMER OF ENDORSEMENT 27/10/2014
  3. 3. 3 DevOps Cultural movement or practice that emphasizes the collaboration and communication of both Application Developers and Operations professionals. Development Business Operations adaptive automated agile
  4. 4. 4 Business Drives o Optimization Effective data center(s) resources utilization: • Utilization of systems, storage, network, or services. • Better use of employees time and skills. o Growth ( N x R x P ) Increasing Infrastructure scale • N: number of managed nodes/clusters/environments • R: number of applications(business roles) • P: number of technical services (technology profiles)
  5. 5. 5 Popular DevOps Tools Docker Mesos GIT Puppet
  6. 6. 6 Data Center blueprints
  7. 7. 7 Script Packages Files Services Mounts Security Cluster Deployment
  8. 8. 8 Script Pack ag es Files Servic es Mo un ts Securi ty Script Pack ag es Files Servic es Mo un ts Securi ty Script Pack ag es Files Servic es Mo un ts Securi ty Script Pack ag es Files Servic es Mo un ts Securi ty Script Pack ag es Files Servic es Mo un ts Securi ty Script Pack ag es Files Servic es Mo un ts Securi ty Script Pack ag es Files Servic es Mo un ts Securi ty Script Pack ag es Files Servic es Mo un ts Securi ty Script Pack ag es Files Servic es Mo un ts Securi ty • Different Hardware • Different Sizes • Different Users • Different Operating Systems
  9. 9. 9 Script Packag es Fi l es Ser vi c es M o u n t s Se c u rit y Script Packag es Fi l es Ser vi c es M o u n t s Se c u rit y Script Packag es Fi l es Ser vi c es M o u n t s Se c u rit y Script Packag es Fi l es Ser vi c es M o u n t s Se c u rit y Script Packag es Fi l es Ser vi c es M o u n t s Se c u rit y Script Packag es Fi l es Ser vi c es M o u n t s Se c u rit y Script Packag es Fi l es Ser vi c es M o u n t s Se c u rit y Script Packag es Fi l es Ser vi c es M o u n t s Se c u rit y Script Packag es Fi l es Ser vi c es M o u n t s Se c u rit y Common Tasks: Apply security patches Add new storage Upgrade the OS Install new packages Common Issues: Scalability issue Lack of history No team collaboration No drift control Long development and test cycle
  10. 10. 10 • Do it DevOps way - Infrastructure as code • Definition of Infrastructure as code: "Enable the reconstruction of the business from nothing but a source code repository, an application data backup, and bare metal resources" Solution
  11. 11. 11 • Domain Specific Language: - To describe the infrastructure desired state • Data Store: - To store the configuration specifications and other data • Control System: - To deploy the code and apply the required configuration changes • Versioning Control System - To keep history - enforce workflow and peer review - Team collaboration Configuration Management Tools
  12. 12. 12 Puppet • Open-source IT automation framework • Framework to simplify and automate system configuration and provisioning • Replaces ssh-for loops and scripts • Hundreds of configuration modules available for download • Supports many Linux distributions, Windows, storage and network devices
  13. 13. 13 • Hardware Delivery • Power Up and Network Connectivity • OS Installation • Aramco Customization • Benchmarking • Application Testing • Production HP CMU . IBM xCat . Dell Bright Where Puppet Fits Cluster Deployment Project Plan
  14. 14. 14 Benefits • Speeds up clusters deployment From days to hours - Shorter development cycle - Same code is used for deployment and compliance - Code Reuse
  15. 15. 15 Benefits Contribution During Puppet Deployment Project Contribution During First Deployment Project Contribution During Second Deployment Project November 13 2014 - April 22 2015 Commits statistic for production 697 commits during 160 days Average 4.4 commits per day Contributed by 9 authors
  16. 16. 16 Benefits • Automatic and continuous deployment - Classify the cluster to the right type and Puppet does the rest
  17. 17. 17 Benefits • Advanced reporting capabilities • Self healing and drift control • Baseline configuration compliance
  18. 18. 18 Benefits • Version control and development workflow • Team Collaboration Production Bug-fix New feature Merge Request Merge Request
  19. 19. 19 git Branches and Commits
  20. 20. 20 How Pervasive is Configuration Management? ASM
  21. 21. 21 Traditional HPC Cluster Management tools https://www.flickr.com/photos/vrogy/514733529
  22. 22. 22 Provisioning Workload Scheduler & Metrics System (user land, kernel modules, devices) Bare metalBootstrapping Coniguration Orchestration consistency Provisioningactivity puppet, Ansible, Chef Grid Engine SLURM TORQUE/MOAB Mesos /Swarm/Nomad puppet, Chef Ansible foreman Razor Digital-rebar Ironic Virtual image Container HPC OPSWeb/Cloud OPS
  23. 23. HPC workload runs on the cloud 25%
  24. 24. 24 Which workloads and frameworks are running on OpenStack? Source : https://www.openstack.org/assets/survey/Public-User-Survey-Report.pdf
  25. 25. 25 HPC in non bare-metal Experimental? Is it Mature? Vendor trends
  26. 26. 26 Next Generation Provisioning Puppet Razor Ironic • No vendor lock: Open Source availability • Environments Agnostic • bare-metal, virtual image, and containers • Use open standards • Ipmi2, ipxe, dhcp, REST, https • Handles end to end application provisioning • Better integration with other tools • configuration management, CMDB, Monitoring • Programmable • On-demand provisioning • Policy/Model based
  27. 27. 27 Data Center current state SchedulerSchedulerScheduler Jobs Jobs Jobs Jobs Jobs Jobs Jobs Jobs Jobs Cluster Management A Cluster Management B Cluster Management C 0% 50% 100%
  28. 28. 28 Data Center Breaking the Silos SchedulerSchedulerScheduler MetaScheduler Jobs Jobs Jobs Jobs Jobs Jobs Jobs Jobs
  29. 29. 29 Data Center Efficient Secure Allocation of Resources VC3 BigData VC1 Infra VC2 HPC SchedulerSchedulerScheduler DataCenterScheduler Jobs Jobs Jobs Jobs Jobs Jobs Jobs Jobs 2nd Generation Cluster Management
  30. 30. 30 Containers Container encapsulate an application completely with all of its software dependencies into a standardized unit for software portable across different platforms* https://www.docker.com/what-docker
  31. 31. 31 Containers Potential Benefits to HPC o High performing o Lightweight o Portable, could solve software packaging, configuration, and delivery o Host Kernel and system drivers visibility o Composable o Targets better scalable monitoring, logging, and security o Private in-house repositories o Workforce Separation of concerns (e.g. Operations, Development, Security, Users) o Builds on mature agile application lifecycle management o Empowers application support, and developers o Holistic, yet modular ECO system o Schedulers, and cluster managers (Traditional e.g. LSF, UGE, Moab, and Slurm) (Modern: Mesos, Kubernetes, nextflow)
  32. 32. 32 Docker Performance http://www.theregister.co.uk/2014/08/18/docker_kicks_kvms_butt_in_ibm_tests
  33. 33. 33 NVIDIA Example use case https://github.com/NVIDIA/nvidia-docker
  34. 34. 34 Host possible workload Tiny Core Linux (VM) Docker Engine Bin/libs Enterprise Linux Distribution Service RHEL7 HPCtask HPCtask HPCtask HPCtask AlpineMicroService MicroService MicroService MicroService Ubuntu Bigdata Alpine Redis Kibana Logstash Elasticsearc
  35. 35. 35 HPC Host Reality RHEL7 HPCTask HPCTask HPCTAsk HPCTask Bin/Libs HPC service Docker Engine Docker capable OS Bin/Libs HPC service Bin/Libs HPC service Docker Engine Docker capable OS Docker Engine Docker capable OS Bin/Libs HPC Job 3 Docker Engine Docker capable OS Docker Engine Docker capable OS Bin/Libs HPC Job 3 Bin/Libs HPC Job 3 Container Cluster Management/orchestration
  36. 36. 36 Possible HPC Challenges o Change of processes, and mindset o Linux kernel requirements o Maturity of the cluster management and scheduling solution o Keeping up with the containers eco system o Extremely fast moving target o Several architectural and fundamental decisions to make o Memory deduplication o Necessity of automated tool-chains “development, integration, and delivery workflows” o Security Trusted container libraries
  37. 37. 37 Thank you
  38. 38. 38 Extra Slides 27/10/2014
  39. 39. 39 • http://www.meetup.com/Docker-Riyadh/ • http://www.meetup.com/Docker-Dhahran/ Saudi Docker meetups 27/10/2014
  40. 40. 40 Mesos § Mature, Open Source Apache Project § Cluster Resource Manager § Scalable to 10,000s of nodes § Fault tolerant, no single point of failure § Multi-tenancy with strong resource isolation § Improved resource utilization
  41. 41. 41 Mesos workload schedulers “Frameworks”
  42. 42. 42
  43. 43. 43 File system Layers
  44. 44. 44 Devil in the details ssh mpi Scheduler Init musl glibc Docker Engine Docker capable OS Bin/Libs HPC service

×