Review of the CLIP (Cloud Infrastructure Project) at the Vienna BioCenter. CLIP is an OpenStack deployment for research computing, set out to replace on-site legacy HPC systems. We talk about how we have setup our continuous deployment process, our infrastructure as code approach, continuous testing and verification, monitoring and the pitfalls and surprises we encountered along the way.
Extending TripleO for OpenStack ManagementKeith Basil
Operational awareness and value for cloud operators has largely been ignored by the OpenStack community. Today with the maturity of TripleO and inclusion of Tuskar, we can now begin to think about TripleO's use as a vehicle for OpenStack infrastructure management.
The question now is How do we extend TripleO with additional value?".
Within this context, there are several areas of integration which can be explored. These include an operator dashboard, infrastructure instrumentation agents, bare metal drivers and other supporting services. Hardware and software vendors can gain insight into what integration looks like from a product point of view.
In this session, we will explore:
- Why TripleO works for infrastructure management
- TripleO management integration points
- What TripleO means for hardware/software vendors
- Early work in this area
Extending TripleO for OpenStack ManagementKeith Basil
Operational awareness and value for cloud operators has largely been ignored by the OpenStack community. Today with the maturity of TripleO and inclusion of Tuskar, we can now begin to think about TripleO's use as a vehicle for OpenStack infrastructure management.
The question now is How do we extend TripleO with additional value?".
Within this context, there are several areas of integration which can be explored. These include an operator dashboard, infrastructure instrumentation agents, bare metal drivers and other supporting services. Hardware and software vendors can gain insight into what integration looks like from a product point of view.
In this session, we will explore:
- Why TripleO works for infrastructure management
- TripleO management integration points
- What TripleO means for hardware/software vendors
- Early work in this area
CERN OpenStack Cloud Control Plane - From VMs to K8sBelmiro Moreira
CERN is the home of the Large Hadron Collider (LHC), a 27km circular proton accelerator that generates petabytes of physics data every year. To process all this data, CERN runs an OpenStack Cloud (>300K cores) that helps scientists all around the world to unveil the mysteries of the Universe. The Infrastructure is also used to run all the IT services of the Organization.
Delivering these services, with high performance and reliable service levels has been one of the major challenges for the CERN Cloud engineering team. We have been constantly iterating the architecture and deployment model of the Cloud control plane.
In this presentation we will describe the different control plane architecture models that we relied over the years. Finally, we will describe all the work done to move the OpenStack Cloud control plane from VMs into a kubernetes cluster. We will report about our experience running this architecture at scale, its advantages and challenges.
Overview of kubernetes network functionsHungWei Chiu
In this slides, I briefly introduce the network function in the kubernetes and explain how kubernetes implement them.
Those function includes the container network interface (CNI) and kubernetes service.
In the last, I introduce the multus CNI which is designed for multiple networks in the container and it's necessary in some use case, such as SDN/NFV/5G
- What is NOVA ?
- NOVA architecture
- How instance are spawned in Openstack ?
- Interaction of nova with other openstack projects like neutron, glance and cinder.
CERN, the European Organization for Nuclear Research, is one of the world’s largest centres for scientific research. Its business is fundamental physics, finding out what the universe is made of and how it works. At CERN, accelerators such as the 27km Large Hadron Collider, are used to study the basic constituents of matter. This talk reviews the challenges to record and analyse the 25 Petabytes/year produced by the experiments and the investigations into how OpenStack could help to deliver a more agile computing infrastructure.
A One-Stop Solution for Puppet and OpenStackPuppet
Throughout the last year, we have been using and developing tools that allow us to have an IaaS where our data center is configured by Puppet and our virtualization and authentication needs are catered by Openstack. RedHat's foreman is our lifecycle management tool which we configured to support both bare metal and Openstack virtual machines. We use git to manage environments and hostgroup configurations and we will tell you how we deal with its security implications, how to store Hieradata secrets. Switching from a homebrew toolchain to open source tools like Facter, Foreman, Openstack has turned out into many contributions to these teams. Nearly everyone at CERN has started to wear the devops hat which brings new challenges in terms of development workflows and scalability.
Daniel Lobato Garcia
Software Engineer, CERN
Daniel Lobato is a developer who has worked in very different environmentst, from data centers and mainframes to startups. Nowadays he has dived into the Agile Infrastructure team at CERN where the design and implementation of the new computing infrastructure is done. As for Puppet, he currently helps RedHat to develop Foreman, a lifecycle management tool for physical and virtual machines. One of his goals at CERN is to knot this tool to all the relevant parts of the infrastructure, which includes Puppet for configuration management, OpenStack for virtualization and authentication, Puppetdb and others. He is sure the source of all computer problems is between the chair and the keyboard.
CERN is the home of the Large Hadron Collider (LHC), a 27km circular proton accelerator generating tens of petabytes of new data every year. Data is stored and processed using a large amount of resources totaling over 250.000 cores and 1000s of storage servers, managed by OpenStack.
Networking is a critical part of our infrastructure and arguably the hardest to evolve. Given the size of CERN’s infrastructure, its flat network is partitioned in segments each representing a separate broadcast domain and potentially offering different levels of service. This fragmentation improves scalability and reduces the impact of misbehaving systems in the datacentre to individual segments. On the other hand, having multiple broadcast domains means features like floating and virtual IPs are much harder to offer.
We will tell the story of OpenStack Networking at CERN. First integration with Nova Network, the migration to Neutron and how we're adding SDN in our infrastructure.
Using Rally for OpenStack certification at ScaleBoris Pavlovic
It goes without saying that one of the most important things all OpenStack clouds from big to small is to be 100% sure that everything works as expected BEFORE taking on production workloads.
To ensure that production workloads will be successful, we can do a few key things:
1) Generate real load test from "real" OpenStack users
2) Collect and retain detailed execution data for examination and historical comparison
3) Measure workload performance and failure rate against established SLAs to validate a deployment
4) Visualize results in beautiful graphic detail
Rally can fully automate these steps for you and save dozens if not hundreds of hours.
Full talk you can find here:
https://www.youtube.com/watch?v=-g1tuTLcxik
HKG15-204: OpenStack: 3rd party testing and performance benchmarkingLinaro
HKG15-204: OpenStack: 3rd party testing and performance benchmarking
---------------------------------------------------
Speaker: Andrew McDermott, Clark Laughlin
Date: February 10, 2015
---------------------------------------------------
★ Session Summary ★
Status of Tempest 3rd party testing, discussion on scenarii for Rally benchmarking and hypervisor performance.
--------------------------------------------------
★ Resources ★
Pathable: https://hkg15.pathable.com/meetings/250785
Video: https://www.youtube.com/watch?v=-00rTPCYAyg
Etherpad: http://pad.linaro.org/p/hkg15-204
---------------------------------------------------
★ Event Details ★
Linaro Connect Hong Kong 2015 - #HKG15
February 9-13th, 2015
Regal Airport Hotel Hong Kong Airport
---------------------------------------------------
http://www.linaro.org
http://connect.linaro.org
The next generation of research infrastructure and large scale scientific instruments will face new magnitudes of data.
This talk presents two flagship programmes: the next generation of the Large Hadron Collider (LHC) at CERN and the Square Kilometre Array (SKA) radio telescope. Each in their way will push infrastructure to the limit.
The LHC has been one of the significant users of OpenStack in scientific computing. The SKA is now working to a final software architecture design and is focusing on OpenStack as an underlying middleware function.
Together, we plan to develop a common platform for scaling science: to accommodate new applications and software services, to deliver high ingest rate real-time and batch processing, to integrate high performance storage and to unlock the potential of software defined networking.
Openstack Summit Tokyo 2015 - Building a private cloud to efficiently handle ...Pierre GRANDIN
What do you do when your usual setup or turnkey solution isn’t suited for your workload?
Most of the documentation and user feedback that you can find about OpenStack is written for the use-case of running a public facing cloud serving several external customers. When you want to host a single tenant with a single application the problem is completely different, you don't want publicly exposed APIs. You want to ensure optimal resource allocation to maximize your application performance. You want to leverage the fact that you own the infrastructure layer to optimize your instance placement strategy, and to get the best latency and to avoid creating SPOFs using affinity (or anti affinity rules).
This talk will focus on what we learned during a two years journey; from getting OpenStack up and running reliably, to investigating performance bottlenecks, to maximizing the performance of our private cloud.
CERN OpenStack Cloud Control Plane - From VMs to K8sBelmiro Moreira
CERN is the home of the Large Hadron Collider (LHC), a 27km circular proton accelerator that generates petabytes of physics data every year. To process all this data, CERN runs an OpenStack Cloud (>300K cores) that helps scientists all around the world to unveil the mysteries of the Universe. The Infrastructure is also used to run all the IT services of the Organization.
Delivering these services, with high performance and reliable service levels has been one of the major challenges for the CERN Cloud engineering team. We have been constantly iterating the architecture and deployment model of the Cloud control plane.
In this presentation we will describe the different control plane architecture models that we relied over the years. Finally, we will describe all the work done to move the OpenStack Cloud control plane from VMs into a kubernetes cluster. We will report about our experience running this architecture at scale, its advantages and challenges.
Overview of kubernetes network functionsHungWei Chiu
In this slides, I briefly introduce the network function in the kubernetes and explain how kubernetes implement them.
Those function includes the container network interface (CNI) and kubernetes service.
In the last, I introduce the multus CNI which is designed for multiple networks in the container and it's necessary in some use case, such as SDN/NFV/5G
- What is NOVA ?
- NOVA architecture
- How instance are spawned in Openstack ?
- Interaction of nova with other openstack projects like neutron, glance and cinder.
CERN, the European Organization for Nuclear Research, is one of the world’s largest centres for scientific research. Its business is fundamental physics, finding out what the universe is made of and how it works. At CERN, accelerators such as the 27km Large Hadron Collider, are used to study the basic constituents of matter. This talk reviews the challenges to record and analyse the 25 Petabytes/year produced by the experiments and the investigations into how OpenStack could help to deliver a more agile computing infrastructure.
A One-Stop Solution for Puppet and OpenStackPuppet
Throughout the last year, we have been using and developing tools that allow us to have an IaaS where our data center is configured by Puppet and our virtualization and authentication needs are catered by Openstack. RedHat's foreman is our lifecycle management tool which we configured to support both bare metal and Openstack virtual machines. We use git to manage environments and hostgroup configurations and we will tell you how we deal with its security implications, how to store Hieradata secrets. Switching from a homebrew toolchain to open source tools like Facter, Foreman, Openstack has turned out into many contributions to these teams. Nearly everyone at CERN has started to wear the devops hat which brings new challenges in terms of development workflows and scalability.
Daniel Lobato Garcia
Software Engineer, CERN
Daniel Lobato is a developer who has worked in very different environmentst, from data centers and mainframes to startups. Nowadays he has dived into the Agile Infrastructure team at CERN where the design and implementation of the new computing infrastructure is done. As for Puppet, he currently helps RedHat to develop Foreman, a lifecycle management tool for physical and virtual machines. One of his goals at CERN is to knot this tool to all the relevant parts of the infrastructure, which includes Puppet for configuration management, OpenStack for virtualization and authentication, Puppetdb and others. He is sure the source of all computer problems is between the chair and the keyboard.
CERN is the home of the Large Hadron Collider (LHC), a 27km circular proton accelerator generating tens of petabytes of new data every year. Data is stored and processed using a large amount of resources totaling over 250.000 cores and 1000s of storage servers, managed by OpenStack.
Networking is a critical part of our infrastructure and arguably the hardest to evolve. Given the size of CERN’s infrastructure, its flat network is partitioned in segments each representing a separate broadcast domain and potentially offering different levels of service. This fragmentation improves scalability and reduces the impact of misbehaving systems in the datacentre to individual segments. On the other hand, having multiple broadcast domains means features like floating and virtual IPs are much harder to offer.
We will tell the story of OpenStack Networking at CERN. First integration with Nova Network, the migration to Neutron and how we're adding SDN in our infrastructure.
Using Rally for OpenStack certification at ScaleBoris Pavlovic
It goes without saying that one of the most important things all OpenStack clouds from big to small is to be 100% sure that everything works as expected BEFORE taking on production workloads.
To ensure that production workloads will be successful, we can do a few key things:
1) Generate real load test from "real" OpenStack users
2) Collect and retain detailed execution data for examination and historical comparison
3) Measure workload performance and failure rate against established SLAs to validate a deployment
4) Visualize results in beautiful graphic detail
Rally can fully automate these steps for you and save dozens if not hundreds of hours.
Full talk you can find here:
https://www.youtube.com/watch?v=-g1tuTLcxik
HKG15-204: OpenStack: 3rd party testing and performance benchmarkingLinaro
HKG15-204: OpenStack: 3rd party testing and performance benchmarking
---------------------------------------------------
Speaker: Andrew McDermott, Clark Laughlin
Date: February 10, 2015
---------------------------------------------------
★ Session Summary ★
Status of Tempest 3rd party testing, discussion on scenarii for Rally benchmarking and hypervisor performance.
--------------------------------------------------
★ Resources ★
Pathable: https://hkg15.pathable.com/meetings/250785
Video: https://www.youtube.com/watch?v=-00rTPCYAyg
Etherpad: http://pad.linaro.org/p/hkg15-204
---------------------------------------------------
★ Event Details ★
Linaro Connect Hong Kong 2015 - #HKG15
February 9-13th, 2015
Regal Airport Hotel Hong Kong Airport
---------------------------------------------------
http://www.linaro.org
http://connect.linaro.org
The next generation of research infrastructure and large scale scientific instruments will face new magnitudes of data.
This talk presents two flagship programmes: the next generation of the Large Hadron Collider (LHC) at CERN and the Square Kilometre Array (SKA) radio telescope. Each in their way will push infrastructure to the limit.
The LHC has been one of the significant users of OpenStack in scientific computing. The SKA is now working to a final software architecture design and is focusing on OpenStack as an underlying middleware function.
Together, we plan to develop a common platform for scaling science: to accommodate new applications and software services, to deliver high ingest rate real-time and batch processing, to integrate high performance storage and to unlock the potential of software defined networking.
Openstack Summit Tokyo 2015 - Building a private cloud to efficiently handle ...Pierre GRANDIN
What do you do when your usual setup or turnkey solution isn’t suited for your workload?
Most of the documentation and user feedback that you can find about OpenStack is written for the use-case of running a public facing cloud serving several external customers. When you want to host a single tenant with a single application the problem is completely different, you don't want publicly exposed APIs. You want to ensure optimal resource allocation to maximize your application performance. You want to leverage the fact that you own the infrastructure layer to optimize your instance placement strategy, and to get the best latency and to avoid creating SPOFs using affinity (or anti affinity rules).
This talk will focus on what we learned during a two years journey; from getting OpenStack up and running reliably, to investigating performance bottlenecks, to maximizing the performance of our private cloud.
Alibaba is one of the fastest growing cloud platform providers and the world’s third largest public cloud provider just after Amazon AWS, Microsoft Azure (Gartner 2017).
The current report is the result of an exercise in understanding the capability of Alibaba cloud comparing to other major cloud platforms, such as, Amazon’s AWS.
USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a MonthNicolas Brousse
TubeMogul grew from few servers to over two thousands servers and handling over one trillion http requests a month, processed in less than 50ms each. To keep up with the fast growth, the SRE team had to implement an efficient Continuous Delivery infrastructure that allowed to do over 10,000 puppet deployment and 8,500 application deployment in 2014. In this presentation, we will cover the nuts and bolts of the TubeMogul operations engineering team and how they overcome challenges.
Join this info-packed and hands-on workshop where we will cover:
Introduction to Kubernetes & GitOps talk:
We'll cover the most popular path that has brought success to many users already - GitOps as a natural evolution of Kubernetes. We'll give an overview of how you can benefit from Kubernetes and GitOps: greater security, reliability, velocity and more. Importantly, we cover definitions and principles standardized by the CNCF's OpenGitOps group and what it means for you.
Get Started with GitOps:
You'll have GitOps up and running in about 30 mins using our free and open source tools! We'll give a brief vision of where you want to be with those security, reliability, and velocity benefits, and then we'll support you while go through the getting started steps. During the workshop, you'll also experience in action and see demos for:
* an opinionated repo structure to minimize decision fatigue
* disaster recovery using GitOps
* Helm charts example
* Multi-cluster example
* all with free and open source tools mostly in the CNCF (eg. Flux and Helm).
If you have questions before or after the workshop, talk to us at #weave-gitops http://bit.ly/WeaveGitOpsSlack (If you need to invite yourself to the Slack, visit https://slack.weave.works/)
Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/2lGNybu.
Stefan Krawczyk discusses how his team at StitchFix use the cloud to enable over 80 data scientists to be productive. He also talks about prototyping ideas, algorithms and analyses, how they set up & keep schemas in sync between Hive, Presto, Redshift & Spark and make access easy for their data scientists, etc. Filmed at qconsf.com..
Stefan Krawczyk is Algo Dev Platform Lead at StitchFix, where he’s leading development of the algorithm development platform. He spent formative years at Stanford, LinkedIn, Nextdoor & Idibon, working on everything from growth engineering, product engineering, data engineering, to recommendation systems, NLP, data science and business intelligence.
Rise of the machines: Continuous Delivery at SEEK - YOW! Night Summary SlidesDiUS
The virtues of continuous delivery are widely understood and accepted by organisations which value fast feedback cycles, reduced risk through incremental delivery of smaller changes and the ability to respond quickly to external factors. Furthermore if microservices are part of your architecture, then the ability to rapidly deploy multiple components of a system become increasingly important.
The foundations of scripting, automation and more recently containers made *nix-based systems the first target for automated deployments and subsequently continuous delivery. With the advent of some new tooling and a bit of courage these principles can now be applied to more heterogeneous environments including those from Redmond.
Using their backgrounds in automating large-scale ruby and java-based deployments, Warner and Matt embarked on a journey with SEEK to increase their agility by enabling continuous delivery – typically multiple times per day. This is their story.
microXchg 2019: "Creating an Effective Developer Experience for Cloud-Native ...Daniel Bryant
In a productive cloud native development workflow, individual teams can build and ship (micro)services independently from each other. But with a rapidly evolving cloud native landscape, creating an effective developer workflow using a platform based on something like Kubernetes can be challenging.
We are all creating software to support the delivery of value to our customers and to the business, and therefore, the developer experience from idea generation to running (and observing) in production must be fast, reliable, and provide good feedback.
During this talk Daniel will share with you several lessons learned from real world consulting experience working with teams deploying to Kubernetes.
Key takeaways include:
- Why an efficient development workflow is so important
- A series of questions to ask in order to understand if you should attempt to build a PaaS on top of Kubernetes (everyone needs a platform, but how much should be built versus integrated versus bought?)
- A brief overview of developer experience tooling for Kubernetes, and how this domain could evolve in the future
- The role of Kubernetes, Envoy, Prometheus, and other popular cloud-native tools in your workflow
- Key considerations in implementing a cloud-native workflow
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Generating a custom Ruby SDK for your web service or Rails API using Smithyg2nightmarescribd
Have you ever wanted a Ruby client API to communicate with your web service? Smithy is a protocol-agnostic language for defining services and SDKs. Smithy Ruby is an implementation of Smithy that generates a Ruby SDK using a Smithy model. In this talk, we will explore Smithy and Smithy Ruby to learn how to generate custom feature-rich SDKs that can communicate with any web service, such as a Rails JSON API.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
1. HPC on OpenStack
Review of the our Cloud Platform Project
Petar Forai, Erich Bingruber, Uemit Seren
Post FOSDEM tech talk event @UGent 04.02.2019
2. Agenda
Who We Are & General Remarks (Petar Forai)
Cloud Deployment and Continuous Verification
(Uemit Seren)
Cloud Monitoring System Architecture (Erich
Birngruber)
3. The “Cloudster” and How we’re Building it!
Shamelessly stolen from
Damien François Talk --
“The convergence of HPC
and BigData
What does it mean for
HPC sysadmins?”
4. Who Are We
● Part of Cloud Platform Engineering Team at molecular biology research
institutes (IMP, IMBA,GMI) located in Vienna, Austria at the Vienna Bio
Center.
● Tasked with delivery and operations of IT infrastructure for ~ 40 research
groups (~ 500 scientists).
● IT department delivers full stack of services from workstations, networking,
application hosting and development (among many others).
● Part of IT infrastructure is delivery of HPC services for our campus
● 14 People in total for everything.
5. Vienna Bio Center Computing Profile
● Computing infrastructure almost exclusively dedicated to bioinformatics
(genomics, image processing, cryo electron microscopy, etc.)
● Almost all applications are data exploration, analysis and data processing, no
simulation workloads
● Have all machinery for data acquisition on site (sequencers, microscopes,
etc.)
● Operating and running several compute clusters for batch computing and
several compute clusters for stateful applications (web apps, data bases, etc.)
6. What We Currently Have
● Siloed islands of infrastructure
● Cant talk to other islands, can’t
access data from other island (or
difficult logistics for users)
● Nightmare to manage
● No central automation across all
resources easily possible
8. Meet the CLIP Project
● OpenStack was chosen to be evaluated further as platform for this
● Setup a project “CLIP” (Cloud Infrastructure Project) and formed project team
(4.0 FTE) with a multi phase approach to delivery of the project.
● Goal is to implement not only a new HPC platform but a software defined
datacenter strategy based on OpenStack and deliver HPC services on top of
this platform
● Delivered in multiple phases
9. Tasks Performed within “CLIP”
● Build PoC environment to explore and develop understanding of OpenStack (
~ 2 months)
● Start deeper analysis of how OpenStack ( ~ 8 months)
○ Define and develop architecture of the cloud (understand HPC specific impact)
○ Develop deployment strategy and pick tooling for installation, configuration management,
monitoring, testing
○ Develop integration into existing data center resources and services
○ Develop understanding for operational topics like development procedures, upgrades, etc.
○ Benchmark
● Deploy production Cloud (~ 2 months and ongoing)
○ Purchase and install hardware
○ Develop architecture and pick tooling for for payload (HPC environments and applications)
○ Payload deployment
10. CLIP Cloud Architecture Hardware
● Heterogeneous nodes
(high core count, high
clock, high memory,
GPU accelerated,
NVME)
● First phase ~ 100
compute nodes and ~
3500 Intel SkyLake
cores
● 100GbE SDN RDMA
capable Ethernet and
some nodes with 2x or
4x ports
● ~ 250TB NVMe IO
Nodes ~ 200Gbyte/s
11. HPC Specific Adaptations
● Tuning, Tuning, Tuning required for excellent performance
○ NUMA clean instances (KVM process layout)
○ Huge pages (KSM etc.) setup
○ Core isolation
○ PCI-E passthrough (GPUs, NVME, …) and SR-IOV (esp. for networking) crucial for good
performance
12. Lessons Learned
● OpenStack is incredibly complex
● OpenStack is not a product. It is a framework.
● You need 2-3 OpenStack environments (development, staging, prod in our
case) to practice and understand upgrades and updates.
● Out of box experience and scalability for certain OpenStack subcomponents
is not optimal and should be considered more of a reference implementation
○ Consider plugging in real hardware here
● Cloud networking is really hard (especially in our case)
14. Deployment and Cloud Verification
● Red Hat OpenStack (OSP) uses the
upstream “TripleO” (OpenStack on
OpenStack) project for the OpenStack
deployment.
● Undercloud (Red Hat terminology: Director)
is a single node deployment of OS using
puppet.
● The Undercloud uses various OpenStack
projects to deploy the Overcloud which is
our actual cloud where payload will run
● TripleO supports deploying a HA overcloud
● Overcloud can be installed either using a
Web-GUI or entirely from the CLI by
customizing yaml files.
16. Deployment and Cloud Verification
● Web GUI is handy to play around but not so great for fast iterations and Infra as code.
→ Disable the Web UI and deploy from the CLI.
● TripleO internally uses heat to drive puppet that drives ansible ¯_(ツ)_/¯
● We decided to use ansible to drive the TripleO OpenStack deployment.
● Deployment split in 4 phases corresponding to 4 git repos:
a. clip-undercloud-prepare: Ansible playbooks that run on a bastion VM to prepare and install the
undercloud using PXE and kickstart.
b. clip-tripleo contains the customized yaml files for the TripleO configuration (storage, network
settings, etc)
c. clip-bootstrap contains ansible playbooks to initially deploy or update the overcloud using the
configuration in the clip-tripleo repo
d. clip-os-infra contains post deployment customizations that are not exposed through TripleO or
very cumbersome to customize
17. Deployment and Cloud Verification
● TripleO is slow because Heat → Puppet → Ansible !!
○ Small changes require “stack update” → 20 minutes (even for simple config stanza
change and service restart).
● Why not move all customizations to ansible (clip-os-infra) ? Unfortunately not robust :-(
○ Stack update (scale down/up) will overwrite our changes
○ → services can be down
● Let’s compromise:
○ Iterate on different customizations using ansible
○ Move finalized changes back to TripleO
● Ansible everywhere else !
○ clip-aci-infra: prepare the networking primitives for the 3 different OpenStack
environments
○ Move nodes between environments in the network fabric
18. Deployment and Cloud Verification
● We have 3 different environments (dev, staging and production) to try out updates and configuration
changes. We can guarantee reproducibility of deployment because we have everything as code/yaml,
but what about software packages ?
● To make sure that we can predictably upgrade and downgrade we decided to use Red Hat Satellite
(Foreman) and create Content Views and Life Cycle Environments for our 3 environments
19. Deployment and Cloud Verification
● While working on the deployment we ran into various known bugs that are fixed
in newer versions of OSP. To keep track of the workaround and the status of
those bugs we use a dedicated JIRA project (CRE)
20. Deployment and Cloud Verification
● How can we make sure and monitor that the cloud works during operations ?
● We leverage OpenStack’s own tempest testing suite to run verification against our deployed cloud.
● First smoke test (~ 128 tests) and if this is successful run full test (~ 3000 tests) against the cloud.
21. Deployment and Cloud Verification
● Ok, the Cloud works but what
about performance ? How can we
make sure that OS performs
when upgrading software
packages etc ?
● We plan to use Browbeat to run
Rally (control plane
performance/stress testing),
Shaker (network stress test) and
PerfkitBenchmarker (payload
performance) tests on a regular
basis or before and after software
upgrades or configuration
changes
23. Deployment and cloud verification
Lessons learned and pitfalls of OpenStack/Tripleo:
● OpenStack and TripleO are complex with many moving parts. → Have a dev/staging environment
to test the upgrade and pin the software versions with Satellite or Foreman.
● Upgrades (even minor ones) can break the cloud in unexpected ways. Biggest pain point was to
upgrade from OSP11 (non-containerized) -> OSP12 (containerized).
● Containers are no free lunch. You need a container build pipeline to customize upstream containers
to add fixes and workarounds.
● TripleO gives you a supported out of the box installer for HA OpenStack with common
customizations. Non-common customizations are hard because of rigid architecture (heat, puppet,
ansible mixed together). TripleO is moving more towards ansible (config download)
● “Flying blind through clouds is dangerous”: Make sure you have a pipeline for verification and
performance regression testing.
● Infra as code (end to end) is great but requires discipline (proper PR reviews) and release
management for tracking workarounds and fixes.
25. Monitoring is Difficult
Because it’s hard to get these right
● The information
○ At the right time
○ For the right people
● The numbers
○ Too few alarms
○ Too many alarms
○ Too many monitoring systems
● The time
○ doing it too late
26. Monitoring: What We Want to Know
● Logs: as structured as possible → Fluentd
○ syslog (unstructured)
○ OpenStack logs (structured)
● Events → RabbitMQ
○ OpenStack RPCs
○ high-level OpenStack interactions
○ CRUD of resources
● Status → Sensu
○ polling: is the service UP?
○ Publish / subscribe
○ modelling service dependencies
● Metrics → Collectd
○ time series, multi dimensional
○ performance metrics
27. Monitoring: How We do it
● Architecture
○ Ingest endpoints for all protocols
○ Buffer for peak loads
○ Persistent store
■ Structured data
■ timeseries
● Dashboards
○ Kibana, Grafana
○ Alerta to unify alarms
● Integration with deployment
○ Automatic configuration
● Service catalog integration
○ Service owners
○ Pointers to documentation
28. Monitoring: Outlook
● What changes for cloud deployments
○ Lifecycle, services come and go
○ Services scale up and down
○ No more hosts
● Further improvements
○ infrastructure debugger (tracing)
○ Stream processing (improved log parsing)
○ Dynamically integrate call-duty / notifications / handover
○ Robustness (last resort deployment)