Rudder is a configuration management tool that applies policies to nodes. As Rudder's user base grew, scalability became an issue. Rudder was originally designed for hundreds of nodes but now manages over 10,000 nodes. Rudder underwent major architectural changes and optimizations to its policy generation, compliance computation, agent performance, and database usage to scale up. Current work is focused on testing and tooling to push Rudder to 50,000 nodes by addressing remaining bottlenecks in validation, networking, and multi-tenancy support.
Tech Talk by John Casey (CTO) CPLANE_NETWORKS : High Performance OpenStack Ne...nvirters
OpenStack is HOT! No doubt about it. A recent survey by The New Stack and The Linux Foundation shows OpenStack as the most popular open source project ahead of other hot projects like Docker and KVM. OpenStack is now taking its rightful place as the open source cloud solution for enterprises and service providers.
To date OpenStack networking has not yet achieved the performance, scalability and reliability that many large enterprises demand. CPLANE NETWORKS solves that problem by delivering secure multi-tenant virtual networking that overcomes the limitations of the standard Neutron networking service. By making all networking services local to the compute node and achieving near line-rate throughput, CPLANE NETWORKS Dynamic Virtual Networks (DVN) delivers mega-scale networking for the most demanding application environments.
In this session John Casey will cover the basics of DVN and explain how CPLANE NETWORKS achieves "at scale" network performance within and across data centers.
About John Casey
John Casey has over 20 years of deep technology leadership. His proven success with a variety of technical leadership roles in Telecom, Enterprise and Government and in software design and development provide the foundation for the system architecture and engineering team.
Previously John led worldwide deployment teams for both IBM’s Software Group and Narus, Inc. His work in large scale, high performance system design at Transarc Labs and Walker Interactive Systems brings leadership to the CPLANE NETWORKS product suite.
Why @Loggly Loves Apache Kafka, and How We Use Its Unbreakable Messaging for ...SolarWinds Loggly
Agenda for this Presentation
• The challenges of Log Management at scale
• Overview of Loggly’s processing pipeline
• Alternative technologies considered
• Why we love Apache Kafka
• How Kafka has added flexibility to our pipeline

The Challenges of Log Management at Scale
• Big data
– >750 billion events logged to date
– Sustained bursts of 100,000+ events per second
– Data space measured in petabytes
• Need for high fault tolerance
• Near real-time indexing requirements
• Time-series index management
본 발표에서는 OCP 하드웨어 및 소프트웨어에 대한 소개를 진행할 예정이다. 특히 페이스북에서 제공한 Wedge ToR 스위치, Open Network Linux, FBOSS, Indigo OpenFlow agent 를 갖고 삽질한 지난 두달 간의 경험을 공유할 예정이다.
Tech Talk by John Casey (CTO) CPLANE_NETWORKS : High Performance OpenStack Ne...nvirters
OpenStack is HOT! No doubt about it. A recent survey by The New Stack and The Linux Foundation shows OpenStack as the most popular open source project ahead of other hot projects like Docker and KVM. OpenStack is now taking its rightful place as the open source cloud solution for enterprises and service providers.
To date OpenStack networking has not yet achieved the performance, scalability and reliability that many large enterprises demand. CPLANE NETWORKS solves that problem by delivering secure multi-tenant virtual networking that overcomes the limitations of the standard Neutron networking service. By making all networking services local to the compute node and achieving near line-rate throughput, CPLANE NETWORKS Dynamic Virtual Networks (DVN) delivers mega-scale networking for the most demanding application environments.
In this session John Casey will cover the basics of DVN and explain how CPLANE NETWORKS achieves "at scale" network performance within and across data centers.
About John Casey
John Casey has over 20 years of deep technology leadership. His proven success with a variety of technical leadership roles in Telecom, Enterprise and Government and in software design and development provide the foundation for the system architecture and engineering team.
Previously John led worldwide deployment teams for both IBM’s Software Group and Narus, Inc. His work in large scale, high performance system design at Transarc Labs and Walker Interactive Systems brings leadership to the CPLANE NETWORKS product suite.
Why @Loggly Loves Apache Kafka, and How We Use Its Unbreakable Messaging for ...SolarWinds Loggly
Agenda for this Presentation
• The challenges of Log Management at scale
• Overview of Loggly’s processing pipeline
• Alternative technologies considered
• Why we love Apache Kafka
• How Kafka has added flexibility to our pipeline

The Challenges of Log Management at Scale
• Big data
– >750 billion events logged to date
– Sustained bursts of 100,000+ events per second
– Data space measured in petabytes
• Need for high fault tolerance
• Near real-time indexing requirements
• Time-series index management
본 발표에서는 OCP 하드웨어 및 소프트웨어에 대한 소개를 진행할 예정이다. 특히 페이스북에서 제공한 Wedge ToR 스위치, Open Network Linux, FBOSS, Indigo OpenFlow agent 를 갖고 삽질한 지난 두달 간의 경험을 공유할 예정이다.
Tech Tutorial by Vikram Dham: Let's build MPLS router using SDNnvirters
Synopsis
We will start with MPLS 101 and then look into MPLS related OpenFlow actions. In the second half we will delve into RouteFlow architecture and extend it to enable Label Distribution Protocol (LDP) and MPLS routing. We will conclude with a mini-net based test bed switching traffic using MPLS labels instead of IP addresses.
This will be a hands on workshop. VM Images for Virtual Box will be provided. Attendees are expected to bring their laptops loaded with Virtual Box.
About Vikram Dham
Vikram is the CTO and co-founder of Kamboi Technologies, LLC where he advises networking companies, switch vendors and early adopters on SDN technology and distributed software development. Also, he is the founder of Bay Area Network Virtualization (BANV) meet-up group, that brings together technologists in the SDN/NFV/NV domain for technical talks, workshops and creates a truly "open" platform for sharing knowledge.
He has used SDN technologies for building software related to traffic engineering, security and routing. In the past, he was the Principal Engineer at Slingbox where he architected & built the distributed networking software for peer to peer connectivity of millions of end points. He holds MS degree in EE with a specialization in Computer Networks from Virginia Tech and has worked on research projects with companies like ECI Telecom, Raytheon and Avaya Research Labs.
DEVNET-1175 OpenDaylight Service Function ChainingCisco DevNet
This tutorial will overview the OpenDaylight Service Function Chaining (SFC) architecture, implementation and operation. A description of the SFC components and the Network Service Header (NSH) will be presented. This talk will conclude with a step-by-step demonstration of SFC configuration and operation using the GUI and REST interfaces.
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015Monal Daxini
Keystone - Processing over Half a Trillion events per day with 8 million events & 17 GB per second peaks, and at-least once processing semantics. We will explore in detail how we employ Kafka, Samza, and Docker at scale to implement a multi-tenant pipeline. We will also look at the evolution to its current state and where the pipeline is headed next in offering a self-service stream processing infrastructure atop the Kafka based pipeline and support Spark Streaming.
This presentation will walk through the values and benefits of using service chaining technologies in OPNFV for service composition. The presentation will talk through and demonstrate, in real time, platform service chaining features and capabilities
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/SecPeter Bakas
Talk on Netflix Keystone by Peter Bakas at SF Data Engineering Meetup on 2/23/2016.
Topics covered:
- Architectural design and principles for Keystone
- Technologies that Keystone is leveraging
- Best practices
http://www.meetup.com/SF-Data-Engineering/events/228293610/
Design and Implementation of Incremental Cooperative Rebalancingconfluent
Watch this talk here: https://www.confluent.io/online-talks/design-and-implementation-of-incremental-cooperative-rebalancing-on-demand
Since its initial release, the Kafka group membership protocol has offered Connect, Streams and Consumer applications an ingenious and robust way to balance resources among distributed processes. The process of rebalancing, as it’s widely known, allows Kafka APIs to define an embedded protocol for load balancing within the group membership protocol itself.
Until now, rebalancing has been working under the simple assumption that every time a new group generation is created, the members join after first releasing all of their resources, getting a whole new load assignment by the time the new group is formed. This allows Kafka APIs to provide task fault-tolerance and elasticity on top of the group membership protocol.
However, due to its side-effects on multi-tenancy and scalability this simple approach in rebalancing, also known as stop-the-world effect, is limiting larger scale deployments. Because of stop-the-world, application tasks get interrupted only for most of them to receive the same resources after rebalancing. In this technical deep dive, we’ll discuss the proposition of Incremental Cooperative Rebalancing as a way to alleviate stop-the-world and optimize rebalancing in Kafka APIs.
This talk will cover:
-The internals of Incremental Cooperative Rebalancing
-Uses cases that benefit from Incremental Cooperative Rebalancing
-Implementation in Kafka Connect
-Performance results in Kafka Connect clusters
These slides were presented at the 2013 Linux Plumbers Conference in New Orleans by myself and Vina Ermagan. We are doing work to enable LISP and NSH in Open vSwitch, and these slides gave some background on both of these protocols as well as detail on what we've accomplished and future directions.
Open Connect Firmware Delivery With Spinnaker (Spinnaker Summit 2018)Asher Feldman
Netflix Open Connect is the worlds most advanced CDN. Netflix now leverages Spinnaker to deliver firmware updates (encompassing low-level hardware firmware updates, FreeBSD, and the app stack) to Open Connect Appliance servers at thousands of POPs across the world. This talk from Spinnaker Summit 2018 in Seattle, WA explores how we extended Spinnaker to handle this mission-critical bare-metal delivery use case.
OpenStack & OVS: From Love-Hate Relationship to Match Made in Heaven - Erez C...Cloud Native Day Tel Aviv
"Many developers building OpenStack clouds have “love-hate” relationship with OVS. They love flexibility and elasticity offered by OVS, but hate the network performance and scalability. As emerging technologies such as NFV keep pushing for higher network performance, it becomes critical to improve OVS performance without compromising flexibility, network programmability, and cost.
In this session, we will present an approach that Mellanox has devised with input from key partners and customers to accelerate Virtual Switch dataplane, using the embedded switch implemented in the server Network Interface Card (NIC)’s hardware. This approach supports both ParaVirt vNIC interfaces and SRIOV based vNICs interfaces"
Netflix Keystone Pipeline at Samza Meetup 10-13-2015Monal Daxini
Netflix Keystone Pipeline processing 600 billion events a day, and detailed treatise on the modification of and use of Samza for real time routing of events including docker.
Netflix keystone streaming data pipeline @scale in the cloud-dbtb-2016Monal Daxini
Keystone processes over 700 billion events per day (1 peta byte) with at-least once processing semantics in the cloud. We will explore in detail how we leverage Kafka, Samza, Docker, and Linux at scale to implement a multi-tenant pipeline in AWS cloud within a year. We will also share our plans on offering a Stream Processing as a Service for all of Netflix use.
Tech Tutorial by Vikram Dham: Let's build MPLS router using SDNnvirters
Synopsis
We will start with MPLS 101 and then look into MPLS related OpenFlow actions. In the second half we will delve into RouteFlow architecture and extend it to enable Label Distribution Protocol (LDP) and MPLS routing. We will conclude with a mini-net based test bed switching traffic using MPLS labels instead of IP addresses.
This will be a hands on workshop. VM Images for Virtual Box will be provided. Attendees are expected to bring their laptops loaded with Virtual Box.
About Vikram Dham
Vikram is the CTO and co-founder of Kamboi Technologies, LLC where he advises networking companies, switch vendors and early adopters on SDN technology and distributed software development. Also, he is the founder of Bay Area Network Virtualization (BANV) meet-up group, that brings together technologists in the SDN/NFV/NV domain for technical talks, workshops and creates a truly "open" platform for sharing knowledge.
He has used SDN technologies for building software related to traffic engineering, security and routing. In the past, he was the Principal Engineer at Slingbox where he architected & built the distributed networking software for peer to peer connectivity of millions of end points. He holds MS degree in EE with a specialization in Computer Networks from Virginia Tech and has worked on research projects with companies like ECI Telecom, Raytheon and Avaya Research Labs.
DEVNET-1175 OpenDaylight Service Function ChainingCisco DevNet
This tutorial will overview the OpenDaylight Service Function Chaining (SFC) architecture, implementation and operation. A description of the SFC components and the Network Service Header (NSH) will be presented. This talk will conclude with a step-by-step demonstration of SFC configuration and operation using the GUI and REST interfaces.
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015Monal Daxini
Keystone - Processing over Half a Trillion events per day with 8 million events & 17 GB per second peaks, and at-least once processing semantics. We will explore in detail how we employ Kafka, Samza, and Docker at scale to implement a multi-tenant pipeline. We will also look at the evolution to its current state and where the pipeline is headed next in offering a self-service stream processing infrastructure atop the Kafka based pipeline and support Spark Streaming.
This presentation will walk through the values and benefits of using service chaining technologies in OPNFV for service composition. The presentation will talk through and demonstrate, in real time, platform service chaining features and capabilities
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/SecPeter Bakas
Talk on Netflix Keystone by Peter Bakas at SF Data Engineering Meetup on 2/23/2016.
Topics covered:
- Architectural design and principles for Keystone
- Technologies that Keystone is leveraging
- Best practices
http://www.meetup.com/SF-Data-Engineering/events/228293610/
Design and Implementation of Incremental Cooperative Rebalancingconfluent
Watch this talk here: https://www.confluent.io/online-talks/design-and-implementation-of-incremental-cooperative-rebalancing-on-demand
Since its initial release, the Kafka group membership protocol has offered Connect, Streams and Consumer applications an ingenious and robust way to balance resources among distributed processes. The process of rebalancing, as it’s widely known, allows Kafka APIs to define an embedded protocol for load balancing within the group membership protocol itself.
Until now, rebalancing has been working under the simple assumption that every time a new group generation is created, the members join after first releasing all of their resources, getting a whole new load assignment by the time the new group is formed. This allows Kafka APIs to provide task fault-tolerance and elasticity on top of the group membership protocol.
However, due to its side-effects on multi-tenancy and scalability this simple approach in rebalancing, also known as stop-the-world effect, is limiting larger scale deployments. Because of stop-the-world, application tasks get interrupted only for most of them to receive the same resources after rebalancing. In this technical deep dive, we’ll discuss the proposition of Incremental Cooperative Rebalancing as a way to alleviate stop-the-world and optimize rebalancing in Kafka APIs.
This talk will cover:
-The internals of Incremental Cooperative Rebalancing
-Uses cases that benefit from Incremental Cooperative Rebalancing
-Implementation in Kafka Connect
-Performance results in Kafka Connect clusters
These slides were presented at the 2013 Linux Plumbers Conference in New Orleans by myself and Vina Ermagan. We are doing work to enable LISP and NSH in Open vSwitch, and these slides gave some background on both of these protocols as well as detail on what we've accomplished and future directions.
Open Connect Firmware Delivery With Spinnaker (Spinnaker Summit 2018)Asher Feldman
Netflix Open Connect is the worlds most advanced CDN. Netflix now leverages Spinnaker to deliver firmware updates (encompassing low-level hardware firmware updates, FreeBSD, and the app stack) to Open Connect Appliance servers at thousands of POPs across the world. This talk from Spinnaker Summit 2018 in Seattle, WA explores how we extended Spinnaker to handle this mission-critical bare-metal delivery use case.
OpenStack & OVS: From Love-Hate Relationship to Match Made in Heaven - Erez C...Cloud Native Day Tel Aviv
"Many developers building OpenStack clouds have “love-hate” relationship with OVS. They love flexibility and elasticity offered by OVS, but hate the network performance and scalability. As emerging technologies such as NFV keep pushing for higher network performance, it becomes critical to improve OVS performance without compromising flexibility, network programmability, and cost.
In this session, we will present an approach that Mellanox has devised with input from key partners and customers to accelerate Virtual Switch dataplane, using the embedded switch implemented in the server Network Interface Card (NIC)’s hardware. This approach supports both ParaVirt vNIC interfaces and SRIOV based vNICs interfaces"
Netflix Keystone Pipeline at Samza Meetup 10-13-2015Monal Daxini
Netflix Keystone Pipeline processing 600 billion events a day, and detailed treatise on the modification of and use of Samza for real time routing of events including docker.
Netflix keystone streaming data pipeline @scale in the cloud-dbtb-2016Monal Daxini
Keystone processes over 700 billion events per day (1 peta byte) with at-least once processing semantics in the cloud. We will explore in detail how we leverage Kafka, Samza, Docker, and Linux at scale to implement a multi-tenant pipeline in AWS cloud within a year. We will also share our plans on offering a Stream Processing as a Service for all of Netflix use.
USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a MonthNicolas Brousse
TubeMogul grew from few servers to over two thousands servers and handling over one trillion http requests a month, processed in less than 50ms each. To keep up with the fast growth, the SRE team had to implement an efficient Continuous Delivery infrastructure that allowed to do over 10,000 puppet deployment and 8,500 application deployment in 2014. In this presentation, we will cover the nuts and bolts of the TubeMogul operations engineering team and how they overcome challenges.
Log and control all service-to-service traffic in one place (Kelvin Wong)London Microservices
When working with microservices, network unreliability brings a new dimension of challenges. Two such challenges are: 1) diagnosing network-related faults that span multiple microservices, and 2) managing pre-emptive fault-handling logic with client libraries.
Some solutions exist already, such as API gateways and service meshes. API gateways are designed primarily for client-server traffic, while service meshes are great for service-to-service traffic, but also highly complex.
We built Apex for small teams that are migrating from a monolith to their first few microservices, and starting to experience the above challenges. Apex is an open-source API proxy that provides one place to log and control all service-to-service traffic.
Key takeaways:
- Solutions already exist (e.g. API gateways, service meshes) for teams who must now also diagnose and pre-empt network faults in their systems
- These solutions come with their own trade-offs, though neither are optimal for small teams that are migrating from a monolith to their first few microservices
- A relatively simple solution, Apex's architecture of API proxy + logs database + configuration store can help small microservices teams handle these challenges
- Apex is recognised as a transitional architecture for microservices teams who don't yet have the expertise or bandwidth to deploy and operate a full service mesh
Kelvin is a full-stack Software Engineer based in London, with experience in Ruby, JavaScript / Node.js, PostgreSQL, Docker, AWS, Rails, and React. Prior to becoming a Software Engineer, Kelvin was a start-up founder in Hong Kong. LinkedIn profile: https://www.linkedin.com/in/kjhwong/.
Zero Downtime Architectures based on JEE platform. Almost every big enterprise with online business tries to design its applications in a way that they are always online. But is it also the case when we upgrade the database cluster? When we switch the whole data center? Based on a customer project we try to present common architecture principles that enable you to do all this without any service interruption and the most important: without any stress.
A brief introduction to the world of Software Defined Networking.
It is a very revolutionary technology which can entirely change the face of network management, if implemented in a network.
This session endeavors to explain high-speed reactive microservice architecture, a set of patterns for building services that can readily back mobile and web applications at scale. It uses a scale-up and -out versus a scale-out model to do more with less hardware. A scale-up model uses in-memory operational data, efficient queue handoff, and microbatch streaming, plus async calls to handle more calls on a single node. High-speed microservice architecture endeavors to get back to OOP roots, where data and logic live together in a cohesive, understandable representation of the problem domain, and away from separation of data and logic, because data lives with the service logic that operates on it.
Software Architecture for Cloud InfrastructureTapio Rautonen
Distributed systems are hard to build. Software architecture must be carefully crafted to suit cloud infrastructure.
Design for failure. Learn from failure. Adopt new cloud compatible design patterns and follow the guidelines during the journey of building cloud native applications.
PLNOG19 - Piotr Marecki - Espresso: Scalable and Programmable Peering EdgePROIDEA
Prezentacja rozwiązania SDN ( projekt espresso - https://blog.google/topics/google-cloud/making-google-cloud-faster-more-available-and-cost-effective-extending-sdn-public-internet-espresso/ ) dla sieci brzegowej Google. Opisuje rozproszoną architekture warstwy kontrolnej i warstwy przesyłania pakietów, system mapowania oraz omawia doświadczenie operatorskie zebrane przy wspieraniu systemu w warunkach produkcyjnych.
This talk focuses on how we used Amazon Kinesis to build the pub-sub infra at Lyft, that ingests more than a 100 billion events per day. We'll review the strengths and weaknesses of Kinesis as a choice for streaming events in realtime, at Lyft's scale; as well as the best practices and lessons learnt over time.
Speaker: Hafiz Hamid (Lyft)
Hafiz Hamid is a software engineer on the Pub-Sub/Streaming Platform team at Lyft. He has built some of the key pieces in the messaging & streaming infrastructure at Lyft. Previously, Hafiz was a technical lead at Bing Search where he worked on data pipelines, relevance and web crawlers.
C. Sotiriou, Vodafone Greece: Adopting Quarkus for the digital experience layerUni Systems S.M.S.A.
Christos Sotiriou, Backend Chapter Lead in Digital Engineering at Vodafone Greece, delivers a thorough presentation on how Vodafone Greece moved from Spring to Quarkus and the journey towards a cleaner & faster stack. The webinar was delivered on June 25, 2020.
Slow things down to make them go faster [FOSDEM 2022]Jimmy Angelakos
Talk from FOSDEM 2022
It's easy to get misled into overconfidence based on the performance of powerful servers, given today's monster core counts and RAM sizes. However, the reality of high concurrency usage is often disappointing, with less throughput than one would expect. Because of its internals and its multi-process architecture, PostgreSQL is very particular about how it likes to deal with high concurrency and in some cases it can slow down to the point where it looks like it's not performing as it should. In this talk we'll take a look at potential pitfalls when you throw a lot of work at your database. Specifically, very high concurrency and resource contention can cause problems with lock waits in Postgres. Very high transaction rates can also cause problems of a different nature. Finally, we will be looking at ways to mitigate these by examining our queries and connection parameters, leveraging connection pooling and replication, or adapting the workload.
Topics:
1. Understand what we mean by high concurrency.
2. Understand ACID & MVCC in Postgres.
3. Understand how high concurrency affects Postgres performance.
4. Understand how locks/latches affect Postgres performance.
5. Understand how high transaction rates can affect Postgres.
6. Mitigation strategies for high concurrency scenarios.
Similar to How we scaled Rudder to 10k, and the road to 50k (20)
What if configuration management didn't need to be lvl60 in dev?RUDDER
Slides from Alexandre BRIANCEAU's talk at #OSSPARIS19 (Open Source Summit.
Server infrastructure automation is not simple. Several solutions have existed for several years and most of them rely on infra-as-code to achieve their mission. By the way, why infra-as-code?
And unfortunately, these solutions require strong development skills. So how can we do this when the infrastructure team does not have sufficient and, above all, homogeneous expertise? Because otherwise, beware of the "Guru Team" effect, or how the infrastructure automation to save time ends up with a huge SPOF because only one person in the team knows how it works....
I would like to discuss this together and introduce you to RUDDER briefly. RUDDER is a configuration management solution, and therefore infra-as-code, that allows you to automate your systems by relying entirely on a graphical interface to manage your configurations. Because the infrastructure is complex enough to add a layer!
Slides from Alexandre BRIANCEAU's talk at #OSSPARIS19 (Open Source Summit Paris 2019).
Security is everyone's business, an exploited breach is enough. Teams are aware of this and yet it is still as difficult as ever to be able to ensure, be confident, and reassure others (prove) that at least one party is under control.
And when it comes to server infrastructure, especially at the OS / middleware level, everything gets complicated. Even with an operational security team, it is difficult to ensure that the Information System Security Policy and security recommendations are properly implemented on all servers.
How can we be sure that our security policies are properly applied on all our servers other than through a massive and costly audit? Even if they were when they were created, how do you know if they remain perfectly compliant after a few days / weeks / months?
Let's discover together RUDDER, an open-source solution for continuous compliance based on configuration management to automatically audit and/or correct our systems.
OSIS 2019 - Qu’apporte l’observabilité à la gestion de configuration ?RUDDER
On parle d’observabilité des services lorsque ceux-ci exposent des états et métriques internes pour améliorer la disponibilité globale.
Qu’en est-il de l’observabilité des infrastructures sur lesquelles ils sont déployés, configurés et maintenus ?
Les différents logs (centralisés, agrégés) permettent un bon début d’analyse mais il faut aussi observer les systèmes au fil de l’eau pour tracer chaque changement et les corréler avec le monitoring. Aujourd’hui, ces étapes de configuration IT devraient être prises en charge par les outils de gestion de configuration, qui deviennent la passerelle vers l’observabilité des opérations.
Nous montrerons l'intérêt de cette approche pour la gestion IT moderne avec un retour d’expérience sur les challenges de leur mise en place dans Rudder, notre solution libre d’audit et de gestion de configuration en continu.
OW2Con - Configurations, do you prove yours?RUDDER
How can we be sure of the continuous configuration management proper operation? How to expose factual topic-related reports to dev, sec, managers, customers...?
We believe that, in order to deliver the full business and collaboration value of continuous configuration management, the solution needs to go further than simply applying policies - it must ensure configuration reliability; prove historized application and status; share it to other teams; notify of any drift with a relevant context.
This talk will present why and how we should be concerned about transmitting factual measures on infrastructure management to all parties involved. We will also guide you through the journey to include a full-fledged reporting feature in a configuration management solution.
The latest major version of the solution has brought a major new feature to the Rudder solution: a plugin ecosystem.
The Rudder software architect will present the reasons for this new feature, how it works, and what are the different plugins available.
Benoit Peccatte, CfgMgmtCamp 2019.
Benoit Peccatte started out as a developer for air traffic control systems but quickly became more interested in writing code generators to automate his job.
After meeting some smart sysadmins on the beach, he switched jobs and has been automating servers for the past decade.
He stumbled across open source in engineering school, and quickly became convinced that free software is the only way to keep software maintainable whatever happens in the future.
Benoit is now trying to automate his job on Rudder, developing features in Rudder to continuously configure and audit more and more servers.
What uses for observing operations of Configuration Management?RUDDER
Nicolas Charles, CfgMgmtCamp 2019.
More and more services expose their state, internal details and metrics to be observable, and improve overall quality of service.
But what about observing the infrastructure they are deployed, configured and maintained on?
What can we learn from that, and what do we need from configuration management to get these features and metrics?
Logs from installation is a good start, but they need centralization, aggregation and especially knowledge derivation from these - but also we need to observe these features over time, to trace changes, and correlate them with monitoring.
Rudder was built around the predicate that all actions of the configuration agent need to be traced, centralized and exposed in a meaningful way - with agents ensuring the continuous configuration of systems, and this talk will show the rationale behind this predicate, how we implemented this solution, and the benefits of this approach for the modern IT world.
UX challenges of a UI-centric config management toolRUDDER
Raphaël Gauthier, CfgMgmtCamp 2019.
One of Rudder’s main focuses is its comprehensive graphical user interface, which allows users to view and manage its configurations without writing a line of code.
The user experience and interface considerations for a tool as technical and complex, and with such potential to break things as a configuration management tool are certainly a challenge, and in some ways in unchartered territory. Rudder’s frontend developer will present an analysis of the situation, the issues encountered and the approach adopted for the improvement of UX and UI planned for 2019.
What happened in RUDDER in 2018 and what’s next?RUDDER
Alexis Mousset, CfgMgmtCamp 2019.
Let’s take a look at Rudder’s new features from 2018, both in terms of the features of versions 4.3 and 5.0 as well as the new documentation and our platform for building and distributing binaries.
We will then present the provisional roadmap for 2019: let’s go to Rudder 5.1 and 5.2!
Alexandre Brianceau, CfgMgmtCamp 2019.
Rudder is an open source configuration management tool that includes continuous auditing (with or without remediation), compliance info and graphs and the possibility to configure everything in the UI and/or APIs.
It has been around for more than six years and has users large (think 10 000 nodes) and small around the world.
Let’s take a moment to look at the vision that lead us here, how Rudder is different from similar tools, and what users find invaluable, nice (or annoying - I’ll be honest!).
If you’re not familiar with Rudder this is a great talk to attend to get the basics covered.
How can we be sure of the continuous configuration management proper operation? How to expose factual topic-related reports to dev, sec, managers, customers...?
We believe that, in order to deliver the full business and collaboration value of continuous configuration management, the solution needs to go further than simply applying policies - it must ensure configuration reliability; prove historized application and status; share it to other teams; notify of any drift with a relevant context.
This talk will present why and how we should be concerned about transmitting factual measures on infrastructure management to all parties involved. We will also guide you through the journey to include a full-fledged reporting feature in a configuration management solution.
L'audit en continu : clé de la conformité démontrable (#POSS 2018)RUDDER
Présentation issue du talk pour le Paris Open Source Summit 2018 par Alexandre Brianceau dans le track Cybersécurité.
Les politiques de sécurité sont de plus en plus complexes et exigeantes à mettre en oeuvre pour les équipes opérationnelles. Comment pouvons-nous être certains que nos politiques de sécurité soient bien appliquées sur tous sos serveurs autrement qu’à travers un audit massif et coûteux ? Quand bien même le seraient-elles lors de leur création, comment savoir si elles restent parfaitement conformes après quelques jours / semaines / mois ?
Nous montrerons comment définir des règles techniques d'une politique de sécurité dans RUDDER, une solution d'automatisation de conformité informatique open source issue du monde devops où la gestion automatique de la configuration est déjà la norme. ensuite toutes les 5 minutes sur chacun des serveurs afin de remonter un résumé global permettant alors d’inspecter les problèmes qui doivent être corrigés.
Nous expliquerons également comment une politique d’audit déployée avec succès peut être imposée sur tous les systèmes avec le même outil, en passant de l’audit automatique à la remédiation automatique.
Fiabilité et conformité continues en production avec Rudder (#BBOOST 2018)RUDDER
Présentation issue du talk pour le BBOOST 2018 par Alexandre Brianceau.
Une infrastructure dont les configurations ne sont pas homogènes, surveillées et maintenues en conformité en continu finit inévitablement par dériver, entraînant failles de sécurité et incidents de production.
Alors que la fiabilité de l’IT est devenue critique, la méthode traditionnelle consistant à mener des audits tous les X mois montre ses limites : une dérive entre deux audits peut passer inaperçue et causer un incident.
RUDDER est une solution qui garantit la conformité des configurations en permanence.
Stay up - voyage d'un éditeur de logiciels libresRUDDER
Voici le retour d'expérience d'un des fondateurs Rudder sur ce que c'est qu'être entrepreneur dans les logiciels libres et les 10 ans de voyage écoulés à travers 4 étapes clés:
- la constitution de l'équipe,
- le passage par un incubateur,
- la levée de fond (ou pas),
- et la recherche d'un business model soutenable.
Rudder 4.1 was released in March 2017 with:
- an advanced feature to query external APIs and pull in node properties dynamically
the ability to add "key=value" tags to all Rules and Directives in order to categorize them
- a new API on relay servers to enable node-to-node file sharing and remote run in firewalled environments performance improvements
- a new plugin package format
Rudder 4.2 was released in September 2017 and includes the support for a new plugin that adds support for a new Windows DSC-based agent. Rudder 4.3 will include:
- Parameters for Technique Editor techniques
- ACLs on the API accounts
- Many architecture improvements
In parallel, new plugins are being developed:
- A plugin to integrate data from external APIs
- Monitoring integration with Centreon
- CMDB integration with iTop
- A reporting plugin for historized compliance
This talk will introduce these new features and show how to use them, hopefully getting you as excited as we are! Then, we will move on to explain about longer-term feature ideas we have for Rudder, and the general vision linked to future developments.
About Nicolas Charles
Nicolas is a tinkerer who likes when things just work, and tries his best to reach this goal. He started as a developer 15 years ago, and often had to reach out of this role to solve issues.
In 2010, he co-founded Normation, and he still enjoys fixing things in Rudder and at its users.
DevOps D-Day 2017 - Gestion des configurations et mise en conformité chez un ...RUDDER
En tant qu’hébergeur et infogérant, Jaguar Network est confronté à une double évolution :
Le marché attend de la part d’un Service Provider de prendre en charge une part toujours plus importante de la gestion du système d’information.
La croissance de l’entreprise entraîne une pression plus importante quantitativement (scalabilité) et qualitativement (garantir la fiabilité et la sécurité sur l’ensemble du parc géré).
Ainsi, Jaguar Network a dû trouver une solution capable de résoudre cette double problématique à laquelle de plus en plus de sociétés sont confrontées : assurer la croissance rapide du parc tout en améliorant et en garantissant la fiabilité.
Grâce à RUDDER, solution open-source française de Continuous Configuration dédiée aux contraintes de la production, l’atteinte de cet objectif a été grandement facilité. En duo avec l’éditeur de RUDDER, Jaguar Network racontera le déroulement de ce projet, de la mise en place de l’outil aux résultats constatés, en passant par l’intégration avec les autres technologies du SI.
Un retour d’expérience concret et complet sur le concept de Continuous Configuration et son implémentation avec RUDDER.
RUDDER is an easy to use, web-driven, role-based solution for IT Infrastructure Automation and Compliance. With a focus on continuously checking configurations and centralising real-time status data, RUDDER can show a high-level summary (“ISO 27001 rules are at 100%!”) and break down noncompliance issues to a deep technical level (“Host prod-web-03: SSH server configuration allows root logins”).
A few things that make RUDDER stand out:
- A simple framework allows you to extend the built-in rules to implement specific low-level configuration patterns, however complex they may be, using simple building blocks (“ensure package installed in version X,” “ensure file content,” “ensure line in file,” etc.). A graphical builder lowers the technical level required to use this.
- Each policy can be independently set to be automatically checked or enforced on a policy or host level. In Enforce mode, each remediation action is recorded, showing the value of these invisible fixes.
- RUDDER works on almost every kind of device, so you’ll be managing physical and virtual servers in the data center, cloud instances, and embedded IoT devices in the same way.
- RUDDER is designed for critical environments where a security breach can mean more than a blip in the sales stats. Built-in features include change requests, audit logs, and strong authentication.
- RUDDER relies on an agent that needs to be installed on all hosts to audit. The agent is very lightweight (10 to 20 MB of RAM at peak) and blazingly fast (it’s written in C and takes less than 10 seconds to verify 100 rules). Installation is self-contained, via a single package, and can auto-update to limit agent management burden.
- RUDDER is a true and professional open source solution—the team behind RUDDER doesn’t believe in the dual-speed licensing approach that makes you reinstall everything and promotes open source as little more than a “demo version.”
RUDDER is an established project with several 10000s of node managed, in companies from small to biggest-in-their-field. Typical deployments manage 100s to 1000s of nodes. The biggest known deployment in 2016 is about 7000 nodes.
Rudder is an easy to use, web-driven, role-based solution for IT Infrastructure Automation and Compliance. With a focus on continuously checking configurations and centralising real-time status data, Rudder can show a high-level summary (“ISO 27001 rules are at 100%!”) and break down noncompliance issues to a deep technical level (“Host prod-web-03: SSH server configuration allows root logins”).
A few things that make Rudder stand out:
- A simple framework allows you to extend the built-in rules to implement specific low-level configuration patterns, however complex they may be, using simple building blocks (“ensure package installed in version X,” “ensure file content,” “ensure line in file,” etc.). A graphical builder lowers the technical level required to use this.
- Each policy can be independently set to be automatically checked or enforced on a policy or host level. In Enforce mode, each remediation action is recorded, showing the value of these invisible fixes.
- Rudder works on almost every kind of device, so you’ll be managing physical and virtual servers in the data center, cloud instances, and embedded IoT devices in the same way.
- Rudder is designed for critical environments where a security breach can mean more than a blip in the sales stats. Built-in features include change requests, audit logs, and strong authentication.
- Rudder relies on an agent that needs to be installed on all hosts to audit. The agent is very lightweight (10 to 20 MB of RAM at peak) and blazingly fast (it’s written in C and takes less than 10 seconds to verify 100 rules). Installation is self-contained, via a single package, and can auto-update to limit agent management burden.
- Rudder is a true and professional open source solution—the team behind Rudder doesn’t believe in the dual-speed licensing approach that makes you reinstall everything and promotes open source as little more than a “demo version.”
Rudder is an established project with several 10000s of node managed, in companies from small to biggest-in-their-field. Typical deployments manage 100s to 1000s of nodes. The biggest known deployment in 2016 is about 7000 nodes.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
How we scaled Rudder to 10k, and the road to 50k
1. How we scaled Rudder to 10k
nodes
And the road to 50k nodes
Nicolas CHARLES
Co-founder and COO
@nico_charles
2. 2
Scalability ?
Scalability is the capability of a system,
network, or process to handle a growing
amount of work, or its potential to be
enlarged to accommodate that growth
https://en.wikipedia.org/wiki/Scalability
3. 3
Scalability – why is it an issue in Rudder?
What does Rudder do ?
●
Users define policies
●
Apply them on groups of nodes
●
Rudder computes the policies for each
nodes
●
Agents apply them, and send back
information
●
Rudder computes the compliance
4. 4
Scalability – why is it an issue in Rudder?
Each of these points need to go fast
●
Process nodes inventory quickly
●
Have a fast UI
●
Generate policies in a reasonable time
●
Have fast agents, and don’t overflow the
network
●
Compliance of actual state available
6. 6
Rudder Architecture
Rudder Server Root
Interfaces
CLI
WEB UI
API
Uses
Applications
Compliance Configuration Inventory
Plugins
Node
Rudder Agent
Node
Rudder relay
Node
Rudder Agent
Rudder Engine Techniques
7. 7
The origin of Rudder
●
At first, Rudder was thought for hundred(s) of nodes
●
No real goal for scalability
●
It was, retrospectively, an MVP
8. 8
The origin of Rudder
●
Scalability went up, driven from
●
Users and usages
– Frustration over slowdowns
– More managed servers
●
Features
– Some features needed much improved performance
– Some needed massive architectural change
9. 9
First bottlenecks to tackle
●
Reporting in Rudder
●
Display compliance of nodes
– Change the data model, as everything was Rule Centric in Rudder 2.3
●
Slow display of reports and compliance
– Remember, we are supporting Postgresql 8.x
– Adding relevant indexes
●
Agent side
●
Agent was already used in critical systems, but impacted performance of
nodes
– Rewrite some policies
– Add tooling around agent to prevent clogging
●
Rudder 2.5 was not more scalable, but more consistent
10. 10
Scalability – Step by Step
Rudder Server Root
Interfaces
CLI
WEB UI
API
Uses
Compliance Configuration Inventory
Rudder Engine
Node
Rudder Agent
Node
Rudder relay
Node
Rudder Agent
Techniques
Bandwidth & Network
- Flag files to detect new policies
- Relay servers
11. 11
Scalability – Step by Step
Rudder Server Root
Interfaces
CLI
WEB UI
API
Uses
Compliance Configuration Inventory
Rudder Engine
Node
Rudder Agent
Node
Rudder relay
Node
Rudder Agent
Techniques
Scale the uses
- Validation workflow
- Synchronisation of Rudder servers
- API
- More Techniques
12. 12
Scalability – Step by Step
Rudder Server Root
Interfaces
CLI
WEB UI
API
Uses
Compliance Configuration Inventory
Rudder Engine
Node
Rudder Agent
Node
Rudder relay
Node
Rudder Agent
Techniques
Improve performance
- Save only changes of Inventories
(several order of magnitude faster)
- Change data model for Compliance
(30 % faster compliance)
13. 13
Scalability – 2.9 & 2.10
●
Improving performances is one of the focus
●
Refactoring and code improvements to improve policy generation time
– Use of hashes and caches
●
Fighting with the ORM to have lighter queries
– Much less commits
●
Make impact on network and node adjustable
●
Configure agent run frequency : can configure based on the
performance of nodes and available bandwidth
14. 14
Scalability – 2.9 & 2.10
●
First industrialized performances test – With Tsung
●
Generated inventories automatically, and send them to endpoint
●
Tests with thousands of inventories
●
Thank you @cscmeu !
http://tsung.erlang-projects.org/
15. 15
Scalability – 2.11
●
Goal: manage thousand nodes
●
Distributed setup
– Make Rudder scale by adding more servers for components
●
UI more responsive to user requests
– Async
– LDAP optimizations
●
No more indexes (everything fits in RAM)
●
Much faster policy generation
– Changed of variable lookup, more caching
– Used a bit of parallelism when it wass easy
●
More performance tests
– A big thank to users pushing the limits
16. 16
Scale the uses – Rudder 2.11
●
Technique Editor : everyone can create techniques
●
Uses ncf
●
Graphical User Interface to make Techniques easier to write
17. 17
Rudder 3
Rudder Server Root
Interfaces
CLI
WEB UI
API
Uses
Compliance Configuration Inventory
Rudder Engine
Node
Rudder Agent
Node
Rudder relay
Node
Rudder Agent
Techniques
Complete change of UI
- Design and layout
Compliance is everywhere
- Everything is async
- Everything is cached
18. 18
Rudder 3
Rudder Server Root
Interfaces
CLI
WEB UI
API
Uses
Compliance Configuration Inventory
Rudder Engine
Node
Rudder Agent
Node
Rudder relay
Node
Rudder Agent
Techniques
New data model : Node Centric
- Compliance is per node
- Cached
- And lazyly computed
19. 19
Rudder 3
Rudder Server Root
Interfaces
CLI
WEB UI
API
Uses
Compliance Configuration Inventory
Rudder Engine
Node
Rudder Agent
Node
Rudder relay
Node
Rudder Agent
Techniques
Lightweight reports
- Change only reporting
- Send reports only for changes
And much less disk usage
20. 20
Rudder 3
●
For this release, devs had between 1000 and 2000 nodes
on their dev systems
●
A lot of timing info embedded in Rudder
●
Permitted to identify low hanging fruits
●
As a result, everything was much faster
●
500ms compute time with 2000 nodes was considered slow, and
reported as a bug
21. 21
Rudder 3.1 – 5000 nodes
●
Rudder 3.1 – reaching the 5000 nodes limit (well – 7500 at
the end of its life)
●
This is the land of micro-optimization, pushing the limits of the model
– Lazy variables to prevent computation of unwanted values
●
Micro tuning of techniques to make policy generation faster
– But we are still talking about 45 minutes for 5000 nodes with policy
validation
●
Massive performance upgrade of the agent
– Change complexity of managing big policy
22. 22
Rudder 3.1 – 5000 nodes
●
Tooling to generate compliance reports from nodes
●
Load servers, detect issues in compliance computing
●
Extensive use of PgBadger to analyze PostgreSQL logs
– From both tests benchs and production systems
– Finding the slow queries and the limits
●
Thank you @matya_j !!
https://github.com/dalibo/pgbadger
24. 24
Rudder 4.0: massive changes
●
Policies
●
Each policy is identified by an id
●
Change database model
– Use Doobie, an excellent ORM that lets you write proper SQL
– Configuration is stored in JSON rather than JOINs
●
No « leaking » of policies changes from one node to another
– Regenerate only for the nodes that have been changed
●
Policy generation is much faster
– About 30 times faster (without policy validation)
25. 25
Rudder 4.0: massive changes
●
Compliance
●
Compliance is computed when reports are received server side, cached,
– Twice as fast display of compliance with 1000 nodes, order of magnitude
faster with 5000 nodes
●
Audit mode
●
New LDAP backend (lmdb based)
26. 26
Rudder 4.1: the road to 10k
●
UI is much faster
●
Everything ressources are cached
●
Compress everything (big impact on bad network with large installs and distant
server)
●
Policy generation is pretty fast (if we don’t validate them)
●
About 3 minutes for 7000 nodes
●
External data sources
●
We can trigger from changes remote tool
●
Hooks on events
●
Allow to fine tune behaviour of node acceptation/deletion/policy generation
●
Thank you @FlorianHeigl1 !
27. 27
Rudder 4.3: 10k
●
Policy engine has been rewritten
●
Pluggable, less mutable, a bit faster
●
We can manage 10k nodes on one Rudder server
●
Recommended configuration is 11GB for the Web Interface for 10k nodes
●
Adding more RAM/CPU/IO is enough to go to 15k nodes
●
Still not perfect
●
Policy generation is long with 10k and policy validation activated
●
UI will be sluggish – because of DOM computations
– Might be ok with Firefox 59
●
API will be ok
28. 28
What’s next ?
●
Improve tooling suite
●
Working with Florian Heigl to automate a super large
test plateform
– Automatically create nodes, rules, reports
– At high rate
– Checks application response rate and loads
●
Find new bottleneck using sysdig
29. 29
What’s next ?
●
Improve tooling suite
●
Improve usability and documentation of load tools
– So that more users/contributors can use them
●
Automated tests of UI and measure the response time
at each commit
30. 30
The road to 50k nodes
●
Several types of bottleneck
●
Policy validation
– We can’t realistically validate on the server 50 000 policies
– Policy validation on client side via 2 steps policy updates
●
GUI
– Paginate results on the server side
●
Ease client side burden
●
Improve response rate (especially over slow networks)
– Switch from Angular to ELM
31. 31
The road to 50k nodes
●
Several types of bottleneck
●
Network
– Current protocol is not fit to update hundreds of thousands of files
– Reports are sent back from nodes to Rudder server via syslog
●
Missing compression
●
Rsyslog-psql does one insert/commit in database per received logs :(
●
Policy generation
– Upgrade or replace StringTemplate to lessen IO
– More static files
●
Database
– Use PostgreSQL 10 partitioning to speed up compliance and archiving
32. 32
The road to 50k nodes
●
Missing features
●
We can expect every users of a given installation to need to
manage the whole 50k nodes
– Fine grained authorization (OrBAC)
– Multi-tenancy
– Federation/Synchronisation of different Rudder servers
●
A lot of thinking need to be put in there
●
Improve collaboration
– Notifications everywhere!
– Warn if another user is modifying the current object
●
Change management
– Canary testing
– Ramp-up deployment
33. 33
Final words
●
We are very lucky to have great users pushing the limits
●
A special thank to all of you
Dennis, Olivier, Florian, Christophe, Janos, Pierre, Stéphane, Marc, Alexander,
David, Fabrice, Daniel, Dmitry, Ferenc, François, Vincent, Jean, Lionel, Maxime,
Michael, Enrico, Ilan, Jean Marie, Jeremy, …
(and I’m terribly sorry for all those that I did not mentionned)
●
Tools, softwares and resources evolved during Rudder life
●
They helped improve the scalability as well
34. How we scaled Rudder to 10k
nodes
Questions?
Nicolas CHARLES
Co-founder and COO
@nico_charles