Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

How we scaled Rudder to 10k, and the road to 50k

104 views

Published on

Management graphical interface, real-time compliance and ease of use are some of Rudder core principles. When Rudder was created in 2010, hundreds of servers were considered a large installation, and the constraints and limits to manage systems were totally different than nowadays, as IT speaks in terms of thousands of nodes. I’ll present how we scaled Rudder from hundreds to 10k nodes, on each different aspect of the product: changing the way nodes talk with the Rudder server, rewriting the data model, evolving the UI, how we detected new limits - further away - and how we removed them; and made sure these limits don’t come back through tooling and testing. Finally, I’ll present the planned evolutions in upcoming releases to reach 50k managed nodes.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

How we scaled Rudder to 10k, and the road to 50k

  1. 1. How we scaled Rudder to 10k nodes And the road to 50k nodes Nicolas CHARLES Co-founder and COO @nico_charles
  2. 2. 2 Scalability ? Scalability is the capability of a system, network, or process to handle a growing amount of work, or its potential to be enlarged to accommodate that growth https://en.wikipedia.org/wiki/Scalability
  3. 3. 3 Scalability – why is it an issue in Rudder? What does Rudder do ? ● Users define policies ● Apply them on groups of nodes ● Rudder computes the policies for each nodes ● Agents apply them, and send back information ● Rudder computes the compliance
  4. 4. 4 Scalability – why is it an issue in Rudder? Each of these points need to go fast ● Process nodes inventory quickly ● Have a fast UI ● Generate policies in a reasonable time ● Have fast agents, and don’t overflow the network ● Compliance of actual state available
  5. 5. 5 Rudder Architecture
  6. 6. 6 Rudder Architecture Rudder Server Root Interfaces CLI WEB UI API Uses Applications Compliance Configuration Inventory Plugins Node Rudder Agent Node Rudder relay Node Rudder Agent Rudder Engine Techniques
  7. 7. 7 The origin of Rudder ● At first, Rudder was thought for hundred(s) of nodes ● No real goal for scalability ● It was, retrospectively, an MVP
  8. 8. 8 The origin of Rudder ● Scalability went up, driven from ● Users and usages – Frustration over slowdowns – More managed servers ● Features – Some features needed much improved performance – Some needed massive architectural change
  9. 9. 9 First bottlenecks to tackle ● Reporting in Rudder ● Display compliance of nodes – Change the data model, as everything was Rule Centric in Rudder 2.3 ● Slow display of reports and compliance – Remember, we are supporting Postgresql 8.x – Adding relevant indexes ● Agent side ● Agent was already used in critical systems, but impacted performance of nodes – Rewrite some policies – Add tooling around agent to prevent clogging ● Rudder 2.5 was not more scalable, but more consistent
  10. 10. 10 Scalability – Step by Step Rudder Server Root Interfaces CLI WEB UI API Uses Compliance Configuration Inventory Rudder Engine Node Rudder Agent Node Rudder relay Node Rudder Agent Techniques Bandwidth & Network - Flag files to detect new policies - Relay servers
  11. 11. 11 Scalability – Step by Step Rudder Server Root Interfaces CLI WEB UI API Uses Compliance Configuration Inventory Rudder Engine Node Rudder Agent Node Rudder relay Node Rudder Agent Techniques Scale the uses - Validation workflow - Synchronisation of Rudder servers - API - More Techniques
  12. 12. 12 Scalability – Step by Step Rudder Server Root Interfaces CLI WEB UI API Uses Compliance Configuration Inventory Rudder Engine Node Rudder Agent Node Rudder relay Node Rudder Agent Techniques Improve performance - Save only changes of Inventories (several order of magnitude faster) - Change data model for Compliance (30 % faster compliance)
  13. 13. 13 Scalability – 2.9 & 2.10 ● Improving performances is one of the focus ● Refactoring and code improvements to improve policy generation time – Use of hashes and caches ● Fighting with the ORM to have lighter queries – Much less commits ● Make impact on network and node adjustable ● Configure agent run frequency : can configure based on the performance of nodes and available bandwidth
  14. 14. 14 Scalability – 2.9 & 2.10 ● First industrialized performances test – With Tsung ● Generated inventories automatically, and send them to endpoint ● Tests with thousands of inventories ● Thank you @cscmeu ! http://tsung.erlang-projects.org/
  15. 15. 15 Scalability – 2.11 ● Goal: manage thousand nodes ● Distributed setup – Make Rudder scale by adding more servers for components ● UI more responsive to user requests – Async – LDAP optimizations ● No more indexes (everything fits in RAM) ● Much faster policy generation – Changed of variable lookup, more caching – Used a bit of parallelism when it wass easy ● More performance tests – A big thank to users pushing the limits
  16. 16. 16 Scale the uses – Rudder 2.11 ● Technique Editor : everyone can create techniques ● Uses ncf ● Graphical User Interface to make Techniques easier to write
  17. 17. 17 Rudder 3 Rudder Server Root Interfaces CLI WEB UI API Uses Compliance Configuration Inventory Rudder Engine Node Rudder Agent Node Rudder relay Node Rudder Agent Techniques Complete change of UI - Design and layout Compliance is everywhere - Everything is async - Everything is cached
  18. 18. 18 Rudder 3 Rudder Server Root Interfaces CLI WEB UI API Uses Compliance Configuration Inventory Rudder Engine Node Rudder Agent Node Rudder relay Node Rudder Agent Techniques New data model : Node Centric - Compliance is per node - Cached - And lazyly computed
  19. 19. 19 Rudder 3 Rudder Server Root Interfaces CLI WEB UI API Uses Compliance Configuration Inventory Rudder Engine Node Rudder Agent Node Rudder relay Node Rudder Agent Techniques Lightweight reports - Change only reporting - Send reports only for changes And much less disk usage
  20. 20. 20 Rudder 3 ● For this release, devs had between 1000 and 2000 nodes on their dev systems ● A lot of timing info embedded in Rudder ● Permitted to identify low hanging fruits ● As a result, everything was much faster ● 500ms compute time with 2000 nodes was considered slow, and reported as a bug
  21. 21. 21 Rudder 3.1 – 5000 nodes ● Rudder 3.1 – reaching the 5000 nodes limit (well – 7500 at the end of its life) ● This is the land of micro-optimization, pushing the limits of the model – Lazy variables to prevent computation of unwanted values ● Micro tuning of techniques to make policy generation faster – But we are still talking about 45 minutes for 5000 nodes with policy validation ● Massive performance upgrade of the agent – Change complexity of managing big policy
  22. 22. 22 Rudder 3.1 – 5000 nodes ● Tooling to generate compliance reports from nodes ● Load servers, detect issues in compliance computing ● Extensive use of PgBadger to analyze PostgreSQL logs – From both tests benchs and production systems – Finding the slow queries and the limits ● Thank you @matya_j !! https://github.com/dalibo/pgbadger
  23. 23. 23 Rudder 4: going beyond
  24. 24. 24 Rudder 4.0: massive changes ● Policies ● Each policy is identified by an id ● Change database model – Use Doobie, an excellent ORM that lets you write proper SQL – Configuration is stored in JSON rather than JOINs ● No « leaking » of policies changes from one node to another – Regenerate only for the nodes that have been changed ● Policy generation is much faster – About 30 times faster (without policy validation)
  25. 25. 25 Rudder 4.0: massive changes ● Compliance ● Compliance is computed when reports are received server side, cached, – Twice as fast display of compliance with 1000 nodes, order of magnitude faster with 5000 nodes ● Audit mode ● New LDAP backend (lmdb based)
  26. 26. 26 Rudder 4.1: the road to 10k ● UI is much faster ● Everything ressources are cached ● Compress everything (big impact on bad network with large installs and distant server) ● Policy generation is pretty fast (if we don’t validate them) ● About 3 minutes for 7000 nodes ● External data sources ● We can trigger from changes remote tool ● Hooks on events ● Allow to fine tune behaviour of node acceptation/deletion/policy generation ● Thank you @FlorianHeigl1 !
  27. 27. 27 Rudder 4.3: 10k ● Policy engine has been rewritten ● Pluggable, less mutable, a bit faster ● We can manage 10k nodes on one Rudder server ● Recommended configuration is 11GB for the Web Interface for 10k nodes ● Adding more RAM/CPU/IO is enough to go to 15k nodes ● Still not perfect ● Policy generation is long with 10k and policy validation activated ● UI will be sluggish – because of DOM computations – Might be ok with Firefox 59 ● API will be ok
  28. 28. 28 What’s next ? ● Improve tooling suite ● Working with Florian Heigl to automate a super large test plateform – Automatically create nodes, rules, reports – At high rate – Checks application response rate and loads ● Find new bottleneck using sysdig
  29. 29. 29 What’s next ? ● Improve tooling suite ● Improve usability and documentation of load tools – So that more users/contributors can use them ● Automated tests of UI and measure the response time at each commit
  30. 30. 30 The road to 50k nodes ● Several types of bottleneck ● Policy validation – We can’t realistically validate on the server 50 000 policies – Policy validation on client side via 2 steps policy updates ● GUI – Paginate results on the server side ● Ease client side burden ● Improve response rate (especially over slow networks) – Switch from Angular to ELM
  31. 31. 31 The road to 50k nodes ● Several types of bottleneck ● Network – Current protocol is not fit to update hundreds of thousands of files – Reports are sent back from nodes to Rudder server via syslog ● Missing compression ● Rsyslog-psql does one insert/commit in database per received logs :( ● Policy generation – Upgrade or replace StringTemplate to lessen IO – More static files ● Database – Use PostgreSQL 10 partitioning to speed up compliance and archiving
  32. 32. 32 The road to 50k nodes ● Missing features ● We can expect every users of a given installation to need to manage the whole 50k nodes – Fine grained authorization (OrBAC) – Multi-tenancy – Federation/Synchronisation of different Rudder servers ● A lot of thinking need to be put in there ● Improve collaboration – Notifications everywhere! – Warn if another user is modifying the current object ● Change management – Canary testing – Ramp-up deployment
  33. 33. 33 Final words ● We are very lucky to have great users pushing the limits ● A special thank to all of you Dennis, Olivier, Florian, Christophe, Janos, Pierre, Stéphane, Marc, Alexander, David, Fabrice, Daniel, Dmitry, Ferenc, François, Vincent, Jean, Lionel, Maxime, Michael, Enrico, Ilan, Jean Marie, Jeremy, … (and I’m terribly sorry for all those that I did not mentionned) ● Tools, softwares and resources evolved during Rudder life ● They helped improve the scalability as well
  34. 34. How we scaled Rudder to 10k nodes Questions? Nicolas CHARLES Co-founder and COO @nico_charles

×