Consul scale

•

2 likes•378 views

Ariel Moskovich

Growing with Consul to 4K agents and beyond

Technology

The Beginning
● Consul version 0.3
● Few dozens of agents
● One DC
● Main usage: Internal LB
● Script checks

The expansion
● Growing to > Hundreds > Thousands of agents
● Mapping all infrastructure & services
● Creating automation for json + scripts generation
● Additional DC

The Abuse
● Consul as Vault backend
● KV for state
● Users implement discovery over KV
● Consul locks for huge clusters
● Consul KV for reverse proxy

First signs of trouble
● IOwait on servers
● Slow KV updates (Few seconds)
● Sporadic DNS query failures
● Restart cause Raft failures

$Improve Visibility Example: "telemetry": { "statsd_address":"statsd.eu1.appsflyer.com:8125", "metrics_prefix":"app.eu1.consul-20001-002-prod" } We monitor: Consul: serf, raft, rpc, kv queries, service queries + 404, de/register, Leader change, locks, reap count, intent queue System metrics: Connections, Iowait, Memory used, Network bandwidth, LoadAvg, DiskFree$

More stability issues
● Network saturation on servers
● Raft failures: “No leader found”
● Service discovery / DNS failures
● Nodes failures to leave

The Problem
High de/register rates due to auto scaling
Node conflicts Increase raft traffic between consul servers
High bursts cause network saturation on leader
Raft heartbeat fail from other servers
Leader Election initiated

Stabilizing
● Better control on join & leave
● Add timestamp to node name
● Increase raft_multiplier
● Upgrade consul version
● Cleaned up stuf
● Decrease reconnect_timeout
● Add TTL + allow_stale
● Add DNS caching daemon
● Migrate servers to ENA enabled

More stuff on the way
● Serf in depth queue errors
● Reaping old nodes
● Bootstrap vs bootstrap-expect
● Security hole by default…
● Enable-scripts-check

Successful upgrade to 1.0.3
Start the migration on small region
Start with clients
Handle node-id cleanup
Gradually migrate agent by services
Demote raft for compatibility
Migrate one server
Migrate all servers
Migrate bigger region

$Consul DNS tactics & Configurations 1) Use: alow_stale, set service_ttl, increase serve count "dns_config": { "allow_stale": true, "max_stale": "28800s", "node_ttl": "0s", "udp_answer_limit": 30, "service_ttl": { "*": "5s" } } 2) Seed Bind with Consul services records as fallback 3) Forward via bind to enable ease fault tolerance 4) Use DNS daemon (we use pdnsd) to reduce load on consul and enable negative ttl$

What's hot

Self Created Load Balancer for MTA on AWSsharu1204

2019 05-28 SRE Consul Criteo MeetupPierre Souchay

SignalFx Kafka Consumer OptimizationSignalFx

Breaking Prometheus (Promcon Berlin '16)Matthew Campbell

AWS re:Invent 2016: Making Every Packet Count (NET404)Amazon Web Services

Producer Performance Tuning for Apache KafkaJiangjie Qin

Server architectureMasters Academy

Building event streaming pipelines using Apache PulsarStreamNative

DNS-SD ExtentionsNina Buchina

Nginx conf.compressedMauricio Roman

High performance queues with CassandraMikalai Alimenkou

Scaling ingest pipelines with high performance computing principles - Rajiv K...SignalFx

Apache Pulsar with MQTT for Edge Computing - Pulsar Summit Asia 2021StreamNative

Pika宗志陈

Trasport layerRAJ ANAND

HBaseCon2017 HBase at XiaomiHBaseCon

Cortex: Prometheus as a Service, One Year OnKausal

QUICApache Traffic Server

Load balancing at tuentiRicardo Bartolomé

Content Addressable NDN Repository - checkpointShi Junxiao

What's hot (20)

Self Created Load Balancer for MTA on AWS

2019 05-28 SRE Consul Criteo Meetup

SignalFx Kafka Consumer Optimization

Breaking Prometheus (Promcon Berlin '16)

AWS re:Invent 2016: Making Every Packet Count (NET404)

Producer Performance Tuning for Apache Kafka

Server architecture

Building event streaming pipelines using Apache Pulsar

DNS-SD Extentions

Nginx conf.compressed

High performance queues with Cassandra

Scaling ingest pipelines with high performance computing principles - Rajiv K...

Apache Pulsar with MQTT for Edge Computing - Pulsar Summit Asia 2021

Pika

Trasport layer

HBaseCon2017 HBase at Xiaomi

Cortex: Prometheus as a Service, One Year On

QUIC

Load balancing at tuenti

Content Addressable NDN Repository - checkpoint

Similar to Consul scale

ConsulAriel Moskovich

How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N...Amazon Web Services

Mike Guthrie - Revamping Your 10 Year Old Nagios InstallationNagios

Scaling Kubernetes to Support 50000 Services.pptxthaond2

HashiStack. To the cloud and beyond...Oleg Lobanov

Hpe service virtualization 3.8 what's new chicago admJeffrey Nunn

Dragonflow 01 2016 TLV meetup Eran Gampel

What's new in NGINX Plus R19NGINX, Inc.

Zero Downtime JEE ArchitecturesAlexander Penev

Practice of large Hadoop cluster in China MobileDataWorks Summit

DCUS17 : Docker networking deep diveMadhu Venugopal

[OpenInfra Days Korea 2018] Day 2 - E6 - OpenInfra monitoring with PrometheusOpenStack Korea Community

Azure webinar kolkata Sumantro Mukherjee

Hacking apache cloud stackNitin Mehta

Netflix Massively Scalable, Highly Available, Immutable InfrastructureAmer Ather

GE IOT Predix Time Series & Data Ingestion Service using Apache Apex (Hadoop)Apache Apex

Citrix Day 2015 Net Scaler Release 10.5 Update v10Digicomp Academy Suisse Romande SA

NGINX Plus R20 WebinarNGINX, Inc.

(NET404) Making Every Packet CountAmazon Web Services

JAX London 2019 "Cloud Native Communication: Using an API Gateway and Service...Daniel Bryant

Similar to Consul scale (20)

Consul

How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N...

Mike Guthrie - Revamping Your 10 Year Old Nagios Installation

Scaling Kubernetes to Support 50000 Services.pptx

HashiStack. To the cloud and beyond...

Hpe service virtualization 3.8 what's new chicago adm

Dragonflow 01 2016 TLV meetup

What's new in NGINX Plus R19

Zero Downtime JEE Architectures

Practice of large Hadoop cluster in China Mobile

DCUS17 : Docker networking deep dive

[OpenInfra Days Korea 2018] Day 2 - E6 - OpenInfra monitoring with Prometheus

Azure webinar kolkata

Hacking apache cloud stack

Netflix Massively Scalable, Highly Available, Immutable Infrastructure

GE IOT Predix Time Series & Data Ingestion Service using Apache Apex (Hadoop)

Citrix Day 2015 Net Scaler Release 10.5 Update v10

NGINX Plus R20 Webinar

(NET404) Making Every Packet Count

JAX London 2019 "Cloud Native Communication: Using an API Gateway and Service...

Recently uploaded

Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer

Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia

Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko

Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge

Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services

08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls

Artificial Intelligence: Facts and MythsJoaquim Jorge

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer

08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls

08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls

GenCyber Cyber Security Day PresentationMichael W. Hawkins

Boost PC performance: How more available memory can improve productivityPrincipled Technologies

Evaluating the top large language models.pdfChristopherTHyatt

[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745

presentation ICT roal in 21st century educationjfdjdjcjdnsjd

Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays

Histor y of HAM Radio presentation slidevu2urc

How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes

Recently uploaded (20)

Axa Assurance Maroc - Insurer Innovation Award 2024

Data Cloud, More than a CDP by Matt Robison

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...

Handwritten Text Recognition for manuscripts and early printed texts

Driving Behavioral Change for Information Management through Data-Driven Gree...

Strategies for Landing an Oracle DBA Job as a Fresher

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men

Artificial Intelligence: Facts and Myths

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024

08448380779 Call Girls In Friends Colony Women Seeking Men

08448380779 Call Girls In Civil Lines Women Seeking Men

GenCyber Cyber Security Day Presentation

Boost PC performance: How more available memory can improve productivity

Evaluating the top large language models.pdf

[2024]Digital Global Overview Report 2024 Meltwater.pdf

presentation ICT roal in 21st century education

Exploring the Future Potential of AI-Enabled Smartphone Processors

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...

Histor y of HAM Radio presentation slide

How to Troubleshoot Apps for the Modern Connected Worker

Consul scale

1. Growing with Consul to 4K agents and beyond

2. The Beginning ● Consul version 0.3 ● Few dozens of agents ● One DC ● Main usage: Internal LB ● Script checks

3. The expansion ● Growing to > Hundreds > Thousands of agents ● Mapping all infrastructure & services ● Creating automation for json + scripts generation ● Additional DC

4. The Abuse ● Consul as Vault backend ● KV for state ● Users implement discovery over KV ● Consul locks for huge clusters ● Consul KV for reverse proxy

5. First signs of trouble ● IOwait on servers ● Slow KV updates (Few seconds) ● Sporadic DNS query failures ● Restart cause Raft failures

6. Upgrade to 0.8.5 The Plan:

7. Upgrade to 0.8.5 The Failure:

8. Improve Visibility Example: "telemetry": { "statsd_address":"statsd.eu1.appsflyer.com:8125", "metrics_prefix":"app.eu1.consul-20001-002-prod" } We monitor: Consul: serf, raft, rpc, kv queries, service queries + 404, de/register, Leader change, locks, reap count, intent queue System metrics: Connections, Iowait, Memory used, Network bandwidth, LoadAvg, DiskFree

9. More stability issues ● Network saturation on servers ● Raft failures: “No leader found” ● Service discovery / DNS failures ● Nodes failures to leave

10. The Problem High de/register rates due to auto scaling Node conflicts Increase raft traffic between consul servers High bursts cause network saturation on leader Raft heartbeat fail from other servers Leader Election initiated

11. Stabilizing ● Better control on join & leave ● Add timestamp to node name ● Increase raft_multiplier ● Upgrade consul version ● Cleaned up stuf ● Decrease reconnect_timeout ● Add TTL + allow_stale ● Add DNS caching daemon ● Migrate servers to ENA enabled

12. More stuff on the way ● Serf in depth queue errors ● Reaping old nodes ● Bootstrap vs bootstrap-expect ● Security hole by default… ● Enable-scripts-check

13. Successful upgrade to 1.0.3 Start the migration on small region Start with clients Handle node-id cleanup Gradually migrate agent by services Demote raft for compatibility Migrate one server Migrate all servers Migrate bigger region

14. Consul DNS tactics & Configurations 1) Use: alow_stale, set service_ttl, increase serve count "dns_config": { "allow_stale": true, "max_stale": "28800s", "node_ttl": "0s", "udp_answer_limit": 30, "service_ttl": { "*": "5s" } } 2) Seed Bind with Consul services records as fallback 3) Forward via bind to enable ease fault tolerance 4) Use DNS daemon (we use pdnsd) to reduce load on consul and enable negative ttl

15. Questions

Consul scale

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Consul scale

Similar to Consul scale (20)

More from Ariel Moskovich

More from Ariel Moskovich (12)

Recently uploaded

Recently uploaded (20)

Consul scale