SlideShare a Scribd company logo
1 of 15
Download to read offline
Growing with Consul to 4K agents
and beyond
The Beginning
● Consul version 0.3
● Few dozens of agents
● One DC
● Main usage: Internal LB
● Script checks
The expansion
● Growing to > Hundreds > Thousands of agents
● Mapping all infrastructure & services
● Creating automation for json + scripts generation
● Additional DC
The Abuse
● Consul as Vault backend
● KV for state
● Users implement discovery over KV
● Consul locks for huge clusters
● Consul KV for reverse proxy
First signs of trouble
● IOwait on servers
● Slow KV updates (Few seconds)
● Sporadic DNS query failures
● Restart cause Raft failures
Upgrade to 0.8.5
The Plan:
Upgrade to 0.8.5
The Failure:
Improve Visibility
Example:
"telemetry": {
"statsd_address":"statsd.eu1.appsflyer.com:8125",
"metrics_prefix":"app.eu1.consul-20001-002-prod"
}
We monitor:
Consul:
serf, raft, rpc, kv queries, service queries + 404, de/register, Leader change,
locks, reap count, intent queue
System metrics:
Connections, Iowait, Memory used, Network bandwidth, LoadAvg, DiskFree
More stability issues
● Network saturation on servers
● Raft failures: “No leader found”
● Service discovery / DNS failures
● Nodes failures to leave
The Problem
High de/register rates due to auto scaling
Node conflicts Increase raft traffic between consul servers
High bursts cause network saturation on leader
Raft heartbeat fail from other servers
Leader Election initiated
Stabilizing
● Better control on join & leave
● Add timestamp to node name
● Increase raft_multiplier
● Upgrade consul version
● Cleaned up stuf
● Decrease reconnect_timeout
● Add TTL + allow_stale
● Add DNS caching daemon
● Migrate servers to ENA enabled
More stuff on the way
● Serf in depth queue errors
● Reaping old nodes
● Bootstrap vs bootstrap-expect
● Security hole by default…
● Enable-scripts-check
Successful upgrade to 1.0.3
Start the migration on small region
Start with clients
Handle node-id cleanup
Gradually migrate agent by services
Demote raft for compatibility
Migrate one server
Migrate all servers
Migrate bigger region
Consul DNS tactics &
Configurations
1) Use: alow_stale, set service_ttl, increase serve count
"dns_config": {
"allow_stale": true,
"max_stale": "28800s",
"node_ttl": "0s",
"udp_answer_limit": 30,
"service_ttl": {
"*": "5s"
}
}
2) Seed Bind with Consul services records as fallback
3) Forward via bind to enable ease fault tolerance
4) Use DNS daemon (we use pdnsd) to reduce load on consul and enable negative ttl
Questions

More Related Content

What's hot

Self Created Load Balancer for MTA on AWS
Self Created Load Balancer for MTA on AWSSelf Created Load Balancer for MTA on AWS
Self Created Load Balancer for MTA on AWSsharu1204
 
2019 05-28 SRE Consul Criteo Meetup
2019 05-28 SRE Consul Criteo Meetup2019 05-28 SRE Consul Criteo Meetup
2019 05-28 SRE Consul Criteo MeetupPierre Souchay
 
SignalFx Kafka Consumer Optimization
SignalFx Kafka Consumer OptimizationSignalFx Kafka Consumer Optimization
SignalFx Kafka Consumer OptimizationSignalFx
 
Breaking Prometheus (Promcon Berlin '16)
Breaking Prometheus (Promcon Berlin '16)Breaking Prometheus (Promcon Berlin '16)
Breaking Prometheus (Promcon Berlin '16)Matthew Campbell
 
AWS re:Invent 2016: Making Every Packet Count (NET404)
AWS re:Invent 2016: Making Every Packet Count (NET404)AWS re:Invent 2016: Making Every Packet Count (NET404)
AWS re:Invent 2016: Making Every Packet Count (NET404)Amazon Web Services
 
Producer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache KafkaProducer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache KafkaJiangjie Qin
 
Building event streaming pipelines using Apache Pulsar
Building event streaming pipelines using Apache PulsarBuilding event streaming pipelines using Apache Pulsar
Building event streaming pipelines using Apache PulsarStreamNative
 
High performance queues with Cassandra
High performance queues with CassandraHigh performance queues with Cassandra
High performance queues with CassandraMikalai Alimenkou
 
Scaling ingest pipelines with high performance computing principles - Rajiv K...
Scaling ingest pipelines with high performance computing principles - Rajiv K...Scaling ingest pipelines with high performance computing principles - Rajiv K...
Scaling ingest pipelines with high performance computing principles - Rajiv K...SignalFx
 
Apache Pulsar with MQTT for Edge Computing - Pulsar Summit Asia 2021
Apache Pulsar with MQTT for Edge Computing - Pulsar Summit Asia 2021Apache Pulsar with MQTT for Edge Computing - Pulsar Summit Asia 2021
Apache Pulsar with MQTT for Edge Computing - Pulsar Summit Asia 2021StreamNative
 
Trasport layer
Trasport layerTrasport layer
Trasport layerRAJ ANAND
 
HBaseCon2017 HBase at Xiaomi
HBaseCon2017 HBase at XiaomiHBaseCon2017 HBase at Xiaomi
HBaseCon2017 HBase at XiaomiHBaseCon
 
Cortex: Prometheus as a Service, One Year On
Cortex: Prometheus as a Service, One Year OnCortex: Prometheus as a Service, One Year On
Cortex: Prometheus as a Service, One Year OnKausal
 
Content Addressable NDN Repository - checkpoint
Content Addressable NDN Repository - checkpointContent Addressable NDN Repository - checkpoint
Content Addressable NDN Repository - checkpointShi Junxiao
 

What's hot (20)

Self Created Load Balancer for MTA on AWS
Self Created Load Balancer for MTA on AWSSelf Created Load Balancer for MTA on AWS
Self Created Load Balancer for MTA on AWS
 
2019 05-28 SRE Consul Criteo Meetup
2019 05-28 SRE Consul Criteo Meetup2019 05-28 SRE Consul Criteo Meetup
2019 05-28 SRE Consul Criteo Meetup
 
SignalFx Kafka Consumer Optimization
SignalFx Kafka Consumer OptimizationSignalFx Kafka Consumer Optimization
SignalFx Kafka Consumer Optimization
 
Breaking Prometheus (Promcon Berlin '16)
Breaking Prometheus (Promcon Berlin '16)Breaking Prometheus (Promcon Berlin '16)
Breaking Prometheus (Promcon Berlin '16)
 
AWS re:Invent 2016: Making Every Packet Count (NET404)
AWS re:Invent 2016: Making Every Packet Count (NET404)AWS re:Invent 2016: Making Every Packet Count (NET404)
AWS re:Invent 2016: Making Every Packet Count (NET404)
 
Producer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache KafkaProducer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache Kafka
 
Server architecture
Server architectureServer architecture
Server architecture
 
Building event streaming pipelines using Apache Pulsar
Building event streaming pipelines using Apache PulsarBuilding event streaming pipelines using Apache Pulsar
Building event streaming pipelines using Apache Pulsar
 
DNS-SD Extentions
DNS-SD ExtentionsDNS-SD Extentions
DNS-SD Extentions
 
Nginx conf.compressed
Nginx conf.compressedNginx conf.compressed
Nginx conf.compressed
 
High performance queues with Cassandra
High performance queues with CassandraHigh performance queues with Cassandra
High performance queues with Cassandra
 
Scaling ingest pipelines with high performance computing principles - Rajiv K...
Scaling ingest pipelines with high performance computing principles - Rajiv K...Scaling ingest pipelines with high performance computing principles - Rajiv K...
Scaling ingest pipelines with high performance computing principles - Rajiv K...
 
Apache Pulsar with MQTT for Edge Computing - Pulsar Summit Asia 2021
Apache Pulsar with MQTT for Edge Computing - Pulsar Summit Asia 2021Apache Pulsar with MQTT for Edge Computing - Pulsar Summit Asia 2021
Apache Pulsar with MQTT for Edge Computing - Pulsar Summit Asia 2021
 
Pika
PikaPika
Pika
 
Trasport layer
Trasport layerTrasport layer
Trasport layer
 
HBaseCon2017 HBase at Xiaomi
HBaseCon2017 HBase at XiaomiHBaseCon2017 HBase at Xiaomi
HBaseCon2017 HBase at Xiaomi
 
Cortex: Prometheus as a Service, One Year On
Cortex: Prometheus as a Service, One Year OnCortex: Prometheus as a Service, One Year On
Cortex: Prometheus as a Service, One Year On
 
QUIC
QUICQUIC
QUIC
 
Load balancing at tuenti
Load balancing at tuentiLoad balancing at tuenti
Load balancing at tuenti
 
Content Addressable NDN Repository - checkpoint
Content Addressable NDN Repository - checkpointContent Addressable NDN Repository - checkpoint
Content Addressable NDN Repository - checkpoint
 

Similar to Consul scale

How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N...
How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N...How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N...
How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N...Amazon Web Services
 
Mike Guthrie - Revamping Your 10 Year Old Nagios Installation
Mike Guthrie - Revamping Your 10 Year Old Nagios InstallationMike Guthrie - Revamping Your 10 Year Old Nagios Installation
Mike Guthrie - Revamping Your 10 Year Old Nagios InstallationNagios
 
Scaling Kubernetes to Support 50000 Services.pptx
Scaling Kubernetes to Support 50000 Services.pptxScaling Kubernetes to Support 50000 Services.pptx
Scaling Kubernetes to Support 50000 Services.pptxthaond2
 
HashiStack. To the cloud and beyond...
HashiStack. To the cloud and beyond...HashiStack. To the cloud and beyond...
HashiStack. To the cloud and beyond...Oleg Lobanov
 
Hpe service virtualization 3.8 what's new chicago adm
Hpe service virtualization 3.8 what's new chicago admHpe service virtualization 3.8 what's new chicago adm
Hpe service virtualization 3.8 what's new chicago admJeffrey Nunn
 
Dragonflow 01 2016 TLV meetup
Dragonflow 01 2016 TLV meetup  Dragonflow 01 2016 TLV meetup
Dragonflow 01 2016 TLV meetup Eran Gampel
 
What's new in NGINX Plus R19
What's new in NGINX Plus R19What's new in NGINX Plus R19
What's new in NGINX Plus R19NGINX, Inc.
 
Zero Downtime JEE Architectures
Zero Downtime JEE ArchitecturesZero Downtime JEE Architectures
Zero Downtime JEE ArchitecturesAlexander Penev
 
Practice of large Hadoop cluster in China Mobile
Practice of large Hadoop cluster in China MobilePractice of large Hadoop cluster in China Mobile
Practice of large Hadoop cluster in China MobileDataWorks Summit
 
DCUS17 : Docker networking deep dive
DCUS17 : Docker networking deep diveDCUS17 : Docker networking deep dive
DCUS17 : Docker networking deep diveMadhu Venugopal
 
[OpenInfra Days Korea 2018] Day 2 - E6 - OpenInfra monitoring with Prometheus
[OpenInfra Days Korea 2018] Day 2 - E6 - OpenInfra monitoring with Prometheus[OpenInfra Days Korea 2018] Day 2 - E6 - OpenInfra monitoring with Prometheus
[OpenInfra Days Korea 2018] Day 2 - E6 - OpenInfra monitoring with PrometheusOpenStack Korea Community
 
Hacking apache cloud stack
Hacking apache cloud stackHacking apache cloud stack
Hacking apache cloud stackNitin Mehta
 
Netflix Massively Scalable, Highly Available, Immutable Infrastructure
Netflix Massively Scalable, Highly Available, Immutable InfrastructureNetflix Massively Scalable, Highly Available, Immutable Infrastructure
Netflix Massively Scalable, Highly Available, Immutable InfrastructureAmer Ather
 
GE IOT Predix Time Series & Data Ingestion Service using Apache Apex (Hadoop)
GE IOT Predix Time Series & Data Ingestion Service using Apache Apex (Hadoop)GE IOT Predix Time Series & Data Ingestion Service using Apache Apex (Hadoop)
GE IOT Predix Time Series & Data Ingestion Service using Apache Apex (Hadoop)Apache Apex
 
NGINX Plus R20 Webinar
NGINX Plus R20 WebinarNGINX Plus R20 Webinar
NGINX Plus R20 WebinarNGINX, Inc.
 
(NET404) Making Every Packet Count
(NET404) Making Every Packet Count(NET404) Making Every Packet Count
(NET404) Making Every Packet CountAmazon Web Services
 
JAX London 2019 "Cloud Native Communication: Using an API Gateway and Service...
JAX London 2019 "Cloud Native Communication: Using an API Gateway and Service...JAX London 2019 "Cloud Native Communication: Using an API Gateway and Service...
JAX London 2019 "Cloud Native Communication: Using an API Gateway and Service...Daniel Bryant
 

Similar to Consul scale (20)

Consul
ConsulConsul
Consul
 
How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N...
How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N...How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N...
How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N...
 
Mike Guthrie - Revamping Your 10 Year Old Nagios Installation
Mike Guthrie - Revamping Your 10 Year Old Nagios InstallationMike Guthrie - Revamping Your 10 Year Old Nagios Installation
Mike Guthrie - Revamping Your 10 Year Old Nagios Installation
 
Scaling Kubernetes to Support 50000 Services.pptx
Scaling Kubernetes to Support 50000 Services.pptxScaling Kubernetes to Support 50000 Services.pptx
Scaling Kubernetes to Support 50000 Services.pptx
 
HashiStack. To the cloud and beyond...
HashiStack. To the cloud and beyond...HashiStack. To the cloud and beyond...
HashiStack. To the cloud and beyond...
 
Hpe service virtualization 3.8 what's new chicago adm
Hpe service virtualization 3.8 what's new chicago admHpe service virtualization 3.8 what's new chicago adm
Hpe service virtualization 3.8 what's new chicago adm
 
Dragonflow 01 2016 TLV meetup
Dragonflow 01 2016 TLV meetup  Dragonflow 01 2016 TLV meetup
Dragonflow 01 2016 TLV meetup
 
What's new in NGINX Plus R19
What's new in NGINX Plus R19What's new in NGINX Plus R19
What's new in NGINX Plus R19
 
Zero Downtime JEE Architectures
Zero Downtime JEE ArchitecturesZero Downtime JEE Architectures
Zero Downtime JEE Architectures
 
Practice of large Hadoop cluster in China Mobile
Practice of large Hadoop cluster in China MobilePractice of large Hadoop cluster in China Mobile
Practice of large Hadoop cluster in China Mobile
 
DCUS17 : Docker networking deep dive
DCUS17 : Docker networking deep diveDCUS17 : Docker networking deep dive
DCUS17 : Docker networking deep dive
 
[OpenInfra Days Korea 2018] Day 2 - E6 - OpenInfra monitoring with Prometheus
[OpenInfra Days Korea 2018] Day 2 - E6 - OpenInfra monitoring with Prometheus[OpenInfra Days Korea 2018] Day 2 - E6 - OpenInfra monitoring with Prometheus
[OpenInfra Days Korea 2018] Day 2 - E6 - OpenInfra monitoring with Prometheus
 
Azure webinar kolkata
Azure webinar kolkata Azure webinar kolkata
Azure webinar kolkata
 
Hacking apache cloud stack
Hacking apache cloud stackHacking apache cloud stack
Hacking apache cloud stack
 
Netflix Massively Scalable, Highly Available, Immutable Infrastructure
Netflix Massively Scalable, Highly Available, Immutable InfrastructureNetflix Massively Scalable, Highly Available, Immutable Infrastructure
Netflix Massively Scalable, Highly Available, Immutable Infrastructure
 
GE IOT Predix Time Series & Data Ingestion Service using Apache Apex (Hadoop)
GE IOT Predix Time Series & Data Ingestion Service using Apache Apex (Hadoop)GE IOT Predix Time Series & Data Ingestion Service using Apache Apex (Hadoop)
GE IOT Predix Time Series & Data Ingestion Service using Apache Apex (Hadoop)
 
Citrix Day 2015 Net Scaler Release 10.5 Update v10
Citrix Day 2015 Net Scaler Release 10.5 Update v10Citrix Day 2015 Net Scaler Release 10.5 Update v10
Citrix Day 2015 Net Scaler Release 10.5 Update v10
 
NGINX Plus R20 Webinar
NGINX Plus R20 WebinarNGINX Plus R20 Webinar
NGINX Plus R20 Webinar
 
(NET404) Making Every Packet Count
(NET404) Making Every Packet Count(NET404) Making Every Packet Count
(NET404) Making Every Packet Count
 
JAX London 2019 "Cloud Native Communication: Using an API Gateway and Service...
JAX London 2019 "Cloud Native Communication: Using an API Gateway and Service...JAX London 2019 "Cloud Native Communication: Using an API Gateway and Service...
JAX London 2019 "Cloud Native Communication: Using an API Gateway and Service...
 

More from Ariel Moskovich (12)

Kafka ops-new
Kafka ops-newKafka ops-new
Kafka ops-new
 
Docker appsflyer
Docker appsflyerDocker appsflyer
Docker appsflyer
 
Advanced Code Flow, Notes From the Field
Advanced Code Flow, Notes From the FieldAdvanced Code Flow, Notes From the Field
Advanced Code Flow, Notes From the Field
 
Practical Monitoring Techniques
Practical Monitoring TechniquesPractical Monitoring Techniques
Practical Monitoring Techniques
 
sensu
sensusensu
sensu
 
devopstools
devopstoolsdevopstools
devopstools
 
kafka
kafkakafka
kafka
 
Bouncer
BouncerBouncer
Bouncer
 
Devopstools
DevopstoolsDevopstools
Devopstools
 
Kafka aws
Kafka awsKafka aws
Kafka aws
 
Docker in prod
Docker in prodDocker in prod
Docker in prod
 
Docker tlv
Docker tlvDocker tlv
Docker tlv
 

Recently uploaded

Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdfChristopherTHyatt
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 

Recently uploaded (20)

Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 

Consul scale

  • 1. Growing with Consul to 4K agents and beyond
  • 2. The Beginning ● Consul version 0.3 ● Few dozens of agents ● One DC ● Main usage: Internal LB ● Script checks
  • 3. The expansion ● Growing to > Hundreds > Thousands of agents ● Mapping all infrastructure & services ● Creating automation for json + scripts generation ● Additional DC
  • 4. The Abuse ● Consul as Vault backend ● KV for state ● Users implement discovery over KV ● Consul locks for huge clusters ● Consul KV for reverse proxy
  • 5. First signs of trouble ● IOwait on servers ● Slow KV updates (Few seconds) ● Sporadic DNS query failures ● Restart cause Raft failures
  • 8. Improve Visibility Example: "telemetry": { "statsd_address":"statsd.eu1.appsflyer.com:8125", "metrics_prefix":"app.eu1.consul-20001-002-prod" } We monitor: Consul: serf, raft, rpc, kv queries, service queries + 404, de/register, Leader change, locks, reap count, intent queue System metrics: Connections, Iowait, Memory used, Network bandwidth, LoadAvg, DiskFree
  • 9. More stability issues ● Network saturation on servers ● Raft failures: “No leader found” ● Service discovery / DNS failures ● Nodes failures to leave
  • 10. The Problem High de/register rates due to auto scaling Node conflicts Increase raft traffic between consul servers High bursts cause network saturation on leader Raft heartbeat fail from other servers Leader Election initiated
  • 11. Stabilizing ● Better control on join & leave ● Add timestamp to node name ● Increase raft_multiplier ● Upgrade consul version ● Cleaned up stuf ● Decrease reconnect_timeout ● Add TTL + allow_stale ● Add DNS caching daemon ● Migrate servers to ENA enabled
  • 12. More stuff on the way ● Serf in depth queue errors ● Reaping old nodes ● Bootstrap vs bootstrap-expect ● Security hole by default… ● Enable-scripts-check
  • 13. Successful upgrade to 1.0.3 Start the migration on small region Start with clients Handle node-id cleanup Gradually migrate agent by services Demote raft for compatibility Migrate one server Migrate all servers Migrate bigger region
  • 14. Consul DNS tactics & Configurations 1) Use: alow_stale, set service_ttl, increase serve count "dns_config": { "allow_stale": true, "max_stale": "28800s", "node_ttl": "0s", "udp_answer_limit": 30, "service_ttl": { "*": "5s" } } 2) Seed Bind with Consul services records as fallback 3) Forward via bind to enable ease fault tolerance 4) Use DNS daemon (we use pdnsd) to reduce load on consul and enable negative ttl