SlideShare a Scribd company logo
Pierre Souchay
Discovery Team @Criteo
Twitter: @vizionr
Github: pierresouchay
Consul
Administration
At Scale
2 •
1 2 3 4 5
Numbers
What we do
Make it work 24/24
Our pillars
Tools to scale
What’s new?
Consul
everywhere
Benefits
Tools / References
Q&A
Our 30 minutes presentation
Numbers
4 •
• Consul in use for 3+ years @criteo
• Dedicated team is 6 months old, 5 people
• SDKs development (JVM / C# / Python), tools (GUIs)
• Handle all infrastructure, on-call 24/24 7/7
• Architecture, 1st External Consul Contributor (70+ PR)
The discovery team
5 •
• Prod 35k bare-metal hosts (40/60 Win/Linux), 8 DCs (2 Hadoop)
• 3200 kind of services with 260k instances
• Up to 2.5M req/sec, 100+Pb of data in Hadoop
• More than 300 developers: we MUST scale users too
Our Infrastructure
6 •
• Automatic Load Balancers provisioning (F5/HaProxy)
• SDKs provides discovery for all apps
• DNS provides discovery for non-aware Consul systems
• Bare Metal systems / Hadoop / Mesos (~Nomad)
• K/V for configuration of some tools
Consul to rule them all
7 •
When Consul is down,
Criteo is down.
Make it work 24/24 - 7/7
9 •
• 35k Consul agents installed by Chef
• Registration of service
• by Chef with helpers: standardized/easy
• in Mesos, standardized/automatic
Rule #1 - (1/3) Full automation - as predictable as possible
10 •
• More than 3k services, protected service registration by ACLs
• ACLs as a Service REST API
• No service Conflict by default, Goal: 1 ACL per Service
• Add/Help people putting service Metadata: version, alerts...
• Deploy in preprod, check ACLs, Go Prod
Rule #1 - (2/3) Full automation - as predictable as possible
11 •
• Secure by default in order to be predictable
• Nobody can write on APIs outside of localhost
• https://github.com/hashicorp/consul/issues/4712
• Available in Consul 1.4.2+
• Reduce entropy added by humans
Rule #1 : (3/3) Full automation - as predictable as possible
12 •
• Blackbox monitoring (5+ probes in each DC)
• Register a service, wait its publication in Consul Catalog
• SLA: objective 1s to register a service, up to 3s max
• When SLA is violated, wake up the on-call
Rule #2 - Metrics (1/3)
13 •
• Consul Metrics
• Native Prometheus Support
• Additional on-call alerts
• Track new usages (increase of RPCs, DNS calls…)
• Debug when there is mess
Rule #2 - Metrics (2/3)
14 •
• Consul-templaterb : metrics.erb export to Prometheus
• Provides rate of changes
• Provides instances Passing/Warning/Critical
• View from an agent point of view, not Consul Server
Rule #2 - Metrics (3/3)
15 •
• Logs in Kibana for Consul Server / few canary agents
• Analyzed regularly for early errors detection
• Expose all data to everybody
• Instant view of all services
• Timeline of changes for all services
Rule #3 - Logs, info and History
16 •
• Consul fork: mainstream with patches
• Ready to go to prod in less than 2 hours
• Compare metrics after deployment
• Preprod → Observe → Prod
• Deploy feature per feature, no bulk updates
Rule #4 - Ready to patch
17 •
• Look at all issues on github
• See if known patterns
• Check if issue might impact us
• PR when issue is potentially critical for us (ex: #5050)
Rule #5 - Work on upstream
Tools/Hints to scale
19 •
Consul-UI: scalable UI to show all details about a service
20 •
Consul-UI: Timeline of changes : not an OPS problem anymore
21 •
Changes/sec is a good indicator, will allow
you to detect:
- deployments (right)
- incidents or future incidents
- optimizations to perform
Many of optimizations/fixes from Criteo:
- #3889, #4720 and many more merged
- With #5050 allowed us to more than x100
performance!
Consul-template metrics.erb : changes/sec on a service
22 •
Consul-Templaterb: script everything! 1/2
<%
# This script cleanup all services with tag `marathon` having less than 1 healthcheck (SerfHealth)
instances_to_cleanup=0
total_instances=0
datacenters.each do |dc|
services(dc:dc, tag:'marathon').each do |service_name, tags|
service(service_name, dc:dc, tag:'marathon').each do |snode|
total_instances+=1
if snode['Checks'].count < 2
instances_to_cleanup += 1
%>ssh $SSH_OPTIONS <%= snode['Node']['Node'] %> "/usr/bin/curl -H X-Consul-Token:$CONSUL_DEREGISTER_TOKEN -fs -XPUT
localhost:8500/v1/agent/service/deregister/<%= snode['Service']['ID'] %>"
<%
end
end
end
end %>
echo Found <%= instances_to_cleanup %> / <%= total_instances %> instances to cleanup
23 •
Consul-Templaterb: script everything! 2/2
Call it once…
$ consul-templaterb -c <CONSUL_ADDR> ./clean_svcs_without_hc.sh.erb --once && 
bash ./clean_svcs_without_hc.sh
Or automatically every minute !
--wait 60 --template “clean_svcs_without_hc.sh.erb:./result.sh:bash ./result.sh”
ssh $SSH_OPTIONS mesos-slave017-pa4.central.criteo.preprod "/usr/bin/curl -H X-Consul-Token:$CONSUL_DEREGISTER_TOKEN -fs -XPUT
localhost:8500/v1/agent/service/deregister/marathon-app-deepr-pipeline-31510-23d44ebc2b8d11e9b0125065f387ef80"
ssh $SSH_OPTIONS mesos-slave019-pa4.central.criteo.preprod "/usr/bin/curl -H X-Consul-Token:$CONSUL_DEREGISTER_TOKEN -fs -XPUT
localhost:8500/v1/agent/service/deregister/marathon-app-jtc-jtc-app-31934-1d5a8e202ba611e9b0125065f387ef80"
echo Found 2 / 1812 instances to cleanup
24 •
• See video from Michael Stewart (To 20,000 Nodes and Beyond)
• We had the same stories, found the same tricks
• Read Consul Docs: all is RPC, there is no cache by default
• Use discovery_max_stale to scale servers horizontally
• Use ttl for DNS and allow_stale = true
Useful configurations hints
Consul Everywhere
26 •
• Inversion of Control
• Monitoring can be automated: ratio > 0.5 passing/critical
• Everything is As A Service, Users are free to experiment (full
network automation for instance)
• Everything is standardized
• ServiceMeta standardization, LB weights...
• Build features on top of services: Monitoring, versions tracking
Benefits (1/2)
27 •
• Debug is easier
• One single place to look for configuration
• LB/API Load balancing works the same way
• Nothing is hidden: people can troubleshoot themselves
• The team is not a SPOF to debug issues
Benefits (2/2)
Tools / references
29 •
• Consul-templaterb : https://github.com/criteo/consul-templaterb/
• Script/Hack/Automate it easily: supports hot-reload
• Provide Consul-UI as well as Consul-timeline
• Provide additional prometheus endpoints (service changes)
• https://github.com/pierresouchay/consul-ops-tools
• small scripts to help debug Consul (will be enriched)
• A Consul Story: To 20,000 Nodes and Beyond (video)
Open-Source Tools
30 •
Q&A
Discovery Team @Criteo
Twitter: @vizionr
Github: pierresouchay

More Related Content

What's hot

Deep dive networking
Deep dive networkingDeep dive networking
Deep dive networking
Victor Morales
 
Workshop Consul .- Service Discovery & Failure Detection
Workshop Consul .- Service Discovery & Failure DetectionWorkshop Consul .- Service Discovery & Failure Detection
Workshop Consul .- Service Discovery & Failure Detection
Vincent Composieux
 
Performance Tuning Your Puppet Infrastructure - PuppetConf 2014
Performance Tuning Your Puppet Infrastructure - PuppetConf 2014Performance Tuning Your Puppet Infrastructure - PuppetConf 2014
Performance Tuning Your Puppet Infrastructure - PuppetConf 2014
Puppet
 
Consul presentation
Consul presentationConsul presentation
Consul presentation
Vladimir Kosmala
 
Spot Trading - A case study in continuous delivery for mission critical finan...
Spot Trading - A case study in continuous delivery for mission critical finan...Spot Trading - A case study in continuous delivery for mission critical finan...
Spot Trading - A case study in continuous delivery for mission critical finan...
SaltStack
 
Celery
CeleryCelery
Celery
Yipit
 
Using SaltStack to orchestrate microservices in application containers at Sal...
Using SaltStack to orchestrate microservices in application containers at Sal...Using SaltStack to orchestrate microservices in application containers at Sal...
Using SaltStack to orchestrate microservices in application containers at Sal...
Love Nyberg
 
SaltConf14 - Saurabh Surana, HP Cloud - Automating operations and support wit...
SaltConf14 - Saurabh Surana, HP Cloud - Automating operations and support wit...SaltConf14 - Saurabh Surana, HP Cloud - Automating operations and support wit...
SaltConf14 - Saurabh Surana, HP Cloud - Automating operations and support wit...
SaltStack
 
Test Kitchen and Infrastructure as Code
Test Kitchen and Infrastructure as CodeTest Kitchen and Infrastructure as Code
Test Kitchen and Infrastructure as Code
Cybera Inc.
 
Intelligent infrastructure with SaltStack
Intelligent infrastructure with SaltStackIntelligent infrastructure with SaltStack
Intelligent infrastructure with SaltStack
Love Nyberg
 
Salty OPS – Saltstack Introduction
Salty OPS – Saltstack IntroductionSalty OPS – Saltstack Introduction
Salty OPS – Saltstack Introduction
Walter Liu
 
Where is my scalable api?
Where is my scalable api?Where is my scalable api?
Where is my scalable api?
Altoros
 
[TechTalks] Learning Configuration Management with SaltStack (Advanced Concepts)
[TechTalks] Learning Configuration Management with SaltStack (Advanced Concepts)[TechTalks] Learning Configuration Management with SaltStack (Advanced Concepts)
[TechTalks] Learning Configuration Management with SaltStack (Advanced Concepts)
Blazeclan Technologies Private Limited
 
Experiences from Running Masterless Puppet - PuppetConf 2014
Experiences from Running Masterless Puppet - PuppetConf 2014Experiences from Running Masterless Puppet - PuppetConf 2014
Experiences from Running Masterless Puppet - PuppetConf 2014
Puppet
 
Sensu and Sensibility - Puppetconf 2014
Sensu and Sensibility - Puppetconf 2014Sensu and Sensibility - Puppetconf 2014
Sensu and Sensibility - Puppetconf 2014
Tomas Doran
 
OSMC 2014: MonitoringLove with Sensu | Jochen Lillich
OSMC 2014: MonitoringLove with Sensu | Jochen LillichOSMC 2014: MonitoringLove with Sensu | Jochen Lillich
OSMC 2014: MonitoringLove with Sensu | Jochen Lillich
NETWAYS
 
Writing Custom Saltstack Execution Modules
Writing Custom Saltstack Execution ModulesWriting Custom Saltstack Execution Modules
Writing Custom Saltstack Execution Modules
Julian Pacheco
 
Introduction to Systems Management with SaltStack
Introduction to Systems Management with SaltStackIntroduction to Systems Management with SaltStack
Introduction to Systems Management with SaltStack
Craig Sebenik
 
Integration testing for salt states using aws ec2 container service
Integration testing for salt states using aws ec2 container serviceIntegration testing for salt states using aws ec2 container service
Integration testing for salt states using aws ec2 container service
SaltStack
 
OMD and Check_mk
OMD and Check_mkOMD and Check_mk
OMD and Check_mk
Artur Martins
 

What's hot (20)

Deep dive networking
Deep dive networkingDeep dive networking
Deep dive networking
 
Workshop Consul .- Service Discovery & Failure Detection
Workshop Consul .- Service Discovery & Failure DetectionWorkshop Consul .- Service Discovery & Failure Detection
Workshop Consul .- Service Discovery & Failure Detection
 
Performance Tuning Your Puppet Infrastructure - PuppetConf 2014
Performance Tuning Your Puppet Infrastructure - PuppetConf 2014Performance Tuning Your Puppet Infrastructure - PuppetConf 2014
Performance Tuning Your Puppet Infrastructure - PuppetConf 2014
 
Consul presentation
Consul presentationConsul presentation
Consul presentation
 
Spot Trading - A case study in continuous delivery for mission critical finan...
Spot Trading - A case study in continuous delivery for mission critical finan...Spot Trading - A case study in continuous delivery for mission critical finan...
Spot Trading - A case study in continuous delivery for mission critical finan...
 
Celery
CeleryCelery
Celery
 
Using SaltStack to orchestrate microservices in application containers at Sal...
Using SaltStack to orchestrate microservices in application containers at Sal...Using SaltStack to orchestrate microservices in application containers at Sal...
Using SaltStack to orchestrate microservices in application containers at Sal...
 
SaltConf14 - Saurabh Surana, HP Cloud - Automating operations and support wit...
SaltConf14 - Saurabh Surana, HP Cloud - Automating operations and support wit...SaltConf14 - Saurabh Surana, HP Cloud - Automating operations and support wit...
SaltConf14 - Saurabh Surana, HP Cloud - Automating operations and support wit...
 
Test Kitchen and Infrastructure as Code
Test Kitchen and Infrastructure as CodeTest Kitchen and Infrastructure as Code
Test Kitchen and Infrastructure as Code
 
Intelligent infrastructure with SaltStack
Intelligent infrastructure with SaltStackIntelligent infrastructure with SaltStack
Intelligent infrastructure with SaltStack
 
Salty OPS – Saltstack Introduction
Salty OPS – Saltstack IntroductionSalty OPS – Saltstack Introduction
Salty OPS – Saltstack Introduction
 
Where is my scalable api?
Where is my scalable api?Where is my scalable api?
Where is my scalable api?
 
[TechTalks] Learning Configuration Management with SaltStack (Advanced Concepts)
[TechTalks] Learning Configuration Management with SaltStack (Advanced Concepts)[TechTalks] Learning Configuration Management with SaltStack (Advanced Concepts)
[TechTalks] Learning Configuration Management with SaltStack (Advanced Concepts)
 
Experiences from Running Masterless Puppet - PuppetConf 2014
Experiences from Running Masterless Puppet - PuppetConf 2014Experiences from Running Masterless Puppet - PuppetConf 2014
Experiences from Running Masterless Puppet - PuppetConf 2014
 
Sensu and Sensibility - Puppetconf 2014
Sensu and Sensibility - Puppetconf 2014Sensu and Sensibility - Puppetconf 2014
Sensu and Sensibility - Puppetconf 2014
 
OSMC 2014: MonitoringLove with Sensu | Jochen Lillich
OSMC 2014: MonitoringLove with Sensu | Jochen LillichOSMC 2014: MonitoringLove with Sensu | Jochen Lillich
OSMC 2014: MonitoringLove with Sensu | Jochen Lillich
 
Writing Custom Saltstack Execution Modules
Writing Custom Saltstack Execution ModulesWriting Custom Saltstack Execution Modules
Writing Custom Saltstack Execution Modules
 
Introduction to Systems Management with SaltStack
Introduction to Systems Management with SaltStackIntroduction to Systems Management with SaltStack
Introduction to Systems Management with SaltStack
 
Integration testing for salt states using aws ec2 container service
Integration testing for salt states using aws ec2 container serviceIntegration testing for salt states using aws ec2 container service
Integration testing for salt states using aws ec2 container service
 
OMD and Check_mk
OMD and Check_mkOMD and Check_mk
OMD and Check_mk
 

Similar to Consul administration at scale

PaaSTA: Autoscaling at Yelp
PaaSTA: Autoscaling at YelpPaaSTA: Autoscaling at Yelp
PaaSTA: Autoscaling at Yelp
Nathan Handler
 
Splunk: Forward me the REST of those shells
Splunk: Forward me the REST of those shellsSplunk: Forward me the REST of those shells
Splunk: Forward me the REST of those shells
Anthony D Hendricks
 
Ansible benelux meetup - Amsterdam 27-5-2015
Ansible benelux meetup - Amsterdam 27-5-2015Ansible benelux meetup - Amsterdam 27-5-2015
Ansible benelux meetup - Amsterdam 27-5-2015
Pavel Chunyayev
 
Docker Logging and analysing with Elastic Stack
Docker Logging and analysing with Elastic StackDocker Logging and analysing with Elastic Stack
Docker Logging and analysing with Elastic Stack
Jakub Hajek
 
Docker Logging and analysing with Elastic Stack - Jakub Hajek
Docker Logging and analysing with Elastic Stack - Jakub Hajek Docker Logging and analysing with Elastic Stack - Jakub Hajek
Docker Logging and analysing with Elastic Stack - Jakub Hajek
PROIDEA
 
Puppet Camp NYC 2014: Build a Modern Infrastructure in 45 min!
Puppet Camp NYC 2014: Build a Modern Infrastructure in 45 min!Puppet Camp NYC 2014: Build a Modern Infrastructure in 45 min!
Puppet Camp NYC 2014: Build a Modern Infrastructure in 45 min!
Puppet
 
Monitoring federation open stack infrastructure
Monitoring federation open stack infrastructureMonitoring federation open stack infrastructure
Monitoring federation open stack infrastructure
Fernando Lopez Aguilar
 
Modern Linux Tracing Landscape
Modern Linux Tracing LandscapeModern Linux Tracing Landscape
Modern Linux Tracing Landscape
Sasha Goldshtein
 
Rakuten openstack
Rakuten openstackRakuten openstack
Rakuten openstack
Rakuten Group, Inc.
 
Steamlining your puppet development workflow
Steamlining your puppet development workflowSteamlining your puppet development workflow
Steamlining your puppet development workflow
Tomas Doran
 
Puppet Camp New York 2014: Streamlining Puppet Development Workflow
Puppet Camp New York 2014: Streamlining Puppet Development Workflow Puppet Camp New York 2014: Streamlining Puppet Development Workflow
Puppet Camp New York 2014: Streamlining Puppet Development Workflow
Puppet
 
2019 05-28 SRE Consul Criteo Meetup
2019 05-28 SRE Consul Criteo Meetup2019 05-28 SRE Consul Criteo Meetup
2019 05-28 SRE Consul Criteo Meetup
Pierre Souchay
 
CNIT 152 10 Enterprise Service
CNIT 152 10 Enterprise ServiceCNIT 152 10 Enterprise Service
CNIT 152 10 Enterprise Service
Sam Bowne
 
Service Discovery using etcd, Consul and Kubernetes
Service Discovery using etcd, Consul and KubernetesService Discovery using etcd, Consul and Kubernetes
Service Discovery using etcd, Consul and Kubernetes
Sreenivas Makam
 
Html5 devconf nodejs_devops_shubhra
Html5 devconf nodejs_devops_shubhraHtml5 devconf nodejs_devops_shubhra
Html5 devconf nodejs_devops_shubhra
Shubhra Kar
 
Habitat talk at CodeMonsters Sofia, Bulgaria Nov 27 2018
Habitat talk at CodeMonsters Sofia, Bulgaria Nov 27 2018Habitat talk at CodeMonsters Sofia, Bulgaria Nov 27 2018
Habitat talk at CodeMonsters Sofia, Bulgaria Nov 27 2018
Mandi Walls
 
[Rakuten TechConf2014] [C-5] Ichiba Architecture on ExaLogic
[Rakuten TechConf2014] [C-5] Ichiba Architecture on ExaLogic[Rakuten TechConf2014] [C-5] Ichiba Architecture on ExaLogic
[Rakuten TechConf2014] [C-5] Ichiba Architecture on ExaLogic
Rakuten Group, Inc.
 
Puppet Camp Tokyo 2014: Keynote
Puppet Camp Tokyo 2014: KeynotePuppet Camp Tokyo 2014: Keynote
Puppet Camp Tokyo 2014: Keynote
Puppet
 
Hogy jussunk ki lezárt hálózatokból?
Hogy jussunk ki lezárt hálózatokból?Hogy jussunk ki lezárt hálózatokból?
Hogy jussunk ki lezárt hálózatokból?
hackersuli
 
2019 hashiconf seattle_consul_ioc
2019 hashiconf seattle_consul_ioc2019 hashiconf seattle_consul_ioc
2019 hashiconf seattle_consul_ioc
Pierre Souchay
 

Similar to Consul administration at scale (20)

PaaSTA: Autoscaling at Yelp
PaaSTA: Autoscaling at YelpPaaSTA: Autoscaling at Yelp
PaaSTA: Autoscaling at Yelp
 
Splunk: Forward me the REST of those shells
Splunk: Forward me the REST of those shellsSplunk: Forward me the REST of those shells
Splunk: Forward me the REST of those shells
 
Ansible benelux meetup - Amsterdam 27-5-2015
Ansible benelux meetup - Amsterdam 27-5-2015Ansible benelux meetup - Amsterdam 27-5-2015
Ansible benelux meetup - Amsterdam 27-5-2015
 
Docker Logging and analysing with Elastic Stack
Docker Logging and analysing with Elastic StackDocker Logging and analysing with Elastic Stack
Docker Logging and analysing with Elastic Stack
 
Docker Logging and analysing with Elastic Stack - Jakub Hajek
Docker Logging and analysing with Elastic Stack - Jakub Hajek Docker Logging and analysing with Elastic Stack - Jakub Hajek
Docker Logging and analysing with Elastic Stack - Jakub Hajek
 
Puppet Camp NYC 2014: Build a Modern Infrastructure in 45 min!
Puppet Camp NYC 2014: Build a Modern Infrastructure in 45 min!Puppet Camp NYC 2014: Build a Modern Infrastructure in 45 min!
Puppet Camp NYC 2014: Build a Modern Infrastructure in 45 min!
 
Monitoring federation open stack infrastructure
Monitoring federation open stack infrastructureMonitoring federation open stack infrastructure
Monitoring federation open stack infrastructure
 
Modern Linux Tracing Landscape
Modern Linux Tracing LandscapeModern Linux Tracing Landscape
Modern Linux Tracing Landscape
 
Rakuten openstack
Rakuten openstackRakuten openstack
Rakuten openstack
 
Steamlining your puppet development workflow
Steamlining your puppet development workflowSteamlining your puppet development workflow
Steamlining your puppet development workflow
 
Puppet Camp New York 2014: Streamlining Puppet Development Workflow
Puppet Camp New York 2014: Streamlining Puppet Development Workflow Puppet Camp New York 2014: Streamlining Puppet Development Workflow
Puppet Camp New York 2014: Streamlining Puppet Development Workflow
 
2019 05-28 SRE Consul Criteo Meetup
2019 05-28 SRE Consul Criteo Meetup2019 05-28 SRE Consul Criteo Meetup
2019 05-28 SRE Consul Criteo Meetup
 
CNIT 152 10 Enterprise Service
CNIT 152 10 Enterprise ServiceCNIT 152 10 Enterprise Service
CNIT 152 10 Enterprise Service
 
Service Discovery using etcd, Consul and Kubernetes
Service Discovery using etcd, Consul and KubernetesService Discovery using etcd, Consul and Kubernetes
Service Discovery using etcd, Consul and Kubernetes
 
Html5 devconf nodejs_devops_shubhra
Html5 devconf nodejs_devops_shubhraHtml5 devconf nodejs_devops_shubhra
Html5 devconf nodejs_devops_shubhra
 
Habitat talk at CodeMonsters Sofia, Bulgaria Nov 27 2018
Habitat talk at CodeMonsters Sofia, Bulgaria Nov 27 2018Habitat talk at CodeMonsters Sofia, Bulgaria Nov 27 2018
Habitat talk at CodeMonsters Sofia, Bulgaria Nov 27 2018
 
[Rakuten TechConf2014] [C-5] Ichiba Architecture on ExaLogic
[Rakuten TechConf2014] [C-5] Ichiba Architecture on ExaLogic[Rakuten TechConf2014] [C-5] Ichiba Architecture on ExaLogic
[Rakuten TechConf2014] [C-5] Ichiba Architecture on ExaLogic
 
Puppet Camp Tokyo 2014: Keynote
Puppet Camp Tokyo 2014: KeynotePuppet Camp Tokyo 2014: Keynote
Puppet Camp Tokyo 2014: Keynote
 
Hogy jussunk ki lezárt hálózatokból?
Hogy jussunk ki lezárt hálózatokból?Hogy jussunk ki lezárt hálózatokból?
Hogy jussunk ki lezárt hálózatokból?
 
2019 hashiconf seattle_consul_ioc
2019 hashiconf seattle_consul_ioc2019 hashiconf seattle_consul_ioc
2019 hashiconf seattle_consul_ioc
 

Recently uploaded

Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdfAutomated software refactoring with OpenRewrite and Generative AI.pptx.pdf
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf
timtebeek1
 
Oracle 23c New Features For DBAs and Developers.pptx
Oracle 23c New Features For DBAs and Developers.pptxOracle 23c New Features For DBAs and Developers.pptx
Oracle 23c New Features For DBAs and Developers.pptx
Remote DBA Services
 
UI5con 2024 - Keynote: Latest News about UI5 and it’s Ecosystem
UI5con 2024 - Keynote: Latest News about UI5 and it’s EcosystemUI5con 2024 - Keynote: Latest News about UI5 and it’s Ecosystem
UI5con 2024 - Keynote: Latest News about UI5 and it’s Ecosystem
Peter Muessig
 
socradar-q1-2024-aviation-industry-report.pdf
socradar-q1-2024-aviation-industry-report.pdfsocradar-q1-2024-aviation-industry-report.pdf
socradar-q1-2024-aviation-industry-report.pdf
SOCRadar
 
Using Query Store in Azure PostgreSQL to Understand Query Performance
Using Query Store in Azure PostgreSQL to Understand Query PerformanceUsing Query Store in Azure PostgreSQL to Understand Query Performance
Using Query Store in Azure PostgreSQL to Understand Query Performance
Grant Fritchey
 
Empowering Growth with Best Software Development Company in Noida - Deuglo
Empowering Growth with Best Software  Development Company in Noida - DeugloEmpowering Growth with Best Software  Development Company in Noida - Deuglo
Empowering Growth with Best Software Development Company in Noida - Deuglo
Deuglo Infosystem Pvt Ltd
 
How Can Hiring A Mobile App Development Company Help Your Business Grow?
How Can Hiring A Mobile App Development Company Help Your Business Grow?How Can Hiring A Mobile App Development Company Help Your Business Grow?
How Can Hiring A Mobile App Development Company Help Your Business Grow?
ToXSL Technologies
 
Lecture 2 - software testing SE 412.pptx
Lecture 2 - software testing SE 412.pptxLecture 2 - software testing SE 412.pptx
Lecture 2 - software testing SE 412.pptx
TaghreedAltamimi
 
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOMLORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
lorraineandreiamcidl
 
Microservice Teams - How the cloud changes the way we work
Microservice Teams - How the cloud changes the way we workMicroservice Teams - How the cloud changes the way we work
Microservice Teams - How the cloud changes the way we work
Sven Peters
 
E-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian Companies
E-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian CompaniesE-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian Companies
E-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian Companies
Quickdice ERP
 
Odoo ERP Vs. Traditional ERP Systems – A Comparative Analysis
Odoo ERP Vs. Traditional ERP Systems – A Comparative AnalysisOdoo ERP Vs. Traditional ERP Systems – A Comparative Analysis
Odoo ERP Vs. Traditional ERP Systems – A Comparative Analysis
Envertis Software Solutions
 
Measures in SQL (SIGMOD 2024, Santiago, Chile)
Measures in SQL (SIGMOD 2024, Santiago, Chile)Measures in SQL (SIGMOD 2024, Santiago, Chile)
Measures in SQL (SIGMOD 2024, Santiago, Chile)
Julian Hyde
 
Using Xen Hypervisor for Functional Safety
Using Xen Hypervisor for Functional SafetyUsing Xen Hypervisor for Functional Safety
Using Xen Hypervisor for Functional Safety
Ayan Halder
 
Unveiling the Advantages of Agile Software Development.pdf
Unveiling the Advantages of Agile Software Development.pdfUnveiling the Advantages of Agile Software Development.pdf
Unveiling the Advantages of Agile Software Development.pdf
brainerhub1
 
Atelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissancesAtelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissances
Neo4j
 
SQL Accounting Software Brochure Malaysia
SQL Accounting Software Brochure MalaysiaSQL Accounting Software Brochure Malaysia
SQL Accounting Software Brochure Malaysia
GohKiangHock
 
Need for Speed: Removing speed bumps from your Symfony projects ⚡️
Need for Speed: Removing speed bumps from your Symfony projects ⚡️Need for Speed: Removing speed bumps from your Symfony projects ⚡️
Need for Speed: Removing speed bumps from your Symfony projects ⚡️
Łukasz Chruściel
 
UI5con 2024 - Bring Your Own Design System
UI5con 2024 - Bring Your Own Design SystemUI5con 2024 - Bring Your Own Design System
UI5con 2024 - Bring Your Own Design System
Peter Muessig
 
What is Master Data Management by PiLog Group
What is Master Data Management by PiLog GroupWhat is Master Data Management by PiLog Group
What is Master Data Management by PiLog Group
aymanquadri279
 

Recently uploaded (20)

Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdfAutomated software refactoring with OpenRewrite and Generative AI.pptx.pdf
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf
 
Oracle 23c New Features For DBAs and Developers.pptx
Oracle 23c New Features For DBAs and Developers.pptxOracle 23c New Features For DBAs and Developers.pptx
Oracle 23c New Features For DBAs and Developers.pptx
 
UI5con 2024 - Keynote: Latest News about UI5 and it’s Ecosystem
UI5con 2024 - Keynote: Latest News about UI5 and it’s EcosystemUI5con 2024 - Keynote: Latest News about UI5 and it’s Ecosystem
UI5con 2024 - Keynote: Latest News about UI5 and it’s Ecosystem
 
socradar-q1-2024-aviation-industry-report.pdf
socradar-q1-2024-aviation-industry-report.pdfsocradar-q1-2024-aviation-industry-report.pdf
socradar-q1-2024-aviation-industry-report.pdf
 
Using Query Store in Azure PostgreSQL to Understand Query Performance
Using Query Store in Azure PostgreSQL to Understand Query PerformanceUsing Query Store in Azure PostgreSQL to Understand Query Performance
Using Query Store in Azure PostgreSQL to Understand Query Performance
 
Empowering Growth with Best Software Development Company in Noida - Deuglo
Empowering Growth with Best Software  Development Company in Noida - DeugloEmpowering Growth with Best Software  Development Company in Noida - Deuglo
Empowering Growth with Best Software Development Company in Noida - Deuglo
 
How Can Hiring A Mobile App Development Company Help Your Business Grow?
How Can Hiring A Mobile App Development Company Help Your Business Grow?How Can Hiring A Mobile App Development Company Help Your Business Grow?
How Can Hiring A Mobile App Development Company Help Your Business Grow?
 
Lecture 2 - software testing SE 412.pptx
Lecture 2 - software testing SE 412.pptxLecture 2 - software testing SE 412.pptx
Lecture 2 - software testing SE 412.pptx
 
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOMLORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
 
Microservice Teams - How the cloud changes the way we work
Microservice Teams - How the cloud changes the way we workMicroservice Teams - How the cloud changes the way we work
Microservice Teams - How the cloud changes the way we work
 
E-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian Companies
E-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian CompaniesE-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian Companies
E-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian Companies
 
Odoo ERP Vs. Traditional ERP Systems – A Comparative Analysis
Odoo ERP Vs. Traditional ERP Systems – A Comparative AnalysisOdoo ERP Vs. Traditional ERP Systems – A Comparative Analysis
Odoo ERP Vs. Traditional ERP Systems – A Comparative Analysis
 
Measures in SQL (SIGMOD 2024, Santiago, Chile)
Measures in SQL (SIGMOD 2024, Santiago, Chile)Measures in SQL (SIGMOD 2024, Santiago, Chile)
Measures in SQL (SIGMOD 2024, Santiago, Chile)
 
Using Xen Hypervisor for Functional Safety
Using Xen Hypervisor for Functional SafetyUsing Xen Hypervisor for Functional Safety
Using Xen Hypervisor for Functional Safety
 
Unveiling the Advantages of Agile Software Development.pdf
Unveiling the Advantages of Agile Software Development.pdfUnveiling the Advantages of Agile Software Development.pdf
Unveiling the Advantages of Agile Software Development.pdf
 
Atelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissancesAtelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissances
 
SQL Accounting Software Brochure Malaysia
SQL Accounting Software Brochure MalaysiaSQL Accounting Software Brochure Malaysia
SQL Accounting Software Brochure Malaysia
 
Need for Speed: Removing speed bumps from your Symfony projects ⚡️
Need for Speed: Removing speed bumps from your Symfony projects ⚡️Need for Speed: Removing speed bumps from your Symfony projects ⚡️
Need for Speed: Removing speed bumps from your Symfony projects ⚡️
 
UI5con 2024 - Bring Your Own Design System
UI5con 2024 - Bring Your Own Design SystemUI5con 2024 - Bring Your Own Design System
UI5con 2024 - Bring Your Own Design System
 
What is Master Data Management by PiLog Group
What is Master Data Management by PiLog GroupWhat is Master Data Management by PiLog Group
What is Master Data Management by PiLog Group
 

Consul administration at scale

  • 1. Pierre Souchay Discovery Team @Criteo Twitter: @vizionr Github: pierresouchay Consul Administration At Scale
  • 2. 2 • 1 2 3 4 5 Numbers What we do Make it work 24/24 Our pillars Tools to scale What’s new? Consul everywhere Benefits Tools / References Q&A Our 30 minutes presentation
  • 4. 4 • • Consul in use for 3+ years @criteo • Dedicated team is 6 months old, 5 people • SDKs development (JVM / C# / Python), tools (GUIs) • Handle all infrastructure, on-call 24/24 7/7 • Architecture, 1st External Consul Contributor (70+ PR) The discovery team
  • 5. 5 • • Prod 35k bare-metal hosts (40/60 Win/Linux), 8 DCs (2 Hadoop) • 3200 kind of services with 260k instances • Up to 2.5M req/sec, 100+Pb of data in Hadoop • More than 300 developers: we MUST scale users too Our Infrastructure
  • 6. 6 • • Automatic Load Balancers provisioning (F5/HaProxy) • SDKs provides discovery for all apps • DNS provides discovery for non-aware Consul systems • Bare Metal systems / Hadoop / Mesos (~Nomad) • K/V for configuration of some tools Consul to rule them all
  • 7. 7 • When Consul is down, Criteo is down.
  • 8. Make it work 24/24 - 7/7
  • 9. 9 • • 35k Consul agents installed by Chef • Registration of service • by Chef with helpers: standardized/easy • in Mesos, standardized/automatic Rule #1 - (1/3) Full automation - as predictable as possible
  • 10. 10 • • More than 3k services, protected service registration by ACLs • ACLs as a Service REST API • No service Conflict by default, Goal: 1 ACL per Service • Add/Help people putting service Metadata: version, alerts... • Deploy in preprod, check ACLs, Go Prod Rule #1 - (2/3) Full automation - as predictable as possible
  • 11. 11 • • Secure by default in order to be predictable • Nobody can write on APIs outside of localhost • https://github.com/hashicorp/consul/issues/4712 • Available in Consul 1.4.2+ • Reduce entropy added by humans Rule #1 : (3/3) Full automation - as predictable as possible
  • 12. 12 • • Blackbox monitoring (5+ probes in each DC) • Register a service, wait its publication in Consul Catalog • SLA: objective 1s to register a service, up to 3s max • When SLA is violated, wake up the on-call Rule #2 - Metrics (1/3)
  • 13. 13 • • Consul Metrics • Native Prometheus Support • Additional on-call alerts • Track new usages (increase of RPCs, DNS calls…) • Debug when there is mess Rule #2 - Metrics (2/3)
  • 14. 14 • • Consul-templaterb : metrics.erb export to Prometheus • Provides rate of changes • Provides instances Passing/Warning/Critical • View from an agent point of view, not Consul Server Rule #2 - Metrics (3/3)
  • 15. 15 • • Logs in Kibana for Consul Server / few canary agents • Analyzed regularly for early errors detection • Expose all data to everybody • Instant view of all services • Timeline of changes for all services Rule #3 - Logs, info and History
  • 16. 16 • • Consul fork: mainstream with patches • Ready to go to prod in less than 2 hours • Compare metrics after deployment • Preprod → Observe → Prod • Deploy feature per feature, no bulk updates Rule #4 - Ready to patch
  • 17. 17 • • Look at all issues on github • See if known patterns • Check if issue might impact us • PR when issue is potentially critical for us (ex: #5050) Rule #5 - Work on upstream
  • 19. 19 • Consul-UI: scalable UI to show all details about a service
  • 20. 20 • Consul-UI: Timeline of changes : not an OPS problem anymore
  • 21. 21 • Changes/sec is a good indicator, will allow you to detect: - deployments (right) - incidents or future incidents - optimizations to perform Many of optimizations/fixes from Criteo: - #3889, #4720 and many more merged - With #5050 allowed us to more than x100 performance! Consul-template metrics.erb : changes/sec on a service
  • 22. 22 • Consul-Templaterb: script everything! 1/2 <% # This script cleanup all services with tag `marathon` having less than 1 healthcheck (SerfHealth) instances_to_cleanup=0 total_instances=0 datacenters.each do |dc| services(dc:dc, tag:'marathon').each do |service_name, tags| service(service_name, dc:dc, tag:'marathon').each do |snode| total_instances+=1 if snode['Checks'].count < 2 instances_to_cleanup += 1 %>ssh $SSH_OPTIONS <%= snode['Node']['Node'] %> "/usr/bin/curl -H X-Consul-Token:$CONSUL_DEREGISTER_TOKEN -fs -XPUT localhost:8500/v1/agent/service/deregister/<%= snode['Service']['ID'] %>" <% end end end end %> echo Found <%= instances_to_cleanup %> / <%= total_instances %> instances to cleanup
  • 23. 23 • Consul-Templaterb: script everything! 2/2 Call it once… $ consul-templaterb -c <CONSUL_ADDR> ./clean_svcs_without_hc.sh.erb --once && bash ./clean_svcs_without_hc.sh Or automatically every minute ! --wait 60 --template “clean_svcs_without_hc.sh.erb:./result.sh:bash ./result.sh” ssh $SSH_OPTIONS mesos-slave017-pa4.central.criteo.preprod "/usr/bin/curl -H X-Consul-Token:$CONSUL_DEREGISTER_TOKEN -fs -XPUT localhost:8500/v1/agent/service/deregister/marathon-app-deepr-pipeline-31510-23d44ebc2b8d11e9b0125065f387ef80" ssh $SSH_OPTIONS mesos-slave019-pa4.central.criteo.preprod "/usr/bin/curl -H X-Consul-Token:$CONSUL_DEREGISTER_TOKEN -fs -XPUT localhost:8500/v1/agent/service/deregister/marathon-app-jtc-jtc-app-31934-1d5a8e202ba611e9b0125065f387ef80" echo Found 2 / 1812 instances to cleanup
  • 24. 24 • • See video from Michael Stewart (To 20,000 Nodes and Beyond) • We had the same stories, found the same tricks • Read Consul Docs: all is RPC, there is no cache by default • Use discovery_max_stale to scale servers horizontally • Use ttl for DNS and allow_stale = true Useful configurations hints
  • 26. 26 • • Inversion of Control • Monitoring can be automated: ratio > 0.5 passing/critical • Everything is As A Service, Users are free to experiment (full network automation for instance) • Everything is standardized • ServiceMeta standardization, LB weights... • Build features on top of services: Monitoring, versions tracking Benefits (1/2)
  • 27. 27 • • Debug is easier • One single place to look for configuration • LB/API Load balancing works the same way • Nothing is hidden: people can troubleshoot themselves • The team is not a SPOF to debug issues Benefits (2/2)
  • 29. 29 • • Consul-templaterb : https://github.com/criteo/consul-templaterb/ • Script/Hack/Automate it easily: supports hot-reload • Provide Consul-UI as well as Consul-timeline • Provide additional prometheus endpoints (service changes) • https://github.com/pierresouchay/consul-ops-tools • small scripts to help debug Consul (will be enriched) • A Consul Story: To 20,000 Nodes and Beyond (video) Open-Source Tools
  • 30. 30 • Q&A Discovery Team @Criteo Twitter: @vizionr Github: pierresouchay