Observability with HAProxy

HAProxy Technologies
HAProxy TechnologiesHAProxy Technologies
WWW.HAPROXY.COM
Observability
with HAProxy
WWW.HAPROXY.COM
● Introduction to “observability”
● Deep dive in HAProxy
● Tips (to get more from HAProxy)
● RoX (Return On eXperience)
● Conclusion (because we need one)
Agenda
WWW.HAPROXY.COMWWW.HAPROXY.COM
Introduction
WWW.HAPROXY.COM
Observability:
"control theory, observability is a measure of how well
internal states of a system can be inferred from knowledge
of its external outputs"
(wikipedia)
In short for us: WTF is going on ?
Definition
WWW.HAPROXY.COM
● Monitoring tells you how well something works (or not)
● Observability helps you detect what is not working and why
=> You monitor an observable system.
Observability vs monitoring
WWW.HAPROXY.COM
● can’t rely on impatient users’ complaints anymore
=> detect and fix trouble not reported yet
● stop adding duct tape, address root causes!
● improve users experience where it matters
Observability is crucial
WWW.HAPROXY.COM
● central place
● in distributed systems, one LB per layer, even better when
sidecar
● sees multiple targets, eases comparisons
=> draw references
● Trusted low level component & excellent transparency
● many LB decisions actually depend on metrics related to
performance and observability
● again, logs provide long-term references
The LB as an observation tower
WWW.HAPROXY.COM
● Maintain two types of TCP connections
○ From client to HAPRoxy (1)
○ From HAProxy to server (2)
○ Only access to data payload (not the “packet”)
LB in Reverse-proxy mode
HAProxyClient Server
(1) (2)
WWW.HAPROXY.COM
● you should! Especially in distributed systems (microservices,
etc)!
● but often it's too late when the first incident happens!
● with existing LB's logs, it's already possible to do a lot
Note: check OpenTracing and Prometheus
Why not looking from other points?
WWW.HAPROXY.COM
● global failures (aborts, timeouts)
● abnormal delays caused by network retransmits
● connection failures and retries caused by bad tuning (eg:
conntrack)
● connection slowdowns caused by inefficient firewall policies
(#rules)
● Non exhaustive list ...
What does the LB see?
WWW.HAPROXY.COM
● client-side issues (BW limitations)
● per-URL processing time (application issues, svc partners)
● per-node vs per-cluster variations
=> narrow down to individual node or shared resource
● deployment issues : new occasional error on a specific page,
can be addressed before going full-scale
What does the LB see?
WWW.HAPROXY.COMWWW.HAPROXY.COM
Deep dive in
HAProxy
WWW.HAPROXY.COM
haproxy[14389]: 10.0.1.2:33317 [06/Feb/2018:12:14:14.655] http-in static/srv1 10/0/30/69/109 200 2750 - -SDNN 1/1/1/1/0 0/0 {haproxy.org} {}
"GET /index.html HTTP/1.1"
I need help !!!
WWW.HAPROXY.COM
● Logs :
○ Halog, ELK, Splunk, datadog…
○ Provides unique-id for tracing/event correlation
● Statistics :
○ Stats page, CLI, hatop
○ Prometheus, statsd, ...
● Stick-tables (per arbitrary key like IP, URL, cookie, ...) :
○ Byte count, cumulated/concurrent conns, errors, rates…
Accessing metrics in HAProxy
WWW.HAPROXY.COM
Sequence of events in HAProxy
WWW.HAPROXY.COM
Sequence of events in HAProxy
WWW.HAPROXY.COM
Sequence of events in HAProxy
WWW.HAPROXY.COM
Sequence of events in HAProxy
WWW.HAPROXY.COM
Sequence of events in HAProxy
WWW.HAPROXY.COM
Sequence of events in HAProxy
WWW.HAPROXY.COM
Sequence of events in HAProxy
WWW.HAPROXY.COM
Sequence of events in HAProxy
WWW.HAPROXY.COM
Sequence of events in HAProxy
WWW.HAPROXY.COM
Sequence of events in HAProxy
WWW.HAPROXY.COM
Sequence of events in HAProxy
WWW.HAPROXY.COM
Sequence of events in HAProxy
WWW.HAPROXY.COM
Sequence of events in HAProxy
WWW.HAPROXY.COM
Sequence of events in HAProxy
WWW.HAPROXY.COM
Sequence of events in HAProxy
WWW.HAPROXY.COM
● HAProxy now supports heavier per-request workloads (Lua,
device identification, …)
● Processing times over 200 µs can become noticeable
● Actions:
○ log per-request total CPU time spent in analysers
○ log per-request total CPU time spent in TLS handshake
○ log per-request total latency added by other tasks
○ Ability to kill offending tasks
○ Ability to alert on high latencies
=> make HAProxy as observable as other components
And more to come in HAProxy 1.9 (Nov. 2018)
WWW.HAPROXY.COM
● Timers are averaged in the stats
● Each timer appears in the logs
● halog -rt/-RT/-pct for quick analysis
● Each timer crossing a limit triggers a timeout
● Each abort at a specific step causes a hard error
=> termination codes
haproxy[14389]: 10.0.1.2:33317 [06/Feb/2018:12:14:14.655] http-in
static/srv1 10/0/30/69/109 200 2750 - - SDNN 1/1/1/1/0 0/0 {haproxy.org}
{} "GET /index.html HTTP/1.1"
Event timing reports
Timers Term code Cookie code
WWW.HAPROXY.COM
● Who, when, why and how the session has been terminated
● Distinguish between timeout and abort
● Indicate who (client, server, haproxy, kill, ...)
● Indicate when (req, queue, connect, response, data...)
● Completed by persistence cookie indications
● Filtered and sorted by halog :
halog -tcn|-TCN ... # for filtering
halog -tc # for sorting
=> Graphing the number of errors reported by second helps
detecting an issue is in progress somewhere
Termination codes
WWW.HAPROXY.COM
● Stats page: distribution per frontend/backend/server
● Filter by ranges: halog -hs/-HS
● Sorted output: halog -st
=> graph the distribution and watch for variations between
application deployments
HTTP status code distribution
WWW.HAPROXY.COM
● Uses server maxconn
● Grows exponentially with slowdowns : easy to detect!
● Tells you how many extra servers you need
● Reported by halog -Q/-QS
● Shown in real time on the stats page per backend/srv
=> If you watch only one metric, watch this one!
Other relevant metrics: queues
WWW.HAPROXY.COM
LB algorithm implies fairness between servers :
● Equal request count with roundrobin
=> Higher than average concurrency indicates abnormally
slow server
● Equal load with leastconn
=> Low req count indicates abnormally slow server
=> graph relevant values within the farm
Other relevant metrics: LB fairness
WWW.HAPROXY.COM
● Global: halog -e
● Per server: halog -srv
● Per client IP: halog -e -ic (detect bad CDN nodes)
● Per URL: halog -ue
● Stats page: per frontend/backend/server
● Stick-tables: per arbitrary key using http_err_rate()
=> no threshold, watch for variations
Other relevant metrics: Error rate
WWW.HAPROXY.COM
● Default httplog format is quite rich
● Can be improved using the log-format directive
● Hint: log stick-table stats for similar keys
haproxy[14389]: 10.0.1.2:33317 [06/Feb/2018:12:14:14.655] http-in
static/srv1 10/0/30/69/109 200 2750 - - SDNN 1/1/1/1/0 0/0 {haproxy.org}
{} "GET /index.html HTTP/1.1"
Useful entries in log-format
Timers
HTTP status Byte
count
Term
Code
Cookie
Code
Conn
Count
Queue
Length
WWW.HAPROXY.COMWWW.HAPROXY.COM
Tips !!!!
WWW.HAPROXY.COM
"I can't enable logs, I have too much traffic!"
● an average syslog server can store 20k events/s without
sweating
● that's 1.7B events/day or 350GB of uncompressed haproxy
logs/day
● compresses to 1TB/month
● for $100 you can store 4 months with no loss
● have more traffic / not interested in this level of detail ?
# log only 5% of requests
http-request set-log-level silent unless { rand(100) -lt 5 }
Sampling: why / When
Tips !!!!
WWW.HAPROXY.COM
● you only want to catch suspicious events
● enable logs per url, source IP, etc...
● disable logging unless Tc/Tq/Tr/Tw/... is above a certain
threshold
● on the fly for selected keys from the CLI + stick-table
● also see "option dontlognormal"
WARNING: you'll lose any valid reference
Selective logging: why / When
Tips !!!!
WWW.HAPROXY.COM
● Poorly documented, use halog --help
● response time per url: halog -uat
● errors per server: halog -srv
● Percentiles on req/queue/conn/resp times: halog -pct
● detect stolen CPU / swap : halog -ac … -ad …
● very fast (1-2 GB of log file per second)
=> Use it in production to figure the relevant metrics
Other halog goodies
Tips !!!!
WWW.HAPROXY.COMWWW.HAPROXY.COM
ROX
(Return On eXperience)
WWW.HAPROXY.COM
● Many ops and dev do this every day all around the world
● Users are complaining that the application is slow
● Devops team check HAProxy logs (using halog or ELK) in order to isolate
potential sources of issue:
○ Check server Tc
○ Sort servers by response time
○ Sort URLs by response time
● HAProxy won’t fix the problem (in most cases), but will drastically reduce the
troubleshooting time by isolating the potential issues
Web application is slow!!!!!!!!
Return On eXperience
WWW.HAPROXY.COM
● Retail customer, with shops everywhere in France
● Cash register reports the following error message “Unable to reach central
system” when searching people in the loyalty card database, after waiting up to
10s.
● Argument between customer and “central system” software provider:
○ Customer: your software is slow
○ Provider: your network triggers this error
● After installing HAProxy, its logs reported the following:
○ Time to get the query from the client: 300ms (from North of France to Paris)
○ Server response time: 50s
● Conclusion: provider turned off debug mode in the “central system” software
Unavailable central system ?!?
Return On eXperience
WWW.HAPROXY.COM
● Customer with big TLS traffic, mostly API based
● Each HAProxy server can’t go higher than 20K HTTPs req/s (in 2015), even if
there seem to be a lot of unused resources on the box
● At first, this was qualified as an “HAProxy” issue (including the HW and OS
itself)
● Hard tuning the box did not improve anything
● halog Tc percentile reported higher values when the load increased (up to
30ms, on the LAN!!!)
● After asking customer to double check the Spanning Tree, it seemed a small 24
ports ToR switch became root bridge….
● Fixing spanning tree also fixed HAProxy’s TLS performance
HAProxy TLS performance issue
Return On eXperience
WWW.HAPROXY.COM
● Tc from HA1 to srv 1,2,3,5 always low, srv 4,6 high at 99 pct
● Tc from HA2 to srv 1,2,4,6 always low, srv 3,5 high at 99 pct
=> both haproxy and servers out of cause
● issue rate stable at various traffic levels => not congestion
● inter-switch link apparently at cause but not for all flows
● inter-switch link made of two fibers balanced on MAC tuple
● thanks to long-term logs, origin could even be identified
Spotting a broken fiber channel between 2
core switches
Return On eXperience
WWW.HAPROXY.COM
● Tc abnormally high with lots of random values to several
seconds, and only for TLS
● Tc timer also covers TLS handshake
=> not a network, hardware or performance issue, only
server config.
=> system was regularly running out of entropy due to
mistakenly using /dev/random as a random source for SSL
TIP: use /dev/urandom ;)
Wrong web server configuration using
/dev/random
Return On eXperience
WWW.HAPROXY.COMWWW.HAPROXY.COM
Conclusion
WWW.HAPROXY.COM
● exploit your stats
● Choose the right LB
● enable logs on LBs, no excuse for not doing it!
● process them automatically, manually once in a while
● compare numbers between similar objects
● detect anomalies
● fix problems before they are witnessed
● Enjoy :-)
Conclusion
Conclusion
WWW.HAPROXY.COM
QUESTION &
ANSWER
1 of 50

Recommended

Grafana introduction by
Grafana introductionGrafana introduction
Grafana introductionRico Chen
8.8K views11 slides
HAProxy 1.9 by
HAProxy 1.9HAProxy 1.9
HAProxy 1.9HAProxy Technologies
934 views39 slides
Prometheus design and philosophy by
Prometheus design and philosophy   Prometheus design and philosophy
Prometheus design and philosophy Docker, Inc.
2.7K views32 slides
Monitoring With Prometheus by
Monitoring With PrometheusMonitoring With Prometheus
Monitoring With PrometheusKnoldus Inc.
423 views14 slides
Explore your prometheus data in grafana - Promcon 2018 by
Explore your prometheus data in grafana - Promcon 2018Explore your prometheus data in grafana - Promcon 2018
Explore your prometheus data in grafana - Promcon 2018Grafana Labs
1.6K views33 slides
Prometheus Overview by
Prometheus OverviewPrometheus Overview
Prometheus OverviewBrian Brazil
34.1K views19 slides

More Related Content

What's hot

Introduction to Grafana Loki by
Introduction to Grafana LokiIntroduction to Grafana Loki
Introduction to Grafana LokiJulien Pivotto
198 views11 slides
Introduction to Prometheus by
Introduction to PrometheusIntroduction to Prometheus
Introduction to PrometheusJulien Pivotto
6.7K views55 slides
Logs/Metrics Gathering With OpenShift EFK Stack by
Logs/Metrics Gathering With OpenShift EFK StackLogs/Metrics Gathering With OpenShift EFK Stack
Logs/Metrics Gathering With OpenShift EFK StackJosef Karásek
5.5K views30 slides
NATS vs HTTP by
NATS vs HTTPNATS vs HTTP
NATS vs HTTPApcera
3.1K views9 slides
Introducing Apache Airflow and how we are using it by
Introducing Apache Airflow and how we are using itIntroducing Apache Airflow and how we are using it
Introducing Apache Airflow and how we are using itBruno Faria
10.8K views18 slides
Monitoring using Prometheus and Grafana by
Monitoring using Prometheus and GrafanaMonitoring using Prometheus and Grafana
Monitoring using Prometheus and GrafanaArvind Kumar G.S
3.5K views25 slides

What's hot(20)

Introduction to Grafana Loki by Julien Pivotto
Introduction to Grafana LokiIntroduction to Grafana Loki
Introduction to Grafana Loki
Julien Pivotto198 views
Introduction to Prometheus by Julien Pivotto
Introduction to PrometheusIntroduction to Prometheus
Introduction to Prometheus
Julien Pivotto6.7K views
Logs/Metrics Gathering With OpenShift EFK Stack by Josef Karásek
Logs/Metrics Gathering With OpenShift EFK StackLogs/Metrics Gathering With OpenShift EFK Stack
Logs/Metrics Gathering With OpenShift EFK Stack
Josef Karásek5.5K views
NATS vs HTTP by Apcera
NATS vs HTTPNATS vs HTTP
NATS vs HTTP
Apcera3.1K views
Introducing Apache Airflow and how we are using it by Bruno Faria
Introducing Apache Airflow and how we are using itIntroducing Apache Airflow and how we are using it
Introducing Apache Airflow and how we are using it
Bruno Faria10.8K views
Monitoring using Prometheus and Grafana by Arvind Kumar G.S
Monitoring using Prometheus and GrafanaMonitoring using Prometheus and Grafana
Monitoring using Prometheus and Grafana
Arvind Kumar G.S3.5K views
Systems Monitoring with Prometheus (Devops Ireland April 2015) by Brian Brazil
Systems Monitoring with Prometheus (Devops Ireland April 2015)Systems Monitoring with Prometheus (Devops Ireland April 2015)
Systems Monitoring with Prometheus (Devops Ireland April 2015)
Brian Brazil24.1K views
Scaling Prometheus Metrics in Kubernetes with Telegraf | Chris Goller | Influ... by InfluxData
Scaling Prometheus Metrics in Kubernetes with Telegraf | Chris Goller | Influ...Scaling Prometheus Metrics in Kubernetes with Telegraf | Chris Goller | Influ...
Scaling Prometheus Metrics in Kubernetes with Telegraf | Chris Goller | Influ...
InfluxData2.4K views
Introduction to Haproxy by Shaopeng He
Introduction to HaproxyIntroduction to Haproxy
Introduction to Haproxy
Shaopeng He3.4K views
End to-end monitoring with the prometheus operator - Max Inden by Paris Container Day
End to-end monitoring with the prometheus operator - Max IndenEnd to-end monitoring with the prometheus operator - Max Inden
End to-end monitoring with the prometheus operator - Max Inden
PCF Platform Monitoring with Prometheus and Grafana by VMware Tanzu
PCF Platform Monitoring with Prometheus and GrafanaPCF Platform Monitoring with Prometheus and Grafana
PCF Platform Monitoring with Prometheus and Grafana
VMware Tanzu8.1K views
Server monitoring using grafana and prometheus by Celine George
Server monitoring using grafana and prometheusServer monitoring using grafana and prometheus
Server monitoring using grafana and prometheus
Celine George2K views

Similar to Observability with HAProxy

Observability tips for HAProxy by
Observability tips for HAProxyObservability tips for HAProxy
Observability tips for HAProxyWilly Tarreau
3.4K views40 slides
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016) by
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)Brian Brazil
9.4K views43 slides
Website & Internet + Performance testing by
Website & Internet + Performance testingWebsite & Internet + Performance testing
Website & Internet + Performance testingRoman Ananev
44 views44 slides
Log Files by
Log FilesLog Files
Log FilesHeinrich Hartmann
4.3K views14 slides
Protecting the Web at a scale using consul and Elk / Valentin Chernozemski (S... by
Protecting the Web at a scale using consul and Elk / Valentin Chernozemski (S...Protecting the Web at a scale using consul and Elk / Valentin Chernozemski (S...
Protecting the Web at a scale using consul and Elk / Valentin Chernozemski (S...Ontico
789 views44 slides
HAProxy as Egress Controller by
HAProxy as Egress ControllerHAProxy as Egress Controller
HAProxy as Egress ControllerJulien Pivotto
2.9K views47 slides

Similar to Observability with HAProxy(20)

Observability tips for HAProxy by Willy Tarreau
Observability tips for HAProxyObservability tips for HAProxy
Observability tips for HAProxy
Willy Tarreau3.4K views
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016) by Brian Brazil
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Brian Brazil9.4K views
Website & Internet + Performance testing by Roman Ananev
Website & Internet + Performance testingWebsite & Internet + Performance testing
Website & Internet + Performance testing
Roman Ananev44 views
Protecting the Web at a scale using consul and Elk / Valentin Chernozemski (S... by Ontico
Protecting the Web at a scale using consul and Elk / Valentin Chernozemski (S...Protecting the Web at a scale using consul and Elk / Valentin Chernozemski (S...
Protecting the Web at a scale using consul and Elk / Valentin Chernozemski (S...
Ontico789 views
HAProxy as Egress Controller by Julien Pivotto
HAProxy as Egress ControllerHAProxy as Egress Controller
HAProxy as Egress Controller
Julien Pivotto2.9K views
Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ... by Hernan Costante
Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...
Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...
Hernan Costante1.9K views
Thinking DevOps in the era of the Cloud - Demi Ben-Ari by Demi Ben-Ari
Thinking DevOps in the era of the Cloud - Demi Ben-AriThinking DevOps in the era of the Cloud - Demi Ben-Ari
Thinking DevOps in the era of the Cloud - Demi Ben-Ari
Demi Ben-Ari420 views
SPDY and What to Consider for HTTP/2.0 by Mike Belshe
SPDY and What to Consider for HTTP/2.0SPDY and What to Consider for HTTP/2.0
SPDY and What to Consider for HTTP/2.0
Mike Belshe2.1K views
Security Monitoring for big Infrastructures without a Million Dollar budget by Juan Berner
Security Monitoring for big Infrastructures without a Million Dollar budgetSecurity Monitoring for big Infrastructures without a Million Dollar budget
Security Monitoring for big Infrastructures without a Million Dollar budget
Juan Berner702 views
Docker Logging and analysing with Elastic Stack - Jakub Hajek by PROIDEA
Docker Logging and analysing with Elastic Stack - Jakub Hajek Docker Logging and analysing with Elastic Stack - Jakub Hajek
Docker Logging and analysing with Elastic Stack - Jakub Hajek
PROIDEA73 views
Docker Logging and analysing with Elastic Stack by Jakub Hajek
Docker Logging and analysing with Elastic StackDocker Logging and analysing with Elastic Stack
Docker Logging and analysing with Elastic Stack
Jakub Hajek157 views
Monitoring as an entry point for collaboration by Julien Pivotto
Monitoring as an entry point for collaborationMonitoring as an entry point for collaboration
Monitoring as an entry point for collaboration
Julien Pivotto1.3K views
Zero Downtime JEE Architectures by Alexander Penev
Zero Downtime JEE ArchitecturesZero Downtime JEE Architectures
Zero Downtime JEE Architectures
Alexander Penev6.9K views
'Effective node.js development' by Viktor Turskyi at OdessaJS'2020 by OdessaJS Conf
'Effective node.js development' by Viktor Turskyi at OdessaJS'2020'Effective node.js development' by Viktor Turskyi at OdessaJS'2020
'Effective node.js development' by Viktor Turskyi at OdessaJS'2020
OdessaJS Conf492 views
University of Delaware - Improving Web Protocols (early SPDY talk) by Mike Belshe
University of Delaware - Improving Web Protocols (early SPDY talk)University of Delaware - Improving Web Protocols (early SPDY talk)
University of Delaware - Improving Web Protocols (early SPDY talk)
Mike Belshe920 views
Prometheus (Microsoft, 2016) by Brian Brazil
Prometheus (Microsoft, 2016)Prometheus (Microsoft, 2016)
Prometheus (Microsoft, 2016)
Brian Brazil3.3K views
OSMC 2018 | Stream connector: Easily sending events and/or metrics from the C... by NETWAYS
OSMC 2018 | Stream connector: Easily sending events and/or metrics from the C...OSMC 2018 | Stream connector: Easily sending events and/or metrics from the C...
OSMC 2018 | Stream connector: Easily sending events and/or metrics from the C...
NETWAYS118 views
WTF is Sensu and Monitoring by Toby Jackson
WTF is Sensu and MonitoringWTF is Sensu and Monitoring
WTF is Sensu and Monitoring
Toby Jackson4.6K views
Approaches for application request throttling - Cloud Developer Days Poland by Maarten Balliauw
Approaches for application request throttling - Cloud Developer Days PolandApproaches for application request throttling - Cloud Developer Days Poland
Approaches for application request throttling - Cloud Developer Days Poland
Maarten Balliauw1.1K views

Recently uploaded

Digital Personal Data Protection (DPDP) Practical Approach For CISOs by
Digital Personal Data Protection (DPDP) Practical Approach For CISOsDigital Personal Data Protection (DPDP) Practical Approach For CISOs
Digital Personal Data Protection (DPDP) Practical Approach For CISOsPriyanka Aash
103 views59 slides
CloudStack and GitOps at Enterprise Scale - Alex Dometrius, Rene Glover - AT&T by
CloudStack and GitOps at Enterprise Scale - Alex Dometrius, Rene Glover - AT&TCloudStack and GitOps at Enterprise Scale - Alex Dometrius, Rene Glover - AT&T
CloudStack and GitOps at Enterprise Scale - Alex Dometrius, Rene Glover - AT&TShapeBlue
81 views34 slides
Webinar : Desperately Seeking Transformation - Part 2: Insights from leading... by
Webinar : Desperately Seeking Transformation - Part 2:  Insights from leading...Webinar : Desperately Seeking Transformation - Part 2:  Insights from leading...
Webinar : Desperately Seeking Transformation - Part 2: Insights from leading...The Digital Insurer
40 views52 slides
Igniting Next Level Productivity with AI-Infused Data Integration Workflows by
Igniting Next Level Productivity with AI-Infused Data Integration Workflows Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration Workflows Safe Software
373 views86 slides
20231123_Camunda Meetup Vienna.pdf by
20231123_Camunda Meetup Vienna.pdf20231123_Camunda Meetup Vienna.pdf
20231123_Camunda Meetup Vienna.pdfPhactum Softwareentwicklung GmbH
49 views73 slides
Microsoft Power Platform.pptx by
Microsoft Power Platform.pptxMicrosoft Power Platform.pptx
Microsoft Power Platform.pptxUni Systems S.M.S.A.
74 views38 slides

Recently uploaded(20)

Digital Personal Data Protection (DPDP) Practical Approach For CISOs by Priyanka Aash
Digital Personal Data Protection (DPDP) Practical Approach For CISOsDigital Personal Data Protection (DPDP) Practical Approach For CISOs
Digital Personal Data Protection (DPDP) Practical Approach For CISOs
Priyanka Aash103 views
CloudStack and GitOps at Enterprise Scale - Alex Dometrius, Rene Glover - AT&T by ShapeBlue
CloudStack and GitOps at Enterprise Scale - Alex Dometrius, Rene Glover - AT&TCloudStack and GitOps at Enterprise Scale - Alex Dometrius, Rene Glover - AT&T
CloudStack and GitOps at Enterprise Scale - Alex Dometrius, Rene Glover - AT&T
ShapeBlue81 views
Webinar : Desperately Seeking Transformation - Part 2: Insights from leading... by The Digital Insurer
Webinar : Desperately Seeking Transformation - Part 2:  Insights from leading...Webinar : Desperately Seeking Transformation - Part 2:  Insights from leading...
Webinar : Desperately Seeking Transformation - Part 2: Insights from leading...
Igniting Next Level Productivity with AI-Infused Data Integration Workflows by Safe Software
Igniting Next Level Productivity with AI-Infused Data Integration Workflows Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Safe Software373 views
Hypervisor Agnostic DRS in CloudStack - Brief overview & demo - Vishesh Jinda... by ShapeBlue
Hypervisor Agnostic DRS in CloudStack - Brief overview & demo - Vishesh Jinda...Hypervisor Agnostic DRS in CloudStack - Brief overview & demo - Vishesh Jinda...
Hypervisor Agnostic DRS in CloudStack - Brief overview & demo - Vishesh Jinda...
ShapeBlue93 views
Confidence in CloudStack - Aron Wagner, Nathan Gleason - Americ by ShapeBlue
Confidence in CloudStack - Aron Wagner, Nathan Gleason - AmericConfidence in CloudStack - Aron Wagner, Nathan Gleason - Americ
Confidence in CloudStack - Aron Wagner, Nathan Gleason - Americ
ShapeBlue58 views
Setting Up Your First CloudStack Environment with Beginners Challenges - MD R... by ShapeBlue
Setting Up Your First CloudStack Environment with Beginners Challenges - MD R...Setting Up Your First CloudStack Environment with Beginners Challenges - MD R...
Setting Up Your First CloudStack Environment with Beginners Challenges - MD R...
ShapeBlue105 views
Extending KVM Host HA for Non-NFS Storage - Alex Ivanov - StorPool by ShapeBlue
Extending KVM Host HA for Non-NFS Storage -  Alex Ivanov - StorPoolExtending KVM Host HA for Non-NFS Storage -  Alex Ivanov - StorPool
Extending KVM Host HA for Non-NFS Storage - Alex Ivanov - StorPool
ShapeBlue56 views
DRaaS using Snapshot copy and destination selection (DRaaS) - Alexandre Matti... by ShapeBlue
DRaaS using Snapshot copy and destination selection (DRaaS) - Alexandre Matti...DRaaS using Snapshot copy and destination selection (DRaaS) - Alexandre Matti...
DRaaS using Snapshot copy and destination selection (DRaaS) - Alexandre Matti...
ShapeBlue69 views
Zero to Cloud Hero: Crafting a Private Cloud from Scratch with XCP-ng, Xen Or... by ShapeBlue
Zero to Cloud Hero: Crafting a Private Cloud from Scratch with XCP-ng, Xen Or...Zero to Cloud Hero: Crafting a Private Cloud from Scratch with XCP-ng, Xen Or...
Zero to Cloud Hero: Crafting a Private Cloud from Scratch with XCP-ng, Xen Or...
ShapeBlue128 views
Developments to CloudStack’s SDN ecosystem: Integration with VMWare NSX 4 - P... by ShapeBlue
Developments to CloudStack’s SDN ecosystem: Integration with VMWare NSX 4 - P...Developments to CloudStack’s SDN ecosystem: Integration with VMWare NSX 4 - P...
Developments to CloudStack’s SDN ecosystem: Integration with VMWare NSX 4 - P...
ShapeBlue120 views
Backup and Disaster Recovery with CloudStack and StorPool - Workshop - Venko ... by ShapeBlue
Backup and Disaster Recovery with CloudStack and StorPool - Workshop - Venko ...Backup and Disaster Recovery with CloudStack and StorPool - Workshop - Venko ...
Backup and Disaster Recovery with CloudStack and StorPool - Workshop - Venko ...
ShapeBlue114 views
TrustArc Webinar - Managing Online Tracking Technology Vendors_ A Checklist f... by TrustArc
TrustArc Webinar - Managing Online Tracking Technology Vendors_ A Checklist f...TrustArc Webinar - Managing Online Tracking Technology Vendors_ A Checklist f...
TrustArc Webinar - Managing Online Tracking Technology Vendors_ A Checklist f...
TrustArc130 views
Automating a World-Class Technology Conference; Behind the Scenes of CiscoLive by Network Automation Forum
Automating a World-Class Technology Conference; Behind the Scenes of CiscoLiveAutomating a World-Class Technology Conference; Behind the Scenes of CiscoLive
Automating a World-Class Technology Conference; Behind the Scenes of CiscoLive
2FA and OAuth2 in CloudStack - Andrija Panić - ShapeBlue by ShapeBlue
2FA and OAuth2 in CloudStack - Andrija Panić - ShapeBlue2FA and OAuth2 in CloudStack - Andrija Panić - ShapeBlue
2FA and OAuth2 in CloudStack - Andrija Panić - ShapeBlue
ShapeBlue75 views
State of the Union - Rohit Yadav - Apache CloudStack by ShapeBlue
State of the Union - Rohit Yadav - Apache CloudStackState of the Union - Rohit Yadav - Apache CloudStack
State of the Union - Rohit Yadav - Apache CloudStack
ShapeBlue218 views
VNF Integration and Support in CloudStack - Wei Zhou - ShapeBlue by ShapeBlue
VNF Integration and Support in CloudStack - Wei Zhou - ShapeBlueVNF Integration and Support in CloudStack - Wei Zhou - ShapeBlue
VNF Integration and Support in CloudStack - Wei Zhou - ShapeBlue
ShapeBlue134 views
Declarative Kubernetes Cluster Deployment with Cloudstack and Cluster API - O... by ShapeBlue
Declarative Kubernetes Cluster Deployment with Cloudstack and Cluster API - O...Declarative Kubernetes Cluster Deployment with Cloudstack and Cluster API - O...
Declarative Kubernetes Cluster Deployment with Cloudstack and Cluster API - O...
ShapeBlue59 views

Observability with HAProxy

  • 2. WWW.HAPROXY.COM ● Introduction to “observability” ● Deep dive in HAProxy ● Tips (to get more from HAProxy) ● RoX (Return On eXperience) ● Conclusion (because we need one) Agenda
  • 4. WWW.HAPROXY.COM Observability: "control theory, observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs" (wikipedia) In short for us: WTF is going on ? Definition
  • 5. WWW.HAPROXY.COM ● Monitoring tells you how well something works (or not) ● Observability helps you detect what is not working and why => You monitor an observable system. Observability vs monitoring
  • 6. WWW.HAPROXY.COM ● can’t rely on impatient users’ complaints anymore => detect and fix trouble not reported yet ● stop adding duct tape, address root causes! ● improve users experience where it matters Observability is crucial
  • 7. WWW.HAPROXY.COM ● central place ● in distributed systems, one LB per layer, even better when sidecar ● sees multiple targets, eases comparisons => draw references ● Trusted low level component & excellent transparency ● many LB decisions actually depend on metrics related to performance and observability ● again, logs provide long-term references The LB as an observation tower
  • 8. WWW.HAPROXY.COM ● Maintain two types of TCP connections ○ From client to HAPRoxy (1) ○ From HAProxy to server (2) ○ Only access to data payload (not the “packet”) LB in Reverse-proxy mode HAProxyClient Server (1) (2)
  • 9. WWW.HAPROXY.COM ● you should! Especially in distributed systems (microservices, etc)! ● but often it's too late when the first incident happens! ● with existing LB's logs, it's already possible to do a lot Note: check OpenTracing and Prometheus Why not looking from other points?
  • 10. WWW.HAPROXY.COM ● global failures (aborts, timeouts) ● abnormal delays caused by network retransmits ● connection failures and retries caused by bad tuning (eg: conntrack) ● connection slowdowns caused by inefficient firewall policies (#rules) ● Non exhaustive list ... What does the LB see?
  • 11. WWW.HAPROXY.COM ● client-side issues (BW limitations) ● per-URL processing time (application issues, svc partners) ● per-node vs per-cluster variations => narrow down to individual node or shared resource ● deployment issues : new occasional error on a specific page, can be addressed before going full-scale What does the LB see?
  • 13. WWW.HAPROXY.COM haproxy[14389]: 10.0.1.2:33317 [06/Feb/2018:12:14:14.655] http-in static/srv1 10/0/30/69/109 200 2750 - -SDNN 1/1/1/1/0 0/0 {haproxy.org} {} "GET /index.html HTTP/1.1" I need help !!!
  • 14. WWW.HAPROXY.COM ● Logs : ○ Halog, ELK, Splunk, datadog… ○ Provides unique-id for tracing/event correlation ● Statistics : ○ Stats page, CLI, hatop ○ Prometheus, statsd, ... ● Stick-tables (per arbitrary key like IP, URL, cookie, ...) : ○ Byte count, cumulated/concurrent conns, errors, rates… Accessing metrics in HAProxy
  • 30. WWW.HAPROXY.COM ● HAProxy now supports heavier per-request workloads (Lua, device identification, …) ● Processing times over 200 µs can become noticeable ● Actions: ○ log per-request total CPU time spent in analysers ○ log per-request total CPU time spent in TLS handshake ○ log per-request total latency added by other tasks ○ Ability to kill offending tasks ○ Ability to alert on high latencies => make HAProxy as observable as other components And more to come in HAProxy 1.9 (Nov. 2018)
  • 31. WWW.HAPROXY.COM ● Timers are averaged in the stats ● Each timer appears in the logs ● halog -rt/-RT/-pct for quick analysis ● Each timer crossing a limit triggers a timeout ● Each abort at a specific step causes a hard error => termination codes haproxy[14389]: 10.0.1.2:33317 [06/Feb/2018:12:14:14.655] http-in static/srv1 10/0/30/69/109 200 2750 - - SDNN 1/1/1/1/0 0/0 {haproxy.org} {} "GET /index.html HTTP/1.1" Event timing reports Timers Term code Cookie code
  • 32. WWW.HAPROXY.COM ● Who, when, why and how the session has been terminated ● Distinguish between timeout and abort ● Indicate who (client, server, haproxy, kill, ...) ● Indicate when (req, queue, connect, response, data...) ● Completed by persistence cookie indications ● Filtered and sorted by halog : halog -tcn|-TCN ... # for filtering halog -tc # for sorting => Graphing the number of errors reported by second helps detecting an issue is in progress somewhere Termination codes
  • 33. WWW.HAPROXY.COM ● Stats page: distribution per frontend/backend/server ● Filter by ranges: halog -hs/-HS ● Sorted output: halog -st => graph the distribution and watch for variations between application deployments HTTP status code distribution
  • 34. WWW.HAPROXY.COM ● Uses server maxconn ● Grows exponentially with slowdowns : easy to detect! ● Tells you how many extra servers you need ● Reported by halog -Q/-QS ● Shown in real time on the stats page per backend/srv => If you watch only one metric, watch this one! Other relevant metrics: queues
  • 35. WWW.HAPROXY.COM LB algorithm implies fairness between servers : ● Equal request count with roundrobin => Higher than average concurrency indicates abnormally slow server ● Equal load with leastconn => Low req count indicates abnormally slow server => graph relevant values within the farm Other relevant metrics: LB fairness
  • 36. WWW.HAPROXY.COM ● Global: halog -e ● Per server: halog -srv ● Per client IP: halog -e -ic (detect bad CDN nodes) ● Per URL: halog -ue ● Stats page: per frontend/backend/server ● Stick-tables: per arbitrary key using http_err_rate() => no threshold, watch for variations Other relevant metrics: Error rate
  • 37. WWW.HAPROXY.COM ● Default httplog format is quite rich ● Can be improved using the log-format directive ● Hint: log stick-table stats for similar keys haproxy[14389]: 10.0.1.2:33317 [06/Feb/2018:12:14:14.655] http-in static/srv1 10/0/30/69/109 200 2750 - - SDNN 1/1/1/1/0 0/0 {haproxy.org} {} "GET /index.html HTTP/1.1" Useful entries in log-format Timers HTTP status Byte count Term Code Cookie Code Conn Count Queue Length
  • 39. WWW.HAPROXY.COM "I can't enable logs, I have too much traffic!" ● an average syslog server can store 20k events/s without sweating ● that's 1.7B events/day or 350GB of uncompressed haproxy logs/day ● compresses to 1TB/month ● for $100 you can store 4 months with no loss ● have more traffic / not interested in this level of detail ? # log only 5% of requests http-request set-log-level silent unless { rand(100) -lt 5 } Sampling: why / When Tips !!!!
  • 40. WWW.HAPROXY.COM ● you only want to catch suspicious events ● enable logs per url, source IP, etc... ● disable logging unless Tc/Tq/Tr/Tw/... is above a certain threshold ● on the fly for selected keys from the CLI + stick-table ● also see "option dontlognormal" WARNING: you'll lose any valid reference Selective logging: why / When Tips !!!!
  • 41. WWW.HAPROXY.COM ● Poorly documented, use halog --help ● response time per url: halog -uat ● errors per server: halog -srv ● Percentiles on req/queue/conn/resp times: halog -pct ● detect stolen CPU / swap : halog -ac … -ad … ● very fast (1-2 GB of log file per second) => Use it in production to figure the relevant metrics Other halog goodies Tips !!!!
  • 43. WWW.HAPROXY.COM ● Many ops and dev do this every day all around the world ● Users are complaining that the application is slow ● Devops team check HAProxy logs (using halog or ELK) in order to isolate potential sources of issue: ○ Check server Tc ○ Sort servers by response time ○ Sort URLs by response time ● HAProxy won’t fix the problem (in most cases), but will drastically reduce the troubleshooting time by isolating the potential issues Web application is slow!!!!!!!! Return On eXperience
  • 44. WWW.HAPROXY.COM ● Retail customer, with shops everywhere in France ● Cash register reports the following error message “Unable to reach central system” when searching people in the loyalty card database, after waiting up to 10s. ● Argument between customer and “central system” software provider: ○ Customer: your software is slow ○ Provider: your network triggers this error ● After installing HAProxy, its logs reported the following: ○ Time to get the query from the client: 300ms (from North of France to Paris) ○ Server response time: 50s ● Conclusion: provider turned off debug mode in the “central system” software Unavailable central system ?!? Return On eXperience
  • 45. WWW.HAPROXY.COM ● Customer with big TLS traffic, mostly API based ● Each HAProxy server can’t go higher than 20K HTTPs req/s (in 2015), even if there seem to be a lot of unused resources on the box ● At first, this was qualified as an “HAProxy” issue (including the HW and OS itself) ● Hard tuning the box did not improve anything ● halog Tc percentile reported higher values when the load increased (up to 30ms, on the LAN!!!) ● After asking customer to double check the Spanning Tree, it seemed a small 24 ports ToR switch became root bridge…. ● Fixing spanning tree also fixed HAProxy’s TLS performance HAProxy TLS performance issue Return On eXperience
  • 46. WWW.HAPROXY.COM ● Tc from HA1 to srv 1,2,3,5 always low, srv 4,6 high at 99 pct ● Tc from HA2 to srv 1,2,4,6 always low, srv 3,5 high at 99 pct => both haproxy and servers out of cause ● issue rate stable at various traffic levels => not congestion ● inter-switch link apparently at cause but not for all flows ● inter-switch link made of two fibers balanced on MAC tuple ● thanks to long-term logs, origin could even be identified Spotting a broken fiber channel between 2 core switches Return On eXperience
  • 47. WWW.HAPROXY.COM ● Tc abnormally high with lots of random values to several seconds, and only for TLS ● Tc timer also covers TLS handshake => not a network, hardware or performance issue, only server config. => system was regularly running out of entropy due to mistakenly using /dev/random as a random source for SSL TIP: use /dev/urandom ;) Wrong web server configuration using /dev/random Return On eXperience
  • 49. WWW.HAPROXY.COM ● exploit your stats ● Choose the right LB ● enable logs on LBs, no excuse for not doing it! ● process them automatically, manually once in a while ● compare numbers between similar objects ● detect anomalies ● fix problems before they are witnessed ● Enjoy :-) Conclusion Conclusion