SlideShare a Scribd company logo
1 of 40
Download to read offline
Observability tips for HAProxy
Willy Tarreau
(willy@haproxy.org)
Dotscale 2018
2
Definition
Observability:
"control theory, observability is a measure of how well
internal states of a system can be inferred from
knowledge of its external outputs" (wikipedia)
In short for us: WTF is going on ?
3
Observability vs monitoring
● Monitoring tells you how well something works (or not)
● Observability helps you detect what is not working and why
=> You monitor an observable system.
4
Observability is important
● can’t rely on impatient users’ complaints anymore
=> detect and fix trouble not reported yet
● stop adding duct tape, address root causes!
● improve users experience where it matters
5
The LB as an observation tower
● central place
● in distributed systems, one LB per level, even better
when sidecar
● sees multiple targets, eases comparisons
=> draw references
● Trusted low level component & excellent transparency
● many LB decisions actually depend on metrics related
to performance and observability
● already logs, provides long-term references
6
Why not look from other points ?
● you should! Especially in distributed systems
(microservices, etc)!
● but often it's too late when the first incident happens!
● with existing LB's logs, it's already possible to do a lot
Note: see OpenTracing and Prometheus
7
What does the LB see ?
● global failures (aborts, timeouts)
● abnormal delays caused by network retransmits
● connection failures and retries caused by bad tuning
(eg: conntrack)
● connection slowdowns caused by inefficient firewall
policies (#rules)
...
8
What does the LB see (…) ?
● client-side issues (BW limitations)
● per-URL processing time (application issues, svc
partners)
● per-node vs per-cluster variations
=> narrow down to individual node or shared resource
● deployment issues : new occasional error on a specific
page, can be addressed before going full-scale
9
Accessing metrics in HAProxy
● Logs :
● Halog, ELK, Prometheus, …
● Provides unique-id for tracing/event correlation
● Stats :
● Stats page, CLI, hatop
● Stick-tables (per arbitrary key like IP, URL, cookie) :
● Byte count, cumulated/concurrent conns, errors, …
10
Sequence of events on HAProxy
11
Sequence of events on HAProxy
12
Sequence of events on HAProxy
13
Sequence of events on HAProxy
14
Sequence of events on HAProxy
15
Sequence of events on HAProxy
16
Sequence of events on HAProxy
17
Sequence of events on HAProxy
18
Sequence of events on HAProxy
19
Sequence of events on HAProxy
20
Sequence of events on HAProxy
21
Sequence of events on HAProxy
22
Sequence of events on HAProxy
23
Sequence of events on HAProxy
24
Sequence of events on HAProxy
25
Sequence of events on HAProxy
26
More timers to come in HAProxy 1.9
● HAProxy now supports heavier per-request workloads
(Lua, device identification, …)
● Processing times over 200 µs can become noticeable
Actions:
● log per-request total CPU time spent in analysers
● log per-request total CPU time spent in TLS handshake
● log per-request total latency added by other tasks
● Ability to kill offending tasks
● Ability to alert on high latencies
=> make HAProxy as observable as other components
27
Event timing reports
● Timers are averaged in the stats
● Each timer appears in the logs
● Halog -rt/-RT/-pct for quick analysis
● Each timer crossing a limit triggers a timeout
● Each abort at a specific step causes a hard error
=> termination codes
haproxy[14389]: 10.0.1.2:33317 [06/Feb/2018:12:14:14.655] http-in
static/srv1 10/0/30/69/109 200 2750 - - SDNN 1/1/1/1/0 0/0 {haproxy.org}
{} "GET /index.html HTTP/1.1"
Timers Term code Cookie code
28
Termination codes
● Distinguish between timeout and abort
● Indicate whom (client, server, haproxy, kill, ...)
● Indicate when (req,queue,connect,response...)
● Completed by persistence cookie indications
● Filtered and sorted by halog :
# halog -tcn|-TCN ... # for filtering
# halog -tc # for sorting
29
Other relevant metrics : HTTP status distribution
● Stats page: distribution per frontend/backend/server
● Filter by ranges: halog -hs/-HS
● Sorted output: halog -st
=> graph the distribution and watch for variations
between application deployments
30
Other relevant metrics : queue length
● Uses server maxconn
● Grows exponentially with slowdowns : easy to detect!
● Tells you how many extra servers you need
● Reported by halog -Q/-QS
● Shown in real time on the stats page per backend/srv
=> If you watch only one metric, watch this one!
31
Other relevant metrics : LB fairness
LB algorithm implies fairness between servers :
● Equal request count with roundrobin
=> Higher than average concurrency indicates
abnormally slow server
● Equal load with leastconn
=> Low req count indicates abnormally slow server
=> graph relevant values within the farm
32
Other relevant metrics : error rate
● Global: halog -e
● Per server: halog -srv
● Per client IP: halog -e -ic (detect bad CDN nodes)
● Per URL: halog -ue
● Stats page: per frontend/backend/server
● Stick-tables: per arbitrary key using http_err_rate()
=> no threshold, watch for variations
33
Useful entries in log-format
● Default httplog format is quite rich
● Can be improved using the log-format directive
● Hint: log stick-table stats for similar keys
haproxy[14389]: 10.0.1.2:33317 [06/Feb/2018:12:14:14.655] http-in
static/srv1 10/0/30/69/109 200 2750 - - SDNN 1/1/1/1/0 0/0 {haproxy.org}
{} "GET /index.html HTTP/1.1"
Timers
HTTP status Byte
count
Term
Code
Cookie
Code
Conn
Count
Queue
Length
34
Tips: sampling : why / when
"I can't enable logs, I have too much traffic!"
● an average syslog server can store 20k events/s
without sweating
● that's 1.7B events/day or 350GB of uncompressed
haproxy logs/day
● compresses to 1TB/month
● for $100 you can store 4 months with no loss
● have more traffic / not interested in this level of detail ?
# log only 5% of requests
http-request set-log-level silent unless { rand(100) -lt 5 }
35
Tips: selective logging: why / when
● you only want to catch suspicious events
● disable logging unless Tc/Tq/Tr/Tw/... is above a certain
threshold
● on the fly for selected keys from the CLI + stick-table
● also see "option dontlognormal"
● WARNING: you'll lose any valid reference
36
Tips: other halog goodies
● Poorly documented, use halog --help
● response time per url: halog -uat
● errors per server: halog -srv
● Percentiles on req/queue/conn/resp times: halog -pct
● detect stolen CPU / swap : halog -ac … -ad …
● very fast (1-2 GB per second)
=> Use it in production to figure the relevant metrics
37
Success stories
Customer spotting a broken fiber between two core switches
● Tc from HA1 to srv 1,2,3,5 always low, srv 4,6 high at 99 pct
● Tc from HA2 to srv 1,2,4,6 always low, srv 3,5 high at 99 pct
=> both haproxy and servers out of cause
● issue rate stable at various traffic levels => not congestion
● inter-switch link apparently at cause but not for all flows
● inter-switch link made of two fibers balanced on MAC tuple
● thanks to long-term logs, origin could even be identified
38
Success stories
Customer figuring a wrong web server configuration using
/dev/random
● Tc abnormally high with lots of random values to several
seconds, and only for TLS
● timer also covers TLS handshake
=> not a network, hardware or performance issue, only
server config.
=> system was regularly running out of entropy due to
mistakenly using /dev/random as a random source for SSL
39
Conclusion
● exploit your stats
● enable logs on LBs, no excuse for not doing it!
● process them automatically, manually once in a while
● compare numbers between similar objects
● detect anomalies
● fix problems before they are witnessed
● profit :-)
40
Interesting lectures
● https://medium.com/@copyconstruct/monitoring-and-observability-8417d1952e1c
● https://www.vividcortex.com/blog/monitoring-isnt-observability
● http://opentracing.io/documentation/
● https://prometheus.io/

More Related Content

What's hot

Where is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in FlinkWhere is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in FlinkFlink Forward
 
Introducing Apache Airflow and how we are using it
Introducing Apache Airflow and how we are using itIntroducing Apache Airflow and how we are using it
Introducing Apache Airflow and how we are using itBruno Faria
 
OracleStore: A Highly Performant RawStore Implementation for Hive Metastore
OracleStore: A Highly Performant RawStore Implementation for Hive MetastoreOracleStore: A Highly Performant RawStore Implementation for Hive Metastore
OracleStore: A Highly Performant RawStore Implementation for Hive MetastoreDataWorks Summit
 
Une introduction à Javascript et ECMAScript 6
Une introduction à Javascript et ECMAScript 6Une introduction à Javascript et ECMAScript 6
Une introduction à Javascript et ECMAScript 6Jean-Baptiste Vigneron
 
Intro to Airflow: Goodbye Cron, Welcome scheduled workflow management
Intro to Airflow: Goodbye Cron, Welcome scheduled workflow managementIntro to Airflow: Goodbye Cron, Welcome scheduled workflow management
Intro to Airflow: Goodbye Cron, Welcome scheduled workflow managementBurasakorn Sabyeying
 
How does PostgreSQL work with disks: a DBA's checklist in detail. PGConf.US 2015
How does PostgreSQL work with disks: a DBA's checklist in detail. PGConf.US 2015How does PostgreSQL work with disks: a DBA's checklist in detail. PGConf.US 2015
How does PostgreSQL work with disks: a DBA's checklist in detail. PGConf.US 2015PostgreSQL-Consulting
 
Hadoop Operations - Best Practices from the Field
Hadoop Operations - Best Practices from the FieldHadoop Operations - Best Practices from the Field
Hadoop Operations - Best Practices from the FieldDataWorks Summit
 
HBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseHBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseenissoz
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to RedisDvir Volk
 
Reverse proxies & Inconsistency
Reverse proxies & InconsistencyReverse proxies & Inconsistency
Reverse proxies & InconsistencyGreenD0g
 
Tuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptxTuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptxFlink Forward
 
Apache Airflow in Production
Apache Airflow in ProductionApache Airflow in Production
Apache Airflow in ProductionRobert Sanders
 
Monitoring your Python with Prometheus (Python Ireland April 2015)
Monitoring your Python with Prometheus (Python Ireland April 2015)Monitoring your Python with Prometheus (Python Ireland April 2015)
Monitoring your Python with Prometheus (Python Ireland April 2015)Brian Brazil
 
如何使用 Xhprof 分析網站效能 (真實案例2)
如何使用 Xhprof 分析網站效能 (真實案例2)如何使用 Xhprof 分析網站效能 (真實案例2)
如何使用 Xhprof 分析網站效能 (真實案例2)Cyril Wang
 
Systems Monitoring with Prometheus (Devops Ireland April 2015)
Systems Monitoring with Prometheus (Devops Ireland April 2015)Systems Monitoring with Prometheus (Devops Ireland April 2015)
Systems Monitoring with Prometheus (Devops Ireland April 2015)Brian Brazil
 
Dongwon Kim – A Comparative Performance Evaluation of Flink
Dongwon Kim – A Comparative Performance Evaluation of FlinkDongwon Kim – A Comparative Performance Evaluation of Flink
Dongwon Kim – A Comparative Performance Evaluation of FlinkFlink Forward
 
RocksDB Performance and Reliability Practices
RocksDB Performance and Reliability PracticesRocksDB Performance and Reliability Practices
RocksDB Performance and Reliability PracticesYoshinori Matsunobu
 
The Log of All Logs: Raft-based Consensus Inside Kafka | Guozhang Wang, Confl...
The Log of All Logs: Raft-based Consensus Inside Kafka | Guozhang Wang, Confl...The Log of All Logs: Raft-based Consensus Inside Kafka | Guozhang Wang, Confl...
The Log of All Logs: Raft-based Consensus Inside Kafka | Guozhang Wang, Confl...HostedbyConfluent
 

What's hot (20)

Where is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in FlinkWhere is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in Flink
 
Introducing Apache Airflow and how we are using it
Introducing Apache Airflow and how we are using itIntroducing Apache Airflow and how we are using it
Introducing Apache Airflow and how we are using it
 
OracleStore: A Highly Performant RawStore Implementation for Hive Metastore
OracleStore: A Highly Performant RawStore Implementation for Hive MetastoreOracleStore: A Highly Performant RawStore Implementation for Hive Metastore
OracleStore: A Highly Performant RawStore Implementation for Hive Metastore
 
Une introduction à Javascript et ECMAScript 6
Une introduction à Javascript et ECMAScript 6Une introduction à Javascript et ECMAScript 6
Une introduction à Javascript et ECMAScript 6
 
Intro to Airflow: Goodbye Cron, Welcome scheduled workflow management
Intro to Airflow: Goodbye Cron, Welcome scheduled workflow managementIntro to Airflow: Goodbye Cron, Welcome scheduled workflow management
Intro to Airflow: Goodbye Cron, Welcome scheduled workflow management
 
How does PostgreSQL work with disks: a DBA's checklist in detail. PGConf.US 2015
How does PostgreSQL work with disks: a DBA's checklist in detail. PGConf.US 2015How does PostgreSQL work with disks: a DBA's checklist in detail. PGConf.US 2015
How does PostgreSQL work with disks: a DBA's checklist in detail. PGConf.US 2015
 
Hadoop Operations - Best Practices from the Field
Hadoop Operations - Best Practices from the FieldHadoop Operations - Best Practices from the Field
Hadoop Operations - Best Practices from the Field
 
HBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseHBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBase
 
Apache Airflow overview
Apache Airflow overviewApache Airflow overview
Apache Airflow overview
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to Redis
 
Reverse proxies & Inconsistency
Reverse proxies & InconsistencyReverse proxies & Inconsistency
Reverse proxies & Inconsistency
 
Tuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptxTuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptx
 
Apache Airflow in Production
Apache Airflow in ProductionApache Airflow in Production
Apache Airflow in Production
 
Monitoring your Python with Prometheus (Python Ireland April 2015)
Monitoring your Python with Prometheus (Python Ireland April 2015)Monitoring your Python with Prometheus (Python Ireland April 2015)
Monitoring your Python with Prometheus (Python Ireland April 2015)
 
如何使用 Xhprof 分析網站效能 (真實案例2)
如何使用 Xhprof 分析網站效能 (真實案例2)如何使用 Xhprof 分析網站效能 (真實案例2)
如何使用 Xhprof 分析網站效能 (真實案例2)
 
Systems Monitoring with Prometheus (Devops Ireland April 2015)
Systems Monitoring with Prometheus (Devops Ireland April 2015)Systems Monitoring with Prometheus (Devops Ireland April 2015)
Systems Monitoring with Prometheus (Devops Ireland April 2015)
 
Dongwon Kim – A Comparative Performance Evaluation of Flink
Dongwon Kim – A Comparative Performance Evaluation of FlinkDongwon Kim – A Comparative Performance Evaluation of Flink
Dongwon Kim – A Comparative Performance Evaluation of Flink
 
RocksDB Performance and Reliability Practices
RocksDB Performance and Reliability PracticesRocksDB Performance and Reliability Practices
RocksDB Performance and Reliability Practices
 
The Log of All Logs: Raft-based Consensus Inside Kafka | Guozhang Wang, Confl...
The Log of All Logs: Raft-based Consensus Inside Kafka | Guozhang Wang, Confl...The Log of All Logs: Raft-based Consensus Inside Kafka | Guozhang Wang, Confl...
The Log of All Logs: Raft-based Consensus Inside Kafka | Guozhang Wang, Confl...
 
Apache airflow
Apache airflowApache airflow
Apache airflow
 

Similar to Observability tips for HAProxy

Webinar slides: An Introduction to Performance Monitoring for PostgreSQL
Webinar slides: An Introduction to Performance Monitoring for PostgreSQLWebinar slides: An Introduction to Performance Monitoring for PostgreSQL
Webinar slides: An Introduction to Performance Monitoring for PostgreSQLSeveralnines
 
Network visibility and control using industry standard sFlow telemetry
Network visibility and control using industry standard sFlow telemetryNetwork visibility and control using industry standard sFlow telemetry
Network visibility and control using industry standard sFlow telemetrypphaal
 
Apache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data 2016: Next Gen Big Data Analytics with Apache ApexApache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data 2016: Next Gen Big Data Analytics with Apache ApexApache Apex
 
Building a Dynamic Rules Engine with Kafka Streams
Building a Dynamic Rules Engine with Kafka StreamsBuilding a Dynamic Rules Engine with Kafka Streams
Building a Dynamic Rules Engine with Kafka StreamsHostedbyConfluent
 
Zero Downtime JEE Architectures
Zero Downtime JEE ArchitecturesZero Downtime JEE Architectures
Zero Downtime JEE ArchitecturesAlexander Penev
 
Monitoring in Big Data Platform - Albert Lewandowski, GetInData
Monitoring in Big Data Platform - Albert Lewandowski, GetInDataMonitoring in Big Data Platform - Albert Lewandowski, GetInData
Monitoring in Big Data Platform - Albert Lewandowski, GetInDataGetInData
 
OpenTelemetry For Architects
OpenTelemetry For ArchitectsOpenTelemetry For Architects
OpenTelemetry For ArchitectsKevin Brockhoff
 
Introduction to Ethereum
Introduction to EthereumIntroduction to Ethereum
Introduction to EthereumArnold Pham
 
TLS - 2016 Velocity Training
TLS - 2016 Velocity TrainingTLS - 2016 Velocity Training
TLS - 2016 Velocity TrainingPatrick Meenan
 
ECS19 - Ingo Gegenwarth - Running Exchange in large environment
ECS19 - Ingo Gegenwarth -  Running Exchangein large environmentECS19 - Ingo Gegenwarth -  Running Exchangein large environment
ECS19 - Ingo Gegenwarth - Running Exchange in large environmentEuropean Collaboration Summit
 
The new (is it really ) api stack
The new (is it really ) api stackThe new (is it really ) api stack
The new (is it really ) api stackLuca Mattia Ferrari
 
Bringing Learnings from Googley Microservices with gRPC - Varun Talwar, Google
Bringing Learnings from Googley Microservices with gRPC - Varun Talwar, GoogleBringing Learnings from Googley Microservices with gRPC - Varun Talwar, Google
Bringing Learnings from Googley Microservices with gRPC - Varun Talwar, GoogleAmbassador Labs
 
Introduction to ZooKeeper - TriHUG May 22, 2012
Introduction to ZooKeeper - TriHUG May 22, 2012Introduction to ZooKeeper - TriHUG May 22, 2012
Introduction to ZooKeeper - TriHUG May 22, 2012mumrah
 
Timely Year Two: Lessons Learned Building a Scalable Metrics Analytic System
Timely Year Two: Lessons Learned Building a Scalable Metrics Analytic SystemTimely Year Two: Lessons Learned Building a Scalable Metrics Analytic System
Timely Year Two: Lessons Learned Building a Scalable Metrics Analytic SystemAccumulo Summit
 
PG Day'14 Russia, PostgreSQL System Architecture, Heikki Linnakangas
PG Day'14 Russia, PostgreSQL System Architecture, Heikki LinnakangasPG Day'14 Russia, PostgreSQL System Architecture, Heikki Linnakangas
PG Day'14 Russia, PostgreSQL System Architecture, Heikki Linnakangaspgdayrussia
 
Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...
Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...
Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...Hernan Costante
 
HTTP/2 Introduction
HTTP/2 IntroductionHTTP/2 Introduction
HTTP/2 IntroductionWalter Liu
 
Security Monitoring for big Infrastructures without a Million Dollar budget
Security Monitoring for big Infrastructures without a Million Dollar budgetSecurity Monitoring for big Infrastructures without a Million Dollar budget
Security Monitoring for big Infrastructures without a Million Dollar budgetJuan Berner
 
Docker Logging and analysing with Elastic Stack - Jakub Hajek
Docker Logging and analysing with Elastic Stack - Jakub Hajek Docker Logging and analysing with Elastic Stack - Jakub Hajek
Docker Logging and analysing with Elastic Stack - Jakub Hajek PROIDEA
 

Similar to Observability tips for HAProxy (20)

Log Files
Log FilesLog Files
Log Files
 
Webinar slides: An Introduction to Performance Monitoring for PostgreSQL
Webinar slides: An Introduction to Performance Monitoring for PostgreSQLWebinar slides: An Introduction to Performance Monitoring for PostgreSQL
Webinar slides: An Introduction to Performance Monitoring for PostgreSQL
 
Network visibility and control using industry standard sFlow telemetry
Network visibility and control using industry standard sFlow telemetryNetwork visibility and control using industry standard sFlow telemetry
Network visibility and control using industry standard sFlow telemetry
 
Apache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data 2016: Next Gen Big Data Analytics with Apache ApexApache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
 
Building a Dynamic Rules Engine with Kafka Streams
Building a Dynamic Rules Engine with Kafka StreamsBuilding a Dynamic Rules Engine with Kafka Streams
Building a Dynamic Rules Engine with Kafka Streams
 
Zero Downtime JEE Architectures
Zero Downtime JEE ArchitecturesZero Downtime JEE Architectures
Zero Downtime JEE Architectures
 
Monitoring in Big Data Platform - Albert Lewandowski, GetInData
Monitoring in Big Data Platform - Albert Lewandowski, GetInDataMonitoring in Big Data Platform - Albert Lewandowski, GetInData
Monitoring in Big Data Platform - Albert Lewandowski, GetInData
 
OpenTelemetry For Architects
OpenTelemetry For ArchitectsOpenTelemetry For Architects
OpenTelemetry For Architects
 
Introduction to Ethereum
Introduction to EthereumIntroduction to Ethereum
Introduction to Ethereum
 
TLS - 2016 Velocity Training
TLS - 2016 Velocity TrainingTLS - 2016 Velocity Training
TLS - 2016 Velocity Training
 
ECS19 - Ingo Gegenwarth - Running Exchange in large environment
ECS19 - Ingo Gegenwarth -  Running Exchangein large environmentECS19 - Ingo Gegenwarth -  Running Exchangein large environment
ECS19 - Ingo Gegenwarth - Running Exchange in large environment
 
The new (is it really ) api stack
The new (is it really ) api stackThe new (is it really ) api stack
The new (is it really ) api stack
 
Bringing Learnings from Googley Microservices with gRPC - Varun Talwar, Google
Bringing Learnings from Googley Microservices with gRPC - Varun Talwar, GoogleBringing Learnings from Googley Microservices with gRPC - Varun Talwar, Google
Bringing Learnings from Googley Microservices with gRPC - Varun Talwar, Google
 
Introduction to ZooKeeper - TriHUG May 22, 2012
Introduction to ZooKeeper - TriHUG May 22, 2012Introduction to ZooKeeper - TriHUG May 22, 2012
Introduction to ZooKeeper - TriHUG May 22, 2012
 
Timely Year Two: Lessons Learned Building a Scalable Metrics Analytic System
Timely Year Two: Lessons Learned Building a Scalable Metrics Analytic SystemTimely Year Two: Lessons Learned Building a Scalable Metrics Analytic System
Timely Year Two: Lessons Learned Building a Scalable Metrics Analytic System
 
PG Day'14 Russia, PostgreSQL System Architecture, Heikki Linnakangas
PG Day'14 Russia, PostgreSQL System Architecture, Heikki LinnakangasPG Day'14 Russia, PostgreSQL System Architecture, Heikki Linnakangas
PG Day'14 Russia, PostgreSQL System Architecture, Heikki Linnakangas
 
Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...
Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...
Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...
 
HTTP/2 Introduction
HTTP/2 IntroductionHTTP/2 Introduction
HTTP/2 Introduction
 
Security Monitoring for big Infrastructures without a Million Dollar budget
Security Monitoring for big Infrastructures without a Million Dollar budgetSecurity Monitoring for big Infrastructures without a Million Dollar budget
Security Monitoring for big Infrastructures without a Million Dollar budget
 
Docker Logging and analysing with Elastic Stack - Jakub Hajek
Docker Logging and analysing with Elastic Stack - Jakub Hajek Docker Logging and analysing with Elastic Stack - Jakub Hajek
Docker Logging and analysing with Elastic Stack - Jakub Hajek
 

Recently uploaded

Re-membering the Bard: Revisiting The Compleat Wrks of Wllm Shkspr (Abridged)...
Re-membering the Bard: Revisiting The Compleat Wrks of Wllm Shkspr (Abridged)...Re-membering the Bard: Revisiting The Compleat Wrks of Wllm Shkspr (Abridged)...
Re-membering the Bard: Revisiting The Compleat Wrks of Wllm Shkspr (Abridged)...Hasting Chen
 
Introduction to Prompt Engineering (Focusing on ChatGPT)
Introduction to Prompt Engineering (Focusing on ChatGPT)Introduction to Prompt Engineering (Focusing on ChatGPT)
Introduction to Prompt Engineering (Focusing on ChatGPT)Chameera Dedduwage
 
If this Giant Must Walk: A Manifesto for a New Nigeria
If this Giant Must Walk: A Manifesto for a New NigeriaIf this Giant Must Walk: A Manifesto for a New Nigeria
If this Giant Must Walk: A Manifesto for a New NigeriaKayode Fayemi
 
ANCHORING SCRIPT FOR A CULTURAL EVENT.docx
ANCHORING SCRIPT FOR A CULTURAL EVENT.docxANCHORING SCRIPT FOR A CULTURAL EVENT.docx
ANCHORING SCRIPT FOR A CULTURAL EVENT.docxNikitaBankoti2
 
BDSM⚡Call Girls in Sector 93 Noida Escorts >༒8448380779 Escort Service
BDSM⚡Call Girls in Sector 93 Noida Escorts >༒8448380779 Escort ServiceBDSM⚡Call Girls in Sector 93 Noida Escorts >༒8448380779 Escort Service
BDSM⚡Call Girls in Sector 93 Noida Escorts >༒8448380779 Escort ServiceDelhi Call girls
 
WhatsApp 📞 9892124323 ✅Call Girls In Juhu ( Mumbai )
WhatsApp 📞 9892124323 ✅Call Girls In Juhu ( Mumbai )WhatsApp 📞 9892124323 ✅Call Girls In Juhu ( Mumbai )
WhatsApp 📞 9892124323 ✅Call Girls In Juhu ( Mumbai )Pooja Nehwal
 
Thirunelveli call girls Tamil escorts 7877702510
Thirunelveli call girls Tamil escorts 7877702510Thirunelveli call girls Tamil escorts 7877702510
Thirunelveli call girls Tamil escorts 7877702510Vipesco
 
CTAC 2024 Valencia - Henrik Hanke - Reduce to the max - slideshare.pdf
CTAC 2024 Valencia - Henrik Hanke - Reduce to the max - slideshare.pdfCTAC 2024 Valencia - Henrik Hanke - Reduce to the max - slideshare.pdf
CTAC 2024 Valencia - Henrik Hanke - Reduce to the max - slideshare.pdfhenrik385807
 
SaaStr Workshop Wednesday w/ Lucas Price, Yardstick
SaaStr Workshop Wednesday w/ Lucas Price, YardstickSaaStr Workshop Wednesday w/ Lucas Price, Yardstick
SaaStr Workshop Wednesday w/ Lucas Price, Yardsticksaastr
 
Mohammad_Alnahdi_Oral_Presentation_Assignment.pptx
Mohammad_Alnahdi_Oral_Presentation_Assignment.pptxMohammad_Alnahdi_Oral_Presentation_Assignment.pptx
Mohammad_Alnahdi_Oral_Presentation_Assignment.pptxmohammadalnahdi22
 
No Advance 8868886958 Chandigarh Call Girls , Indian Call Girls For Full Nigh...
No Advance 8868886958 Chandigarh Call Girls , Indian Call Girls For Full Nigh...No Advance 8868886958 Chandigarh Call Girls , Indian Call Girls For Full Nigh...
No Advance 8868886958 Chandigarh Call Girls , Indian Call Girls For Full Nigh...Sheetaleventcompany
 
Presentation on Engagement in Book Clubs
Presentation on Engagement in Book ClubsPresentation on Engagement in Book Clubs
Presentation on Engagement in Book Clubssamaasim06
 
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptx
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptxChiulli_Aurora_Oman_Raffaele_Beowulf.pptx
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptxraffaeleoman
 
George Lever - eCommerce Day Chile 2024
George Lever -  eCommerce Day Chile 2024George Lever -  eCommerce Day Chile 2024
George Lever - eCommerce Day Chile 2024eCommerce Institute
 
Night 7k Call Girls Noida Sector 128 Call Me: 8448380779
Night 7k Call Girls Noida Sector 128 Call Me: 8448380779Night 7k Call Girls Noida Sector 128 Call Me: 8448380779
Night 7k Call Girls Noida Sector 128 Call Me: 8448380779Delhi Call girls
 
Andrés Ramírez Gossler, Facundo Schinnea - eCommerce Day Chile 2024
Andrés Ramírez Gossler, Facundo Schinnea - eCommerce Day Chile 2024Andrés Ramírez Gossler, Facundo Schinnea - eCommerce Day Chile 2024
Andrés Ramírez Gossler, Facundo Schinnea - eCommerce Day Chile 2024eCommerce Institute
 
Governance and Nation-Building in Nigeria: Some Reflections on Options for Po...
Governance and Nation-Building in Nigeria: Some Reflections on Options for Po...Governance and Nation-Building in Nigeria: Some Reflections on Options for Po...
Governance and Nation-Building in Nigeria: Some Reflections on Options for Po...Kayode Fayemi
 
CTAC 2024 Valencia - Sven Zoelle - Most Crucial Invest to Digitalisation_slid...
CTAC 2024 Valencia - Sven Zoelle - Most Crucial Invest to Digitalisation_slid...CTAC 2024 Valencia - Sven Zoelle - Most Crucial Invest to Digitalisation_slid...
CTAC 2024 Valencia - Sven Zoelle - Most Crucial Invest to Digitalisation_slid...henrik385807
 
VVIP Call Girls Nalasopara : 9892124323, Call Girls in Nalasopara Services
VVIP Call Girls Nalasopara : 9892124323, Call Girls in Nalasopara ServicesVVIP Call Girls Nalasopara : 9892124323, Call Girls in Nalasopara Services
VVIP Call Girls Nalasopara : 9892124323, Call Girls in Nalasopara ServicesPooja Nehwal
 
Call Girl Number in Khar Mumbai📲 9892124323 💞 Full Night Enjoy
Call Girl Number in Khar Mumbai📲 9892124323 💞 Full Night EnjoyCall Girl Number in Khar Mumbai📲 9892124323 💞 Full Night Enjoy
Call Girl Number in Khar Mumbai📲 9892124323 💞 Full Night EnjoyPooja Nehwal
 

Recently uploaded (20)

Re-membering the Bard: Revisiting The Compleat Wrks of Wllm Shkspr (Abridged)...
Re-membering the Bard: Revisiting The Compleat Wrks of Wllm Shkspr (Abridged)...Re-membering the Bard: Revisiting The Compleat Wrks of Wllm Shkspr (Abridged)...
Re-membering the Bard: Revisiting The Compleat Wrks of Wllm Shkspr (Abridged)...
 
Introduction to Prompt Engineering (Focusing on ChatGPT)
Introduction to Prompt Engineering (Focusing on ChatGPT)Introduction to Prompt Engineering (Focusing on ChatGPT)
Introduction to Prompt Engineering (Focusing on ChatGPT)
 
If this Giant Must Walk: A Manifesto for a New Nigeria
If this Giant Must Walk: A Manifesto for a New NigeriaIf this Giant Must Walk: A Manifesto for a New Nigeria
If this Giant Must Walk: A Manifesto for a New Nigeria
 
ANCHORING SCRIPT FOR A CULTURAL EVENT.docx
ANCHORING SCRIPT FOR A CULTURAL EVENT.docxANCHORING SCRIPT FOR A CULTURAL EVENT.docx
ANCHORING SCRIPT FOR A CULTURAL EVENT.docx
 
BDSM⚡Call Girls in Sector 93 Noida Escorts >༒8448380779 Escort Service
BDSM⚡Call Girls in Sector 93 Noida Escorts >༒8448380779 Escort ServiceBDSM⚡Call Girls in Sector 93 Noida Escorts >༒8448380779 Escort Service
BDSM⚡Call Girls in Sector 93 Noida Escorts >༒8448380779 Escort Service
 
WhatsApp 📞 9892124323 ✅Call Girls In Juhu ( Mumbai )
WhatsApp 📞 9892124323 ✅Call Girls In Juhu ( Mumbai )WhatsApp 📞 9892124323 ✅Call Girls In Juhu ( Mumbai )
WhatsApp 📞 9892124323 ✅Call Girls In Juhu ( Mumbai )
 
Thirunelveli call girls Tamil escorts 7877702510
Thirunelveli call girls Tamil escorts 7877702510Thirunelveli call girls Tamil escorts 7877702510
Thirunelveli call girls Tamil escorts 7877702510
 
CTAC 2024 Valencia - Henrik Hanke - Reduce to the max - slideshare.pdf
CTAC 2024 Valencia - Henrik Hanke - Reduce to the max - slideshare.pdfCTAC 2024 Valencia - Henrik Hanke - Reduce to the max - slideshare.pdf
CTAC 2024 Valencia - Henrik Hanke - Reduce to the max - slideshare.pdf
 
SaaStr Workshop Wednesday w/ Lucas Price, Yardstick
SaaStr Workshop Wednesday w/ Lucas Price, YardstickSaaStr Workshop Wednesday w/ Lucas Price, Yardstick
SaaStr Workshop Wednesday w/ Lucas Price, Yardstick
 
Mohammad_Alnahdi_Oral_Presentation_Assignment.pptx
Mohammad_Alnahdi_Oral_Presentation_Assignment.pptxMohammad_Alnahdi_Oral_Presentation_Assignment.pptx
Mohammad_Alnahdi_Oral_Presentation_Assignment.pptx
 
No Advance 8868886958 Chandigarh Call Girls , Indian Call Girls For Full Nigh...
No Advance 8868886958 Chandigarh Call Girls , Indian Call Girls For Full Nigh...No Advance 8868886958 Chandigarh Call Girls , Indian Call Girls For Full Nigh...
No Advance 8868886958 Chandigarh Call Girls , Indian Call Girls For Full Nigh...
 
Presentation on Engagement in Book Clubs
Presentation on Engagement in Book ClubsPresentation on Engagement in Book Clubs
Presentation on Engagement in Book Clubs
 
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptx
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptxChiulli_Aurora_Oman_Raffaele_Beowulf.pptx
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptx
 
George Lever - eCommerce Day Chile 2024
George Lever -  eCommerce Day Chile 2024George Lever -  eCommerce Day Chile 2024
George Lever - eCommerce Day Chile 2024
 
Night 7k Call Girls Noida Sector 128 Call Me: 8448380779
Night 7k Call Girls Noida Sector 128 Call Me: 8448380779Night 7k Call Girls Noida Sector 128 Call Me: 8448380779
Night 7k Call Girls Noida Sector 128 Call Me: 8448380779
 
Andrés Ramírez Gossler, Facundo Schinnea - eCommerce Day Chile 2024
Andrés Ramírez Gossler, Facundo Schinnea - eCommerce Day Chile 2024Andrés Ramírez Gossler, Facundo Schinnea - eCommerce Day Chile 2024
Andrés Ramírez Gossler, Facundo Schinnea - eCommerce Day Chile 2024
 
Governance and Nation-Building in Nigeria: Some Reflections on Options for Po...
Governance and Nation-Building in Nigeria: Some Reflections on Options for Po...Governance and Nation-Building in Nigeria: Some Reflections on Options for Po...
Governance and Nation-Building in Nigeria: Some Reflections on Options for Po...
 
CTAC 2024 Valencia - Sven Zoelle - Most Crucial Invest to Digitalisation_slid...
CTAC 2024 Valencia - Sven Zoelle - Most Crucial Invest to Digitalisation_slid...CTAC 2024 Valencia - Sven Zoelle - Most Crucial Invest to Digitalisation_slid...
CTAC 2024 Valencia - Sven Zoelle - Most Crucial Invest to Digitalisation_slid...
 
VVIP Call Girls Nalasopara : 9892124323, Call Girls in Nalasopara Services
VVIP Call Girls Nalasopara : 9892124323, Call Girls in Nalasopara ServicesVVIP Call Girls Nalasopara : 9892124323, Call Girls in Nalasopara Services
VVIP Call Girls Nalasopara : 9892124323, Call Girls in Nalasopara Services
 
Call Girl Number in Khar Mumbai📲 9892124323 💞 Full Night Enjoy
Call Girl Number in Khar Mumbai📲 9892124323 💞 Full Night EnjoyCall Girl Number in Khar Mumbai📲 9892124323 💞 Full Night Enjoy
Call Girl Number in Khar Mumbai📲 9892124323 💞 Full Night Enjoy
 

Observability tips for HAProxy

  • 1. Observability tips for HAProxy Willy Tarreau (willy@haproxy.org) Dotscale 2018
  • 2. 2 Definition Observability: "control theory, observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs" (wikipedia) In short for us: WTF is going on ?
  • 3. 3 Observability vs monitoring ● Monitoring tells you how well something works (or not) ● Observability helps you detect what is not working and why => You monitor an observable system.
  • 4. 4 Observability is important ● can’t rely on impatient users’ complaints anymore => detect and fix trouble not reported yet ● stop adding duct tape, address root causes! ● improve users experience where it matters
  • 5. 5 The LB as an observation tower ● central place ● in distributed systems, one LB per level, even better when sidecar ● sees multiple targets, eases comparisons => draw references ● Trusted low level component & excellent transparency ● many LB decisions actually depend on metrics related to performance and observability ● already logs, provides long-term references
  • 6. 6 Why not look from other points ? ● you should! Especially in distributed systems (microservices, etc)! ● but often it's too late when the first incident happens! ● with existing LB's logs, it's already possible to do a lot Note: see OpenTracing and Prometheus
  • 7. 7 What does the LB see ? ● global failures (aborts, timeouts) ● abnormal delays caused by network retransmits ● connection failures and retries caused by bad tuning (eg: conntrack) ● connection slowdowns caused by inefficient firewall policies (#rules) ...
  • 8. 8 What does the LB see (…) ? ● client-side issues (BW limitations) ● per-URL processing time (application issues, svc partners) ● per-node vs per-cluster variations => narrow down to individual node or shared resource ● deployment issues : new occasional error on a specific page, can be addressed before going full-scale
  • 9. 9 Accessing metrics in HAProxy ● Logs : ● Halog, ELK, Prometheus, … ● Provides unique-id for tracing/event correlation ● Stats : ● Stats page, CLI, hatop ● Stick-tables (per arbitrary key like IP, URL, cookie) : ● Byte count, cumulated/concurrent conns, errors, …
  • 10. 10 Sequence of events on HAProxy
  • 11. 11 Sequence of events on HAProxy
  • 12. 12 Sequence of events on HAProxy
  • 13. 13 Sequence of events on HAProxy
  • 14. 14 Sequence of events on HAProxy
  • 15. 15 Sequence of events on HAProxy
  • 16. 16 Sequence of events on HAProxy
  • 17. 17 Sequence of events on HAProxy
  • 18. 18 Sequence of events on HAProxy
  • 19. 19 Sequence of events on HAProxy
  • 20. 20 Sequence of events on HAProxy
  • 21. 21 Sequence of events on HAProxy
  • 22. 22 Sequence of events on HAProxy
  • 23. 23 Sequence of events on HAProxy
  • 24. 24 Sequence of events on HAProxy
  • 25. 25 Sequence of events on HAProxy
  • 26. 26 More timers to come in HAProxy 1.9 ● HAProxy now supports heavier per-request workloads (Lua, device identification, …) ● Processing times over 200 µs can become noticeable Actions: ● log per-request total CPU time spent in analysers ● log per-request total CPU time spent in TLS handshake ● log per-request total latency added by other tasks ● Ability to kill offending tasks ● Ability to alert on high latencies => make HAProxy as observable as other components
  • 27. 27 Event timing reports ● Timers are averaged in the stats ● Each timer appears in the logs ● Halog -rt/-RT/-pct for quick analysis ● Each timer crossing a limit triggers a timeout ● Each abort at a specific step causes a hard error => termination codes haproxy[14389]: 10.0.1.2:33317 [06/Feb/2018:12:14:14.655] http-in static/srv1 10/0/30/69/109 200 2750 - - SDNN 1/1/1/1/0 0/0 {haproxy.org} {} "GET /index.html HTTP/1.1" Timers Term code Cookie code
  • 28. 28 Termination codes ● Distinguish between timeout and abort ● Indicate whom (client, server, haproxy, kill, ...) ● Indicate when (req,queue,connect,response...) ● Completed by persistence cookie indications ● Filtered and sorted by halog : # halog -tcn|-TCN ... # for filtering # halog -tc # for sorting
  • 29. 29 Other relevant metrics : HTTP status distribution ● Stats page: distribution per frontend/backend/server ● Filter by ranges: halog -hs/-HS ● Sorted output: halog -st => graph the distribution and watch for variations between application deployments
  • 30. 30 Other relevant metrics : queue length ● Uses server maxconn ● Grows exponentially with slowdowns : easy to detect! ● Tells you how many extra servers you need ● Reported by halog -Q/-QS ● Shown in real time on the stats page per backend/srv => If you watch only one metric, watch this one!
  • 31. 31 Other relevant metrics : LB fairness LB algorithm implies fairness between servers : ● Equal request count with roundrobin => Higher than average concurrency indicates abnormally slow server ● Equal load with leastconn => Low req count indicates abnormally slow server => graph relevant values within the farm
  • 32. 32 Other relevant metrics : error rate ● Global: halog -e ● Per server: halog -srv ● Per client IP: halog -e -ic (detect bad CDN nodes) ● Per URL: halog -ue ● Stats page: per frontend/backend/server ● Stick-tables: per arbitrary key using http_err_rate() => no threshold, watch for variations
  • 33. 33 Useful entries in log-format ● Default httplog format is quite rich ● Can be improved using the log-format directive ● Hint: log stick-table stats for similar keys haproxy[14389]: 10.0.1.2:33317 [06/Feb/2018:12:14:14.655] http-in static/srv1 10/0/30/69/109 200 2750 - - SDNN 1/1/1/1/0 0/0 {haproxy.org} {} "GET /index.html HTTP/1.1" Timers HTTP status Byte count Term Code Cookie Code Conn Count Queue Length
  • 34. 34 Tips: sampling : why / when "I can't enable logs, I have too much traffic!" ● an average syslog server can store 20k events/s without sweating ● that's 1.7B events/day or 350GB of uncompressed haproxy logs/day ● compresses to 1TB/month ● for $100 you can store 4 months with no loss ● have more traffic / not interested in this level of detail ? # log only 5% of requests http-request set-log-level silent unless { rand(100) -lt 5 }
  • 35. 35 Tips: selective logging: why / when ● you only want to catch suspicious events ● disable logging unless Tc/Tq/Tr/Tw/... is above a certain threshold ● on the fly for selected keys from the CLI + stick-table ● also see "option dontlognormal" ● WARNING: you'll lose any valid reference
  • 36. 36 Tips: other halog goodies ● Poorly documented, use halog --help ● response time per url: halog -uat ● errors per server: halog -srv ● Percentiles on req/queue/conn/resp times: halog -pct ● detect stolen CPU / swap : halog -ac … -ad … ● very fast (1-2 GB per second) => Use it in production to figure the relevant metrics
  • 37. 37 Success stories Customer spotting a broken fiber between two core switches ● Tc from HA1 to srv 1,2,3,5 always low, srv 4,6 high at 99 pct ● Tc from HA2 to srv 1,2,4,6 always low, srv 3,5 high at 99 pct => both haproxy and servers out of cause ● issue rate stable at various traffic levels => not congestion ● inter-switch link apparently at cause but not for all flows ● inter-switch link made of two fibers balanced on MAC tuple ● thanks to long-term logs, origin could even be identified
  • 38. 38 Success stories Customer figuring a wrong web server configuration using /dev/random ● Tc abnormally high with lots of random values to several seconds, and only for TLS ● timer also covers TLS handshake => not a network, hardware or performance issue, only server config. => system was regularly running out of entropy due to mistakenly using /dev/random as a random source for SSL
  • 39. 39 Conclusion ● exploit your stats ● enable logs on LBs, no excuse for not doing it! ● process them automatically, manually once in a while ● compare numbers between similar objects ● detect anomalies ● fix problems before they are witnessed ● profit :-)
  • 40. 40 Interesting lectures ● https://medium.com/@copyconstruct/monitoring-and-observability-8417d1952e1c ● https://www.vividcortex.com/blog/monitoring-isnt-observability ● http://opentracing.io/documentation/ ● https://prometheus.io/