Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Observability with HAProxy

73 views

Published on

Observability with HAProxy

Published in: Technology
  • Be the first to comment

Observability with HAProxy

  1. 1. WWW.HAPROXY.COM Observability with HAProxy
  2. 2. WWW.HAPROXY.COM ● Introduction to “observability” ● Deep dive in HAProxy ● Tips (to get more from HAProxy) ● RoX (Return On eXperience) ● Conclusion (because we need one) Agenda
  3. 3. WWW.HAPROXY.COMWWW.HAPROXY.COM Introduction
  4. 4. WWW.HAPROXY.COM Observability: "control theory, observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs" (wikipedia) In short for us: WTF is going on ? Definition
  5. 5. WWW.HAPROXY.COM ● Monitoring tells you how well something works (or not) ● Observability helps you detect what is not working and why => You monitor an observable system. Observability vs monitoring
  6. 6. WWW.HAPROXY.COM ● can’t rely on impatient users’ complaints anymore => detect and fix trouble not reported yet ● stop adding duct tape, address root causes! ● improve users experience where it matters Observability is crucial
  7. 7. WWW.HAPROXY.COM ● central place ● in distributed systems, one LB per layer, even better when sidecar ● sees multiple targets, eases comparisons => draw references ● Trusted low level component & excellent transparency ● many LB decisions actually depend on metrics related to performance and observability ● again, logs provide long-term references The LB as an observation tower
  8. 8. WWW.HAPROXY.COM ● Maintain two types of TCP connections ○ From client to HAPRoxy (1) ○ From HAProxy to server (2) ○ Only access to data payload (not the “packet”) LB in Reverse-proxy mode HAProxyClient Server (1) (2)
  9. 9. WWW.HAPROXY.COM ● you should! Especially in distributed systems (microservices, etc)! ● but often it's too late when the first incident happens! ● with existing LB's logs, it's already possible to do a lot Note: check OpenTracing and Prometheus Why not looking from other points?
  10. 10. WWW.HAPROXY.COM ● global failures (aborts, timeouts) ● abnormal delays caused by network retransmits ● connection failures and retries caused by bad tuning (eg: conntrack) ● connection slowdowns caused by inefficient firewall policies (#rules) ● Non exhaustive list ... What does the LB see?
  11. 11. WWW.HAPROXY.COM ● client-side issues (BW limitations) ● per-URL processing time (application issues, svc partners) ● per-node vs per-cluster variations => narrow down to individual node or shared resource ● deployment issues : new occasional error on a specific page, can be addressed before going full-scale What does the LB see?
  12. 12. WWW.HAPROXY.COMWWW.HAPROXY.COM Deep dive in HAProxy
  13. 13. WWW.HAPROXY.COM haproxy[14389]: 10.0.1.2:33317 [06/Feb/2018:12:14:14.655] http-in static/srv1 10/0/30/69/109 200 2750 - -SDNN 1/1/1/1/0 0/0 {haproxy.org} {} "GET /index.html HTTP/1.1" I need help !!!
  14. 14. WWW.HAPROXY.COM ● Logs : ○ Halog, ELK, Splunk, datadog… ○ Provides unique-id for tracing/event correlation ● Statistics : ○ Stats page, CLI, hatop ○ Prometheus, statsd, ... ● Stick-tables (per arbitrary key like IP, URL, cookie, ...) : ○ Byte count, cumulated/concurrent conns, errors, rates… Accessing metrics in HAProxy
  15. 15. WWW.HAPROXY.COM Sequence of events in HAProxy
  16. 16. WWW.HAPROXY.COM Sequence of events in HAProxy
  17. 17. WWW.HAPROXY.COM Sequence of events in HAProxy
  18. 18. WWW.HAPROXY.COM Sequence of events in HAProxy
  19. 19. WWW.HAPROXY.COM Sequence of events in HAProxy
  20. 20. WWW.HAPROXY.COM Sequence of events in HAProxy
  21. 21. WWW.HAPROXY.COM Sequence of events in HAProxy
  22. 22. WWW.HAPROXY.COM Sequence of events in HAProxy
  23. 23. WWW.HAPROXY.COM Sequence of events in HAProxy
  24. 24. WWW.HAPROXY.COM Sequence of events in HAProxy
  25. 25. WWW.HAPROXY.COM Sequence of events in HAProxy
  26. 26. WWW.HAPROXY.COM Sequence of events in HAProxy
  27. 27. WWW.HAPROXY.COM Sequence of events in HAProxy
  28. 28. WWW.HAPROXY.COM Sequence of events in HAProxy
  29. 29. WWW.HAPROXY.COM Sequence of events in HAProxy
  30. 30. WWW.HAPROXY.COM ● HAProxy now supports heavier per-request workloads (Lua, device identification, …) ● Processing times over 200 µs can become noticeable ● Actions: ○ log per-request total CPU time spent in analysers ○ log per-request total CPU time spent in TLS handshake ○ log per-request total latency added by other tasks ○ Ability to kill offending tasks ○ Ability to alert on high latencies => make HAProxy as observable as other components And more to come in HAProxy 1.9 (Nov. 2018)
  31. 31. WWW.HAPROXY.COM ● Timers are averaged in the stats ● Each timer appears in the logs ● halog -rt/-RT/-pct for quick analysis ● Each timer crossing a limit triggers a timeout ● Each abort at a specific step causes a hard error => termination codes haproxy[14389]: 10.0.1.2:33317 [06/Feb/2018:12:14:14.655] http-in static/srv1 10/0/30/69/109 200 2750 - - SDNN 1/1/1/1/0 0/0 {haproxy.org} {} "GET /index.html HTTP/1.1" Event timing reports Timers Term code Cookie code
  32. 32. WWW.HAPROXY.COM ● Who, when, why and how the session has been terminated ● Distinguish between timeout and abort ● Indicate who (client, server, haproxy, kill, ...) ● Indicate when (req, queue, connect, response, data...) ● Completed by persistence cookie indications ● Filtered and sorted by halog : halog -tcn|-TCN ... # for filtering halog -tc # for sorting => Graphing the number of errors reported by second helps detecting an issue is in progress somewhere Termination codes
  33. 33. WWW.HAPROXY.COM ● Stats page: distribution per frontend/backend/server ● Filter by ranges: halog -hs/-HS ● Sorted output: halog -st => graph the distribution and watch for variations between application deployments HTTP status code distribution
  34. 34. WWW.HAPROXY.COM ● Uses server maxconn ● Grows exponentially with slowdowns : easy to detect! ● Tells you how many extra servers you need ● Reported by halog -Q/-QS ● Shown in real time on the stats page per backend/srv => If you watch only one metric, watch this one! Other relevant metrics: queues
  35. 35. WWW.HAPROXY.COM LB algorithm implies fairness between servers : ● Equal request count with roundrobin => Higher than average concurrency indicates abnormally slow server ● Equal load with leastconn => Low req count indicates abnormally slow server => graph relevant values within the farm Other relevant metrics: LB fairness
  36. 36. WWW.HAPROXY.COM ● Global: halog -e ● Per server: halog -srv ● Per client IP: halog -e -ic (detect bad CDN nodes) ● Per URL: halog -ue ● Stats page: per frontend/backend/server ● Stick-tables: per arbitrary key using http_err_rate() => no threshold, watch for variations Other relevant metrics: Error rate
  37. 37. WWW.HAPROXY.COM ● Default httplog format is quite rich ● Can be improved using the log-format directive ● Hint: log stick-table stats for similar keys haproxy[14389]: 10.0.1.2:33317 [06/Feb/2018:12:14:14.655] http-in static/srv1 10/0/30/69/109 200 2750 - - SDNN 1/1/1/1/0 0/0 {haproxy.org} {} "GET /index.html HTTP/1.1" Useful entries in log-format Timers HTTP status Byte count Term Code Cookie Code Conn Count Queue Length
  38. 38. WWW.HAPROXY.COMWWW.HAPROXY.COM Tips !!!!
  39. 39. WWW.HAPROXY.COM "I can't enable logs, I have too much traffic!" ● an average syslog server can store 20k events/s without sweating ● that's 1.7B events/day or 350GB of uncompressed haproxy logs/day ● compresses to 1TB/month ● for $100 you can store 4 months with no loss ● have more traffic / not interested in this level of detail ? # log only 5% of requests http-request set-log-level silent unless { rand(100) -lt 5 } Sampling: why / When Tips !!!!
  40. 40. WWW.HAPROXY.COM ● you only want to catch suspicious events ● enable logs per url, source IP, etc... ● disable logging unless Tc/Tq/Tr/Tw/... is above a certain threshold ● on the fly for selected keys from the CLI + stick-table ● also see "option dontlognormal" WARNING: you'll lose any valid reference Selective logging: why / When Tips !!!!
  41. 41. WWW.HAPROXY.COM ● Poorly documented, use halog --help ● response time per url: halog -uat ● errors per server: halog -srv ● Percentiles on req/queue/conn/resp times: halog -pct ● detect stolen CPU / swap : halog -ac … -ad … ● very fast (1-2 GB of log file per second) => Use it in production to figure the relevant metrics Other halog goodies Tips !!!!
  42. 42. WWW.HAPROXY.COMWWW.HAPROXY.COM ROX (Return On eXperience)
  43. 43. WWW.HAPROXY.COM ● Many ops and dev do this every day all around the world ● Users are complaining that the application is slow ● Devops team check HAProxy logs (using halog or ELK) in order to isolate potential sources of issue: ○ Check server Tc ○ Sort servers by response time ○ Sort URLs by response time ● HAProxy won’t fix the problem (in most cases), but will drastically reduce the troubleshooting time by isolating the potential issues Web application is slow!!!!!!!! Return On eXperience
  44. 44. WWW.HAPROXY.COM ● Retail customer, with shops everywhere in France ● Cash register reports the following error message “Unable to reach central system” when searching people in the loyalty card database, after waiting up to 10s. ● Argument between customer and “central system” software provider: ○ Customer: your software is slow ○ Provider: your network triggers this error ● After installing HAProxy, its logs reported the following: ○ Time to get the query from the client: 300ms (from North of France to Paris) ○ Server response time: 50s ● Conclusion: provider turned off debug mode in the “central system” software Unavailable central system ?!? Return On eXperience
  45. 45. WWW.HAPROXY.COM ● Customer with big TLS traffic, mostly API based ● Each HAProxy server can’t go higher than 20K HTTPs req/s (in 2015), even if there seem to be a lot of unused resources on the box ● At first, this was qualified as an “HAProxy” issue (including the HW and OS itself) ● Hard tuning the box did not improve anything ● halog Tc percentile reported higher values when the load increased (up to 30ms, on the LAN!!!) ● After asking customer to double check the Spanning Tree, it seemed a small 24 ports ToR switch became root bridge…. ● Fixing spanning tree also fixed HAProxy’s TLS performance HAProxy TLS performance issue Return On eXperience
  46. 46. WWW.HAPROXY.COM ● Tc from HA1 to srv 1,2,3,5 always low, srv 4,6 high at 99 pct ● Tc from HA2 to srv 1,2,4,6 always low, srv 3,5 high at 99 pct => both haproxy and servers out of cause ● issue rate stable at various traffic levels => not congestion ● inter-switch link apparently at cause but not for all flows ● inter-switch link made of two fibers balanced on MAC tuple ● thanks to long-term logs, origin could even be identified Spotting a broken fiber channel between 2 core switches Return On eXperience
  47. 47. WWW.HAPROXY.COM ● Tc abnormally high with lots of random values to several seconds, and only for TLS ● Tc timer also covers TLS handshake => not a network, hardware or performance issue, only server config. => system was regularly running out of entropy due to mistakenly using /dev/random as a random source for SSL TIP: use /dev/urandom ;) Wrong web server configuration using /dev/random Return On eXperience
  48. 48. WWW.HAPROXY.COMWWW.HAPROXY.COM Conclusion
  49. 49. WWW.HAPROXY.COM ● exploit your stats ● Choose the right LB ● enable logs on LBs, no excuse for not doing it! ● process them automatically, manually once in a while ● compare numbers between similar objects ● detect anomalies ● fix problems before they are witnessed ● Enjoy :-) Conclusion Conclusion
  50. 50. WWW.HAPROXY.COM QUESTION & ANSWER

×