SlideShare a Scribd company logo
1 of 57
Hidden Linux Metrics with Prometheus
eBPF Exporter
Building blocks for managing a fleet at scale
Alexander Huynh
R&D @ Cloudflare
Ivan Babrou
Performance @ Cloudflare
What does Cloudflare do
CDN
Moving content physically
closer to visitors with
our CDN.
Intelligent caching
Unlimited DDOS
mitigation
Unlimited bandwidth at
flat pricing with free
plans
Edge access control
IPFS gateway
Onion service
Website Optimization
Making web fast and up to
date for everyone.
TLS 1.3 (with 0-RTT)
HTTP/2 + QUIC
Server push
AMP
Origin load-balancing
Smart routing
Serverless / Edge Workers
Post quantum crypto
DNS
Cloudflare is the fastest
managed DNS providers
in the world.
1.1.1.1
2606:4700:4700::1111
DNS over TLS
160+
Data centers globally
4.5M+
DNS requests/s
across authoritative, recursive
and internal
10%
Internet requests
everyday
10M+
HTTP requests/second
Websites, apps & APIs
in 150 countries
10M+
Cloudflare’s anycast network
Network capacity
20Tbps
Link to slides with speaker notes
Slideshare doesn’t allow links on the first 3 slides
eBPF overview
● Extended Berkeley Packet Filter
● In-kernel virtual machine with specific instruction set
● General tracepoints
● Further reading
○ http://www.brendangregg.com/ebpf.html
○ https://cilium.readthedocs.io/en/v1.1/bpf/
eBPF overview
● Builds on classical BPF
● Limited to 4096 instructions, 512 bytes stack, no loops
● Both kernel-level and process-level tracing
● Further reading
○ http://www.brendangregg.com/ebpf.html
○ https://cilium.readthedocs.io/en/v1.1/bpf/
eBPF usage, current
● Many examples at bcc tools
○ Print function parameters
○ Histogram of I/O operation times
○ Tap network connections
BCC tools’ trace
ubuntu:/usr/share/bcc/tools$ sudo ./trace 'sys_execve "%s", arg1'
PID TID COMM FUNC -
17366 17366 tmux: server sys_execve /bin/bash
17369 17369 bash sys_execve /usr/bin/locale-check
17371 17371 bash sys_execve /usr/bin/lesspipe
17372 17372 lesspipe sys_execve /usr/bin/basename
17374 17374 lesspipe sys_execve /usr/bin/dirname
17376 17376 bash sys_execve /usr/bin/dircolors
^C
BCC tools’ biolatency
ubuntu:/usr/share/bcc/tools$ sudo ./biolatency 10 1
Tracing block device I/O... Hit Ctrl-C to end.
usecs : count distribution
...
32 -> 63 : 0 | |
64 -> 127 : 2 |* |
128 -> 255 : 24 |*********************** |
256 -> 511 : 41 |****************************************|
512 -> 1023 : 7 |****** |
1024 -> 2047 : 1 | |
BCC tools’ tcpconnect
ubuntu:/usr/share/bcc/tools$ sudo ./tcpconnect 2>/dev/null
PID COMM IP SADDR DADDR DPORT
6918 wget 4 10.0.2.15 1.1.1.1 443
ubuntu@ubuntu:~$ wget -qO /dev/null https://1.1.1.1/
Many more tools available
eBPF usage, current
● Limited to one system
● Human-driven
● Called as needed
160+
Data centers globally
4.5M+
DNS requests/s
across authoritative, recursive
and internal
10%
Internet requests
everyday
10M+
HTTP requests/second
Websites, apps & APIs
in 150 countries
10M+
Cloudflare’s anycast network
Network capacity
20Tbps
eBPF usage, future
● Passive collection
● Exposed for metrics consumption
● Stick to basic metrics to preserve performance
Two main options to collect system metrics
node_exporter
Gauges and counters for system metrics with lots of plugins:
cpu, diskstats, edac, filesystem, loadavg, meminfo, netdev, etc
cAdvisor
Gauges and counters for container level metrics:
cpu, memory, io, net, delay accounting, etc.
Check out this issue about Prometheus friendliness.
Example graphs from node_exporter
Example graphs from node_exporter
Example graphs from cAdvisor
Counters are easy, but lack detail: e.g. IO
What’s the distribution?
● Many fast IOs?
● Few slow IOs?
● Some kind of mix?
● Read vs write speed?
Histograms to the rescue
● Counter:
node_disk_io_time_ms{instance="foo", device="sdc"} 39251489
● Histogram:
bio_latency_seconds_bucket{instance="foo", device="sdc", le="+Inf"} 53516704
bio_latency_seconds_bucket{instance="foo", device="sdc", le="67.108864"} 53516704
...
bio_latency_seconds_bucket{instance="foo", device="sdc", le="0.001024"} 51574285
bio_latency_seconds_bucket{instance="foo", device="sdc", le="0.000512"} 46825073
bio_latency_seconds_bucket{instance="foo", device="sdc", le="0.000256"} 33208881
bio_latency_seconds_bucket{instance="foo", device="sdc", le="0.000128"} 9037907
bio_latency_seconds_bucket{instance="foo", device="sdc", le="6.4e-05"} 239629
bio_latency_seconds_bucket{instance="foo", device="sdc", le="3.2e-05"} 132
bio_latency_seconds_bucket{instance="foo", device="sdc", le="1.6e-05"} 42
bio_latency_seconds_bucket{instance="foo", device="sdc", le="8e-06"} 29
bio_latency_seconds_bucket{instance="foo", device="sdc", le="4e-06"} 2
bio_latency_seconds_bucket{instance="foo", device="sdc", le="2e-06"} 0
Can be nicely visualized with new Grafana
Disk upgrade in production
Larger view to see in detail
So much for holding up to spec
Autodesk research: Datasaurus (animated)
Fun fact: US states have official dinosaurs
Autodesk research: Datasaurus (animated)
You need individual events for histograms
● Solution has to be low overhead (no blktrace)
● Solution has to be universal (not just IO tracing)
● Solution has to be supported out of the box (no modules or patches)
● Solution has to be safe (no kernel crashes or loops)
Monitoring, current
# counter-based, field 10 of https://www.kernel.org/doc/Documentation/iostats.txt
node_disk_io_time_ms{device="sdc"} 39251489
# counter-based, field 11
node_disk_io_time_weighted{device="sdc"} 110260706
Monitoring, future
# histogram-based
bio_latency_seconds_bucket{device="sdc", le="+Inf"} 53516704
bio_latency_seconds_bucket{device="sdc", le="67.108864"} 53516704
...
bio_latency_seconds_bucket{device="sdc", le="0.000512"} 46825073
bio_latency_seconds_bucket{device="sdc", le="0.000256"} 33208881
bio_latency_seconds_bucket{device="sdc", le="0.000128"} 9037907
bio_latency_seconds_bucket{device="sdc", le="6.4e-05"} 239629
bio_latency_seconds_bucket{device="sdc", le="3.2e-05"} 132
bio_latency_seconds_bucket{device="sdc", le="1.6e-05"} 42
bio_latency_seconds_bucket{device="sdc", le="8e-06"} 29
bio_latency_seconds_bucket{device="sdc", le="4e-06"} 2
bio_latency_seconds_bucket{device="sdc", le="2e-06"} 0
The future is now
● ebpf_exporter
● Uses iovisor/gobpf and prometheus/client_golang
● Configured via YAML
● https://github.com/cloudflare/ebpf_exporter
Monitoring pipeline
global
prometheus
local
prometheus
local
prometheus
local
prometheus
ebpf_exporter ebpf_exporter
Monitoring, future
# histogram-based
bio_latency_seconds_bucket{device="sdc", le="+Inf"} 53516704
bio_latency_seconds_bucket{device="sdc", le="67.108864"} 53516704
...
bio_latency_seconds_bucket{device="sdc", le="0.000512"} 46825073
bio_latency_seconds_bucket{device="sdc", le="0.000256"} 33208881
bio_latency_seconds_bucket{device="sdc", le="0.000128"} 9037907
bio_latency_seconds_bucket{device="sdc", le="6.4e-05"} 239629
bio_latency_seconds_bucket{device="sdc", le="3.2e-05"} 132
bio_latency_seconds_bucket{device="sdc", le="1.6e-05"} 42
bio_latency_seconds_bucket{device="sdc", le="8e-06"} 29
bio_latency_seconds_bucket{device="sdc", le="4e-06"} 2
bio_latency_seconds_bucket{device="sdc", le="2e-06"} 0
Graphs!
Monitoring pipeline
global
prometheus
local
prometheus
local
prometheus
local
prometheus
ebpf_exporter ebpf_exporter
Monitoring pipeline
global
prometheus
local
prometheus
local
prometheus
local
prometheus
ebpf_exporter ebpf_exporter
Platform setup
ubuntu:~$ # bionic
ubuntu:~$ cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=18.04
DISTRIB_CODENAME=bionic
DISTRIB_DESCRIPTION="Ubuntu 18.04.1 LTS"
Kernel configuration
ubuntu:~$ grep -E 'CONFIG[^=]*_BPF' /boot/config-$(uname -r)
CONFIG_BPF=y
CONFIG_BPF_SYSCALL=y
CONFIG_NET_CLS_BPF=m
CONFIG_NET_ACT_BPF=m
CONFIG_BPF_JIT=y
CONFIG_BPF_EVENTS=y
Dependencies
# bcc-tools, for comparison
ubuntu:~$ sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys 4052245BD4284CDD
ubuntu:~$ echo "deb https://repo.iovisor.org/apt/bionic bionic main" | 
sudo tee /etc/apt/sources.list.d/iovisor.list
ubuntu:~$ sudo apt-get update
ubuntu:~$ sudo apt-get install bcc-tools libbcc-examples linux-headers-$(uname -r)
# nice to have utilities, like `perf`
ubuntu:~$ sudo apt-get install -y linux-tools-$(uname -r)
# ebpf_exporter dependencies
ubuntu:~$ sudo apt-get install -y git golang
Installing ebpf_exporter
ubuntu:~$ mkdir /tmp/ebpf_exporter
ubuntu:~$ cd /tmp/epbf_exporter
ubuntu:/tmp/ebpf_exporter$ GOPATH=$(pwd) go get github.com/cloudflare/ebpf_exporter/...
Build on BCC tools’ biolatency
ubuntu:~$ sudo /usr/share/bcc/tools/biolatency 10 1
Tracing block device I/O... Hit Ctrl-C to end.
usecs : count distribution
...
32 -> 63 : 0 | |
64 -> 127 : 2 |* |
128 -> 255 : 24 |*********************** |
256 -> 511 : 41 |****************************************|
512 -> 1023 : 7 |****** |
1024 -> 2047 : 1 | |
Generate exportable metrics
ubuntu:/tmp/ebpf_exporter$ sudo ./bin/ebpf_exporter 
--config.file=src/github.com/cloudflare/ebpf_exporter/examples/bio-tracepoints.yaml
2018/10/19 23:38:13 Starting with 1 programs found in the config
2018/10/19 23:38:13 Listening on :9435
ubuntu@ubuntu:~$ wget -qO - http://localhost:9435/metrics | 
grep '^ebpf_exporter_bio_latency_seconds_bucket' | head -n 10 | tail -n -5
ebpf_exporter_bio_latency_seconds_bucket{device="sda",operation="write",le="3.2e-05"} 0
ebpf_exporter_bio_latency_seconds_bucket{device="sda",operation="write",le="6.4e-05"} 7
ebpf_exporter_bio_latency_seconds_bucket{device="sda",operation="write",le="0.000128"} 41
ebpf_exporter_bio_latency_seconds_bucket{device="sda",operation="write",le="0.000256"} 106
ebpf_exporter_bio_latency_seconds_bucket{device="sda",operation="write",le="0.000512"} 141
Example 1: timers
ubuntu@ubuntu:/tmp/ebpf_exporter$ cat src/github.com/cloudflare/ebpf_exporter/examples/timers.yaml
programs:
- name: timers
metrics:
counters:
- name: timer_start_total
help: Timers fired in the kernel
table: counts
labels:
- name: function
size: 8
decoders:
- name: ksym
tracepoints:
timer:timer_start: tracepoint__timer__timer_start
code: |
BPF_HASH(counts, u64);
// Generates function tracepoint__timer__timer_start
TRACEPOINT_PROBE(timer, timer_start) {
counts.increment((u64) args->function);
return 0;
}
Example 1: timers
ubuntu@ubuntu:/tmp/ebpf_exporter$ wget -qO - http://localhost:9436/metrics 
| grep '^ebpf_exporter_timer_start_total' | head -n 10
ebpf_exporter_timer_start_total{function="__prandom_timer"} 42
ebpf_exporter_timer_start_total{function="blk_rq_timed_out_timer"} 513
ebpf_exporter_timer_start_total{function="clocksource_watchdog"} 4924
ebpf_exporter_timer_start_total{function="commit_timeout"} 319
ebpf_exporter_timer_start_total{function="delayed_work_timer_fn"} 5268
ebpf_exporter_timer_start_total{function="idle_worker_timeout"} 7
ebpf_exporter_timer_start_total{function="io_watchdog_func"} 8550
ebpf_exporter_timer_start_total{function="mce_timer_fn"} 8
ebpf_exporter_timer_start_total{function="neigh_timer_handler"} 303
ebpf_exporter_timer_start_total{function="pool_mayday_timeout"} 7
Other bundled examples: IPC
See Brendan Gregg’s blog post: “CPU Utilization is Wrong”.
TL;DR: same CPU% may mean different throughput in terms of CPU
work done. IPC helps to understand the workload better.
Why can instructions per cycle be useful?
Other bundled examples: LLC (L3 Cache)
You can answer questions like:
● Do I need to pay more for a CPU with bigger L3 cache?
● How does having more cores affect my workload?
LLC hit rate usually follows IPC patterns as well.
Why can LLC hit rate be useful?
Other bundled examples: run queue delay
See: “perf sched for Linux CPU scheduler analysis” by Brendan G.
You can see how contended your system is, how effective is the
scheduler and how changing sysctls can affect that.
It’s surprising how high delay is by default.
From Scylla: “Reducing latency spikes by tuning the CPU scheduler”.
Why can run queue latency be useful?
How many things can you measure?
Numbers are from a production Cloudflare machine running Linux 4.14:
● 501 hardware events and counters:
○ sudo perf list hw cache pmu | grep '^ [a-z]' | wc -l
● 1853 tracepoints:
○ sudo perf list tracepoint | grep '^ [a-z]' | wc -l
● Any non-inlined kernel function (there’s like a bazillion of them)
● No support for USDT or uprobes yet
Run through runqlat example
You should always measure yourself (system CPU is the metric).
Here’s what we’ve measured for getpid() syscall:
eBPF overhead numbers for kprobes
Where should you run ebpf_exporter
Anywhere where overhead is worth it.
● Simple programs can run anywhere
● Complex programs (run queue latency) can be gated to:
○ Canary machines where you test upgrades
○ Timeline around updates
At Cloudflare we do exactly this, except we use canary datacenters.
Thank you
● https://github.com/cloudflare/ebpf_exporter
● Link to this deck, which supplements
○ Debugging Linux issues with eBPF
● Always looking for interested people

More Related Content

What's hot

Linux 4.x Tracing Tools: Using BPF Superpowers
Linux 4.x Tracing Tools: Using BPF SuperpowersLinux 4.x Tracing Tools: Using BPF Superpowers
Linux 4.x Tracing Tools: Using BPF Superpowers
Brendan Gregg
 
USENIX ATC 2017 Performance Superpowers with Enhanced BPF
USENIX ATC 2017 Performance Superpowers with Enhanced BPFUSENIX ATC 2017 Performance Superpowers with Enhanced BPF
USENIX ATC 2017 Performance Superpowers with Enhanced BPF
Brendan Gregg
 
Tuning parallelcodeonsolaris005
Tuning parallelcodeonsolaris005Tuning parallelcodeonsolaris005
Tuning parallelcodeonsolaris005
dflexer
 
BPF: Tracing and more
BPF: Tracing and moreBPF: Tracing and more
BPF: Tracing and more
Brendan Gregg
 

What's hot (20)

Systems@Scale 2021 BPF Performance Getting Started
Systems@Scale 2021 BPF Performance Getting StartedSystems@Scale 2021 BPF Performance Getting Started
Systems@Scale 2021 BPF Performance Getting Started
 
Container Performance Analysis
Container Performance AnalysisContainer Performance Analysis
Container Performance Analysis
 
NetConf 2018 BPF Observability
NetConf 2018 BPF ObservabilityNetConf 2018 BPF Observability
NetConf 2018 BPF Observability
 
Security Monitoring with eBPF
Security Monitoring with eBPFSecurity Monitoring with eBPF
Security Monitoring with eBPF
 
UM2019 Extended BPF: A New Type of Software
UM2019 Extended BPF: A New Type of SoftwareUM2019 Extended BPF: A New Type of Software
UM2019 Extended BPF: A New Type of Software
 
Performance Tuning EC2 Instances
Performance Tuning EC2 InstancesPerformance Tuning EC2 Instances
Performance Tuning EC2 Instances
 
bcc/BPF tools - Strategy, current tools, future challenges
bcc/BPF tools - Strategy, current tools, future challengesbcc/BPF tools - Strategy, current tools, future challenges
bcc/BPF tools - Strategy, current tools, future challenges
 
LISA17 Container Performance Analysis
LISA17 Container Performance AnalysisLISA17 Container Performance Analysis
LISA17 Container Performance Analysis
 
Linux 4.x Tracing Tools: Using BPF Superpowers
Linux 4.x Tracing Tools: Using BPF SuperpowersLinux 4.x Tracing Tools: Using BPF Superpowers
Linux 4.x Tracing Tools: Using BPF Superpowers
 
LSFMM 2019 BPF Observability
LSFMM 2019 BPF ObservabilityLSFMM 2019 BPF Observability
LSFMM 2019 BPF Observability
 
Performance Wins with BPF: Getting Started
Performance Wins with BPF: Getting StartedPerformance Wins with BPF: Getting Started
Performance Wins with BPF: Getting Started
 
Kernel Recipes 2019 - ftrace: Where modifying a running kernel all started
Kernel Recipes 2019 - ftrace: Where modifying a running kernel all startedKernel Recipes 2019 - ftrace: Where modifying a running kernel all started
Kernel Recipes 2019 - ftrace: Where modifying a running kernel all started
 
eBPF Perf Tools 2019
eBPF Perf Tools 2019eBPF Perf Tools 2019
eBPF Perf Tools 2019
 
RxNetty vs Tomcat Performance Results
RxNetty vs Tomcat Performance ResultsRxNetty vs Tomcat Performance Results
RxNetty vs Tomcat Performance Results
 
USENIX ATC 2017 Performance Superpowers with Enhanced BPF
USENIX ATC 2017 Performance Superpowers with Enhanced BPFUSENIX ATC 2017 Performance Superpowers with Enhanced BPF
USENIX ATC 2017 Performance Superpowers with Enhanced BPF
 
IntelON 2021 Processor Benchmarking
IntelON 2021 Processor BenchmarkingIntelON 2021 Processor Benchmarking
IntelON 2021 Processor Benchmarking
 
Tuning parallelcodeonsolaris005
Tuning parallelcodeonsolaris005Tuning parallelcodeonsolaris005
Tuning parallelcodeonsolaris005
 
Meet cute-between-ebpf-and-tracing
Meet cute-between-ebpf-and-tracingMeet cute-between-ebpf-and-tracing
Meet cute-between-ebpf-and-tracing
 
YOW2020 Linux Systems Performance
YOW2020 Linux Systems PerformanceYOW2020 Linux Systems Performance
YOW2020 Linux Systems Performance
 
BPF: Tracing and more
BPF: Tracing and moreBPF: Tracing and more
BPF: Tracing and more
 

Similar to LISA18: Hidden Linux Metrics with Prometheus eBPF Exporter

Leveraging the strength of OSGi to deliver a convergent IoT Ecosystem - O Log...
Leveraging the strength of OSGi to deliver a convergent IoT Ecosystem - O Log...Leveraging the strength of OSGi to deliver a convergent IoT Ecosystem - O Log...
Leveraging the strength of OSGi to deliver a convergent IoT Ecosystem - O Log...
mfrancis
 
Distributed Virtual Transaction Directory Server
Distributed Virtual Transaction Directory ServerDistributed Virtual Transaction Directory Server
Distributed Virtual Transaction Directory Server
LDAPCon
 
Splunk app for stream
Splunk app for stream Splunk app for stream
Splunk app for stream
csching
 
Kubernetes One-Click Deployment: Hands-on Workshop (Mainz)
Kubernetes One-Click Deployment: Hands-on Workshop (Mainz)Kubernetes One-Click Deployment: Hands-on Workshop (Mainz)
Kubernetes One-Click Deployment: Hands-on Workshop (Mainz)
QAware GmbH
 

Similar to LISA18: Hidden Linux Metrics with Prometheus eBPF Exporter (20)

Tim Hall [InfluxData] | InfluxDB Roadmap | InfluxDays Virtual Experience Lond...
Tim Hall [InfluxData] | InfluxDB Roadmap | InfluxDays Virtual Experience Lond...Tim Hall [InfluxData] | InfluxDB Roadmap | InfluxDays Virtual Experience Lond...
Tim Hall [InfluxData] | InfluxDB Roadmap | InfluxDays Virtual Experience Lond...
 
FIWARE Wednesday Webinars - Short Term History within Smart Systems
FIWARE Wednesday Webinars - Short Term History within Smart SystemsFIWARE Wednesday Webinars - Short Term History within Smart Systems
FIWARE Wednesday Webinars - Short Term History within Smart Systems
 
Tempesta FW - Framework и Firewall для WAF и DDoS mitigation, Александр Крижа...
Tempesta FW - Framework и Firewall для WAF и DDoS mitigation, Александр Крижа...Tempesta FW - Framework и Firewall для WAF и DDoS mitigation, Александр Крижа...
Tempesta FW - Framework и Firewall для WAF и DDoS mitigation, Александр Крижа...
 
Linux Kernel vs DPDK: HTTP Performance Showdown
Linux Kernel vs DPDK: HTTP Performance ShowdownLinux Kernel vs DPDK: HTTP Performance Showdown
Linux Kernel vs DPDK: HTTP Performance Showdown
 
Web performance Part 1 "Networking theoretical basis"
Web performance Part 1 "Networking theoretical basis"Web performance Part 1 "Networking theoretical basis"
Web performance Part 1 "Networking theoretical basis"
 
Monitoring in Motion: Monitoring Containers and Amazon ECS
Monitoring in Motion: Monitoring Containers and Amazon ECSMonitoring in Motion: Monitoring Containers and Amazon ECS
Monitoring in Motion: Monitoring Containers and Amazon ECS
 
AWS re:Invent 2016: Amazon CloudFront Flash Talks: Best Practices on Configur...
AWS re:Invent 2016: Amazon CloudFront Flash Talks: Best Practices on Configur...AWS re:Invent 2016: Amazon CloudFront Flash Talks: Best Practices on Configur...
AWS re:Invent 2016: Amazon CloudFront Flash Talks: Best Practices on Configur...
 
Simplify Networking for Containers
Simplify Networking for ContainersSimplify Networking for Containers
Simplify Networking for Containers
 
Monitoring federation open stack infrastructure
Monitoring federation open stack infrastructureMonitoring federation open stack infrastructure
Monitoring federation open stack infrastructure
 
NGINX Plus R20 Webinar
NGINX Plus R20 WebinarNGINX Plus R20 Webinar
NGINX Plus R20 Webinar
 
Leveraging the strength of OSGi to deliver a convergent IoT Ecosystem - O Log...
Leveraging the strength of OSGi to deliver a convergent IoT Ecosystem - O Log...Leveraging the strength of OSGi to deliver a convergent IoT Ecosystem - O Log...
Leveraging the strength of OSGi to deliver a convergent IoT Ecosystem - O Log...
 
Distributed Virtual Transaction Directory Server
Distributed Virtual Transaction Directory ServerDistributed Virtual Transaction Directory Server
Distributed Virtual Transaction Directory Server
 
Rudder 3.0 and beyond
Rudder 3.0 and beyondRudder 3.0 and beyond
Rudder 3.0 and beyond
 
Splunk app for stream
Splunk app for stream Splunk app for stream
Splunk app for stream
 
optimizing_ceph_flash
optimizing_ceph_flashoptimizing_ceph_flash
optimizing_ceph_flash
 
COSCUP 2019 - CDN in an Edge Box
COSCUP 2019 - CDN in an Edge BoxCOSCUP 2019 - CDN in an Edge Box
COSCUP 2019 - CDN in an Edge Box
 
Deploying windows containers with kubernetes
Deploying windows containers with kubernetesDeploying windows containers with kubernetes
Deploying windows containers with kubernetes
 
Analyzing NGINX Logs with Datadog
Analyzing NGINX Logs with DatadogAnalyzing NGINX Logs with Datadog
Analyzing NGINX Logs with Datadog
 
Kubernetes One-Click Deployment: Hands-on Workshop (Mainz)
Kubernetes One-Click Deployment: Hands-on Workshop (Mainz)Kubernetes One-Click Deployment: Hands-on Workshop (Mainz)
Kubernetes One-Click Deployment: Hands-on Workshop (Mainz)
 
Google Cloud Next '22 Recap: Serverless & Data edition
Google Cloud Next '22 Recap: Serverless & Data editionGoogle Cloud Next '22 Recap: Serverless & Data edition
Google Cloud Next '22 Recap: Serverless & Data edition
 

Recently uploaded

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Recently uploaded (20)

Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 

LISA18: Hidden Linux Metrics with Prometheus eBPF Exporter

  • 1. Hidden Linux Metrics with Prometheus eBPF Exporter Building blocks for managing a fleet at scale
  • 4. What does Cloudflare do CDN Moving content physically closer to visitors with our CDN. Intelligent caching Unlimited DDOS mitigation Unlimited bandwidth at flat pricing with free plans Edge access control IPFS gateway Onion service Website Optimization Making web fast and up to date for everyone. TLS 1.3 (with 0-RTT) HTTP/2 + QUIC Server push AMP Origin load-balancing Smart routing Serverless / Edge Workers Post quantum crypto DNS Cloudflare is the fastest managed DNS providers in the world. 1.1.1.1 2606:4700:4700::1111 DNS over TLS
  • 5. 160+ Data centers globally 4.5M+ DNS requests/s across authoritative, recursive and internal 10% Internet requests everyday 10M+ HTTP requests/second Websites, apps & APIs in 150 countries 10M+ Cloudflare’s anycast network Network capacity 20Tbps
  • 6. Link to slides with speaker notes Slideshare doesn’t allow links on the first 3 slides
  • 7. eBPF overview ● Extended Berkeley Packet Filter ● In-kernel virtual machine with specific instruction set ● General tracepoints ● Further reading ○ http://www.brendangregg.com/ebpf.html ○ https://cilium.readthedocs.io/en/v1.1/bpf/
  • 8. eBPF overview ● Builds on classical BPF ● Limited to 4096 instructions, 512 bytes stack, no loops ● Both kernel-level and process-level tracing ● Further reading ○ http://www.brendangregg.com/ebpf.html ○ https://cilium.readthedocs.io/en/v1.1/bpf/
  • 9.
  • 10. eBPF usage, current ● Many examples at bcc tools ○ Print function parameters ○ Histogram of I/O operation times ○ Tap network connections
  • 11. BCC tools’ trace ubuntu:/usr/share/bcc/tools$ sudo ./trace 'sys_execve "%s", arg1' PID TID COMM FUNC - 17366 17366 tmux: server sys_execve /bin/bash 17369 17369 bash sys_execve /usr/bin/locale-check 17371 17371 bash sys_execve /usr/bin/lesspipe 17372 17372 lesspipe sys_execve /usr/bin/basename 17374 17374 lesspipe sys_execve /usr/bin/dirname 17376 17376 bash sys_execve /usr/bin/dircolors ^C
  • 12. BCC tools’ biolatency ubuntu:/usr/share/bcc/tools$ sudo ./biolatency 10 1 Tracing block device I/O... Hit Ctrl-C to end. usecs : count distribution ... 32 -> 63 : 0 | | 64 -> 127 : 2 |* | 128 -> 255 : 24 |*********************** | 256 -> 511 : 41 |****************************************| 512 -> 1023 : 7 |****** | 1024 -> 2047 : 1 | |
  • 13. BCC tools’ tcpconnect ubuntu:/usr/share/bcc/tools$ sudo ./tcpconnect 2>/dev/null PID COMM IP SADDR DADDR DPORT 6918 wget 4 10.0.2.15 1.1.1.1 443 ubuntu@ubuntu:~$ wget -qO /dev/null https://1.1.1.1/
  • 14. Many more tools available
  • 15. eBPF usage, current ● Limited to one system ● Human-driven ● Called as needed
  • 16. 160+ Data centers globally 4.5M+ DNS requests/s across authoritative, recursive and internal 10% Internet requests everyday 10M+ HTTP requests/second Websites, apps & APIs in 150 countries 10M+ Cloudflare’s anycast network Network capacity 20Tbps
  • 17. eBPF usage, future ● Passive collection ● Exposed for metrics consumption ● Stick to basic metrics to preserve performance
  • 18. Two main options to collect system metrics node_exporter Gauges and counters for system metrics with lots of plugins: cpu, diskstats, edac, filesystem, loadavg, meminfo, netdev, etc cAdvisor Gauges and counters for container level metrics: cpu, memory, io, net, delay accounting, etc. Check out this issue about Prometheus friendliness.
  • 19. Example graphs from node_exporter
  • 20. Example graphs from node_exporter
  • 22. Counters are easy, but lack detail: e.g. IO What’s the distribution? ● Many fast IOs? ● Few slow IOs? ● Some kind of mix? ● Read vs write speed?
  • 23. Histograms to the rescue ● Counter: node_disk_io_time_ms{instance="foo", device="sdc"} 39251489 ● Histogram: bio_latency_seconds_bucket{instance="foo", device="sdc", le="+Inf"} 53516704 bio_latency_seconds_bucket{instance="foo", device="sdc", le="67.108864"} 53516704 ... bio_latency_seconds_bucket{instance="foo", device="sdc", le="0.001024"} 51574285 bio_latency_seconds_bucket{instance="foo", device="sdc", le="0.000512"} 46825073 bio_latency_seconds_bucket{instance="foo", device="sdc", le="0.000256"} 33208881 bio_latency_seconds_bucket{instance="foo", device="sdc", le="0.000128"} 9037907 bio_latency_seconds_bucket{instance="foo", device="sdc", le="6.4e-05"} 239629 bio_latency_seconds_bucket{instance="foo", device="sdc", le="3.2e-05"} 132 bio_latency_seconds_bucket{instance="foo", device="sdc", le="1.6e-05"} 42 bio_latency_seconds_bucket{instance="foo", device="sdc", le="8e-06"} 29 bio_latency_seconds_bucket{instance="foo", device="sdc", le="4e-06"} 2 bio_latency_seconds_bucket{instance="foo", device="sdc", le="2e-06"} 0
  • 24. Can be nicely visualized with new Grafana Disk upgrade in production
  • 25. Larger view to see in detail
  • 26. So much for holding up to spec
  • 28. Fun fact: US states have official dinosaurs
  • 30. You need individual events for histograms ● Solution has to be low overhead (no blktrace) ● Solution has to be universal (not just IO tracing) ● Solution has to be supported out of the box (no modules or patches) ● Solution has to be safe (no kernel crashes or loops)
  • 31. Monitoring, current # counter-based, field 10 of https://www.kernel.org/doc/Documentation/iostats.txt node_disk_io_time_ms{device="sdc"} 39251489 # counter-based, field 11 node_disk_io_time_weighted{device="sdc"} 110260706
  • 32. Monitoring, future # histogram-based bio_latency_seconds_bucket{device="sdc", le="+Inf"} 53516704 bio_latency_seconds_bucket{device="sdc", le="67.108864"} 53516704 ... bio_latency_seconds_bucket{device="sdc", le="0.000512"} 46825073 bio_latency_seconds_bucket{device="sdc", le="0.000256"} 33208881 bio_latency_seconds_bucket{device="sdc", le="0.000128"} 9037907 bio_latency_seconds_bucket{device="sdc", le="6.4e-05"} 239629 bio_latency_seconds_bucket{device="sdc", le="3.2e-05"} 132 bio_latency_seconds_bucket{device="sdc", le="1.6e-05"} 42 bio_latency_seconds_bucket{device="sdc", le="8e-06"} 29 bio_latency_seconds_bucket{device="sdc", le="4e-06"} 2 bio_latency_seconds_bucket{device="sdc", le="2e-06"} 0
  • 33. The future is now ● ebpf_exporter ● Uses iovisor/gobpf and prometheus/client_golang ● Configured via YAML ● https://github.com/cloudflare/ebpf_exporter
  • 35. Monitoring, future # histogram-based bio_latency_seconds_bucket{device="sdc", le="+Inf"} 53516704 bio_latency_seconds_bucket{device="sdc", le="67.108864"} 53516704 ... bio_latency_seconds_bucket{device="sdc", le="0.000512"} 46825073 bio_latency_seconds_bucket{device="sdc", le="0.000256"} 33208881 bio_latency_seconds_bucket{device="sdc", le="0.000128"} 9037907 bio_latency_seconds_bucket{device="sdc", le="6.4e-05"} 239629 bio_latency_seconds_bucket{device="sdc", le="3.2e-05"} 132 bio_latency_seconds_bucket{device="sdc", le="1.6e-05"} 42 bio_latency_seconds_bucket{device="sdc", le="8e-06"} 29 bio_latency_seconds_bucket{device="sdc", le="4e-06"} 2 bio_latency_seconds_bucket{device="sdc", le="2e-06"} 0
  • 39. Platform setup ubuntu:~$ # bionic ubuntu:~$ cat /etc/lsb-release DISTRIB_ID=Ubuntu DISTRIB_RELEASE=18.04 DISTRIB_CODENAME=bionic DISTRIB_DESCRIPTION="Ubuntu 18.04.1 LTS"
  • 40. Kernel configuration ubuntu:~$ grep -E 'CONFIG[^=]*_BPF' /boot/config-$(uname -r) CONFIG_BPF=y CONFIG_BPF_SYSCALL=y CONFIG_NET_CLS_BPF=m CONFIG_NET_ACT_BPF=m CONFIG_BPF_JIT=y CONFIG_BPF_EVENTS=y
  • 41. Dependencies # bcc-tools, for comparison ubuntu:~$ sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys 4052245BD4284CDD ubuntu:~$ echo "deb https://repo.iovisor.org/apt/bionic bionic main" | sudo tee /etc/apt/sources.list.d/iovisor.list ubuntu:~$ sudo apt-get update ubuntu:~$ sudo apt-get install bcc-tools libbcc-examples linux-headers-$(uname -r) # nice to have utilities, like `perf` ubuntu:~$ sudo apt-get install -y linux-tools-$(uname -r) # ebpf_exporter dependencies ubuntu:~$ sudo apt-get install -y git golang
  • 42. Installing ebpf_exporter ubuntu:~$ mkdir /tmp/ebpf_exporter ubuntu:~$ cd /tmp/epbf_exporter ubuntu:/tmp/ebpf_exporter$ GOPATH=$(pwd) go get github.com/cloudflare/ebpf_exporter/...
  • 43. Build on BCC tools’ biolatency ubuntu:~$ sudo /usr/share/bcc/tools/biolatency 10 1 Tracing block device I/O... Hit Ctrl-C to end. usecs : count distribution ... 32 -> 63 : 0 | | 64 -> 127 : 2 |* | 128 -> 255 : 24 |*********************** | 256 -> 511 : 41 |****************************************| 512 -> 1023 : 7 |****** | 1024 -> 2047 : 1 | |
  • 44. Generate exportable metrics ubuntu:/tmp/ebpf_exporter$ sudo ./bin/ebpf_exporter --config.file=src/github.com/cloudflare/ebpf_exporter/examples/bio-tracepoints.yaml 2018/10/19 23:38:13 Starting with 1 programs found in the config 2018/10/19 23:38:13 Listening on :9435 ubuntu@ubuntu:~$ wget -qO - http://localhost:9435/metrics | grep '^ebpf_exporter_bio_latency_seconds_bucket' | head -n 10 | tail -n -5 ebpf_exporter_bio_latency_seconds_bucket{device="sda",operation="write",le="3.2e-05"} 0 ebpf_exporter_bio_latency_seconds_bucket{device="sda",operation="write",le="6.4e-05"} 7 ebpf_exporter_bio_latency_seconds_bucket{device="sda",operation="write",le="0.000128"} 41 ebpf_exporter_bio_latency_seconds_bucket{device="sda",operation="write",le="0.000256"} 106 ebpf_exporter_bio_latency_seconds_bucket{device="sda",operation="write",le="0.000512"} 141
  • 45. Example 1: timers ubuntu@ubuntu:/tmp/ebpf_exporter$ cat src/github.com/cloudflare/ebpf_exporter/examples/timers.yaml programs: - name: timers metrics: counters: - name: timer_start_total help: Timers fired in the kernel table: counts labels: - name: function size: 8 decoders: - name: ksym tracepoints: timer:timer_start: tracepoint__timer__timer_start code: | BPF_HASH(counts, u64); // Generates function tracepoint__timer__timer_start TRACEPOINT_PROBE(timer, timer_start) { counts.increment((u64) args->function); return 0; }
  • 46. Example 1: timers ubuntu@ubuntu:/tmp/ebpf_exporter$ wget -qO - http://localhost:9436/metrics | grep '^ebpf_exporter_timer_start_total' | head -n 10 ebpf_exporter_timer_start_total{function="__prandom_timer"} 42 ebpf_exporter_timer_start_total{function="blk_rq_timed_out_timer"} 513 ebpf_exporter_timer_start_total{function="clocksource_watchdog"} 4924 ebpf_exporter_timer_start_total{function="commit_timeout"} 319 ebpf_exporter_timer_start_total{function="delayed_work_timer_fn"} 5268 ebpf_exporter_timer_start_total{function="idle_worker_timeout"} 7 ebpf_exporter_timer_start_total{function="io_watchdog_func"} 8550 ebpf_exporter_timer_start_total{function="mce_timer_fn"} 8 ebpf_exporter_timer_start_total{function="neigh_timer_handler"} 303 ebpf_exporter_timer_start_total{function="pool_mayday_timeout"} 7
  • 48. See Brendan Gregg’s blog post: “CPU Utilization is Wrong”. TL;DR: same CPU% may mean different throughput in terms of CPU work done. IPC helps to understand the workload better. Why can instructions per cycle be useful?
  • 49. Other bundled examples: LLC (L3 Cache)
  • 50. You can answer questions like: ● Do I need to pay more for a CPU with bigger L3 cache? ● How does having more cores affect my workload? LLC hit rate usually follows IPC patterns as well. Why can LLC hit rate be useful?
  • 51. Other bundled examples: run queue delay
  • 52. See: “perf sched for Linux CPU scheduler analysis” by Brendan G. You can see how contended your system is, how effective is the scheduler and how changing sysctls can affect that. It’s surprising how high delay is by default. From Scylla: “Reducing latency spikes by tuning the CPU scheduler”. Why can run queue latency be useful?
  • 53. How many things can you measure? Numbers are from a production Cloudflare machine running Linux 4.14: ● 501 hardware events and counters: ○ sudo perf list hw cache pmu | grep '^ [a-z]' | wc -l ● 1853 tracepoints: ○ sudo perf list tracepoint | grep '^ [a-z]' | wc -l ● Any non-inlined kernel function (there’s like a bazillion of them) ● No support for USDT or uprobes yet
  • 55. You should always measure yourself (system CPU is the metric). Here’s what we’ve measured for getpid() syscall: eBPF overhead numbers for kprobes
  • 56. Where should you run ebpf_exporter Anywhere where overhead is worth it. ● Simple programs can run anywhere ● Complex programs (run queue latency) can be gated to: ○ Canary machines where you test upgrades ○ Timeline around updates At Cloudflare we do exactly this, except we use canary datacenters.
  • 57. Thank you ● https://github.com/cloudflare/ebpf_exporter ● Link to this deck, which supplements ○ Debugging Linux issues with eBPF ● Always looking for interested people

Editor's Notes

  1. Hello, Today I’ll be going through how to build on current eBPF offerings, to enhance large scale monitoring. Traditionally, eBPF is known more as a trace utility, but this presentation, and the tools used within it, hope to change that mindset to one of scalable and federated metrics.
  2. My name is Alex, and I work R&D at Cloudflare.
  3. I’m Ivan, and I focus on performance and efficiency of our products.
  4. What does Cloudflare do? We’re a web performance and security company. In addition to our community services, like contributing to TLS 1.3 and our open DNS resolver, 1.1.1.1, we deliver content that our customers want shown to consumers faster. There’s also a few additions available when using us: for example, we offer web-based ACLs to your assets via OAuth providers, like Google or Okta.
  5. Our anycast network handles over 10 million HTTP and HTTPS requests per second, and over 4.5 million DNS requests per second. We’re constantly pushing the limits of what our network can do, putting our 160+ datacenters to good work.
  6. Back to eBPF. eBPF stands for extended Berkeley Packet Filter, which builds on the classical Berkeley Packet Filter. Many of you are already using with BPF, even if you don’t know it. If you’ve typed in a tcpdump filter, then you’ve asked tcpdump to compile BPF code. We use BPF extensively at Cloudflare for our DDoS mitigation pipeline. eBPF allows us to take these packet-oriented tracepoints and generalize it for operating system use.
  7. On Linux, eBPF exists as a virtual machine inside the kernel. Since it runs so low level, a few limits are currently imposed to ensure that it doesn’t cause system instability. Today, eBPF is limited to 4096 instructions, with 512 bytes of stack. Additionally, no loops are allowed, though this can be worked around via unrolling in your preferred templating language. Some architectures include a JIT to transform eBPF into native code, for additional speedup.
  8. There exists a large collection of eBPF utilities made available by the IO Visor Project, called the BPF Compiler Collection, or BCC for short. It’s a good starting point for any eBPF project.
  9. Inside the BCC tools folder, there are many goodies you can run to inspect a machine.
  10. For example, `trace` takes a command-line list of functions to inspect, optionally with a match condition, and prints out the specified arguments of the matches.
  11. `biolatency` summarizes I/O access times of block devices into a nice histogram. You can see that there are 41 accesses that take between 256 and 511 microseconds, within the 10 second capture window requested. If only our production machines were as fast as this idle VM. Feel free to try this command on a loaded production machine, to see how yours differs. More on this example later.
  12. `tcpconnect` taps into the network stack to trace `connect()` syscalls. Here you can see us running wget to Cloudflare’s resolver, with the process, IP, and port information captured.
  13. You can explore the full list of available tools on BCC’s GitHub repo. The extensive list of scripts demonstrate the power of eBPF, making it a strong contender as a unified tracing framework.
  14. However, this isn’t a talk about BCC, mostly. While the current offering of eBPF is amazing for machine debugging, …
  15. …It doesn’t scale to Cloudflare’s size. We can’t have SREs going to individual machines to run standalone scripts, hoping to catch slow I/O or a crashing program in the act. What we really need is…
  16. …always-on collection, constantly exporting low-level metrics normally unavailable to existing monitoring services.
  17. There are two main exporters one can use to get some insight into a linux system performance. The first one is node_exporter that gives you information about basics like CPU usage breakdown by type, memory usage, disk IO stats, filesystem and network usage. The second one is cAdvisor, that gives similar metrics, but drills down to a container level. Instead of seeing total CPU usage you can see which containers (and systemd units are also containers for cAdvisor) use how much of global resources. This is the absolute minimum of what you should know about your systems. If you don’t have these two, you should start there. Let’s look at the graphs you can get out of this.
  18. I should mention that every screenshot in this talk is from a real production machine doing something useful, but we have different generations, so don’t try to draw any conclusions. Here you can see the basics you get from node_exporter for CPU and memory. You can see the utilization and how much slack resources you have.
  19. Some more metrics from node_exporter, this time for disk IO. There are similar panels for network as well. At the basic level you can do some correlation to explain why CPU went up if you see higher network and disk activity.
  20. With cAdvisor this gets more interesting, since you can now see how global resources like CPU are sliced between services. If CPU or memory usage grows, you can pinpoint exact service that is responsible and you can also see how it affects other services. If global CPU numbers do not change, you can still see shifts between services. All of this information comes from the simplest counters.
  21. Counters are great, but they lack detail about individual events. Let’s take disk io for example. We get device io time from node_exporter and the derivative of that cannot go beyond one second per one second of real time, so we can draw a bold red line and see what kind of utilization we get from our disks. We get one number that characterizes out workload, but that’s simply not enough to understand it. Are we doing many fast IO operations? Are we doing few slow ones? What kind of mix of slow vs fast do we get?
  22. These questions beg for a histogram. What we have is a counter above, but what we want is a histogram below. Keep in mind that Prometheus histograms are inclusive, “le” label counts all events lower than or equal to the value.
  23. Okay, so we have that histogram. To visualize the difference you get between the two metrics here are two screenshots of the same event, which is an SSD replacement in production. We replaced disk in production and this is how it affected line and the histogram. In the new Grafana 5 you can plot histograms as heatmaps and it gives you a lot more detail than just one line. The next slide has a bigger screenshot, on this slide you should just see the shift in detail.
  24. Here you can see buckets color coded and each timeslot has its own distribution in a tooltip. It can definitely be better with bar highlights and double Y axis, but it’s a big step forward from just one line nonetheless. In addition to nice visualizations, you can also plot number of events above Xms and have alerts and SLOs on that.
  25. And if you were looking closely at these histograms, you may have noticed values on the Y axis are kind of high. Before the replacement you can see values in 0.5s to 1.0s bucket. Tech specs for the left disk give you 50 microsecond read/write latency and on the right you get a slight decrease to 36 microseconds. That’s not what we see on the histogram at all. Sometimes you can spot this with fio in testing, but production workloads may have patterns that are difficult to replicate and have very different characteristics. Histograms show how how it is. By now you should be convinced that histograms are clearly superior for events where you care about timings of individual events. And if you wanted to switch all your storage latency measurements to histograms, I have tough news for you: linux kernel only provides counters for node_exporter. You can try to mess with blktrace for this, but it doesn’t seem practical or efficient. Let’s switch gears a little and see how summary stats from counters can be deceiving.
  26. This is a research from Autodesk. The creature on the left is called Datasaurus. Then there are target figures in the middle and animations on the right. The amazing part is that shapes on the right and all intermediate animation frames for them have the same values for mean and standard deviation for both X and Y axis. From summary statistic’s point of view it’s all the same, but if you plot individual events, a different picture emerges.
  27. Apparently some states have dinosaurs assigned and you can look up your state on Wikipedia.
  28. This is from the same research, here you can clearly see how box plots (mean + stddev) are non-representative of the raw events. Histograms, on the contrary, give an accurate picture.
  29. We established that histograms is what you want. You need individual events to make those histograms. What are the requirements for a system that would handle it, assuming that we want to measure things like io operations in the linux kernel? * It has to be low overhead, otherwise we can’t run it in production because of cost * It has to be universal, so we are not locked into just io tracing — there are plenty of interesting things to measure in linux kernel * It has to be supported out of the box, third party kernel modules and patches are not very practical * And finally it has to be safe, we don’t want to crash large chunk of the internet for metrics
  30. Current offerings, for example cAdvisor, or Prometheus’ node_exporter, don’t offer the fine granularity needed to isolate hotspots on a massive scale. Let’s use an example metric, node_disk_io_time_ms, available from Prometheus’ node_exporter. This value derives from Linux, and is available in /proc/diskstats, field 13. The documentation states that this field is incremented as long as field 12, which counts the number of I/O operations in flight, is nonzero. What that doesn’t distinguish is whether there was one I/O operation in flight, or 100. This results in an on/off counter, making it easy to spot if a machine is busy, but hard to figure out how busy it is. Field 14, represented by node_disk_io_time_weighted, does a slightly better job by first taking the number of I/O operations in flight, using that as a Y axis of sorts, and multiplying it by the time spent doing I/O, in milliseconds. This gives an OK view of how busy a machine is, but still collapses that information down into one value. What we really want are…
  31. Histograms. By not collapsing how long I/O operations take, we can see whether we have many short I/O accesses, or a few long I/O accesses. Like flamegraphs, this is gives an additional dimension to draw conclusions from.
  32. We at Cloudflare, or more accurately, Ivan at Cloudflare, have built a connector to marry the low level debugging capabilities of eBPF with the federated metrics model of Prometheus. Naming things are hard, so we opted to call it ebpf_exporter. At the fundamental level, it inserts eBPF code, similar to BCC. But instead of reporting back to standard out, it exposes its information to a metrics port for a Prometheus scraper to hit.
  33. When combined with our full monitoring pipeline, we can turn…
  34. …Text-based histograms, to…
  35. …Graphs! In this example, you can see that the write latency on this node is clearly divided into a few groups: one for writes under 512 microseconds, one for writes over a millisecond, and one band consistent across the 256 millisecond mark. I would classify this metric as a second-level metric. Generally, your primary metrics are the ones that are user-facing, and the ones that alerts should be based off of, for example “my application runs slowly”. From there, an operator would have a detailed set of instructions as to what to do when someone complains “my application runs slowly”, including checking a dashboard similar to this. If this node is isolated for further debugging, and a dedicated engineer was to look at it, they could run experiments and use this graph as part of the feedback loop. One of the experiments we ran, and eventually rolled out to our fleet, was to basically replace slower disks like this with new ones. And it worked!
  36. Now that you have a general idea of how this connector can be extremely valuable, I’m going to…
  37. …focus more on the ebpf_exporter part. The Prometheus and Grafana setup we have won’t be explained in detail, but there should be enough documentation out there for you to build something similar.
  38. For our example, we’ll be using the latest Ubuntu LTS, Bionic. I’ll be going a little quickly, so don’t feel bad if you can’t keep up, the slides will be available online. Bionic was chosen due to BCC’s first-class support for Ubuntu. The instructions provided here should work for any Linux, provided you build BCC from source.
  39. The following kernel configuration settings are recommended. If you run Linux 4.1 or newer, released on June 2015, you probably already have these flags enabled.
  40. We’ll be running comparisons between what’s offered in BCC tools, as well as what you can build with ebpf_exporter. The first set of commands install BCC tools for Bionic, the second command installs `perf`, and the last command is needed for ebpf_exporter.
  41. Installing ebpf_exporter is as simple as running go get.
  42. Now that we have both BCC tools and ebpf_exporter installed, let’s compare one of the earlier examples, biolatency. When running with a 10 second sample size, you can see the distribution fall between 128 and 1024 microseconds.
  43. Running ebpf_exporter with the bio-tracepoints.yaml configuration also results in a similar dataset, with buckets classified using the Prometheus label “le”. This format, the Prometheus exposition format, is evolving to be part of the larger OpenMetrics project, which aims to standardize how metrics are delivered. All the configuration for ebpf_exporter is set in the yaml file, so let’s take a look.
  44. This example, timers.yaml, is one of the more basic things you can implement. The full specification is available in the GitHub repo. I’ll be highlighting some of the more important parts here. All of the colour-coded parts above must match. The yellow text defines a tracepoint that hooks into timer:timer_start. A full list of tracepoints can be obtained by running `sudo perf list`. Once the tracepoint fires, it executes the orange-red function, defined in the `code` block. The function is generated from the `TRACEPOINT_PROBE` macro, which is documented in BCC. This increments a BPF_HASH called counts, shown in blue, which stores the addresses of functions called, along with how many times it’s been called. The addresses of the functions, in purple, is run through a decoder, which looks at the kernel’s symbol table to resolve function names. The decoder here lives in `/proc/kallsyms`. Finally, the map is exported under the `timer_start_total` name, with `function` as a label.
  45. The result is the Prometheus metrics you see here.
  46. Another bundled example can give you IPC or instruction per cycle metrics for each CPU. On this production system we can see a few interesting things: * Not all cores are equal, with two cores being outliers (green on the top is zero, yellow on the bottom is one) * Something happened that dropped IPC of what looks like half of cores * There’s some variation in daily load cycles that affects IPC
  47. Why can this be useful? Check out Brendan Gregg’s somewhat controversial blog post about CPU utilization. TL;DR is that CPU% does not include memory stalls that do not perform any useful compute work. IPC helps to understand that factor better.
  48. Another bundled example is LLC or L3 CPU cache hit rate. This is from the same machine as IPC metrics and you can see some major affecting not just IPC, but also the hit rate.
  49. Why can this be useful? You can see how your CPU cache is doing and you can see how it may be affected by bigger cache or more cores sharing the same cache. LLC hit rate usually goes hand in hand with IPC patterns as well.
  50. Now to the fun part. We started with IO latency histograms and they are bundled with ebpf_exporter examples. This is also a histogram, but now fow run queue latency. When a process is woken up in the kernel, it’s ready to run. If CPU is idle, it can run immediately and scheduling delay is zero. If CPU is busy doing some other work, the process is put on a run queue and the CPU picks it up when it’s available. That delay between being ready to run and actually running is the scheduling delay and it affects latency for interactive applications. In cAdvisor you have a counter with this delay, here you have a histogram of actual values of that delay.
  51. Understanding of how you can be delayed is important for understanding the causes of externally visible latency. Check out another blog post by Brendan Gregg to see how you can further trace and understand scheduling delays with linux perf tool. It also helps to have a metrics that you can observe if you change any scheduling sysctls in the kernel. You can never trust internet or even your own judgement to change any sysctls. If you can’t measure the effects, you are lying to yourself. There’s also another great post from Scylla people about their finds. We were able to apply their sysctls and observe results, that was nice. Things like pinning processes to CPUs also affects scheduling, so this metric is invaluable for such experiments.
  52. The examples I gave are not the only ones possible. In addition to IPC and LLC metrics there are around 500 hardware metrics you can get on a typical server. There are around 2000 tracepoints with stable ABI you can use on many kernel subsystems. And you can always trace any non-inlined kernel function, but nobody guarantees binary compatibility between releases for those. Some are stable, others not so much. We don’t have support for user user statically defined tracepoints or uprobes, which means you cannot trace userspace applications. This is something we can reconsider in the future, but in the meantime you can always add metrics to your apps by regular means.
  53. Nothing in this life is free and even low overhead still means some overhead. You should measure this in your own environment, but this is what we’ve seen. For a fast getpid() syscall you get around 34% overhead if you count them by pid. 34% sounds like a lot, but it’s the simplest syscall and 100ns overhead is a cost of one memory reference according to quick googling. For complex case where we mix command name to copy some memory and mix in some randomness, the number jumps to 330ns or 105% overhead. We can still do 1.55M ops/s instead of 3.16M ops/s per logical core. If you measure something that doesn’t happen as often on each core, you’re probably not going to notice it. We noticed run queue latency histogram to ad 300ms of system time on a machine with 40 logical cores and 250K scheduling events per second. Your mileage may vary.
  54. So where should you run the exporter then? Anywhere you feel comfortable. If you run simple programs, it doesn’t hurt to run anywhere. If you are concerned about overhead, you can do it only on canary instances and hope for results to be representative. We do exactly that, but instead of canary instances we have a luxury of having a couple of canary datacenters with live customers. They get updates after our offices do, so it’s not as bad as it sounds. For some upgrade events you may want to enable more extensive metrics. An example would be a major distro or kernel upgrade.