This document discusses various places where CPU cycles are wasted in production systems based on insights from continuous profiling of large compute clusters. Some examples of wasted cycles include misconfiguration of the underlying OS, suboptimal choices of dependencies, and excessive serialization/deserialization. Specific issues highlighted include gettimeofday calls on older AWS instances, kubelet directory walking, exception throwing in xgboost, and high CPU usage from zlib. Continuous fleet-wide profiling is becoming an important tool for identifying such performance issues.
Scanning the Internet for External Cloud Exposures via SSL Certs
Where Cycles Go: Insights from Continuous Profiling at Scale
1. Brought to you by
Where Did All These
Cycles Go?
Thomas Dullien
CEO at
2. Thomas Dullien
CEO at optimyze.cloud
■ 20+ year career in low-level security tools and research:
● Patch diffing (BinDiff), Binary Code Search (VxClass) - acquired by
Google
● DRAM flaws (Rowhammer), allocator analysis
■ Came to performance work because the skillset was similar, but results
are CO2-friendly.
■ Known as @halvarflake on Twitter, where I don’t self-censor enough
■ Like to be in a hammock by an ocean and read math.
3. Where have all the cycles gone?
This talk presents interesting places where CPU cycles are “lost” or “wasted” in
production systems.
The insights were obtained by deploying fleet- and system-wide continuous
profiling to large compute clusters across different companies.
All examples are from real-life deployments of nontrivial size.
4. Where have all the cycles gone?
The results can be roughly grouped into a few buckets:
■ Misconfiguration of the underlying OS vs. software assumptions
■ Surprising behavior of dependencies
■ Suboptimal choice of dependencies
■ Paying “too much” for serialization / deserialisation
6. Clock sources on AWS Xen instances
Symptom: Excessive time spent … looking up the time in kernel space?
7. Understanding vDSO fast path
Userspace application
libc vDSO
Kernel
gettimeofday()
Can I return time without kernel
transition? If yes, do it.
If no, call the kernel to get it,
return result
8. Fast path for gettimeofday() is assumed!
Most software assumes gettimeofday() is very cheap.
This is particularly true of databases (timestamping of rows etc.)
But:
■ Old AWS instances are Xen by default, and configured with xen time source.
■ Xen time source disables usermode-only gettimeofday()
9. Fast path for gettimeofday() is assumed!
Configuring TSC-based timekeeping prevents the CPU from doing excessive
userspace-kernelspace transitions when doing gettimeofday().
More details:
https://blog.packagecloud.io/eng/2017/03/08/system-calls-are-much-slower-on-ec2/
https://heap.io/blog/clocksource-aws-ec2-vdso
10. No longer an issue on Nitro instances
“Modern” instances (m5, m6i etc.) are no longer Xen-based -- the issue goes away.
Important: i3 instances (instances with fast local SSDs) are still on Xen!
Many distributed databases are largely run on i3 ...
15. The underlying issue: cadvisor
k8s.io/kubernetes/vendor/github.com/google/cadvisor/fs.GetDirUsage()
In order to track per-container usage of ephemeral storage, cadvisor does a
recursive directory walk of the ephemeral layer of every running container every
minute.
Lesson: Whatever you do, do not have a container create an excessive number of
files in the ephemeral storage: Can cost CPU and even exhaust IOPS on EBS
volumes.
17. Throwing exceptions
Setup:
■ Large K8s cluster.
■ Machine learning workload using Python bindings to xgboost 0.81, doing
linear regression.
■ No GPUs on the instances.
Symptom: The most heavy function where most CPU time was spent was…
libc-2.28.so:determine_info
19. Throwing exceptions
Root cause:
■ Xgboost lower than 1.0 always tries to talk to the GPU first
■ If no GPU is found, an exception is thrown
■ Only after this is done, the code falls back to CPU
■ determine_info was called during the creation of the backtrace to log
Solution:
■ Recompile xgboost without CUDA support (or upgrade)
21. Zlib vs. zlib-ng vs. zlib-cloudflare
■ Most deployments spend the majority of their time in dependencies
■ In any given large deployment, your own code is likely the “minority” of CPU
consumption
■ ⇒ Careful choice of dependencies can make a big difference
■ Compression is often a surprisingly heavy component of any workload.
■ We’ve seen 10%-20% of cycles in 2000-core clusters go into zlib
22. Drop-in replacement vs. other algorithm?
■ Proper choice of compression means picking an algorithm on the pareto
frontier
■ Zstd explicitly aims to outperform zlib everywhere, and usually succeeds
■ Dependent on data distribution etc.
■ If change of algorithm is not possible, change of zlib fork will help!
26. Ser / Deserialisation code is often 20-30%+
Examples:
■ A large PySpark computational Physics job from CERN: 20% Java-to-Python
de/serialization, 10% Python-to-Java de/serialization
■ Customer deployment: Code falling back to pure-Python msgpack
implementation. Replacing with ormsgpack (Rust-implemented Python
msgpack extension) reduced whole-cluster CPU load by 40%.
28. Where have all the cycles gone?
■ Fleet-wide continuous profiling finds many surprising “sinks” for CPU cycles
■ Fleet-wide continuous profiling is becoming common outside hyperscalers
■ Prediction: By 2026, the majority of large infrastructures will be doing
continuous fleet-wide profiling
29. List of whole-system continuous profilers
■ Pixie Labs (link): C++, Go, Rust -- requires symbols for C++ and Rust to unwind
■ Pyroscope (link): Likely possible to do whole-system, requires symbols on
machines
■ Prodfiler (link): Our own continuous profiler,
C/C++/Rust/Go/JVM/Python/PHP/Perl/Kernel, no symbols on-prod, no
recompile for framepointers.