Where Cycles Go: Insights from Continuous Profiling at Scale

Brought to you by
Where Did All These
Cycles Go?
Thomas Dullien
CEO at

Thomas Dullien
CEO at optimyze.cloud
■ 20+ year career in low-level security tools and research:
● Patch diffing (BinDiff), Binary Code Search (VxClass) - acquired by
Google
● DRAM flaws (Rowhammer), allocator analysis
■ Came to performance work because the skillset was similar, but results
are CO2-friendly.
■ Known as @halvarflake on Twitter, where I don’t self-censor enough
■ Like to be in a hammock by an ocean and read math.

Where have all the cycles gone?
This talk presents interesting places where CPU cycles are “lost” or “wasted” in
production systems.
The insights were obtained by deploying ﬂeet- and system-wide continuous
proﬁling to large compute clusters across different companies.
All examples are from real-life deployments of nontrivial size.

The results can be roughly grouped into a few buckets:
■ Misconﬁguration of the underlying OS vs. software assumptions
■ Surprising behavior of dependencies
■ Suboptimal choice of dependencies
■ Paying “too much” for serialization / deserialisation

The “classic” issue: Clock sources
on AWS Xen instances (m4, i3)

Clock sources on AWS Xen instances
Symptom: Excessive time spent … looking up the time in kernel space?

Understanding vDSO fast path
Userspace application
libc vDSO
Kernel
gettimeofday()
Can I return time without kernel
transition? If yes, do it.
If no, call the kernel to get it,
return result

Fast path for gettimeofday() is assumed!
Most software assumes gettimeofday() is very cheap.
This is particularly true of databases (timestamping of rows etc.)
But:
■ Old AWS instances are Xen by default, and conﬁgured with xen time source.
■ Xen time source disables usermode-only gettimeofday()

Fast path for gettimeofday() is assumed!
Conﬁguring TSC-based timekeeping prevents the CPU from doing excessive
userspace-kernelspace transitions when doing gettimeofday().
More details:
https://blog.packagecloud.io/eng/2017/03/08/system-calls-are-much-slower-on-ec2/
https://heap.io/blog/clocksource-aws-ec2-vdso

No longer an issue on Nitro instances
“Modern” instances (m5, m6i etc.) are no longer Xen-based -- the issue goes away.
Important: i3 instances (instances with fast local SSDs) are still on Xen!
Many distributed databases are largely run on i3 ...

Kubelet walking directories
Kubelet
Symptom: Kubelet eating signiﬁcant quantities of CPU time

Clear slide for diagram with caption

The underlying issue: cadvisor
k8s.io/kubernetes/vendor/github.com/google/cadvisor/fs.GetDirUsage()
In order to track per-container usage of ephemeral storage, cadvisor does a
recursive directory walk of the ephemeral layer of every running container every
minute.
Lesson: Whatever you do, do not have a container create an excessive number of
ﬁles in the ephemeral storage: Can cost CPU and even exhaust IOPS on EBS
volumes.

Throwing and catching exceptions

Throwing exceptions
Setup:
■ Large K8s cluster.
■ Machine learning workload using Python bindings to xgboost 0.81, doing
linear regression.
■ No GPUs on the instances.
Symptom: The most heavy function where most CPU time was spent was…
libc-2.28.so:determine_info

Following the stack trace
3) dmlc::StackTrace()
4) dmlc::LogMessageFatal::~LogMessageFatal()
5) dh::ThrowOnCudaError(cudaError, char const*, int)
6) xgboost::AllVisibleImpl::AllVisible()
7) xgboost::obj::RegLossObj<xgboost::obj::LogisticClassification>::Configure
(std::vector<std::pair<std::string, std::string>,
std::allocator<std::pair<std::string, std::string> > > const&)
8) void xgboost::ObjFunction::Configure<
std::_Rb_tree_iterator<std::pair<std::string const, std::string> >>
(std::_Rb_tree_iterator<std::pair<std::string const, std::string> >,
std::_Rb_tree_iterator<std::pair<std::string const, std::string> >)
9) xgboost::LearnerImpl::Load(dmlc::Stream*)
10) XGBoosterLoadModelFromBuffer
<use consolas for font when displaying code>

Throwing exceptions
Root cause:
■ Xgboost lower than 1.0 always tries to talk to the GPU ﬁrst
■ If no GPU is found, an exception is thrown
■ Only after this is done, the code falls back to CPU
■ determine_info was called during the creation of the backtrace to log
Solution:
■ Recompile xgboost without CUDA support (or upgrade)

Zlib vs. zlib-ng vs. zlib-cloudﬂare
■ Most deployments spend the majority of their time in dependencies
■ In any given large deployment, your own code is likely the “minority” of CPU
consumption
■ ⇒ Careful choice of dependencies can make a big difference
■ Compression is often a surprisingly heavy component of any workload.
■ We’ve seen 10%-20% of cycles in 2000-core clusters go into zlib

Drop-in replacement vs. other algorithm?
■ Proper choice of compression means picking an algorithm on the pareto
frontier
■ Zstd explicitly aims to outperform zlib everywhere, and usually succeeds
■ Dependent on data distribution etc.
■ If change of algorithm is not possible, change of zlib fork will help!

Serialization and Deserialization

Ser / Deserialisation code is often 20-30%+
Examples:
■ A large PySpark computational Physics job from CERN: 20% Java-to-Python
de/serialization, 10% Python-to-Java de/serialization
■ Customer deployment: Code falling back to pure-Python msgpack
implementation. Replacing with ormsgpack (Rust-implemented Python
msgpack extension) reduced whole-cluster CPU load by 40%.

■ Fleet-wide continuous profiling finds many surprising “sinks” for CPU cycles
■ Fleet-wide continuous profiling is becoming common outside hyperscalers
■ Prediction: By 2026, the majority of large infrastructures will be doing
continuous fleet-wide profiling

List of whole-system continuous profilers
■ Pixie Labs (link): C++, Go, Rust -- requires symbols for C++ and Rust to unwind
■ Pyroscope (link): Likely possible to do whole-system, requires symbols on
machines
■ Prodfiler (link): Our own continuous profiler,
C/C++/Rust/Go/JVM/Python/PHP/Perl/Kernel, no symbols on-prod, no
recompile for framepointers.

Brought to you by
Thomas Dullien / Halvar Flake
thomasdullien@optimyze.cloud
@halvarﬂake on Twitter

Where Cycles Go: Insights from Continuous Profiling at Scale

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Where Cycles Go: Insights from Continuous Profiling at Scale

Similar to Where Cycles Go: Insights from Continuous Profiling at Scale (20)

More from ScyllaDB

More from ScyllaDB (20)

Recently uploaded

Recently uploaded (20)

Where Cycles Go: Insights from Continuous Profiling at Scale