Successfully reported this slideshow.
Your SlideShare is downloading. ×

Where Did All These Cycles Go?

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad

Check these out next

1 of 31 Ad

Where Did All These Cycles Go?

Download to read offline

Modern systems are large, and complicated, and it is often difficult to account precisely where CPU cycles are spent in production.

Once you begin measuring, you will find all sorts of strange surprises - like cleaning out strange objects from an attic that has accumulated stuff for decades.

This talk discusses surprising places where we found CPU waste in real-world production environments: From Kubelet consuming multiple percent of whole-cluster CPU, via popular machine learning libraries spending their time juggling exceptions instead of classifying, to EC2 time sources being much slower than necessary. CPU cycles are being lost in surprising places, and often it isn't in your own code.

Modern systems are large, and complicated, and it is often difficult to account precisely where CPU cycles are spent in production.

Once you begin measuring, you will find all sorts of strange surprises - like cleaning out strange objects from an attic that has accumulated stuff for decades.

This talk discusses surprising places where we found CPU waste in real-world production environments: From Kubelet consuming multiple percent of whole-cluster CPU, via popular machine learning libraries spending their time juggling exceptions instead of classifying, to EC2 time sources being much slower than necessary. CPU cycles are being lost in surprising places, and often it isn't in your own code.

Advertisement
Advertisement

More Related Content

Slideshows for you (20)

Similar to Where Did All These Cycles Go? (20)

Advertisement

More from ScyllaDB (20)

Recently uploaded (20)

Advertisement

Where Did All These Cycles Go?

  1. 1. Brought to you by Where Did All These Cycles Go? Thomas Dullien CEO at
  2. 2. Thomas Dullien CEO at optimyze.cloud ■ 20+ year career in low-level security tools and research: ● Patch diffing (BinDiff), Binary Code Search (VxClass) - acquired by Google ● DRAM flaws (Rowhammer), allocator analysis ■ Came to performance work because the skillset was similar, but results are CO2-friendly. ■ Known as @halvarflake on Twitter, where I don’t self-censor enough ■ Like to be in a hammock by an ocean and read math.
  3. 3. Where have all the cycles gone? This talk presents interesting places where CPU cycles are “lost” or “wasted” in production systems. The insights were obtained by deploying fleet- and system-wide continuous profiling to large compute clusters across different companies. All examples are from real-life deployments of nontrivial size.
  4. 4. Where have all the cycles gone? The results can be roughly grouped into a few buckets: ■ Misconfiguration of the underlying OS vs. software assumptions ■ Surprising behavior of dependencies ■ Suboptimal choice of dependencies ■ Paying “too much” for serialization / deserialisation
  5. 5. The “classic” issue: Clock sources on AWS Xen instances (m4, i3)
  6. 6. Clock sources on AWS Xen instances Symptom: Excessive time spent … looking up the time in kernel space?
  7. 7. Understanding vDSO fast path Userspace application libc vDSO Kernel gettimeofday() Can I return time without kernel transition? If yes, do it. If no, call the kernel to get it, return result
  8. 8. Fast path for gettimeofday() is assumed! Most software assumes gettimeofday() is very cheap. This is particularly true of databases (timestamping of rows etc.) But: ■ Old AWS instances are Xen by default, and configured with xen time source. ■ Xen time source disables usermode-only gettimeofday()
  9. 9. Fast path for gettimeofday() is assumed! Configuring TSC-based timekeeping prevents the CPU from doing excessive userspace-kernelspace transitions when doing gettimeofday(). More details: https://blog.packagecloud.io/eng/2017/03/08/system-calls-are-much-slower-on-ec2/ https://heap.io/blog/clocksource-aws-ec2-vdso
  10. 10. No longer an issue on Nitro instances “Modern” instances (m5, m6i etc.) are no longer Xen-based -- the issue goes away. Important: i3 instances (instances with fast local SSDs) are still on Xen! Many distributed databases are largely run on i3 ...
  11. 11. Kubelet walking directories
  12. 12. Kubelet walking directories Kubelet Symptom: Kubelet eating significant quantities of CPU time
  13. 13. Flamegraphs to the rescue
  14. 14. Clear slide for diagram with caption
  15. 15. The underlying issue: cadvisor k8s.io/kubernetes/vendor/github.com/google/cadvisor/fs.GetDirUsage() In order to track per-container usage of ephemeral storage, cadvisor does a recursive directory walk of the ephemeral layer of every running container every minute. Lesson: Whatever you do, do not have a container create an excessive number of files in the ephemeral storage: Can cost CPU and even exhaust IOPS on EBS volumes.
  16. 16. Throwing and catching exceptions
  17. 17. Throwing exceptions Setup: ■ Large K8s cluster. ■ Machine learning workload using Python bindings to xgboost 0.81, doing linear regression. ■ No GPUs on the instances. Symptom: The most heavy function where most CPU time was spent was… libc-2.28.so:determine_info
  18. 18. Following the stack trace 3) dmlc::StackTrace() 4) dmlc::LogMessageFatal::~LogMessageFatal() 5) dh::ThrowOnCudaError(cudaError, char const*, int) 6) xgboost::AllVisibleImpl::AllVisible() 7) xgboost::obj::RegLossObj<xgboost::obj::LogisticClassification>::Configure (std::vector<std::pair<std::string, std::string>, std::allocator<std::pair<std::string, std::string> > > const&) 8) void xgboost::ObjFunction::Configure< std::_Rb_tree_iterator<std::pair<std::string const, std::string> >> (std::_Rb_tree_iterator<std::pair<std::string const, std::string> >, std::_Rb_tree_iterator<std::pair<std::string const, std::string> >) 9) xgboost::LearnerImpl::Load(dmlc::Stream*) 10) XGBoosterLoadModelFromBuffer <use consolas for font when displaying code>
  19. 19. Throwing exceptions Root cause: ■ Xgboost lower than 1.0 always tries to talk to the GPU first ■ If no GPU is found, an exception is thrown ■ Only after this is done, the code falls back to CPU ■ determine_info was called during the creation of the backtrace to log Solution: ■ Recompile xgboost without CUDA support (or upgrade)
  20. 20. zlib vs. zlib-ng
  21. 21. Zlib vs. zlib-ng vs. zlib-cloudflare ■ Most deployments spend the majority of their time in dependencies ■ In any given large deployment, your own code is likely the “minority” of CPU consumption ■ ⇒ Careful choice of dependencies can make a big difference ■ Compression is often a surprisingly heavy component of any workload. ■ We’ve seen 10%-20% of cycles in 2000-core clusters go into zlib
  22. 22. Drop-in replacement vs. other algorithm? ■ Proper choice of compression means picking an algorithm on the pareto frontier ■ Zstd explicitly aims to outperform zlib everywhere, and usually succeeds ■ Dependent on data distribution etc. ■ If change of algorithm is not possible, change of zlib fork will help!
  23. 23. Serialization and Deserialization
  24. 24. Ser / Deserialisation code is often 20-30%+ Examples: ■ A large PySpark computational Physics job from CERN: 20% Java-to-Python de/serialization, 10% Python-to-Java de/serialization ■ Customer deployment: Code falling back to pure-Python msgpack implementation. Replacing with ormsgpack (Rust-implemented Python msgpack extension) reduced whole-cluster CPU load by 40%.
  25. 25. Summary
  26. 26. Where have all the cycles gone? ■ Fleet-wide continuous profiling finds many surprising “sinks” for CPU cycles ■ Fleet-wide continuous profiling is becoming common outside hyperscalers ■ Prediction: By 2026, the majority of large infrastructures will be doing continuous fleet-wide profiling
  27. 27. List of whole-system continuous profilers ■ Pixie Labs (link): C++, Go, Rust -- requires symbols for C++ and Rust to unwind ■ Pyroscope (link): Likely possible to do whole-system, requires symbols on machines ■ Prodfiler (link): Our own continuous profiler, C/C++/Rust/Go/JVM/Python/PHP/Perl/Kernel, no symbols on-prod, no recompile for framepointers.
  28. 28. Questions?
  29. 29. Brought to you by Thomas Dullien / Halvar Flake thomasdullien@optimyze.cloud @halvarflake on Twitter

×