Developers love Linux containers, which neatly package up an application and its dependencies and are easy to create and share. However, this unbeatable developer experience hides some deployment challenges for real applications: how do you wire together pieces of a multi-container application? Where do you store your persistent data if your containers are ephemeral? Do containers really contain and isolate your application, or are they merely hiding potential security vulnerabilities? Are your containers scheduled across your compute resources efficiently, or are they trampling on one another?
Container application platforms like Kubernetes provide the answers to some of these questions. We’ll draw on expertise in Linux security, distributed scheduling, and the Java Virtual Machine to dig deep on the performance and security implications of running in containers. This talk will provide a deep dive into tuning and orchestrating containerized Spark applications. You’ll leave this talk with an understanding of the relevant issues, best practices for containerizing data-processing workloads, and tips for taking advantage of the latest features and fixes in Linux Containers, the JDK, and Kubernetes. You’ll leave inspired and enabled to deploy high-performance Spark applications without giving up the security you need or the developer-friendly workflow you want.
8. What is a container?
…a lightweight VM?
…a way to totally isolate applications?
…a packaging format for a container runtime or orchestration platform?
17. What is a container?
…a lightweight VM?
…a way to totally isolate applications?
…a packaging format for a container runtime or orchestration platform?
18. What is a container?
…a lightweight VM?
…a way to totally isolate applications?
…a packaging format for a container runtime or orchestration platform?
…a lightweight means to address some of the same use cases as VMs.
19. What is a container?
…a lightweight VM?
…a way to totally isolate applications?
…a packaging format for a container runtime or orchestration platform?
…a lightweight means to address some of the same use cases as VMs.
…a way to provide reasonable, not exhaustive application isolation.
20. What is a container?
…a lightweight VM?
…a way to totally isolate applications?
…a packaging format for a container runtime or orchestration platform?
…a lightweight means to address some of the same use cases as VMs.
…a way to provide reasonable, not exhaustive application isolation.
…yes, but really just any Linux process with some special settings!
64. Potential performance pitfalls
Virtualized networking likely
has minimal impact on overall
application performance!
…but measure the
performance of your
I/O configuration!
70. Architectural takeaways
Spark executors are already microservices.
Consider using a single Spark cluster per application for flexible
scheduling and easy deployments.
Persistent storage lives outside of containers and is probably best
accessed via service interfaces rather than through filesystem interfaces.
71. Security takeaways
It isn’t safe to run arbitrary code just because you put it in a container.
Use SELinux to minimize your exposure to error and malice.
Don’t run as root unless you absolutely have to (and you probably don’t).
Ad hoc mechanisms for configuring secrets are likely to leak information
and are almost always a bad idea.
72. Performance takeaways
Avoid hypervisor overhead by using different approaches to isolation.
Measure everything, but virtualized networking likely has a minimal
performance impact on real applications.
Artificially throttled performance can be a real problem. Experiment with
JVM settings, including serial GC, to reduce your chance of getting limited.
73. Configuration takeaways
If you consume logs from standard output and error, consider using an
alternate stack trace formatter to get exceptions in a single log record.
If you use ephemeral user IDs, set SPARK_USER or use nss_wrapper so
Hadoop file libraries won’t get confused.