Kubernetes is an open source system to deploy, scale, and manage containerized applications anywhere. It builds on 15 years of running Google's containerized workloads and the valuable contributions from the open source community. To shepherd Kubernetes' evolution with the open source community, Google helped form the Cloud Native Computing Foundation (CNCF) and donated Kubernetes as the founding project. Starting in Spark 2.3.0, Spark has an experimental option to run clusters managed by Kubernetes. This feature makes use of the native Kubernetes scheduler that has been added to Spark. In this talk, we will provide a baseline understanding of what Kubernetes is, why it is relevant for the Spark community and how it compares to YARN. We will then look under the hood of Spark managed by Kubernetes to better understand how this works. Finally, we provide an early evaluation of this feature as well as our thoughts on the future of running Spark on Kubernetes.
4. 2016
4
Google
Research
20082002 2004 2006 2010 2012 2014 2015
Open
Source
2005
Google
Cloud
Product
s BigQuery Pub/Sub Dataflow Bigtable ML
GFS
Map
Reduce
BigTable Dremel
Flume
Java
Millwheel Tensorflow
Google has 20+ years experience solving Data Problems
Apache Beam
PubSub
Dataproc
5. GCP Open Data Ecosystem
Cloud Composer
(Apache Airflow)
DataFlow
(Apache Beam)
Cloud Dataproc
(Apache Spark/Hadoop)
6. What is Cloud Dataproc?
Rapid cluster creation
Familiar open source tools
Google Cloud Platform’s
fully-managed Apache Spark
and Apache Hadoop service
Ephemeral clusters on-
demand
Customizable machines
Tightly Integrated
with other Google Cloud
Platform services
7. Cloud Dataproc in 2018…
More than 30 features launched
Fast
- Clusters from YAML
- Cloud Storage
connector
optimizations
- Weekly updates
- OSS performance
tuning
Easy
- Custom images
- Stackdriver
monitoring
- Workflow templates
- Workflow
parameters
- Optional components
Cost-effective
- Autoscaling
- Granular IAM
- CMEK support for
multiple products
- Graceful
decommissioning
- Global expansion to 6
new regions
8. Cloud Dataproc Internals
User Land
Google Borg
Apache Hadoop YARN
Job Dispatcher
Spanner
Dataproc Agent
TaskService
Frontend
CLI
GUI
API
GFE
(dataproc.googleapis.com)
GFE
(dataproc.control.googleapis.com)
JobService
Task Dispatcher
GCS
$ hadoop jar ...
21. ● Integrates with BigQuery,
Google’s Serverless Data
Warehouse
● Provides Google Cloud
Storage as replacement for
HDFS
● Ships logs to Stackdriver
Monitoring
○ via Prometheus server
with the Stackdriver
sidecar
● Contains sparkctl, a
command line tool that
simplifies client-local
application dependencies in a
Kubernetes environment.
xIntro to my presentation:
xStory about boss coming to me and asking for our K8s plan. What is K8s? Anyone else have that happen yet?
Got a good thing going with YARN, why am I ripping all that out? Where does this fit?
Going to walk through my findings to save you time.
Most of you have probably seen the Google Search box. Behind Google Search is a tremendous amount of infrastructure – and machine learning and analytics foundation – that make those wonderfully simple experiences possible.
20 years of experience in building secure and trusted infrastructure for processing massive volume of data! The universe of data! Google has also been a leader in the field of machine learning and AI to better organize world’s information and make it available to everyone.
From EBC deck: https://docs.google.com/presentation/d/1Am33uS23Hkdew-OsFLAmzB1XjPIQSo3bq4CAtvNaQps/edit#slide=id.g404c2253c6_0_928
Enterprises, planet-scale internet companies, disruptive start-ups are innovating and transforming their businesses with Google Cloud data analytics solutions.
Speaker Notes: Google has a long history of innovating in the data space. Papers on GFS and MapReduce are widely credited for the creation of the first versions of Hadoop. Continued papers have resulted in projects such as HBase, Crunch, Drill, Beam, and TensorFlow. Google has been demoncritizing these internal technologies through GCP projects since ~2011 starting with BigQuery.
Dataproc as a processing engine gives customers a managed cloud experience but without having to re-architect applications and code. It also provides deep integrations with the rest of GCP making it easy to mix open source solutions alongside native GCP services.
https://kubernetes.io/
Kubernetes is greek for "helmsman" or "pilot". Projects started in 2014. Based on experience with Google's internal container management system.
Most users only really care that you provide them with an API.
And most operators only really care that they have a running container cluster.
Kubernetes is an open source project (available on kubernetes.io) that can run on many different environments, from laptops to high-availability multi-node clusters, from public clouds to on-premise deployments, from virtual machines to bare metal.
At the highest level, it is a set of APIs that you can use to deploy containers on a set of nodes.
The system is divided into a set of master components that run as the control plane and a set of nodes that run containers.
Users access your API via a command-line interface, HTTP, or a user interface.
A Container Cluster is a Google abstraction that relates Kubernetes to the GCE infrastructure.
A collection of GCE VM instances, consisting of the Kubernetes Master Endpoint and one or more node instances.
Now that you understand the physical relationship between GCE and Kubernetes, we need to focus on Kubernetes abstractions and understand how Kubernetes works.
Then we will return to the discussion of containers and see how those Kubernetes abstractions interact with nodes.
One purpose of GKE is to enable you to manage applications, not machines.
To accomplish this, you need to understand the GKE abstractions for applications.
Any data access mounted to a pod, called a Volume, is available to all containers in the pod.
Containers that are part of the same pod are guaranteed to be scheduled together on the same VM and can share state via local volumes.
Persistent Volumes, using persistent disks in GCE, survive instance and container restarts.
One purpose of GKE is to enable you to manage applications, not machines.
To accomplish this, you need to understand the GKE abstractions for applications.
Any data access mounted to a pod, called a Volume, is available to all containers in the pod.
Containers that are part of the same pod are guaranteed to be scheduled together on the same VM and can share state via local volumes.
Persistent Volumes, using persistent disks in GCE, survive instance and container restarts.
Deployments handle the scheduling of the pods onto the machines, which are called Nodes.
So now that you understand Pods and Deployments, we will return to the relationship with GCE.
Here's a complete overview of a cluster with its key components.
You have a set of master servers and worker nodes. The masters provide the control plane for the cluster. Worker nodes run pods with containers in them.
Cluster administrators configure the cluster by sending requests to apiservers on masters using a command-line tool called kubectl. Kubectl can be installed and run anywhere.
From there, the apiserver communicates with the cluster in two primary ways:
To the kubelet process that runs on each node
To any node, pod, or service through the apiserver's proxy functionality (not shown).
Then pods are started on various nodes. In this example, there are two types of pods running (shown in yellow and green).
There is also a process on each node called kube-proxy (not shown) that sets up networking rules and connection forwarding for services and pods on the host.
Although networking and data storage services are shown outside nodes, most functionality resides on nodes.
You can also access the apiserver using a web interface called the dashboard via kubectl proxy (not shown).
Image from external Google Slide deck: https://docs.google.com/presentation/d/1lJ2F7e-nYHU1eZq3M9H61rRsIE75s-eQQ_mkoY5a7Ro/edit#slide=id.g2865abe94e_0_1069
You can think of Operators as the runtime that manages this type of application on Kubernetes
CoreOS, Bought by Redhat, bought by IBM.
Y.
Behind the scenes, a deployment relies on a ReplicaSet to manage and run a given number of pods at a given time.
In this example, there is a Deployment named hello.
When you create that deployment, it's going to create a ReplicaSet of size 3.
You add the label selector of app: hello.
Inside of the pod, you have a single image called hello1.
Very first impression is that this is great if your primary business need is having to calculate Pi and you don’t mind that the driver node sometimes fails to start without any error messages.
But completely overhauling a cluster scheduler is a lot - we can expect it to improve. When it does, looking ahead, here are some of the tradeoffs as I see them.