An Early Evaluation of Running Spark on Kubernetes

© 2017 Google Inc. All rights reserved. Google
and the Google logo are trademarks of Google Inc.
All other company and product names may be
trademarks of the respective companies with
which they are associated.
An Early Evaluation of Running
Spark on Kubernetes
Christopher Crosbie MPH, MS
Product Manager, Open Data Analytics

3
Customers Using Google Cloud

2016
4
Google
Research
20082002 2004 2006 2010 2012 2014 2015
Open
Source
2005
Google
Cloud
Product
s BigQuery Pub/Sub Dataflow Bigtable ML
GFS
Map
Reduce
BigTable Dremel
Flume
Java
Millwheel Tensorflow
Google has 20+ years experience solving Data Problems
Apache Beam
PubSub
Dataproc

GCP Open Data Ecosystem
Cloud Composer
(Apache Airflow)
DataFlow
(Apache Beam)
Cloud Dataproc
(Apache Spark/Hadoop)

What is Cloud Dataproc?
Rapid cluster creation
Familiar open source tools
Google Cloud Platform’s
fully-managed Apache Spark
and Apache Hadoop service
Ephemeral clusters on-
demand
Customizable machines
Tightly Integrated
with other Google Cloud
Platform services

Cloud Dataproc in 2018…
More than 30 features launched
Fast
- Clusters from YAML
- Cloud Storage
connector
optimizations
- Weekly updates
- OSS performance
tuning
Easy
- Custom images
- Stackdriver
monitoring
- Workflow templates
- Workflow
parameters
- Optional components
Cost-effective
- Autoscaling
- Granular IAM
- CMEK support for
multiple products
- Graceful
decommissioning
- Global expansion to 6
new regions

Cloud Dataproc Internals
User Land
Google Borg
Apache Hadoop YARN
Job Dispatcher
Spanner
Dataproc Agent
TaskService
Frontend
CLI
GUI
API
GFE
(dataproc.googleapis.com)
GFE
(dataproc.control.googleapis.com)
JobService
Task Dispatcher
GCS
$ hadoop jar ...

© 2017 Google Inc. All rights reserved. Google and the Google logo are trademarks of Google Inc. All other
company and product names may be trademarks of the respective companies with which they are associated.
What is Kubernetes?

Kubernetes (a.k.a "k8s")
● An open source project
● Framework for container
management and automation
● Based on Google's systems
● Developing rapidly - complex
○ Only covering the basics in this class
○ Only covering GKE in this class
● More information
○ kuberenetes.io (also, k8s.io)
11

The Kubernetes Vision
UI
API
Container
Cluster

How Kubernetes accomplishes this
kubelet
kubelet
etcd
scheduler
controllers
kubelet
apiserver
users master nodes

Container Cluster
14
Container Cluster - a group of GCE
instances running Kubernetes
node node nodemaster
Each node runs:
● Docker runtime
● Kubelet agent
○ Manages scheduled
Docker containers
● Network proxy

Kubernetes master endpoint
15
Kubernetes Master
node node node
● Endpoint -- doorway to the cluster
● Kubernetes API server
○ Services REST requests
○ Schedules pod creation/deletion on nodes
○ Synchs pod info with service info
● Cloud Services integration

Pods
16
Pod
Containers
10.1.0.100
● A Pod is K8s abstraction
to represent an
application
● It holds one or more
containers
● The containers in the
pod share:
○ A single IP address
○ A single namespace
IP
nginx
Spark

Pods
17
Pod
Containers
10.1.0.100
Cloud Storage Disk
● A Pods can share other
items
○ Access to storage
IP
Volumes
nginx
Spark

Pods are scheduled onto nodes
18
Deployment
Pod
"A"
Pod
"B"
nodesmaster
Node
"1"
Node
"2"
Containers
Cluster Container

cluster
node
Here's a complete overview of a cluster
networking services
pod
data
storage
services
pod
master
kubelet
apiserver
node
pod pod
kubelet
node
kubelet
node
pod
kubelet
node
pod
kubelet
node
kubeletmaster
apiserver
kubectl
app1
app2

Operators to the rescue
● Method of packaging,
deploying and managing a
Kubernetes application.
● Deployed on Kubernetes and
managed using the
Kubernetes APIs and kubectl
tooling.
● Set of cohesive APIs to
extend in order to service and
manage your applications
that run on Kubernetes.
YARN translation: *think* Spark Job
Server or Livy

● Integrates with BigQuery,
Google’s Serverless Data
Warehouse
● Provides Google Cloud
Storage as replacement for
HDFS
● Ships logs to Stackdriver
Monitoring
○ via Prometheus server
with the Stackdriver
sidecar
● Contains sparkctl, a
command line tool that
simplifies client-local
application dependencies in a
Kubernetes environment.

Spark Operator Walkthrough

Deployment options

Creating Workloads with Deployments
Workloads are controller objects that correspond
to one of the following workload types
● Stateless applications
● Stateful applications
● Batch jobs
● Daemons
The Spark operator uses Deployments to initiate
the Workload
● Describes a desired state
● Deployment controller changes the actual
state to the desired state at a controlled
rate.
● Deployments can create new ReplicaSets,
or to remove existing Deployments and
adopt all their resources with new
Deployments
10.1.0.2 10.1.0.3
Deployment
+1
Autoscaling

Deployments rely on ReplicaSets to manage and run pods
deployment
pod pod pod
ReplicaSet
ReplicaSet
- replicas: 3
- selector:
- app: hello
Deployment
- name: hello
Pod
- containers:
- image: hello1
What a ReplicaSet contains
● A selector to specify how to
identify Pods it can aquire
● A number indicating how
many Pods should be
maintained
● A Pod template that
describes what runs on the
pod
What a ReplicaSet does
● Creates Pods using the
template until it reaches the
number of replicas

10.1.0.1 10.1.0.2
10.8.244.100
Service
Pod
"A"
Pod
"B"
A Service provides a persistent internal or external IP for pods
SparkUI is exposed as a Service

First Impressions

The Good The Bad
Unified interface to cluster environment (assuming
you have K8 but not YARN)
Another cluster environment (assuming you have
YARN but not K8)

The Good The Bad
Let’s data scientists and developers tap into unused
resources of existing cluster (assuming you are
running K8 for other applications and not at full
utilization)
Re-tuning all the Spark applications for K8 instead
of YARN.

The Good The Bad
Developers can use custom configurations that may
be at the OS level.
Audit logs - YARN has logs split into resource
manager and node manager logs. Most enterprises
have setup for monitoring and alerting and can look
at different class paths. All of this would need to be
revisited.

The Good The Bad
Package containers with libraries for your
application (rm need for Conda environments)
Allows for more targeted upgrades
Forced into dealing with networking for an
application - usually another team in traditional
clusters

The Good The Bad
Can break out users easily into distinct workloads
and isolate resources based on max memory, cpu
throttling - can get away from queue management
and mapping users to YARN queues
Distributed stateful data. Spark 2.4 opens up
volumes but not much work has been done with
tying K8 Stateful Sets back to Spark operator.
(mitigated with GCS as data source+sink)

The Good The Bad
Secrets management for user connections to
various data sources
Security resembling a matryoshka doll - Alpha
Kerberos within K8 RBAC controls, within VM
service account, within cloud IAM, backed by Cloud
Identity often synced to something else.

The Good The Bad
Scale to zero Poor shuffle performance (work in flight to address)

What are your first impressions?

An Early Evaluation of Running Spark on Kubernetes

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to An Early Evaluation of Running Spark on Kubernetes

Similar to An Early Evaluation of Running Spark on Kubernetes (20)

More from DataWorks Summit

More from DataWorks Summit (20)

Recently uploaded

Recently uploaded (20)

An Early Evaluation of Running Spark on Kubernetes

Editor's Notes