Reliable Performance at Scale with Apache Spark on Kubernetes

WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics

Will Manning + Matt Cheah, Palantir Technologies
Reliable Performance at Scale
with Spark on Kubernetes
#UnifiedDataAnalytics #SparkAISummit

Joined Palantir
First production adopter of
YARN & Parquet
Helped form our Data Flows
& Analytics product groups
Responsible for Engineering
and Architecture for Compute
(including Spark)
3
Will Manning
About us
2013
2015
2016

Joined Palantir
Migrated Spark cluster management
from standalone to YARN, and later
from YARN to Kubernetes
Spark committer / open source
Spark developer
4
Matt Cheah
About us
2014
2018

Agenda
1. A (Very) Quick Primer on Palantir
2. Why We Moved from YARN
3. Spark on Kubernetes
4. Key Production Challenges
Ø Kubernetes Scheduling
Ø Shuffle Resiliency
5

A (Very) Quick Primer on Palantir
… and how Spark on Kubernetes helps power Palantir Foundry

Who are we?
7
Headquartered
Presence
Employees
Palo Alto, CA
Global
~2500 / Mostly Engineers
Founded 2004
Software Data Integration

8
Supporting Counterterrorism
From intelligence operations
to mission planning in the field.

9
Energy
Institutions must evolve or die.
Technology and data are driving
this evolution.

10
Manufacturing
Ferrari uses Palantir Foundry
to increase performance + reliability.

11
Aviation Safety
Palantir and Airbus founded
Skywise to help make air travel
safer and more economical.

12
Cancer Research
Syntropy brings together the greatest
minds and institutions to advance
research toward the common goal
of improving human lives.

13
Products Built
for a Purpose
Integrate, manage, secure,
and analyze all of your
enterprise data.
Amplify and extend the power
of data integration.

Enabling analytics in Palantir Foundry
Executing untrusted code on behalf of trusted users in a multitenant environment
14
– Users can author code (e.g., using Spark SQL
or pySpark) to define complex data
transformations or to perform analysis
– Our users want to write code once and have it
keep working the same way indefinitely

15
– Foundry is responsible for executing arbitrary
code on users’ behalf securely
– Even though the customer might trust the
user, Palantir infrastructure can’t

Enabling collaboration across organizations
Using Spark on Kubernetes to enable multitenant compute
16
Engineers from Airline A
Airbus Employees
Engineers from Airline B

17
– Foundry is responsible for executing arbitrary
code on users’ behalf securely
– Even though the customer might trust the
user, Palantir infrastructure can’t
Repeatable Performance Security

Hadoop/YARN Security
Two modes
1. Kerberos
19

Two modes
1. Kerberos
2. None (only mode until Hadoop 2.x)
NB: I recommend reading Steve Loughran’s “Madness Beyond the Gate” to learn more
20

Containerization as of 3.1.1 (late 2018)
21

Performance in YARN
22
YARN’s scheduler attempts to maximize utilization

Performance in YARN
23
Spark on YARN with dynamic allocation is great at:
– extracting maximum performance from static resources
– providing bursts of resources for one-off batch work
– running “just one more thing”

Performance in YARN
24
Spark on YARN with dynamic allocation is great at:
– extracting maximum performance from static resources
– providing bursts of resources for one-off batch work
– running “just one more thing”

YARN: Clown Car Scheduling
25
(Image Credit: 20th Century Fox Television)

Performance in YARN
26
Spark on YARN with dynamic allocation is terrible at:
– providing consistency from run to run
– isolating performance between different users/tenants
(i.e., if you kick off a big job, then my job is likely to run slower)

So… Kubernetes?
27
ü Native containerization
ü Extreme extensibility (e.g., scheduler, networking/firewalls)
ü Active community with a fast-moving code base
ü Single platform for microservices and compute*
*Spoiler alert: the Kubernetes scheduler is excellent
for web services, not optimized for batch

Timeline
29
Sep ‘16 Mar ‘17 Jan ‘18
Nov ‘16 Oct ‘17 Jun ‘18
Begin initial prototype
of Spark on K8s
Minimal integration complete,
begin experimental deployment
Begin first migration
from YARN toK8s
Establish working group
with community
First PR merged into
upstream master
Complete first migration
from YARN to K8s
in production

Spark on K8s Architecture
30
Client runs spark-submit with arguments
spark-submit converts arguments into a PodSpec for the driver

31
Client runs spark-submit with arguments
spark-submit converts arguments into a PodSpec for the driver
K8s-specific implementation of SchedulerBackend interface requests
executor pods in batches

32
CREATE
DRIVER POD
spark-submit input:
…
--master k8s://example.com:8443
--conf spark.kubernetes.image=example.com/appImage
--conf …
com.palantir.spark.app.main.Main
spark-submit output:
…
spec:
containers:
- name: example
image: example.com/appImage
command: [”/opt/spark/entrypoint.sh"]
args: [”--driver", ”--class”, “com.palantir.spark.app.main.Main”]
…
spark-submit

33
Driver Pod
CREATE: 2
EXECUTOR PODS
spark.executor.instances = 4, spark.kubernetes.allocation.batch.size = 2

34
Executor Pod #1
Executor Pod #2
Driver Pod
REGISTER
REGISTER
CREATE: 2
EXECUTOR PODS

35
Executor Pod #1
Executor Pod #2
Driver Pod
REGISTER
REGISTER
CREATE: 2
EXECUTOR PODS
CREATE: 2 MORE
EXECUTOR PODS

36
Executor Pod #1
Executor Pod #2
Executor Pod #3
Executor Pod #4
Driver Pod
REGISTER
REGISTER
REGISTER
REGISTER
CREATE: 2
EXECUTOR PODS
CREATE: 2 MORE
EXECUTOR PODS

37
Executor Pod #1
Executor Pod #2
Executor Pod #3
Executor Pod #4
Driver Pod
RUN TASKS
REGISTER
REGISTER
REGISTER
REGISTER
RUN TASKS
RUN TASKS
RUN TASKS
CREATE: 2
EXECUTOR PODS
CREATE: 2 MORE
EXECUTOR PODS

Early Challenge: Disk Performance
38
Executor (Heap = E) Executor (Heap = E)
OS File System Cache
(Max Size = M – 2E)
YARN Node Manager (Memory = M)
File System

39
Executor (Heap = E) Executor (Heap = E)
(Max Size = M – 2E)
YARN Node Manager (Memory = M)
(Max Size = C – E = ~0)
Executor Pod
(Memory = C)
Executor
(Heap = E = ~C)
(Max Size = C – E = ~0)
Executor Pod
(Memory = C)
Executor
(Heap = E = ~C)
File System File System File System

OS filesystem cache is now container-local
– i.e., dramatically smaller
– disks must be fast without hitting FS cache
– Solution: use NVMe drives for temp storage
Docker disk interface is slower than direct disk access
– Solution: Use EmptyDir volumes for temp storage
40

Challenge I: Kubernetes Scheduling
42
– The built-in Kubernetes scheduler isn’t really designed
for distributed batch workloads (e.g., MapReduce)
– Historically, optimized for microservice instances or
single-pod, one-off jobs (what k8s natively supports)

Challenge II: Shuffle Resiliency
43
– External Shuffle Service is unavailable in Kubernetes
– Jobs must be written more carefully to avoid executor
failure (e.g., OOM) and the subsequent need to
recompute lost blocks

Reliable, Repeatable Runtimes
45
A key goal is to make runtimes of the same workload
consistent from run to run
Recall

46
When a driver is deployed, wants to do work, it should
receive the same resources from run to run
Recall
Corollary

47
When a driver is deployed, wants to do work, it should
receive the same resources from run to run
Using vanilla k8s scheduler led to partial starvation
as the cluster became saturated
Recall
Corollary
Problem

Kubernetes Scheduling
48
Scheduling Queue
P2
Running Pods
Rd 1 P1 P3

49
Scheduling Queue
P2
Running Pods
P1
Rd 1 P1 P3
P3 P2

50
Scheduling Queue
P2
P2
Running Pods
P1
P1
Rd 1
Rd 2
P1 P3
P3
P3
P2
Back to the
end of the line!

51
Scheduling Queue
P2
P2
Running Pods
P1
P1
P3
Rd 1
Rd 2
P1
P1 P3
P3
P3
P2
P2
(Still waiting…)

Naïve Spark-on-K8s Scheduling
52
Scheduling Queue Running Pods
Rd 1 D

53
D
Rd 1 D

54
D
D
Rd 1
Rd 2
D
E2E1

55
D
D
Rd 1
Rd 2
D
D
E2E1
E2E1

The “1000 Drivers” Problem
56
Rd 1 D1 D2
…D1 D2
… D1000

57
D1
Rd 1 D1 D2
…D1 D2
… D1000
D2 D1000
…

58
D1
Rd 1
Rd 2
D1
E2E1
D2
…D1 D2
… D1000
D2 D1000
…
… EN D2 D1000
…D1

59
D1
Rd 1
Rd 2
D1
E2E1
D2
…D1 D2
… D1000
D2 D1000
…
… EN D2 D1000
…
E3E2
… EN D2 D1000
…D1
D1
(Uh oh!)
E1

60
ü Native containerization
ü Extreme extensibility (e.g., scheduler, networking/firewalls)
ü Active community with a fast-moving code base
ü Single platform for microservices and compute*
*Spoiler alert: the Kubernetes scheduler is excellent
for web services, not optimized for batch
So… Kubernetes?

K8s Spark Scheduler
Idea: use the Kubernetes scheduler extender API to add
– Gang scheduling of drivers & executors
– FIFO (within instance groups)

K8s Spark Scheduler
Goal: build entirely with native K8s extension points
– Scheduler extender API: fn(pod, [node]) -> Option[node]
– Custom resource definition: ResourceReservation
– Driver annotations for resource requests (executors)

K8s Spark Scheduler
get cluster usage (running pods + reservations)
bin pack pending resource requests in FIFO order
reserve resources (with CRD)
if driver
find unbound reservation
bind reservation
if executor

K8s Spark Scheduler
64
D1
Rd 1
Rd 2+
D1
E2E1
D2
…D1 D2
… D1000
…
D1
D1
R2R1D2
… D1000
D2
… D1000
…R2R1
E2E1D2
… D1000

Spark-Aware Autoscaling
Idea: use the resource request annotations for unscheduled drivers to
project desired cluster size
– Again, we use a CRD to represent this unsatisfied Demand
– Sum resources for pods + reservations + demand
– In the interest of time, out of scope for today

“Soft” dynamic allocation
68
Static allocation wastes resourcesProblem

69
Static allocation wastes resourcesProblem
Idea
Voluntarily give up executors that aren’t
needed, no preemption (same from run to run)

70
Static allocation wastes resources
No external shuffle, so executors store shuffle files
Problem
Idea
Problem

71
Static allocation wastes resources
No external shuffle, so executors store shuffle files
Problem
Idea
Problem
The driver already tracks shuffle file locations, so it
can determine which executors are safe to give up
Idea

72
– Saves $$$$, ~no runtime variance if consistent from run to run
– Inspired by a 2018 Databricks blog post [1]
– Merged into our fork in ~January 2019
– Recently adapted by @vanzin (thanks!) and merged upstream [2]
[1] https://databricks.com/blog/2018/05/02/introducing-databricks-optimized-auto-
scaling.html
[2] https://github.com/apache/spark/pull/24817

K8s Spark Scheduler
See our engineering blog on Medium
– https://medium.com/palantir/spark-scheduling-in-kubernetes-
4976333235f3
Or, check out the source for yourself! (Apache v2 license)
– https://github.com/palantir/k8s-spark-scheduler

Shuffle Resiliency
75
Shuffles have a map side and a reduce side
— Mapper executors write temporary data to local disk
— Reducer executors contact mapper executors to retrieve written data
YARN Node Manager
Mapper
Executor
Reducer
Executor
YARN Node Manager
Local Disk Local Disk

Shuffle Resiliency
76
If an executor crashes, all data written by that executor’s map tasks are lost
— Spark must re-schedule the map tasks on other executors
— Cannot remove executors to save resources because we would lose shuffle files
YARN Node Manager
Mapper
Executor
Reducer
Executor
YARN Node Manager

Shuffle Resiliency
77
Spark’s external shuffle service preserves shuffle data in cases of executor loss
— Required to make preemption non-pathological
— Prevents loss of work when executors crash
Mapper
Executor
YARN Node Manager
Shuffle
Service
Reducer
Executor
YARN Node Manager
Shuffle
Service

Problem: for security, containers have isolated storage
Shuffle Resiliency
78
YARN Node Manager
Shuffle
Service
Mapper
Executor
Executor Pod
Mapper
Executor
Shuffle Service Pod
YARN Kubernetes

Shuffle Resiliency
79
Idea: asynchronously back up shuffle files to a distributed storage system
Backup
Thread
Executor Pod
Local Disk
Map Task
Thread

Shuffle Resiliency
80
1. Reducers first try to fetch from other executors
2. Download from remote storage if mapper is unreachable
Mapper Executor
Pod
Mapper Executor
Pod
Live Mapper
Dead Mapper
Reducer Executor
Pod
Reducer Executor
Pod

Shuffle Resiliency
– Wanted to generalize the framework for storing shuffle data in arbitrary
storage systems
– API In Progress: https://issues.apache.org/jira/browse/SPARK-25299
– Goal: Open source asynchronous backup strategy by end of 2019

DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT

Reliable Performance at Scale with Apache Spark on Kubernetes

More Related Content

What's hot

Similar to Reliable Performance at Scale with Apache Spark on Kubernetes

More from Databricks

Recently uploaded

Reliable Performance at Scale with Apache Spark on Kubernetes