Run Apache Spark on Kubernetes in Large Scale_ Challenges and Solutions-2.pdf

Bo Yang
https://www.linkedin.com/in/hiboyang/
1
Run Apache Spark as a Service on Kubernetes
Challenges and Solutions

2
Agenda
● Quick Introduction
● Challenges
● Spark As A Service
● One Click Deployment

3
Self Introduction
● Bo Yang
● Worked in Big Data for 10+ Years (Uber, Apple, …)
● Open Source Projects
○ JVM Profiler for Spark
○ Remote Shuffle Service for Spark
○ Data Punch - One Click to Deploy Spark Service
● Now Working in ZettaBlock.com - Data Infra for Web3 (Easy Access On-Chain Data /
Off-Chain Data)

4
Introduction
● Apache Spark: powerful tool for data processing and machine learning
● Kubernetes: extensible platform to run containers
● Apache Spark + Kubernetes: 1 + 1 > 2
● It works: after solving various challenges

5
Why Run Spark on Kubernetes
● Industry Trend
● Low Operation Cost
○ No Need to Maintain
Hadoop/YARN Stacks
● Unified Compute Platform
○ Online Service
○ Offline Data Processing
screenshot comes from Google search web page

6
Challenge: Complexity
😨
Create
EKS/Kubernetes
Set up IAM Policy
Create Service Account
Learn Kubectl
Set up Auto Scaling
Install Spark Operator
Add Node Groups

7
Spark Operator
● Operator Pattern
● Spark Operator:
https://github.com/GoogleCloudPlatform/spark-on-k8s-ope
rator
○ Spark Application CRD
(Custom Resource)
○ Monitor Application Status
○ Restart Application on
Failure
photo copied from https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/master/docs/design.md

8
Spark UI
● Challenge:
○ Inside Driver Pod
○ Cannot Access Spark UI from outside Kubernetes cluster
● Solution: Nginx Ingress Controller / Reverse Proxy
○ Automate ingress rule to expose Spark UI
○ Use Reverse Proxy to route traffic to Spark Driver Pod

9
Spark Logs
● Challenges
○ No log aggregation
○ Executor logs gone after application finish
● Solutions
○ Log shipping tools (Fluentd, Fluent Bit, Logstash, etc.)
○ spark.kubernetes.executor.deleteOnTermination=false

10
Dynamic Allocation
● Benefits
○ Dynamically allocate and terminate workers (executors)
○ Increase cluster utilization and reduce cost
● Challenges
○ Issue on Kubernetes: shuffle data prevents terminating executors
○ No External Shuffle Service (which depends on YARN)
● Solutions
○ Decouple shuffle data from executor
○ Uber Remote Shuffle Service

11
Kubernetes Default Scheduler
● Originally designed to orchestrate long running services
● Problems
○ Missing Job Queue
○ No Dynamic Resource Limit
○ Driver Deadlock

12
Batch Friendly Scheduler
● Kubernetes Scheduling Framework
○ Support plugins to enhance behavior
○ Able to totally replace the default scheduler
● Options
○ Volcano: a batch scheduler inspired by machine learning workload
○ Apache YuniKorn: inspired by the YARN scheduler from Hadoop
○ Scheduler Plugin - scheduler plugin to support gang scheduling and FIFO

13
Auto Scaling
● Automatically scale up/out
● Cluster Autoscaler: horizontally scale the cluster
● Usage Example
○ Create different node groups for different teams in the same cluster
○ Enable Cluster Autoscaler for those node groups
○ Adjust different min and max sizes for different node groups

14
Make it User Friendly: Spark As a Service
User: CLI
/ Curl
API Gateway
Spark
Operator
Spark Application
Spark
Driver
Pod
Spark
Executor
Pod
Spark
Driver
Pod
Spark
Executor
Pod
Kubernetes
Cluster
Spark
Operator
Spark Application
Spark
Driver
Pod
Spark
Executor
Pod
Spark
Driver
Pod
Spark
Executor
Pod
Kubernetes
Cluster
… …
User Environment Service Environment
Dynamic
Cluster
Routing
Spark Application Submission

15
Summary
● Spark on Kubernetes: Low Operation Oost
● A Lot of Work to Get Started!
● Need Easy Deployment
○ Automate
○ Repeatable
○ Zero Time to Market

16
DataPunch Project - One Click Deploy
https://github.com/datapunchorg/punch
😃
Create
EKS/Kubernetes
Set up IAM Policy
Create Service Account
Set up Auto Scaling
Install Spark Operator
Spark API Gateway
One Click To Deploy
Curl / CLI To Submit Application

17
DataPunch Architecture
Punch
Command Topology
EKS
SparkOnEks
Hive
Metastore
Topology Deployment
Step N
Step 1
Step 2
…

18
DataPunch Benefits
● Learning Curve: Days -> Minutes
● Time to Deploy: Hours -> Minutes
● Operation: Manual -> Automated and Repeatable

19
Take-Away
● Decouple End User from Kubernete Resources
○ Reduce Learning Curve
● REST API Gateway to Submit Application
○ Dynamically Add/Remove Kubernete Clusters without user impact
● Repeatable/Easy Deployment - DataPunch Project
○ Simplify Operation
○ https://github.com/datapunchorg/punch
● References
○ Challenges of Running Spark on Kubernetes - Part 1
○ Challenges of Running Spark on Kubernetes - Part 2

Run Apache Spark on Kubernetes in Large Scale_ Challenges and Solutions-2.pdf

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Run Apache Spark on Kubernetes in Large Scale_ Challenges and Solutions-2.pdf

Similar to Run Apache Spark on Kubernetes in Large Scale_ Challenges and Solutions-2.pdf (20)

Recently uploaded

Recently uploaded (20)

Run Apache Spark on Kubernetes in Large Scale_ Challenges and Solutions-2.pdf