Speaker: Bo Yang
Summary: More and more people are running Apache Spark on Kubernetes due to the popularity of Kubernetes. There are a lot of challenges since Spark was not originally designed for Kubernetes, for example, easily submitting/managing applications, accessing Spark UI, allocating resource queues based on cpu/memory, and etc. This talk will present how to address these challenges and provide Spark As Service in a large scale.
3. 3
Self Introduction
● Bo Yang
● Worked in Big Data for 10+ Years (Uber, Apple, …)
● Open Source Projects
○ JVM Profiler for Spark
○ Remote Shuffle Service for Spark
○ Data Punch - One Click to Deploy Spark Service
● Now Working in ZettaBlock.com - Data Infra for Web3 (Easy Access On-Chain Data /
Off-Chain Data)
4. 4
Introduction
● Apache Spark: powerful tool for data processing and machine learning
● Kubernetes: extensible platform to run containers
● Apache Spark + Kubernetes: 1 + 1 > 2
● It works: after solving various challenges
5. 5
Why Run Spark on Kubernetes
● Industry Trend
● Low Operation Cost
○ No Need to Maintain
Hadoop/YARN Stacks
● Unified Compute Platform
○ Online Service
○ Offline Data Processing
screenshot comes from Google search web page
10. 10
Dynamic Allocation
● Benefits
○ Dynamically allocate and terminate workers (executors)
○ Increase cluster utilization and reduce cost
● Challenges
○ Issue on Kubernetes: shuffle data prevents terminating executors
○ No External Shuffle Service (which depends on YARN)
● Solutions
○ Decouple shuffle data from executor
○ Uber Remote Shuffle Service
11. 11
Kubernetes Default Scheduler
● Originally designed to orchestrate long running services
● Problems
○ Missing Job Queue
○ No Dynamic Resource Limit
○ Driver Deadlock
12. 12
Batch Friendly Scheduler
● Kubernetes Scheduling Framework
○ Support plugins to enhance behavior
○ Able to totally replace the default scheduler
● Options
○ Volcano: a batch scheduler inspired by machine learning workload
○ Apache YuniKorn: inspired by the YARN scheduler from Hadoop
○ Scheduler Plugin - scheduler plugin to support gang scheduling and FIFO
13. 13
Auto Scaling
● Automatically scale up/out
● Cluster Autoscaler: horizontally scale the cluster
● Usage Example
○ Create different node groups for different teams in the same cluster
○ Enable Cluster Autoscaler for those node groups
○ Adjust different min and max sizes for different node groups
14. 14
Make it User Friendly: Spark As a Service
User: CLI
/ Curl
API Gateway
Spark
Operator
Spark Application
Spark
Driver
Pod
Spark
Executor
Pod
Spark
Driver
Pod
Spark
Executor
Pod
Kubernetes
Cluster
Spark
Operator
Spark Application
Spark
Driver
Pod
Spark
Executor
Pod
Spark
Driver
Pod
Spark
Executor
Pod
Kubernetes
Cluster
… …
User Environment Service Environment
Dynamic
Cluster
Routing
Spark Application Submission
15. 15
Summary
● Spark on Kubernetes: Low Operation Oost
● A Lot of Work to Get Started!
● Need Easy Deployment
○ Automate
○ Repeatable
○ Zero Time to Market
16. 16
DataPunch Project - One Click Deploy
https://github.com/datapunchorg/punch
😃
Create
EKS/Kubernetes
Set up IAM Policy
Create Service Account
Set up Auto Scaling
Install Spark Operator
Spark API Gateway
One Click To Deploy
Curl / CLI To Submit Application
18. 18
DataPunch Benefits
● Learning Curve: Days -> Minutes
● Time to Deploy: Hours -> Minutes
● Operation: Manual -> Automated and Repeatable
19. 19
Take-Away
● Decouple End User from Kubernete Resources
○ Reduce Learning Curve
● REST API Gateway to Submit Application
○ Dynamically Add/Remove Kubernete Clusters without user impact
● Repeatable/Easy Deployment - DataPunch Project
○ Simplify Operation
○ https://github.com/datapunchorg/punch
● References
○ Challenges of Running Spark on Kubernetes - Part 1
○ Challenges of Running Spark on Kubernetes - Part 2