Successfully reported this slideshow.

Building a Data Platform with Apache Spark on Kubernetes

0

Share

Upcoming SlideShare
What to Upload to SlideShare
What to Upload to SlideShare
Loading in …3
×
1 of 39
1 of 39

Building a Data Platform with Apache Spark on Kubernetes

0

Share

Download to read offline

Many platforms and frameworks are starting to support Kubernetes as first-class and Apache Spark, analytics engine for large-scale data processing, is one of them. From Spark 2.3, Spark can run on clusters managed by Kubernetes. PUBG Corporation, serving an online video game for 10s of millions of users, decided to migrate its on-demand data analytics platform using Spark on Kubernetes. At the slide, Jihwan Chun and Gyutak Kim will describe the challenges and solutions building a brand-new data platform project powered by Spark on Kubernetes.

Many platforms and frameworks are starting to support Kubernetes as first-class and Apache Spark, analytics engine for large-scale data processing, is one of them. From Spark 2.3, Spark can run on clusters managed by Kubernetes. PUBG Corporation, serving an online video game for 10s of millions of users, decided to migrate its on-demand data analytics platform using Spark on Kubernetes. At the slide, Jihwan Chun and Gyutak Kim will describe the challenges and solutions building a brand-new data platform project powered by Spark on Kubernetes.

More Related Content

Building a Data Platform with Apache Spark on Kubernetes

  1. 1. Building a Data Platform with Spark on Kubernetes Jihwan Chun / Gyutak Kim WeAreDeveloper Congress Vienna 2019
  2. 2. Gyutak Kim Gyutak is a Data Engineer at PUBG corporation who takes a role in building data platform and ETL pipelines for the service. His current goal is to migrate existing system to container-based infrastructure throughout overall data services. Introduction Jihwan Chun Jihwan is a Software Engineer at PUBG Corporation who is enthusiastic about cloud infrastructure, container and Kubernetes. His recent primary area of focus is to build a resilient and scalable platform for large-scale services. 2
  3. 3. Agenda ● Data engineering and data platform ● Problems we encountered operating platform ● Motivations to migrate to Kubernetes ● Data platform with Spark on Kubernetes ● Challenges & remaining works 3
  4. 4. Data Engineering 4 Monica Rogati, The AI Hierarchy of Needs Data Engineering Data Flow ETL Pipeline Storage Logging Infrastructure ...
  5. 5. Building a Data Platform What is our goal? ● Serve all data produced from the service ● Provide platforms to utilize data 5 Game Microservices Kinesis Logstream S3 Log Buckets Data Analysts Data Engineers Data Mart Business Intelligence Icons from www.flaticon.com
  6. 6. Building a Data Platform Data Platform ● Provide easier access of data as a workplace for data science tasks 6 Game Microservices Kinesis Logstream S3 Log Buckets Data Analysts Data Engineers Data Mart Business Intelligence Icons from www.flaticon.com Data Platform
  7. 7. What is Apache Spark An open-source distributed cluster computing framework with in-memory data processing engine 7 Executor Worker Node Executor Worker Node Cluster ManagerDriver Master Node Spark Session (SparkContext) Ref: https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-overview.html
  8. 8. Building a Data Platform ● Notebook platform with Jupyter + Spark 8 ● Batch system with Airflow + Spark
  9. 9. Notebook Platform 9 User Platform Server Analysis Environment Spark / Jupyter EC2 instances Icons from www.flaticon.com
  10. 10. Batch System 10 Computing Resources Job Scheduler Submit Batch scripts Icons from www.flaticon.com
  11. 11. Problems We Encountered A lot of repetitions for every user, every day 11 User Analysis Environment Repeat everyday! For 20+ analysts, 500+ instances Platform Server Icons from www.flaticon.com
  12. 12. Hard to optimize scheduling various sizes of batch jobs Problems We Encountered 12 Spark Clusters Job Scheduler Icons from www.flaticon.com Require ~2GB data Require ~500GB Require ~2TB
  13. 13. Problems We Encountered ● Takes long time to launch Spark clusters ● Error-prone provisioning steps ● Absence of resource scheduling for various sizes of workloads ⇒ Solution: Kubernetes 13
  14. 14. What is Kubernetes? As an Container Orchestration Platform, it... ● Abstracts away infrastructure and provides declarative CRUD interface ● Provides a runtime for containers ● Schedules/Manages containers up & running health check, scaling, load balancing... 14
  15. 15. 1. Abstracted Configuration (‘manifest’) Why Kubernetes? 15 Pod Pod Service Volume Chart.yaml Kubernetes Manifest Config Values
  16. 16. 2. Resource allocation/management ● Efficient resource allocation ● Easy to scale out with the cluster autoscaler Why Kubernetes? 16 ... Docker containers
  17. 17. Spark on Kubernetes (Spark 2.3+) 17 ref: https://spark.apache.org/docs/latest/running-on-kubernetes.html
  18. 18. Kubernetes Nodes Notebook Platform on Kubernetes 18 Client Users Ingress Load Balancer RPC Server Managing Spark Deployments Executor Pods MasterTiller Executor Pods Master Master Master User deploying Spark and Jupyter User taking Spark workload via Jupyter Notebook
  19. 19. Notebook Platform on Kubernetes 19
  20. 20. Notebook Platform on Kubernetes 20
  21. 21. Notebook Platform on Kubernetes 21
  22. 22. Batch on Kubernetes Reason to run batch on Kubernetes ● Complexity of YARN is too high to be managed by smaller teams ● Spark on EC2 or EMR take longer launch time ● Easier Spark launch on Kubernetes leads better parallelism 22
  23. 23. Kubernetes Nodes Batch on Kubernetes 23 Client Users Ingress Load Balancer Tiller MasterMaster Airflow deploying Spark clusters for batch workload Using Spark workload via Notebook Scheduler / WebSvr Executors Executor Pods Master Tiller RPC Server Managing Spark Deployments
  24. 24. Running Clusters at Scale Two main problems: ● Scaling down the cluster makes pods force drained → Wait for executor pods finished before terminating a node ● Sparse pod distribution leads to under-utilization → Implement custom scheduler optimized for Spark clusters Related to Kubernetes Scheduling Behaviour 24
  25. 25. Default Scheduler Behaviour 25 It ‘Spreads pods’ to balance the load ● Spreads pods across the hosts ● Favors node with fewer requested resources ● Favors node with balanced resource usage Kubernetes Nodes Created 5 minutes ago Created 2 hours ago Created a day ago
  26. 26. Default Scheduler Behaviour 26 It ‘Spreads pods’ to balance the load ● Spreads pods across the hosts ● Favors node with fewer requested resources ● Favors node with balanced resource usage Kubernetes Nodes Created 5 minutes ago Created 2 hours ago Created a day ago
  27. 27. Default Scheduler Behaviour 27 It ‘Spreads pods’ to balance the load ● Spreads pods across the hosts ● Favors node with fewer requested resources ● Favors node with balanced resource usage Kubernetes Nodes Created 5 minutes ago Created 2 hours ago Created a day ago
  28. 28. Default Scheduler Behaviour 28 It ‘Spreads pods’ to balance the load ● Spreads pods across the hosts ● Favors node with fewer requested resources ● Favors node with balanced resource usage When deployments are removed... Kubernetes Nodes Created 5 minutes ago Created 2 hours ago Created a day ago
  29. 29. It ‘Spreads pods’ to balance the load ● Spreads pods across the hosts ● Favors node with fewer requested resources ● Favors node with balanced resource usage When deployments are removed... ● We can ‘evict’ some pods to ‘drain’ the node (before scaling down the cluster) Default Scheduler Behaviour 29 Kubernetes Nodes Created 5 minutes ago Created 2 hours ago Created a day ago
  30. 30. Default Scheduler Behaviour 30 It ‘Spreads pods’ to balance the load ● Spreads pods across the hosts ● Favors node with fewer requested resources ● Favors node with balanced resource usage When deployments are removed... ● We can ‘evict’ some pods to ‘drain’ the node (before scaling down the cluster) What if the pods should not be evicted? Kubernetes Nodes Created 5 minutes ago Created 2 hours ago Created a day ago ? ? ? ?
  31. 31. Default Scheduler Behaviour 31 Solution: Custom Scheduling ● To minimize the scenarios to evict the pod ● To maximize cluster utilization Kubernetes Nodes Created 5 minutes ago Created 2 hours ago Created a day ago
  32. 32. Custom Scheduling for Spark 32 Kubernetes scheduler is extendable ● Implements HTTP API to serve request from Kubernetes scheduler (scheduler extender) ● Filters / prioritizes which node to schedule Custom scheduling for Spark ● Prefers fresher node (make autoscaler gracefully stop older nodes) ● Schedules pods on the smallest number of nodes (in order to utilize local network) ● … other scheduling ideas … Kubernetes Nodes Created a day ago Created 5 minutes ago Created 2 hours ago
  33. 33. Improvements Achieved 1. Cluster management gets easier 33 Executor Pods Master
  34. 34. Improvements Achieved 2. Isolated workload while sharing resources 34 Dedicated resource Per each Spark workload Isolated Spark workload with Kubernetes scheduling
  35. 35. 3. Orchestration and monitoring done easier Improvements Achieved 35
  36. 36. 3. Orchestration and monitoring done easier Improvements Achieved 36 Node Node Node Metrics Server Monitoring System
  37. 37. Future Works ● Dynamic resource allocation (SPARK-27963; estimate. Spark 3.0) ● Spark streaming on Kubernetes ● More optimized scheduling algorithms ● Fine-grained cost analysis ● Support for GPU resources 37
  38. 38. Key Takeaways ● Apache Spark works as a core component of data platform ○ Helped managing tons of data with a smaller team ● Running Spark on will help managing & scheduling the workloads ○ Declarative method for deployments ○ Optimized resource management ○ Make use of ‘orchestration’ done by Kubernetes ● Challenges may happen when scaling the clusters ○ It is still at the experimental stage 38
  39. 39. Gyutak Kim gyutak-kim End of Presentation Jihwan Chun jihwanchun 39

×