Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Anirudh Ramananathan (foxish@google.com)
Software Engineer (Kubernetes)
Timothy Chen (tim@hyperpilot.io)
Co-founder & CTO ...
Agenda
• Kubernetes & Containers
• Motivation
• Design
• Demo
• Deep Dive
• Roadmap
What is Kubernetes?
Kubernetes
Kubernetes is an open-source system
Kubernetes
Kubernetes is an open-source system for automating
deployment, scaling, and management
Kubernetes
Kubernetes is an open-source system for automating
deployment, scaling, and management of containerized
applica...
‘Containerized’
Containers
libs
app
kernel
libs
app
libs
app
libs
app
• Repeatable Builds and Workflows
• Application Portability
• High D...
• Large OSS Community - 1200+ contributors and 45k+ commits
• Ecosystem and Partners - 100+ organizations involved
• One o...
Overview
At a Glance
kubelet
UI
kubeletCLI
API
users master nodes
etcd
kubelet
scheduler
controllers
apiserver
Nodes and Pods
Pod
Volume
Containers
Pod
Containers
8080 8080 8080
Volume
Node
• A pod is a set of co-located containers
•...
Motivation
Why Spark on Kubernetes?
• Docker and the Container Ecosystem
• Kubernetes
– Lots of addon services: third-party logging, ...
Design
Spark, meet Kubernetes!
Spark Core
Kubernetes Standalone YARN Mesos
GraphX SparkSQL MLib Streaming
Spark, meet Kubernetes!
Spark Core Kubernetes Scheduler Backend
Kubernetes Clusternew executors
remove executors
configura...
Kubernetes, meet Spark!
Kubernetes Cluster
File Staging Server
• Staging server: component to
stage local files
• Spark Sh...
Kubernetes
Integration
Dependencies
Container images with dependencies
baked in
Files from GCS/S3/HDFS/HTTP
File Staging S...
Administration
Namespaces
Resource
Accounting
Logging
Monitoring
Resource
Quota
Pluggable
Authorization
Admission
Control
...
Focus Areas
Wordcloud of the command-line options we added to spark-submit
on Kubernetes
Demo
Deep Dive
Deep Dive
spark-subm
it
kubernetes cluster
apiserver
scheduler
• Spark Submit submits job to K8s
• Spark Submit submits job to K8s
• K8s schedules the driver for job
Deep Dive
kubernetes cluster
apiserver
scheduler sche...
• Spark Submit submits job to K8s
• K8s schedules the driver for job
Deep Dive
• Spark Submit submits job to K8s
• K8s sch...
• Spark Submit submits job to K8s
• K8s schedules the driver for job
Deep Dive
• Spark Submit submits job to K8s
• K8s sch...
• Spark Submit submits job to K8s
• K8s schedules the driver for job
Deep Dive
• Spark Submit submits job to K8s
• K8s sch...
• Spark Submit submits job to K8s
• K8s schedules the driver for job
Deep Dive
• Spark Submit submits job to K8s
• K8s sch...
Roadmap
Spark Roadmap
Spark Shell
Client Mode
Python/R support
Cluster Mode
Java/Scala Support
Dynamic Allocation Local File Stagi...
We’re just getting started...
• Kubernetes CustomResources
• Priorities and Preemption for Pods
• Batch Scheduling and Res...
Contributors
Organizations Alphabetically:
• Google
• Haiwen
• Hyperpilot
• Intel
• Palantir
• Pepperdata
• Red Hat
Links:...
Thank You.
HDFS on Kubernetes - Lessons Learned
June 7 at 11:00 AM in Room 2003
Join us Wednesdays at 10am PT at the SIG B...
Upcoming SlideShare
Loading in …5
×

Apache Spark on Kubernetes Anirudh Ramanathan and Tim Chen

3,412 views

Published on

Kubernetes is a fast growing open-source platform which provides container-centric infrastructure. Conceived by Google in 2014, and leveraging over a decade of experience running containers at scale internally, it is one of the fastest moving projects on GitHub with 1000+ contributors and 40,000+ commits. Kubernetes has first class support on Google Cloud Platform, Amazon Web Services, and Microsoft Azure.

Unlike YARN, Kubernetes started as a general purpose orchestration framework with a focus on serving jobs. Support for long-running, data intensive batch workloads required some careful design decisions. Engineers across several organizations have been working on Kubernetes support as a cluster scheduler backend within Spark. During this process, we encountered several challenges in translating Spark considerations into idiomatic Kubernetes constructs. In this talk, we describe the challenges and the ways in which we solved them. This talk will be technical and is aimed at people who are looking to run Spark effectively on their clusters. The talk assumes basic familiarity with cluster orchestration and containers.

Published in: Data & Analytics
  • Hello! High Quality And Affordable Essays For You. Starting at $4.99 per page - Check our website! https://vk.cc/82gJD2
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Apache Spark on Kubernetes Anirudh Ramanathan and Tim Chen

  1. 1. Anirudh Ramananathan (foxish@google.com) Software Engineer (Kubernetes) Timothy Chen (tim@hyperpilot.io) Co-founder & CTO (HyperPilot) Apache Spark on Kubernetes
  2. 2. Agenda • Kubernetes & Containers • Motivation • Design • Demo • Deep Dive • Roadmap
  3. 3. What is Kubernetes?
  4. 4. Kubernetes Kubernetes is an open-source system
  5. 5. Kubernetes Kubernetes is an open-source system for automating deployment, scaling, and management
  6. 6. Kubernetes Kubernetes is an open-source system for automating deployment, scaling, and management of containerized applications.
  7. 7. ‘Containerized’
  8. 8. Containers libs app kernel libs app libs app libs app • Repeatable Builds and Workflows • Application Portability • High Degree of Control over Software • Faster Development Cycle • Reduced dev-ops load • Improved Infrastructure Utilization
  9. 9. • Large OSS Community - 1200+ contributors and 45k+ commits • Ecosystem and Partners - 100+ organizations involved • One of the top 100 projects overall on GitHub - 23k+ stars • Large production deployments on-prem and on various cloud providers • Built with multi-tenant and multi-cloud deployments in mind Kubernetes
  10. 10. Overview
  11. 11. At a Glance kubelet UI kubeletCLI API users master nodes etcd kubelet scheduler controllers apiserver
  12. 12. Nodes and Pods Pod Volume Containers Pod Containers 8080 8080 8080 Volume Node • A pod is a set of co-located containers • Created by a declarative specification supplied to the master • Each pod has its own IP address • Volumes can be local or network-attached
  13. 13. Motivation
  14. 14. Why Spark on Kubernetes? • Docker and the Container Ecosystem • Kubernetes – Lots of addon services: third-party logging, monitoring, and security tools – For example, the Istio project, announced May 24, by IBM, Google and Lyft, provides a service mesh for authenticating, authorizing, tracing, and timing, and rate-limiting container-to-container communication, and more. • Resource sharing between batch, serving and stateful workloads – Streamlined developer experience – Reduced operational costs – Improved infrastructure utilization
  15. 15. Design
  16. 16. Spark, meet Kubernetes! Spark Core Kubernetes Standalone YARN Mesos GraphX SparkSQL MLib Streaming
  17. 17. Spark, meet Kubernetes! Spark Core Kubernetes Scheduler Backend Kubernetes Clusternew executors remove executors configuration • Resource Requests • Authnz • Communication with K8s • Runs Spark Drivers/Executors • Runs Shuffle Service • Runs Additional Components for Spark jobs
  18. 18. Kubernetes, meet Spark! Kubernetes Cluster File Staging Server • Staging server: component to stage local files • Spark Shuffle service: component to store shuffle data for dynamic allocation • ThirdParty/CustomResources: extend Kubernetes API with Spark Knowledge Shuffle Service SparkJob API Endpoint
  19. 19. Kubernetes Integration Dependencies Container images with dependencies baked in Files from GCS/S3/HDFS/HTTP File Staging Server Staged files and JARs Several ways of running Spark Jobs along with their dependencies on Kubernetes
  20. 20. Administration Namespaces Resource Accounting Logging Monitoring Resource Quota Pluggable Authorization Admission Control RBAC • Launch Spark Jobs as a particular user into a specific namespace • RBAC and Namespace-level resource quotas • Audit logging for clusters • Several monitoring solutions to see node, cluster and pod-level statistics
  21. 21. Focus Areas Wordcloud of the command-line options we added to spark-submit on Kubernetes
  22. 22. Demo
  23. 23. Deep Dive
  24. 24. Deep Dive spark-subm it kubernetes cluster apiserver scheduler • Spark Submit submits job to K8s
  25. 25. • Spark Submit submits job to K8s • K8s schedules the driver for job Deep Dive kubernetes cluster apiserver scheduler schedule driver pod spark driver
  26. 26. • Spark Submit submits job to K8s • K8s schedules the driver for job Deep Dive • Spark Submit submits job to K8s • K8s schedules the driver for job • Driver requests executors as needed kubernetes cluster apiserver scheduler spark driver create executor pods
  27. 27. • Spark Submit submits job to K8s • K8s schedules the driver for job Deep Dive • Spark Submit submits job to K8s • K8s schedules the driver for job • Driver requests executors as needed • Executors scheduled and created kubernetes cluster apiserver scheduler spark driver schedule executorpods executors
  28. 28. • Spark Submit submits job to K8s • K8s schedules the driver for job Deep Dive • Spark Submit submits job to K8s • K8s schedules the driver for job • Driver requests executors as needed • Executors scheduled and created • Executors run tasks kubernetes cluster apiserver scheduler spark driver executors
  29. 29. • Spark Submit submits job to K8s • K8s schedules the driver for job Deep Dive • Spark Submit submits job to K8s • K8s schedules the driver for job • Driver requests executors as needed • Executors scheduled and created • Executors run tasks • Driver “completes” job and persists logs kubernetes cluster apiserver scheduler spark driver
  30. 30. Roadmap
  31. 31. Spark Roadmap Spark Shell Client Mode Python/R support Cluster Mode Java/Scala Support Dynamic Allocation Local File Staging Spark Streaming High Availability Spark SQL GraphX MLib Dec 2016 Development Began Mar 2017 Alpha Release June 2017 Beta Release Nov 2016 Design = supported but untested = not yet supported
  32. 32. We’re just getting started... • Kubernetes CustomResources • Priorities and Preemption for Pods • Batch Scheduling and Resource Sharing • Cluster Federation and Multi-cloud deployments • Ecosystem: Kafka, Cassandra, HDFS, etc
  33. 33. Contributors Organizations Alphabetically: • Google • Haiwen • Hyperpilot • Intel • Palantir • Pepperdata • Red Hat Links: • Spark 2.2.0 Documentation • https://issues.apache.org/jira/bro wse/SPARK-18278 • https://github.com/apache-spark- on-k8s/spark • https://github.com/kubernetes/ku bernetes/issues/34377
  34. 34. Thank You. HDFS on Kubernetes - Lessons Learned June 7 at 11:00 AM in Room 2003 Join us Wednesdays at 10am PT at the SIG BigData meeting https://github.com/kubernetes/community/

×