WORKSHOP
Spark on Kubernetes
Create and set up your Spark cluster on Kubernetes
Leah Kolben, CTO
@leah4kosh
leah@cnvrg.io
whoami
• Developer/Data scientist => CTO
• cnvrg.io = built by data scientists, for data scientists to help teams:
• Get from data to models to production in the most efficient and fast way
• bridge science and engineering
agenda
• Introduction
• What’s spark
• Different spark implementations
• What’s kubernetes
• Spark on Kubernetes – pros vs cons
• Live Workshop
• Summary
What is Spark?
• Unified analytics engine for large-scale data processing
• Faster processing speed of applications due to In-memory cluster computing (100x
improvement)
• Support different workloads – batch, iterative, streaming, interactive SQL etc.
• Support multiple languages and different environments
Spark deployment modes
• Hadoop Yarn
• Apache Mesos
• Kubernetes
• Standalone
Kubernetes - recap
• Provides a runtime environment for Docker containers
• Provides an abstraction layer for containers to run on
• All services are natively load balanced
• Can scale up and down dynamically
• Monitor the health of the containers
• Schedule runs and cronjobs
• Use the same API across EVERY cloud provider and bare metal!
Spark Architecture on Kubernetes
Spark Architecture on Kubernetes
• Spark-submit will be used to submit a spark application using
kubectl:
• Spark will create a spark driver as a pod
• Driver will create executors (pods) and run the application code
• When job is done – terminate jobs and clean resources
(terminate nodes)
Spark on kubernetes
• Kubernetes can manage unified containerized pipelines
• Optimization for resource sharing
• Leverage kubernetes resources: PV, service mesh, scheduling
• Beta stage
Running out first workload
• Use GKE for kubernetes cluster ( with auto scaling enabled)
• Build spark docker image for kubernetes to use
• Run pi.py on the cluster
Let’s do it!
Why use cnvrg to run your spark workloads?
• Leverages the spark & kubernetes to one unified system
• Reproducible jobs: artifacts are linked to workloads
• Monitor your SPARK workload health
• One unified dashboard for all your projects and workloads
• Simple & fast
• Clarity
DEMO
Summary
• Spark is a unified analytics engine for large-scale data processing
• Kubernetes is a platform for containers orchestration
• Overview on Spark different implementaitons
• Overview of spark on Kubernetes architecture
• Overview of spark on Kubernetes: Pros vs. Cons
• Submit a spark job directly on kubernetes cluster
• Manage, monitor and automate your workloads using cnvrg
Thanks!
https://cnvrg.io
info@cnvrg.io
+972-506-660186

Webinar kubernetes and-spark

  • 1.
    WORKSHOP Spark on Kubernetes Createand set up your Spark cluster on Kubernetes Leah Kolben, CTO @leah4kosh leah@cnvrg.io
  • 2.
    whoami • Developer/Data scientist=> CTO • cnvrg.io = built by data scientists, for data scientists to help teams: • Get from data to models to production in the most efficient and fast way • bridge science and engineering
  • 3.
    agenda • Introduction • What’sspark • Different spark implementations • What’s kubernetes • Spark on Kubernetes – pros vs cons • Live Workshop • Summary
  • 4.
    What is Spark? •Unified analytics engine for large-scale data processing • Faster processing speed of applications due to In-memory cluster computing (100x improvement) • Support different workloads – batch, iterative, streaming, interactive SQL etc. • Support multiple languages and different environments
  • 5.
    Spark deployment modes •Hadoop Yarn • Apache Mesos • Kubernetes • Standalone
  • 6.
    Kubernetes - recap •Provides a runtime environment for Docker containers • Provides an abstraction layer for containers to run on • All services are natively load balanced • Can scale up and down dynamically • Monitor the health of the containers • Schedule runs and cronjobs • Use the same API across EVERY cloud provider and bare metal!
  • 7.
  • 8.
    Spark Architecture onKubernetes • Spark-submit will be used to submit a spark application using kubectl: • Spark will create a spark driver as a pod • Driver will create executors (pods) and run the application code • When job is done – terminate jobs and clean resources (terminate nodes)
  • 9.
    Spark on kubernetes •Kubernetes can manage unified containerized pipelines • Optimization for resource sharing • Leverage kubernetes resources: PV, service mesh, scheduling • Beta stage
  • 10.
    Running out firstworkload • Use GKE for kubernetes cluster ( with auto scaling enabled) • Build spark docker image for kubernetes to use • Run pi.py on the cluster
  • 11.
  • 12.
    Why use cnvrgto run your spark workloads? • Leverages the spark & kubernetes to one unified system • Reproducible jobs: artifacts are linked to workloads • Monitor your SPARK workload health • One unified dashboard for all your projects and workloads • Simple & fast • Clarity
  • 13.
  • 14.
    Summary • Spark isa unified analytics engine for large-scale data processing • Kubernetes is a platform for containers orchestration • Overview on Spark different implementaitons • Overview of spark on Kubernetes architecture • Overview of spark on Kubernetes: Pros vs. Cons • Submit a spark job directly on kubernetes cluster • Manage, monitor and automate your workloads using cnvrg
  • 15.