Webinar kubernetes and-spark

WORKSHOP
Spark on Kubernetes
Create and set up your Spark cluster on Kubernetes
Leah Kolben, CTO
@leah4kosh
leah@cnvrg.io

whoami
• Developer/Data scientist => CTO
• cnvrg.io = built by data scientists, for data scientists to help teams:
• Get from data to models to production in the most efficient and fast way
• bridge science and engineering

agenda
• Introduction
• What’s spark
• Different spark implementations
• What’s kubernetes
• Spark on Kubernetes – pros vs cons
• Live Workshop
• Summary

What is Spark?
• Unified analytics engine for large-scale data processing
• Faster processing speed of applications due to In-memory cluster computing (100x
improvement)
• Support different workloads – batch, iterative, streaming, interactive SQL etc.
• Support multiple languages and different environments

Spark deployment modes
• Hadoop Yarn
• Apache Mesos
• Kubernetes
• Standalone

Kubernetes - recap
• Provides a runtime environment for Docker containers
• Provides an abstraction layer for containers to run on
• All services are natively load balanced
• Can scale up and down dynamically
• Monitor the health of the containers
• Schedule runs and cronjobs
• Use the same API across EVERY cloud provider and bare metal!

Spark Architecture on Kubernetes

Spark Architecture on Kubernetes
• Spark-submit will be used to submit a spark application using
kubectl:
• Spark will create a spark driver as a pod
• Driver will create executors (pods) and run the application code
• When job is done – terminate jobs and clean resources
(terminate nodes)

Spark on kubernetes
• Kubernetes can manage unified containerized pipelines
• Optimization for resource sharing
• Leverage kubernetes resources: PV, service mesh, scheduling
• Beta stage

Running out first workload
• Use GKE for kubernetes cluster ( with auto scaling enabled)
• Build spark docker image for kubernetes to use
• Run pi.py on the cluster

Why use cnvrg to run your spark workloads?
• Leverages the spark & kubernetes to one unified system
• Reproducible jobs: artifacts are linked to workloads
• Monitor your SPARK workload health
• One unified dashboard for all your projects and workloads
• Simple & fast
• Clarity

Summary
• Spark is a unified analytics engine for large-scale data processing
• Kubernetes is a platform for containers orchestration
• Overview on Spark different implementaitons
• Overview of spark on Kubernetes architecture
• Overview of spark on Kubernetes: Pros vs. Cons
• Submit a spark job directly on kubernetes cluster
• Manage, monitor and automate your workloads using cnvrg

Thanks!
https://cnvrg.io
info@cnvrg.io
+972-506-660186

Webinar kubernetes and-spark

More Related Content

What's hot

Similar to Webinar kubernetes and-spark

More from cnvrg.io AI OS - Hands-on ML Workshops

Recently uploaded

Webinar kubernetes and-spark