Why Kubernetes as a container orchestrator is a right choice for running spark clusters on cloud?

Why Kubernetes as a container
orchestrator is a right choice for
running spark clusters on cloud?
Rachit Arora
rachitar@in.ibm.com
IBM, India Software Labs

Spark
Unified, open source, parallel, data processing framework for Big Data Analytics
Spark Core Engine
Yarn Mesos
Standalon
e
Scheduler
Kubernete
s
Spark SQL
Interactive
Queries
Spark
Streaming
Stream
processing
Spark
MLlib
Machine
Learning
GraphX
Graph
Computation

Typical Bigdata Application
Secure
Catalog and Search
Ingest &
Store
Prepare Analyze Visualize
Date Engineer Date Scientist
Application
Developer

Let look into role of Data Scientist
• I want to run my analytics jobs
• Social media analytics
• Text analytics (Structure and Unstructured)
• I want to run queries on demand
• I want to run R scripts
• I want to submit Spark jobs
• I want to view History Server Logs of my application
• I want to View Daemon logs
• I want to write Notebooks

Evolution of Spark Analytics
On Prem Install
• Acquire
Hardware
• Prepare
Machine
• Install Spark
• Retry
• Apply patches
• security
• Upgrades
• Scale
• High
availability
Virtualization
• Prepare Vm
Imaging
Solution
• Network
Management
• High
Avilability
• Patches
• Scale
Managed
• Configure
Cluster
• Customize
• Scale
• Pay even if
idle
Serverless
• Run analytics

Spark Serverless Characteristics
• No Servers to Provision
• Scale with usage
• Availability and fault tolerant
• Never pay for idle
Kernel
History
Server
Notebook
Server
Browswer
Data
Scientist
COS/Injestion
Data
Engineer

Typical Hadoop/Spark Cluster - Setup
• Get the suitable hardware
• Prepare host machine
• Setup various networks
• Private
• Public
• Management
• Fetch the binaries for the install
• Prepare the blueprint/config file for the install
• Start the install
• Many a times install fails, debug and retry again.

Earlier Experiments
Option OS Provisioning Config
Cluster Management /
Updates
1 Bare metal Chef Chef
2 xCAT – Stateful(Create your own VMs) PostScripts xCAT - updateNode
3 xCAT – Sysclone(Image from current system) Not Needed xCAT - updateNode
4 Bare metal PostScripts xCAT - updateNode
5 Cloud Provider Specific Images Not Needed Manual/Scripts
6 Standard-ISO Image Anaconda -post scripts Manual/Scripts

How do I build Serverless Spark
• Option 1: Vanilla Containers – If I need to build with Kubernetes
• Repeatable
• Application Portability
• Faster Development Cycle
• Reduced dev-ops load
• Improved Infrastructure Utilization

Guiding Principles
• Virtualization helps repeatability, lesser failures & speed
• Maintenance
• Performance(Equivalent to Bare metal)
• Use open source from an active community
• Cloud-agnostic

Containers: Thinking as VMs ? (mistake)

Docker in Hadoop Cluster on Cloud
• Each cluster node is a virtual node powered by Docker. Each node of
the cluster is a Docker container
• Docker containers run on a bunch of bare metal hosts (Docker-hosts)
• Each Hadoop cluster will have multiple nodes/Docker containers
spanning multiple hosts
• Docker
• Container management - Custom
• Multi host networking – Overlay Network
• Registry – Private
• Local Storage

Typical Clusters
L
D
A
P
K
M
S
My
SQL
S
S
H
Master Data 1
Data 2 Data 3
L
D
A
P
K
M
S
My
SQL
S
S
H
Master Data 1
Data 2
L
D
A
P
K
M
S
My
SQL
S
S
H
Master Data 1
Data 2 Data 3
Data 4 Data 5Cluster 1
Cluster 2
Cluster 3
Network Boundary

Docker Images
• Master node
• Data Node
• Edge Node
• Auxiliary service images
• Ldap
• Mysql
• Ambari server
• KMS

Multi host Docker networking
• Weave based overlay network among nodes
• One /26 private subnet per cluster (172.x.x.x)
• Master node to have a public IP – ports-forwarding
• Portable public IPs
• Network speed (shared with other masters)
• Edge node will be accessible using a public IP
• User can SSH and run Hive, Hbase, Hadoop & Spark shells
• Private network
• High Speed
• Secure

Network Architecture
A A A B
bond1 bond0 eth-weave
docker0
10 Gbps Softlayer
private network
10 Gbps Softlayer
public network
B B B C
eth-weave
docker0
C C C C
eth-weave
docker0
bond1 bond0 bond1 bond0
docker port
forwarding
* docker ICC=false (no inter container communication over docker0 network)
* All inter-container communication is through weave network
* One weave’s private subnet per cluster (No communication across subnets)
weave overlay network on Softlayer private network
Host 1 Host 2 Host 3
Master
node
B BB
Master
node
Edge
node
Data
node

Provisioning Infrastructure
• Cluster Manager that provides REST API to create cluster
• API Gateway application
• Deployment agent
• Home grown Container Orchestrator - Deployer scripts that actually do all the work
• Prepare Directory structure
• Prepare network
• Start Containers with right options for
• Volumes
• Ports
• IP
• Hostname
• network

Phases involved
1. Acquire Hardware
2. Deploy Provisioning Infra
3. Add Hardware to Resource Pool
4. Prepare Host Machines
5. Orchestrate Cluster lifecycle
1. Create Cluster, Add Nodes, Remove Nodes, Delete Cluster

How is a cluster created?
Cluster
Manager
DBAPI Gateway
2. Manage resources
1. Create Cluster
Deployer
Agent
Deployer
Agent
Deployer
Agent
3. Create Cluster
4. PrepareNode
7. Install IOP
5. Get Node details
6. Get Blueprint

• Option 2 : Function as a service
• Single Node Cluster – Or No Cluster at all
• Spark local mode
• all in one Image
• Resource Limitations
• Design Limitations

• Option 3 : Kubernetes
* Slide from Kubernetes Scheduler Design & Discussion

What Kubernetes Bring in?
• Kubernetes is an open-source system for automating deployment,
scaling, and management of containerized applications.
• It Manages Containers for me
• It Manages High availability
• It Provides me flexibility to choose resource I WANT and Persistence I want
• Kubernetes – Lots of addon services: third-party logging, monitoring,
and security tools
• Reduced operational costs
• Improved infrastructure utilization

Kubernetes - Spark Cluster Manager Options
Kubernetes Scheduler
Standalone
Yarn on Kubernetes ?

Conclusion
• Spark Serverless - need for Data Scientist
• Kuberenetes enables
• Spark Clusters in Background and Kernel Up and Running in seconds
• High Availability
• Auto scaling with Spark monitoring and Kubernetes deployment features
• No Extra Money for idle time

References
• IBM Watson Studio
https://datascience.ibm.com
• IBM Watson
https://www.ibm.com/analytics/us/en/watson-data-platform/tutorial/
• Analytics Engine
https://www.ibm.com/cloud/analytics-engine
• Apache Spark
• Kubernetes Scheduler
Design & Discussion
• Kuberenetes Clusters on IBM Cloud
Rachit Arora
rachitar@in.ibm.com
@rachit1arora

Thank you
Rachit Arora
rachitar@in.ibm.com
@rachit1arora

Why Kubernetes as a container orchestrator is a right choice for running spark clusters on cloud?

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Why Kubernetes as a container orchestrator is a right choice for running spark clusters on cloud?

Similar to Why Kubernetes as a container orchestrator is a right choice for running spark clusters on cloud? (20)

More from DataWorks Summit

More from DataWorks Summit (20)

Recently uploaded

Recently uploaded (20)

Why Kubernetes as a container orchestrator is a right choice for running spark clusters on cloud?

Editor's Notes