Serverless spark

Serverless Spark
Rachit Arora, Lead Software Architect, IBM Cloud

Spark
Unified, open source, parallel, data processing framework for Big Data Analytics
Spark Core Engine
Yarn Mesos
Standalon
e
Scheduler
Kubernete
s
Spark SQL
Interactive
Queries
Spark
Streaming
Stream
processing
Spark
MLlib
Machine
Learning
GraphX
Graph
Computation

Typical Bigdata Application
Secure
Catalog and Search
Ingest &
Store
Prepare Analyze Visualize
Date Engineer Date Scientist
Application
Developer

Let look into
role of Data
Scientist
• I want to run my analytics jobs
• Social media analytics
• Text analytics (Structure and Unstructured)
• I want to run queries on demand
• I want to run R scripts
• I want to submit Spark jobs
• I want to view History Server Logs of my
application
• I want to View Daemon logs
• I want to write Notebooks

Evolution of Spark Analytics
On Prem Install
• Acquire
Hardware
• Prepare
Machine
• Install Spark
• Retry
• Apply patches
• security
• Upgrades
• Scale
• High
availability
Virtualization
• Prepare Vm
Imaging
Solution
• Network
Management
• High
Avilability
• Patches
• Scale
Managed
• Configure
Cluster
• Customize
• Scale
• Pay even if
idle
Serverless
• Run analytics

IBM Watson Studio
Spark Environments

What
Kubernetes
Bring in?
Kubernetes is an open-source
system for automating
deployment, scaling, and
management of containerized
applications.
•It Manages Containers for me
•It Manages High availability
•It Provides me flexibility to choose
resource I WANT and Persistence I
want
Kubernetes – Lots of addon
services: third-party logging,
monitoring, and security tools
Reduced operational costs
Improved infrastructure
utilization

Why Run Spark on
Kubernetes
• Are you using data
analytical pipeline which is
containerized?
• Resource sharing is better
optimized
• Leveraging Kubernetes
ecosystem
• Kubernetes community
support

Serverless Spark
Option 1: Multitenant Spark
Cluster
•Performance is not consistent
•Wrong Library in class path impact
•Secure and Complaint
•ISO
•HIPAA
•Stability issue
•Single point of failure
•Maintenance and Upgrades

Serverless Spark
Option 2: Function as a service
• Single Node Cluster – Or No
Cluster at all
• Spark local mode
• all in one Image
• Resource Limitations
• Design Limitations

Serverless Spark
Option 3: Vanilla Containers
• Repeatable
• Application Portability
• Faster Development Cycle
• Reduced dev-ops load
• Improved Infrastructure
Utilization

Serverless Spark
Option 4:- Kubernetes with
Standalone cluster manager

Serverless Spark
Kubernetes cluster manager

Serverless Spark
Kubernetes cluster manager +
Spark Operator

References
• IBM Watson Studio
https://datascience.ibm.com
• Spark Environments
• IBM Watson
https://www.ibm.com/analytics/us/en/watson-
data-platform/tutorial/
• Analytics Engine
https://www.ibm.com/cloud/analytics-engine
• Apache Spark
• Apache Arrow
• Alluxio
• Kubernetes Scheduler
Design & Discussion
• Kuberenetes Clusters on IBM Cloud

Thank you
Rachit Arora
rachitar@in.ibm.com
Twitter @rachit1arora

Serverless spark

More Related Content

What's hot

Similar to Serverless spark

Recently uploaded

Serverless spark

Editor's Notes