Apache Spark has introduced a powerful engine for distributed data processing, providing unmatched capabilities to handle petabytes of data across multiple servers. Its capabilities and performance unseated other technologies in the Hadoop world, but while Spark provides a lot of power, it also comes with a high maintenance cost, which is why we now see innovations to simplify the Spark infrastructure.
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
Running Apache Spark Jobs Using Kubernetes
1.
2. Running Apache Spark Jobs
Using Kubernetes
Yaron Haviv
CTO, Iguazio
Marcelo Litovsky
Solution Architect, Iguazio
3. 85% of AI Projects Never Make it to Production
Research Environment Production Pipeline
Build from
Scratch
with a Large
Team
Manual
extraction
In-mem
analysis
Small scale
training
Manual
evaluation
Real-time
ingestion
Preparation at
scale
Train with many
params & large data
Real-time events
& data features
ETL Streaming APIs
Sync
4. Spark Help Us Scale ML Pipeline
4
ETL, Streaming,
Logs, Scrapers, ..
Ingest Prepare Train
With hyper-params,
multiple algorithms
Validate Deploy ++
Join, Aggregate,
Split, ..
Test, deploy, monitor
model & API serversServerless:
ML & Analytics
Functions
Features/Data:
Fast, Secure,
Versioned base features train + test datasets model report report metricsRT features
Selected model
with test data
5. Why Spark on Kubernetes?
▪ Unified management —Getting away from two cluster management
interfaces if your organization already is using Kubernetes elsewhere.
▪ Ability to isolate jobs —You can move models and ETL pipelines from
dev to production without the headaches of dependency
management.
▪ Resilient infrastructure —You don’t worry about sizing and building the
cluster, manipulating Docker files or networking configurations.
▪ Vibrant community constantly evolving
5
6. Goodbye Hadoop, Hello Cloud-Native
Eliminate complexity and inefficiency, gain cloud agility
6
YARN
HbaseHDFS
Map
Reduce
Pig,
Hive, ..
Data
Orchestration
Middleware
Your Business Logic
Consume
Innovate
Managed Storage
and Databases
Any Containerized Microservice
7. Spark on Kubernetes
7
Diagram and Bullet point Credit: https://spark.apache.org/docs/latest/running-on-kubernetes.html#prerequisites
• Spark creates a Spark driver running within
a Kubernetes pod.
• The driver creates executors which are also
running within Kubernetes pods and connects
to them, and executes application code.
• When the application completes, the executor
pods terminate and are cleaned up, but the
driver pod persists logs and remains in
“completed” state in the Kubernetes API until
it’s eventually garbage collected or manually
cleaned up.
8. How to run your spark job in Kubernetes ?
Cluster
Mode
Client
Mode
K8S
Operator
Spark Executors
Client
Spark-submit
Spark
driver
Spark ExecutorsSpark
driver
Client
Spark-submit
Spark Executors
Spark
driver
Spark
Operator
Kubernetes API
kubectl
9. Comparing Modes
Client Mode Cluster Mode K8S Operator
Execution
environment
Driver runs on job scheduling
environment
Driver runs in a Kubernetes pod Driver runs in a Kubernetes pod
Driver pod
communication
User needs to define
communications between
driver and executors
Kubernetes networking needs to be
properly configured for drive and
executor pods to communicate
The operator enables proper
communication between driver
and executors
Role based
access controls
User needs direct access to
Kubernetes with proper RBAC
User needs direct access to
Kubernetes with proper RBAC
The operator handles
deployments. More flexibility
configuring RBAC
Execution Driver could be located in a
separate host/container
Driver runs in the same kubernetes
cluster as executors
Driver runs in the same
kubernetes cluster as executors
10. Demo I
See repo: http://github.com/marcelonyc/igz_sparkk8s
• Instructions to deploy Spark Operator on Docker Desktop
• Configuration commands and files
• Examples
11. DevOps Challenges Remain
▪ Per J ob custom resources and configuration
▪ Specific runtime requirements and package dependencies
▪ Elastic scaling, resource limits &guarantees, ..
▪ ML Pipeline integration
▪ Coexistence/integration with other frameworks
▪ Resource and job monitoring
▪ …
12. Serverless: resource elasticity (to Zero),
automated deployment and operations
Serverless Today Data Prep and Training Jobs
Task lifespan Millisecs to mins Secs to hours
Scaling Load-balancer Partition, shuffle, reduce,
Hyper-params, RDD
State Stateless Stateful
Input Event Params, Datasets
6
Why Not Make Spark Serverless?
Time we extend Serverless to data-science !
13. ML & Analytics Functions Architecture
User Code OR
ML service
Runtime / SaaS
(e.g. Spark, Dask,
Horovod, Nuclio, ..)
Data / Feature
stores
Secrets
Artifacts &
Models
Ops
ML Pipeline
Inputs OutputsML Function
14. Serverless Spark ML Function Example
https://github.com/mlrun/mlrun/blob/master/examples/mlrun_sparkk8s.ipynb
15. Automating The Development & Tracking Workflow
Write and
test locally
specify runtime
configuration
Run/scale on
the cluster
Build
(if needed)
Document
& Publish
Run in a
Pipeline
Track experiments/runs, functions and data
image, deps
cpu/gpu/mem
data, volumes, ..
Use
published
functions
16. KubeFlow+Serverless: Automated ML Pipelines
What is Kubeflow ?
▪ Operators for ML frameworks
(lifecycle management, scale-out, ..)
▪ Managed notebooks
▪ ML Pipeline Automation
▪ With Serverless, we automate the
deployment, execution, scaling and
monitoring of our code
16
17. • 4M global customers
• 200 countries and territories - streaming global commerce
• Understanding illicit patterns of behavior in real time
based on 90 different parameters
• Proactively preventing money laundering before it occurs
Want To Move From Fraud Detection to
Prevention And Cut Time To Production
Fraud Prevention
Case Study: Payoneer
18. Traditional Fraud-Detection
Architecture (Hadoop)
18
SQL Server
Operational database
ETL to the DWH
every 30min
Data warehouse
Mirror table
Offline
processing
(SQL)
Feature vector Batch prediction
Using R Server
40 Minutes to identify suspicious money laundering account
40 Precious Minutes (detect fraud after the fact)
Long and complex process to production
19. Moving To Real-Time Fraud Prevention
19
SQL Server
Operational database
CDC
(Real-time)
Real-time
Ingestion Online + Offline
Feature Store
Model Training
(sklearn)
Model Inferencing
(Nuclio)
Block account !
Queue
Analysis
12 Seconds (prevent fraud)
12 Seconds to detect and prevent fraud !
Automated dev to production using a serverless approach