Lessons Learnt from Running Thousands of On-demand Spark Applications

Lessons Learnt from Running
thousands of On-demand Spark
Applications
Ada Sharoni
Software Engineering Architect
@Hunters

Ada Sharoni
Software Engineering Architect @Hunters
• Software Engineering Architect @Hunters ~3 Year
• ML & Big Data
• Fun Fact: I started out as a Hardware Engineer
https://twitter.com/AdaSharoni
https://www.linkedin.com/in/ada-sharoni-47ba26b8/

Hunters
Security Operations Platform
• Help security teams understand the full attack story
• Correlate existing telemetry and sources across surfaces
Infection Download Persist
Command &
Control
Lateral
Movement
Data
Exfiltration
• Network • Cloud
• SasS
• Endpoint
• Email
• etc

Integrations can be easily added …

Streaming Security Data in Real-Time
Data Lake
● multiple formats
● multiple sources
● Streaming in real-time
Flexible Ingestion
Data Sources
ingestion
Detection
Layer
Auto
Investigation
Knowledge
Graph
S3

ETL Logic as Conﬁguration
Fruit Ninja
Fruit Ninja
Ingestion
Apps
Data Lake
Detection
Layer
Auto
Investigation
Knowledge
Graph
Kafka
S3
Azure Blob …

ETL Logic as Conﬁguration
Fruit Ninja
Fruit Ninja
Ingestion
Apps
(per dataflow)
Data Lake
Detection
Layer
Auto
Investigation
Knowledge
Graph
Kafka
S3
Azure Blob
App Config
Decoder
Config
Transformer
Config
…
Schema

Why ETL Logic as Conﬁguration ?
1. Security logs come in all shapes and sizes -> takes time to research!
2. There’s a lot of expertise and knowledge domain
3. Engineering teams cannot be a bottleneck
4. Business logic should be easily developed & deployed by the masses
5. SLA to production should be FAST to allow for rapid iteration

Example: Ingestion Decoder Logic
yaml
1
2
3
4
5
6
7
8
9
1
0
name: nested-json-csv
params:
- name: mainParsingColumns
type: string
...
file_format: text
steps:
- in: raw_input
out: json_decoded_df
df_step: JsonDecoder
params: ...
- in: raw_input
out: decoded_df
df_step: CsvDecoder
params:
column: '@jsonMessageColumn@'
schema: '@mainParsingColumns@'
delimiter: '@csvDelimiter@'
Similar solution in open source: Metorikku

Example: Ingestion Transformer Logic
yaml
1
2
3
4
5
6
7
8
9
1
0
name: aws-cloudtrail
params: ...
mainParsingColumns
: '{
"eventTime": {"type": "timestamp"},
"awsRegion": {"type": "string"},
"eventID": {"type": "string"}
...
}'
steps:
- in: decoded_df
out: fully_flattened_df
df_step: FlattenObjects
params: ...
- in: fully_flattened_df
out: final_output
sql: SELECT get_json_object(raw, '$.userIdentity')as user_identity,
...
Similar solution in open source: Metorikku

Cool… But how does logic
get quickly to Production?

how do transformers get to production?
Transformer .yaml Config
Portal
Transformer
CICD
analysts
customers
Ingestion App
Layer
name: aws-cloudtrail
params: ...
mainParsingColumns
: '{
"eventTime": {"type": "timestamp"},
"awsRegion": {"type": "string"},
"eventID": {"type": "string"}
...
}'
steps:
- in: decoded_df
out: fully_flattened_df
df_step: FlattenObjects
params: ...
- in: fully_flattened_df
out: final_output
sql: SELECT ...

Tip: extract all
functionality to conﬁg

Example App Conﬁg
yaml
dataflow_id: 3c839ebe-424a-d5f5-efd8-649c54229cd0
data_type: aws-vpc-flow-logs
logic:
transformer_path
: https://xxx/transformers/name/aws-vpc-flow-logs/version/2.2/definition.yaml
transformer_params
: {}
decoder_path
: https://xxx/decoders/name/syslog-with-csv-no-header/version/1.0/definition.yaml
decoder_params
: {“delimiter”: “t”}
inputs:
data_frame_name
: raw_input
source_type: s3-list
options:
maxFileAge: 10d
outputs:
...
spark_binary_version
: latest
t_shirt_size: XS
spark_conf_overrides
: ...
1
2
3
4
5
6
7
8
9
1
0
Notice these

Architecture Solution
Controller
Service
Fruit Ninja
Fruit Ninja
Ingestion
Apps
(per dataflow)
Data Lake
new
integration
analysts
Detection
Layer
Auto
Investigation
Knowledge
Graph
Kafka
S3
Azure Blob
Portal
ETL Logic
CICD
Decoder
Config
Transformer
Config
…
1. Track & redeploy upon app
config changes
2. Redeploy upon ETL logic
(decoder /transformer) version
bump
3. App can run on specific branch
4. App can run with specific
Spark config overrides
Schema
customers
Controller Service (or ArgoCD )
App Config
deploy

This is probably how you
envision Spark Cluster …
20

Master
Worker Worker Worker Worker
Spark Cluster

This is more like it
on Kubernetes -
22

-> This is how it can look on Kubernetes
Master
Node D
Node C
Worker
Node B
Master
Master
Node A
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Master
Worker
Worker
Worker
Worker
Worker Worker
Worker
Worker

-> and with auto scaling …
Master
Node D
Node C
Worker
Node B
Master
Master
Node A
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Master
Worker
Worker
Worker
Worker
Worker Worker
Worker
Worker
Node E …
Worker
Worker

Why Kubernetes?
1. Cost $$$ (vs. operation)
2. Easy to manage infra (after initial effort)
3. Fast Autoscaling (1 kB/hr → 1 TB/hr in minutes)
4. Easy to achieve Isolation (low overhead per app)
a. app per customer
b. app per data ﬂow …

K8S Challenges
1. Learn k8s ecosystem
2. Initial setups
a. Create k8s cluster
b. Create node group
c. Setup auto scaling
d. Setup IAM policy
e. Install Spark Operator

API Server
Spark Operator
Master Pod
Worker Pod
Spark Operator
Spark Operator
Spark Application
Worker Pod
Worker Pod
Cluster
Autoscaler
Spark Application
(custom resource)
Scheduler

Setup Auto-Scaling Group
Terraform
resource "aws_autoscaling_group" {
name = "spark-spots"
override_instance_types = [
"r6a.2xlarge", "r6id.2xlarge", "r6in.2xlarge", "r6idn.2xlarge", "r5.2xlarge",
"r5a.2xlarge", "r5n.2xlarge", "r5d.2xlarge", "r5ad.2xlarge", "r5dn.2xlarge"
]
spot_instance_pools = 8
min_size = 0
max_size = 1000
desired_capacity = 0
tags = {
"role" = "<role_name>"
"k8s.io/cluster-autoscaler/node-template/label/role" = "<role_name>"
"k8s.io/cluster-autoscaler/node-template/taint/role=<role_name>" = "NoSchedule"
"k8s.io/cluster-autoscaler/node-template/autoscaling-options/scaledownunneededtime" = "30m"
}
1
2
3
4
5
6
7
8
9
1
0
let autoscaler do the job automatically
Taint: repel unrelated pods
remove redundant nodes after descaling $$$

Setup Spark Operator
helm / kubectl apply / argoCD
Spark-operator:
sparkJobNamespace: <namespace>
serviceAccounts:
spark:
name: spark
webhook:
enable: true
namespaceSelector: "kubernetes.io/metadata.name=<namespace>"
1
2
3
4
5
6
7
8
9
1
0
Limit mutating admission webhook
to specific <namespace> so it can
setup pod tolerations correctly
⚠

Apply SparkApplication yamls
helm / kubectl apply
apiVersion: "sparkoperator.k8s.io/v1beta2"
kind: SparkApplication
metadata:
name: my_spark_app
namespace: <namespace>
labels:
my_very_useful_label
: "filter_by_me"
spec:
dynamicAllocation
:
enabled: true
initialExecutors
: 2
driver:
tolerations:
- key: "role"
operator: "Equal"
value: "<role_name>"
effect: "NoSchedule"
1
2
3
4
5
6
7
8
9
1
0
namespace supervised by spark operator
Tolaration: pod can exist in which nodes
executor:
tolerations:
- key: "role"
operator: "Equal"
value: "<role_name>"
effect: "NoSchedule"

How can we reduce
operational overhead ?
(when running thousands of apps)

Resilience to sudden shutdowns
1. If you have custom source connectors:
a. Save state in (checkpointed) Metadata Log
b. Graceful shutdown
Spark Custom
Source
Connector
fetch() checkpoint
Metadata Log
Source
cache
shutdown()

2. Graceful executor decommissioning (Spark >3.1)
spark.decommission.enabled : "true"
spark.storage.decommission.enabled : "true"
spark.storage.decommission.rddBlocks.enabled : "true"
spark.storage.decommission.shuffleBlocks.enabled : "true"
spark.kubernetes.executor.decommissionLabel : "true"
spark.kubernetes.executor.decommissionLabelValue : "decomissioned"

3. Reliable data write:
a. Staging committer
b. Magic committer
spark.hadoop.mapreduce.outputcommitter.factory.scheme.s3a :
"org.apache.hadoop.fs.s3a.commit.S3ACommitterFactory"
spark.hadoop.fs.s3a.committer.name : "partitioned"
spark.hadoop.fs.s3a.committer.staging.conflict-mode" : "append"
spark.hadoop.fs.s3a.committer.staging.unique-filenames : "true"

Master
Node D
Node C
Worker
Node B
Master
Master
Node A
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Master
Worker
Worker
Worker
Worker
Worker Worker
Worker
Worker
Node E …
Worker
Worker
Worker
Worker
Worker
T Shirt Sizing

T Shirt Sizing
Scale-up definitions
Property XS S M L XL
spark.memory.offHeap.size 2g 4g 8g 10g 12g
driver.coreRequest 600m 1600m 1800m 4600m 5600m
driver.cores 1 2 2 5 6
driver.memory 1g 2g 4g 6g 8g
executor.coreRequest 1800m 3600m 3600m 5800m 5800m
executor.cores 2 4 4 6 6
executor.memory 2048m 4096m 5120m 6144m 7168m
executor.memoryOverhead 307m 735m 768m 921m 1075m
Bonus: app gets auto deployed with new T shirt size in just one click!

Summary
1. Isolation - breakdown applications per functionality, per tenant
2. “What you see is what you get” - get everything out to conﬁg (inc. logic!)
a. Easier management
b. Easier debug & testing in Production
c. Easier deployment
3. Kubernetes is awesome (and cheap!)

Whats next
1. Cluster Autoscaler -> Karpenter
a. Reduce network trafﬁc (cross availability zones)
b. Combine on-demand with spot instances
c. Complex cost saving logic
2. Running on Graviton nodes ($$$)

Lessons Learnt from Running Thousands of On-demand Spark Applications

Recommended

Recommended

More Related Content

Similar to Lessons Learnt from Running Thousands of On-demand Spark Applications

Similar to Lessons Learnt from Running Thousands of On-demand Spark Applications (20)

More from Itai Yaffe

More from Itai Yaffe (20)

Recently uploaded

Recently uploaded (20)

Lessons Learnt from Running Thousands of On-demand Spark Applications