SlideShare a Scribd company logo
1 of 46
Download to read offline
Lessons Learnt from Running
thousands of On-demand Spark
Applications
Ada Sharoni
Software Engineering Architect
@Hunters
Ada Sharoni
Software Engineering Architect @Hunters
• Software Engineering Architect @Hunters ~3 Year
• ML & Big Data
• Fun Fact: I started out as a Hardware Engineer
https://twitter.com/AdaSharoni
https://www.linkedin.com/in/ada-sharoni-47ba26b8/
Hunters
Security Operations Platform
• Help security teams understand the full attack story
• Correlate existing telemetry and sources across surfaces
Infection Download Persist
Command &
Control
Lateral
Movement
Data
Exfiltration
• Network • Cloud
• SasS
• Endpoint
• Email
• etc
Integrations can be easily added …
Streaming Security Data in Real-Time
Data Lake
● multiple formats
● multiple sources
● Streaming in real-time
Flexible Ingestion
Data Sources
ingestion
Detection
Layer
Auto
Investigation
Knowledge
Graph
S3
ETL Logic as
Configuration
ETL Logic as Configuration
Fruit Ninja
Fruit Ninja
Ingestion
Apps
Data Lake
Detection
Layer
Auto
Investigation
Knowledge
Graph
Kafka
S3
Azure Blob …
ETL Logic as Configuration
Fruit Ninja
Fruit Ninja
Ingestion
Apps
(per dataflow)
Data Lake
Detection
Layer
Auto
Investigation
Knowledge
Graph
Kafka
S3
Azure Blob
App Config
Decoder
Config
Transformer
Config
…
Schema
Why ETL Logic as Configuration ?
1. Security logs come in all shapes and sizes -> takes time to research!
2. There’s a lot of expertise and knowledge domain
3. Engineering teams cannot be a bottleneck
4. Business logic should be easily developed & deployed by the masses
5. SLA to production should be FAST to allow for rapid iteration
Example: Ingestion Decoder Logic
yaml
1
2
3
4
5
6
7
8
9
1
0
name: nested-json-csv
params:
- name: mainParsingColumns
type: string
...
file_format: text
steps:
- in: raw_input
out: json_decoded_df
df_step: JsonDecoder
params: ...
- in: raw_input
out: decoded_df
df_step: CsvDecoder
params:
column: '@jsonMessageColumn@'
schema: '@mainParsingColumns@'
delimiter: '@csvDelimiter@'
Similar solution in open source: Metorikku
Example: Ingestion Transformer Logic
yaml
1
2
3
4
5
6
7
8
9
1
0
name: aws-cloudtrail
params: ...
mainParsingColumns
: '{
"eventTime": {"type": "timestamp"},
"awsRegion": {"type": "string"},
"eventID": {"type": "string"}
...
}'
steps:
- in: decoded_df
out: fully_flattened_df
df_step: FlattenObjects
params: ...
- in: fully_flattened_df
out: final_output
sql: SELECT get_json_object(raw, '$.userIdentity')as user_identity,
...
Similar solution in open source: Metorikku
Cool… But how does logic
get quickly to Production?
how do transformers get to production?
Transformer .yaml Config
Portal
Transformer
CICD
analysts
customers
Ingestion App
Layer
name: aws-cloudtrail
params: ...
mainParsingColumns
: '{
"eventTime": {"type": "timestamp"},
"awsRegion": {"type": "string"},
"eventID": {"type": "string"}
...
}'
steps:
- in: decoded_df
out: fully_flattened_df
df_step: FlattenObjects
params: ...
- in: fully_flattened_df
out: final_output
sql: SELECT ...
Tip: extract all
functionality to config
Integrations can be easily added …
Example App Config
yaml
dataflow_id: 3c839ebe-424a-d5f5-efd8-649c54229cd0
data_type: aws-vpc-flow-logs
logic:
transformer_path
: https://xxx/transformers/name/aws-vpc-flow-logs/version/2.2/definition.yaml
transformer_params
: {}
decoder_path
: https://xxx/decoders/name/syslog-with-csv-no-header/version/1.0/definition.yaml
decoder_params
: {“delimiter”: “t”}
inputs:
data_frame_name
: raw_input
source_type: s3-list
options:
maxFileAge: 10d
outputs:
...
spark_binary_version
: latest
t_shirt_size: XS
spark_conf_overrides
: ...
1
2
3
4
5
6
7
8
9
1
0
Notice these
ETL Logic as Configuration
Fruit Ninja
Fruit Ninja
Ingestion
Apps
(per dataflow)
Data Lake
Detection
Layer
Auto
Investigation
Knowledge
Graph
Kafka
S3
Azure Blob
App Config
Decoder
Config
Transformer
Config
…
Schema
Architecture Solution
Controller
Service
Fruit Ninja
Fruit Ninja
Ingestion
Apps
(per dataflow)
Data Lake
new
integration
analysts
Detection
Layer
Auto
Investigation
Knowledge
Graph
Kafka
S3
Azure Blob
Portal
ETL Logic
CICD
Decoder
Config
Transformer
Config
…
1. Track & redeploy upon app
config changes
2. Redeploy upon ETL logic
(decoder /transformer) version
bump
3. App can run on specific branch
4. App can run with specific
Spark config overrides
Schema
customers
Controller Service (or ArgoCD )
App Config
deploy
Why K8S
This is probably how you
envision Spark Cluster …
20
Master
Worker Worker Worker Worker
Spark Cluster
This is more like it
on Kubernetes -
22
-> This is how it can look on Kubernetes
Master
Node D
Node C
Worker
Node B
Master
Master
Node A
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Master
Worker
Worker
Worker
Worker
Worker Worker
Worker
Worker
-> and with auto scaling …
Master
Node D
Node C
Worker
Node B
Master
Master
Node A
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Master
Worker
Worker
Worker
Worker
Worker Worker
Worker
Worker
Node E …
Worker
Worker
Why Kubernetes?
1. Cost $$$ (vs. operation)
2. Easy to manage infra (after initial effort)
3. Fast Autoscaling (1 kB/hr → 1 TB/hr in minutes)
4. Easy to achieve Isolation (low overhead per app)
a. app per customer
b. app per data flow …
K8S Challenges
1. Learn k8s ecosystem
2. Initial setups
a. Create k8s cluster
b. Create node group
c. Setup auto scaling
d. Setup IAM policy
e. Install Spark Operator
How do I get started?
27
API Server
Spark Operator
Master Pod
Worker Pod
Spark Operator
Spark Operator
Spark Application
Worker Pod
Worker Pod
Cluster
Autoscaler
Spark Application
(custom resource)
Scheduler
Setup Auto-Scaling Group
Terraform
resource "aws_autoscaling_group" {
name = "spark-spots"
override_instance_types = [
"r6a.2xlarge", "r6id.2xlarge", "r6in.2xlarge", "r6idn.2xlarge", "r5.2xlarge",
"r5a.2xlarge", "r5n.2xlarge", "r5d.2xlarge", "r5ad.2xlarge", "r5dn.2xlarge"
]
spot_instance_pools = 8
min_size = 0
max_size = 1000
desired_capacity = 0
tags = {
"role" = "<role_name>"
"k8s.io/cluster-autoscaler/node-template/label/role" = "<role_name>"
"k8s.io/cluster-autoscaler/node-template/taint/role=<role_name>" = "NoSchedule"
"k8s.io/cluster-autoscaler/node-template/autoscaling-options/scaledownunneededtime" = "30m"
}
1
2
3
4
5
6
7
8
9
1
0
let autoscaler do the job automatically
Taint: repel unrelated pods
remove redundant nodes after descaling $$$
Setup Spark Operator
helm / kubectl apply / argoCD
Spark-operator:
sparkJobNamespace: <namespace>
serviceAccounts:
spark:
name: spark
webhook:
enable: true
namespaceSelector: "kubernetes.io/metadata.name=<namespace>"
1
2
3
4
5
6
7
8
9
1
0
Limit mutating admission webhook
to specific <namespace> so it can
setup pod tolerations correctly
⚠
Apply SparkApplication yamls
helm / kubectl apply
apiVersion: "sparkoperator.k8s.io/v1beta2"
kind: SparkApplication
metadata:
name: my_spark_app
namespace: <namespace>
labels:
my_very_useful_label
: "filter_by_me"
spec:
dynamicAllocation
:
enabled: true
initialExecutors
: 2
driver:
tolerations:
- key: "role"
operator: "Equal"
value: "<role_name>"
effect: "NoSchedule"
1
2
3
4
5
6
7
8
9
1
0
namespace supervised by spark operator
Tolaration: pod can exist in which nodes
executor:
tolerations:
- key: "role"
operator: "Equal"
value: "<role_name>"
effect: "NoSchedule"
How can we reduce
operational overhead ?
(when running thousands of apps)
1. Resiliency to shutdowns
33
Resilience to sudden shutdowns
1. If you have custom source connectors:
a. Save state in (checkpointed) Metadata Log
b. Graceful shutdown
Spark Custom
Source
Connector
fetch() checkpoint
Metadata Log
Source
cache
shutdown()
Resilience to sudden shutdowns
2. Graceful executor decommissioning (Spark >3.1)
spark.decommission.enabled : "true"
spark.storage.decommission.enabled : "true"
spark.storage.decommission.rddBlocks.enabled : "true"
spark.storage.decommission.shuffleBlocks.enabled : "true"
spark.kubernetes.executor.decommissionLabel : "true"
spark.kubernetes.executor.decommissionLabelValue : "decomissioned"
Resilience to sudden shutdowns
3. Reliable data write:
a. Staging committer
b. Magic committer
spark.hadoop.mapreduce.outputcommitter.factory.scheme.s3a :
"org.apache.hadoop.fs.s3a.commit.S3ACommitterFactory"
spark.hadoop.fs.s3a.committer.name : "partitioned"
spark.hadoop.fs.s3a.committer.staging.conflict-mode" : "append"
spark.hadoop.fs.s3a.committer.staging.unique-filenames : "true"
2. T Shirt Sizing
37
Master
Node D
Node C
Worker
Node B
Master
Master
Node A
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Master
Worker
Worker
Worker
Worker
Worker Worker
Worker
Worker
Node E …
Worker
Worker
Worker
Worker
Worker
T Shirt Sizing
T Shirt Sizing
Scale-up definitions
Property XS S M L XL
spark.memory.offHeap.size 2g 4g 8g 10g 12g
driver.coreRequest 600m 1600m 1800m 4600m 5600m
driver.cores 1 2 2 5 6
driver.memory 1g 2g 4g 6g 8g
executor.coreRequest 1800m 3600m 3600m 5800m 5800m
executor.cores 2 4 4 6 6
executor.memory 2048m 4096m 5120m 6144m 7168m
executor.memoryOverhead 307m 735m 768m 921m 1075m
Bonus: app gets auto deployed with new T shirt size in just one click!
Example App Config
yaml
dataflow_id: 3c839ebe-424a-d5f5-efd8-649c54229cd0
data_type: aws-vpc-flow-logs
logic:
transformer_path
: https://xxx/transformers/name/aws-vpc-flow-logs/version/2.2/definition.yaml
transformer_params
: {}
decoder_path
: https://xxx/decoders/name/syslog-with-csv-no-header/version/1.0/definition.yaml
decoder_params
: {“delimiter”: “t”}
inputs:
data_frame_name
: raw_input
source_type: s3-list
options:
maxFileAge: 10d
outputs:
...
spark_binary_version
: latest
t_shirt_size: XS
spark_conf_overrides
: ...
1
2
3
4
5
6
7
8
9
1
0
Notice these
T Shirt Sizing
Scale-up definitions
Property XS S M L XL
spark.memory.offHeap.size 2g 4g 8g 10g 12g
driver.coreRequest 600m 1600m 1800m 4600m 5600m
driver.cores 1 2 2 5 6
driver.memory 1g 2g 4g 6g 8g
executor.coreRequest 1800m 3600m 3600m 5800m 5800m
executor.cores 2 4 4 6 6
executor.memory 2048m 4096m 5120m 6144m 7168m
executor.memoryOverhead 307m 735m 768m 921m 1075m
Bonus: app gets auto deployed with new T shirt size in just one click!
Summary
Summary
1. Isolation - breakdown applications per functionality, per tenant
2. “What you see is what you get” - get everything out to config (inc. logic!)
a. Easier management
b. Easier debug & testing in Production
c. Easier deployment
3. Kubernetes is awesome (and cheap!)
What’s next
Whats next
1. Cluster Autoscaler -> Karpenter
a. Reduce network traffic (cross availability zones)
b. Combine on-demand with spot instances
c. Complex cost saving logic
2. Running on Graviton nodes ($$$)
Questions ?

More Related Content

Similar to Lessons Learnt from Running Thousands of On-demand Spark Applications

Serhiy Kalinets "Building Service Mesh with .NET Core"
Serhiy Kalinets "Building Service Mesh with .NET Core"Serhiy Kalinets "Building Service Mesh with .NET Core"
Serhiy Kalinets "Building Service Mesh with .NET Core"Fwdays
 
Elastic Morocco Meetup Nov 2020
Elastic Morocco Meetup Nov 2020Elastic Morocco Meetup Nov 2020
Elastic Morocco Meetup Nov 2020Anna Ossowski
 
The future will be Serverless - JSDay Verona 2018
The future will be Serverless - JSDay Verona 2018The future will be Serverless - JSDay Verona 2018
The future will be Serverless - JSDay Verona 2018Luciano Mammino
 
Monitoring as Code: Getting to Monitoring-Driven Development - DEV314 - re:In...
Monitoring as Code: Getting to Monitoring-Driven Development - DEV314 - re:In...Monitoring as Code: Getting to Monitoring-Driven Development - DEV314 - re:In...
Monitoring as Code: Getting to Monitoring-Driven Development - DEV314 - re:In...Amazon Web Services
 
OSMC 2016 - ZMON Zalandos OS approach to monitoring in the cloud and DCs by J...
OSMC 2016 - ZMON Zalandos OS approach to monitoring in the cloud and DCs by J...OSMC 2016 - ZMON Zalandos OS approach to monitoring in the cloud and DCs by J...
OSMC 2016 - ZMON Zalandos OS approach to monitoring in the cloud and DCs by J...NETWAYS
 
OSMC 2016 | ZMON: Zalando's OS approach to monitoring in the cloud and DCs by...
OSMC 2016 | ZMON: Zalando's OS approach to monitoring in the cloud and DCs by...OSMC 2016 | ZMON: Zalando's OS approach to monitoring in the cloud and DCs by...
OSMC 2016 | ZMON: Zalando's OS approach to monitoring in the cloud and DCs by...NETWAYS
 
Getting Started with Apache Spark on Kubernetes
Getting Started with Apache Spark on KubernetesGetting Started with Apache Spark on Kubernetes
Getting Started with Apache Spark on KubernetesDatabricks
 
Spark Streaming Recipes and "Exactly Once" Semantics Revised
Spark Streaming Recipes and "Exactly Once" Semantics RevisedSpark Streaming Recipes and "Exactly Once" Semantics Revised
Spark Streaming Recipes and "Exactly Once" Semantics RevisedMichael Spector
 
Containerless in the Cloud with AWS Lambda
Containerless in the Cloud with AWS LambdaContainerless in the Cloud with AWS Lambda
Containerless in the Cloud with AWS LambdaRyan Cuprak
 
CloudWatch hidden features for debugging serverless application
CloudWatch hidden features for debugging serverless applicationCloudWatch hidden features for debugging serverless application
CloudWatch hidden features for debugging serverless applicationMarko (ServerlessLife)
 
Seattle spark-meetup-032317
Seattle spark-meetup-032317Seattle spark-meetup-032317
Seattle spark-meetup-032317Nan Zhu
 
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...Spark Summit
 
AMF Testing Made Easy! DeepSec 2012
AMF Testing Made Easy! DeepSec 2012AMF Testing Made Easy! DeepSec 2012
AMF Testing Made Easy! DeepSec 2012Luca Carettoni
 
Apache Spark Streaming in K8s with ArgoCD & Spark Operator
Apache Spark Streaming in K8s with ArgoCD & Spark OperatorApache Spark Streaming in K8s with ArgoCD & Spark Operator
Apache Spark Streaming in K8s with ArgoCD & Spark OperatorDatabricks
 
GE Predix 新手入门 赵锴 物联网_IoT
GE Predix 新手入门 赵锴 物联网_IoTGE Predix 新手入门 赵锴 物联网_IoT
GE Predix 新手入门 赵锴 物联网_IoTKai Zhao
 
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsRunning Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsDatabricks
 
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...DataWorks Summit/Hadoop Summit
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupRafal Kwasny
 
Reactive app using actor model & apache spark
Reactive app using actor model & apache sparkReactive app using actor model & apache spark
Reactive app using actor model & apache sparkRahul Kumar
 
Introduction to Apache NiFi 1.11.4
Introduction to Apache NiFi 1.11.4Introduction to Apache NiFi 1.11.4
Introduction to Apache NiFi 1.11.4Timothy Spann
 

Similar to Lessons Learnt from Running Thousands of On-demand Spark Applications (20)

Serhiy Kalinets "Building Service Mesh with .NET Core"
Serhiy Kalinets "Building Service Mesh with .NET Core"Serhiy Kalinets "Building Service Mesh with .NET Core"
Serhiy Kalinets "Building Service Mesh with .NET Core"
 
Elastic Morocco Meetup Nov 2020
Elastic Morocco Meetup Nov 2020Elastic Morocco Meetup Nov 2020
Elastic Morocco Meetup Nov 2020
 
The future will be Serverless - JSDay Verona 2018
The future will be Serverless - JSDay Verona 2018The future will be Serverless - JSDay Verona 2018
The future will be Serverless - JSDay Verona 2018
 
Monitoring as Code: Getting to Monitoring-Driven Development - DEV314 - re:In...
Monitoring as Code: Getting to Monitoring-Driven Development - DEV314 - re:In...Monitoring as Code: Getting to Monitoring-Driven Development - DEV314 - re:In...
Monitoring as Code: Getting to Monitoring-Driven Development - DEV314 - re:In...
 
OSMC 2016 - ZMON Zalandos OS approach to monitoring in the cloud and DCs by J...
OSMC 2016 - ZMON Zalandos OS approach to monitoring in the cloud and DCs by J...OSMC 2016 - ZMON Zalandos OS approach to monitoring in the cloud and DCs by J...
OSMC 2016 - ZMON Zalandos OS approach to monitoring in the cloud and DCs by J...
 
OSMC 2016 | ZMON: Zalando's OS approach to monitoring in the cloud and DCs by...
OSMC 2016 | ZMON: Zalando's OS approach to monitoring in the cloud and DCs by...OSMC 2016 | ZMON: Zalando's OS approach to monitoring in the cloud and DCs by...
OSMC 2016 | ZMON: Zalando's OS approach to monitoring in the cloud and DCs by...
 
Getting Started with Apache Spark on Kubernetes
Getting Started with Apache Spark on KubernetesGetting Started with Apache Spark on Kubernetes
Getting Started with Apache Spark on Kubernetes
 
Spark Streaming Recipes and "Exactly Once" Semantics Revised
Spark Streaming Recipes and "Exactly Once" Semantics RevisedSpark Streaming Recipes and "Exactly Once" Semantics Revised
Spark Streaming Recipes and "Exactly Once" Semantics Revised
 
Containerless in the Cloud with AWS Lambda
Containerless in the Cloud with AWS LambdaContainerless in the Cloud with AWS Lambda
Containerless in the Cloud with AWS Lambda
 
CloudWatch hidden features for debugging serverless application
CloudWatch hidden features for debugging serverless applicationCloudWatch hidden features for debugging serverless application
CloudWatch hidden features for debugging serverless application
 
Seattle spark-meetup-032317
Seattle spark-meetup-032317Seattle spark-meetup-032317
Seattle spark-meetup-032317
 
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
 
AMF Testing Made Easy! DeepSec 2012
AMF Testing Made Easy! DeepSec 2012AMF Testing Made Easy! DeepSec 2012
AMF Testing Made Easy! DeepSec 2012
 
Apache Spark Streaming in K8s with ArgoCD & Spark Operator
Apache Spark Streaming in K8s with ArgoCD & Spark OperatorApache Spark Streaming in K8s with ArgoCD & Spark Operator
Apache Spark Streaming in K8s with ArgoCD & Spark Operator
 
GE Predix 新手入门 赵锴 物联网_IoT
GE Predix 新手入门 赵锴 物联网_IoTGE Predix 新手入门 赵锴 物联网_IoT
GE Predix 新手入门 赵锴 物联网_IoT
 
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsRunning Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
 
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetup
 
Reactive app using actor model & apache spark
Reactive app using actor model & apache sparkReactive app using actor model & apache spark
Reactive app using actor model & apache spark
 
Introduction to Apache NiFi 1.11.4
Introduction to Apache NiFi 1.11.4Introduction to Apache NiFi 1.11.4
Introduction to Apache NiFi 1.11.4
 

More from Itai Yaffe

Mastering Partitioning for High-Volume Data Processing
Mastering Partitioning for High-Volume Data ProcessingMastering Partitioning for High-Volume Data Processing
Mastering Partitioning for High-Volume Data ProcessingItai Yaffe
 
Solving Data Engineers Velocity - Wix's Data Warehouse Automation
Solving Data Engineers Velocity - Wix's Data Warehouse AutomationSolving Data Engineers Velocity - Wix's Data Warehouse Automation
Solving Data Engineers Velocity - Wix's Data Warehouse AutomationItai Yaffe
 
Why do the majority of Data Science projects never make it to production?
Why do the majority of Data Science projects never make it to production?Why do the majority of Data Science projects never make it to production?
Why do the majority of Data Science projects never make it to production?Itai Yaffe
 
Planning a data solution - "By Failing to prepare, you are preparing to fail"
Planning a data solution - "By Failing to prepare, you are preparing to fail"Planning a data solution - "By Failing to prepare, you are preparing to fail"
Planning a data solution - "By Failing to prepare, you are preparing to fail"Itai Yaffe
 
Evaluating Big Data & ML Solutions - Opening Notes
Evaluating Big Data & ML Solutions - Opening NotesEvaluating Big Data & ML Solutions - Opening Notes
Evaluating Big Data & ML Solutions - Opening NotesItai Yaffe
 
Big data serving: Processing and inference at scale in real time
Big data serving: Processing and inference at scale in real timeBig data serving: Processing and inference at scale in real time
Big data serving: Processing and inference at scale in real timeItai Yaffe
 
Data Lakes on Public Cloud: Breaking Data Management Monoliths
Data Lakes on Public Cloud: Breaking Data Management MonolithsData Lakes on Public Cloud: Breaking Data Management Monoliths
Data Lakes on Public Cloud: Breaking Data Management MonolithsItai Yaffe
 
Unleashing the Power of your Data
Unleashing the Power of your DataUnleashing the Power of your Data
Unleashing the Power of your DataItai Yaffe
 
Data Lake on Public Cloud - Opening Notes
Data Lake on Public Cloud - Opening NotesData Lake on Public Cloud - Opening Notes
Data Lake on Public Cloud - Opening NotesItai Yaffe
 
Airflow Summit 2020 - Migrating airflow based spark jobs to kubernetes - the ...
Airflow Summit 2020 - Migrating airflow based spark jobs to kubernetes - the ...Airflow Summit 2020 - Migrating airflow based spark jobs to kubernetes - the ...
Airflow Summit 2020 - Migrating airflow based spark jobs to kubernetes - the ...Itai Yaffe
 
DevTalks Reimagined 2020 - Funnel Analysis with Spark and Druid
DevTalks Reimagined 2020 - Funnel Analysis with Spark and DruidDevTalks Reimagined 2020 - Funnel Analysis with Spark and Druid
DevTalks Reimagined 2020 - Funnel Analysis with Spark and DruidItai Yaffe
 
Virtual Apache Druid Meetup: AIADA (Ask Itai and David Anything)
Virtual Apache Druid Meetup: AIADA (Ask Itai and David Anything)Virtual Apache Druid Meetup: AIADA (Ask Itai and David Anything)
Virtual Apache Druid Meetup: AIADA (Ask Itai and David Anything)Itai Yaffe
 
Introducing Kafka Connect and Implementing Custom Connectors
Introducing Kafka Connect and Implementing Custom ConnectorsIntroducing Kafka Connect and Implementing Custom Connectors
Introducing Kafka Connect and Implementing Custom ConnectorsItai Yaffe
 
A Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's RoadmapA Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's RoadmapItai Yaffe
 
Scalable Incremental Index for Druid
Scalable Incremental Index for DruidScalable Incremental Index for Druid
Scalable Incremental Index for DruidItai Yaffe
 
Funnel Analysis with Spark and Druid
Funnel Analysis with Spark and DruidFunnel Analysis with Spark and Druid
Funnel Analysis with Spark and DruidItai Yaffe
 
The benefits of running Spark on your own Docker
The benefits of running Spark on your own DockerThe benefits of running Spark on your own Docker
The benefits of running Spark on your own DockerItai Yaffe
 
Optimizing Spark-based data pipelines - are you up for it?
Optimizing Spark-based data pipelines - are you up for it?Optimizing Spark-based data pipelines - are you up for it?
Optimizing Spark-based data pipelines - are you up for it?Itai Yaffe
 
Scheduling big data workloads on serverless infrastructure
Scheduling big data workloads on serverless infrastructureScheduling big data workloads on serverless infrastructure
Scheduling big data workloads on serverless infrastructureItai Yaffe
 
GraphQL API on a Serverless Environment
GraphQL API on a Serverless EnvironmentGraphQL API on a Serverless Environment
GraphQL API on a Serverless EnvironmentItai Yaffe
 

More from Itai Yaffe (20)

Mastering Partitioning for High-Volume Data Processing
Mastering Partitioning for High-Volume Data ProcessingMastering Partitioning for High-Volume Data Processing
Mastering Partitioning for High-Volume Data Processing
 
Solving Data Engineers Velocity - Wix's Data Warehouse Automation
Solving Data Engineers Velocity - Wix's Data Warehouse AutomationSolving Data Engineers Velocity - Wix's Data Warehouse Automation
Solving Data Engineers Velocity - Wix's Data Warehouse Automation
 
Why do the majority of Data Science projects never make it to production?
Why do the majority of Data Science projects never make it to production?Why do the majority of Data Science projects never make it to production?
Why do the majority of Data Science projects never make it to production?
 
Planning a data solution - "By Failing to prepare, you are preparing to fail"
Planning a data solution - "By Failing to prepare, you are preparing to fail"Planning a data solution - "By Failing to prepare, you are preparing to fail"
Planning a data solution - "By Failing to prepare, you are preparing to fail"
 
Evaluating Big Data & ML Solutions - Opening Notes
Evaluating Big Data & ML Solutions - Opening NotesEvaluating Big Data & ML Solutions - Opening Notes
Evaluating Big Data & ML Solutions - Opening Notes
 
Big data serving: Processing and inference at scale in real time
Big data serving: Processing and inference at scale in real timeBig data serving: Processing and inference at scale in real time
Big data serving: Processing and inference at scale in real time
 
Data Lakes on Public Cloud: Breaking Data Management Monoliths
Data Lakes on Public Cloud: Breaking Data Management MonolithsData Lakes on Public Cloud: Breaking Data Management Monoliths
Data Lakes on Public Cloud: Breaking Data Management Monoliths
 
Unleashing the Power of your Data
Unleashing the Power of your DataUnleashing the Power of your Data
Unleashing the Power of your Data
 
Data Lake on Public Cloud - Opening Notes
Data Lake on Public Cloud - Opening NotesData Lake on Public Cloud - Opening Notes
Data Lake on Public Cloud - Opening Notes
 
Airflow Summit 2020 - Migrating airflow based spark jobs to kubernetes - the ...
Airflow Summit 2020 - Migrating airflow based spark jobs to kubernetes - the ...Airflow Summit 2020 - Migrating airflow based spark jobs to kubernetes - the ...
Airflow Summit 2020 - Migrating airflow based spark jobs to kubernetes - the ...
 
DevTalks Reimagined 2020 - Funnel Analysis with Spark and Druid
DevTalks Reimagined 2020 - Funnel Analysis with Spark and DruidDevTalks Reimagined 2020 - Funnel Analysis with Spark and Druid
DevTalks Reimagined 2020 - Funnel Analysis with Spark and Druid
 
Virtual Apache Druid Meetup: AIADA (Ask Itai and David Anything)
Virtual Apache Druid Meetup: AIADA (Ask Itai and David Anything)Virtual Apache Druid Meetup: AIADA (Ask Itai and David Anything)
Virtual Apache Druid Meetup: AIADA (Ask Itai and David Anything)
 
Introducing Kafka Connect and Implementing Custom Connectors
Introducing Kafka Connect and Implementing Custom ConnectorsIntroducing Kafka Connect and Implementing Custom Connectors
Introducing Kafka Connect and Implementing Custom Connectors
 
A Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's RoadmapA Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's Roadmap
 
Scalable Incremental Index for Druid
Scalable Incremental Index for DruidScalable Incremental Index for Druid
Scalable Incremental Index for Druid
 
Funnel Analysis with Spark and Druid
Funnel Analysis with Spark and DruidFunnel Analysis with Spark and Druid
Funnel Analysis with Spark and Druid
 
The benefits of running Spark on your own Docker
The benefits of running Spark on your own DockerThe benefits of running Spark on your own Docker
The benefits of running Spark on your own Docker
 
Optimizing Spark-based data pipelines - are you up for it?
Optimizing Spark-based data pipelines - are you up for it?Optimizing Spark-based data pipelines - are you up for it?
Optimizing Spark-based data pipelines - are you up for it?
 
Scheduling big data workloads on serverless infrastructure
Scheduling big data workloads on serverless infrastructureScheduling big data workloads on serverless infrastructure
Scheduling big data workloads on serverless infrastructure
 
GraphQL API on a Serverless Environment
GraphQL API on a Serverless EnvironmentGraphQL API on a Serverless Environment
GraphQL API on a Serverless Environment
 

Recently uploaded

꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAbdelrhman abooda
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
vip Sarai Rohilla Call Girls 9999965857 Call or WhatsApp Now Book
vip Sarai Rohilla Call Girls 9999965857 Call or WhatsApp Now Bookvip Sarai Rohilla Call Girls 9999965857 Call or WhatsApp Now Book
vip Sarai Rohilla Call Girls 9999965857 Call or WhatsApp Now Bookmanojkuma9823
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home ServiceSapana Sha
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
Data Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxData Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxFurkanTasci3
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 

Recently uploaded (20)

꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
vip Sarai Rohilla Call Girls 9999965857 Call or WhatsApp Now Book
vip Sarai Rohilla Call Girls 9999965857 Call or WhatsApp Now Bookvip Sarai Rohilla Call Girls 9999965857 Call or WhatsApp Now Book
vip Sarai Rohilla Call Girls 9999965857 Call or WhatsApp Now Book
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
Data Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxData Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptx
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 

Lessons Learnt from Running Thousands of On-demand Spark Applications

  • 1. Lessons Learnt from Running thousands of On-demand Spark Applications Ada Sharoni Software Engineering Architect @Hunters
  • 2. Ada Sharoni Software Engineering Architect @Hunters • Software Engineering Architect @Hunters ~3 Year • ML & Big Data • Fun Fact: I started out as a Hardware Engineer https://twitter.com/AdaSharoni https://www.linkedin.com/in/ada-sharoni-47ba26b8/
  • 3. Hunters Security Operations Platform • Help security teams understand the full attack story • Correlate existing telemetry and sources across surfaces Infection Download Persist Command & Control Lateral Movement Data Exfiltration • Network • Cloud • SasS • Endpoint • Email • etc
  • 4. Integrations can be easily added …
  • 5. Streaming Security Data in Real-Time Data Lake ● multiple formats ● multiple sources ● Streaming in real-time Flexible Ingestion Data Sources ingestion Detection Layer Auto Investigation Knowledge Graph S3
  • 7. ETL Logic as Configuration Fruit Ninja Fruit Ninja Ingestion Apps Data Lake Detection Layer Auto Investigation Knowledge Graph Kafka S3 Azure Blob …
  • 8. ETL Logic as Configuration Fruit Ninja Fruit Ninja Ingestion Apps (per dataflow) Data Lake Detection Layer Auto Investigation Knowledge Graph Kafka S3 Azure Blob App Config Decoder Config Transformer Config … Schema
  • 9. Why ETL Logic as Configuration ? 1. Security logs come in all shapes and sizes -> takes time to research! 2. There’s a lot of expertise and knowledge domain 3. Engineering teams cannot be a bottleneck 4. Business logic should be easily developed & deployed by the masses 5. SLA to production should be FAST to allow for rapid iteration
  • 10. Example: Ingestion Decoder Logic yaml 1 2 3 4 5 6 7 8 9 1 0 name: nested-json-csv params: - name: mainParsingColumns type: string ... file_format: text steps: - in: raw_input out: json_decoded_df df_step: JsonDecoder params: ... - in: raw_input out: decoded_df df_step: CsvDecoder params: column: '@jsonMessageColumn@' schema: '@mainParsingColumns@' delimiter: '@csvDelimiter@' Similar solution in open source: Metorikku
  • 11. Example: Ingestion Transformer Logic yaml 1 2 3 4 5 6 7 8 9 1 0 name: aws-cloudtrail params: ... mainParsingColumns : '{ "eventTime": {"type": "timestamp"}, "awsRegion": {"type": "string"}, "eventID": {"type": "string"} ... }' steps: - in: decoded_df out: fully_flattened_df df_step: FlattenObjects params: ... - in: fully_flattened_df out: final_output sql: SELECT get_json_object(raw, '$.userIdentity')as user_identity, ... Similar solution in open source: Metorikku
  • 12. Cool… But how does logic get quickly to Production?
  • 13. how do transformers get to production? Transformer .yaml Config Portal Transformer CICD analysts customers Ingestion App Layer name: aws-cloudtrail params: ... mainParsingColumns : '{ "eventTime": {"type": "timestamp"}, "awsRegion": {"type": "string"}, "eventID": {"type": "string"} ... }' steps: - in: decoded_df out: fully_flattened_df df_step: FlattenObjects params: ... - in: fully_flattened_df out: final_output sql: SELECT ...
  • 15. Integrations can be easily added …
  • 16. Example App Config yaml dataflow_id: 3c839ebe-424a-d5f5-efd8-649c54229cd0 data_type: aws-vpc-flow-logs logic: transformer_path : https://xxx/transformers/name/aws-vpc-flow-logs/version/2.2/definition.yaml transformer_params : {} decoder_path : https://xxx/decoders/name/syslog-with-csv-no-header/version/1.0/definition.yaml decoder_params : {“delimiter”: “t”} inputs: data_frame_name : raw_input source_type: s3-list options: maxFileAge: 10d outputs: ... spark_binary_version : latest t_shirt_size: XS spark_conf_overrides : ... 1 2 3 4 5 6 7 8 9 1 0 Notice these
  • 17. ETL Logic as Configuration Fruit Ninja Fruit Ninja Ingestion Apps (per dataflow) Data Lake Detection Layer Auto Investigation Knowledge Graph Kafka S3 Azure Blob App Config Decoder Config Transformer Config … Schema
  • 18. Architecture Solution Controller Service Fruit Ninja Fruit Ninja Ingestion Apps (per dataflow) Data Lake new integration analysts Detection Layer Auto Investigation Knowledge Graph Kafka S3 Azure Blob Portal ETL Logic CICD Decoder Config Transformer Config … 1. Track & redeploy upon app config changes 2. Redeploy upon ETL logic (decoder /transformer) version bump 3. App can run on specific branch 4. App can run with specific Spark config overrides Schema customers Controller Service (or ArgoCD ) App Config deploy
  • 20. This is probably how you envision Spark Cluster … 20
  • 21. Master Worker Worker Worker Worker Spark Cluster
  • 22. This is more like it on Kubernetes - 22
  • 23. -> This is how it can look on Kubernetes Master Node D Node C Worker Node B Master Master Node A Worker Worker Worker Worker Worker Worker Worker Worker Worker Master Worker Worker Worker Worker Worker Worker Worker Worker
  • 24. -> and with auto scaling … Master Node D Node C Worker Node B Master Master Node A Worker Worker Worker Worker Worker Worker Worker Worker Worker Master Worker Worker Worker Worker Worker Worker Worker Worker Node E … Worker Worker
  • 25. Why Kubernetes? 1. Cost $$$ (vs. operation) 2. Easy to manage infra (after initial effort) 3. Fast Autoscaling (1 kB/hr → 1 TB/hr in minutes) 4. Easy to achieve Isolation (low overhead per app) a. app per customer b. app per data flow …
  • 26. K8S Challenges 1. Learn k8s ecosystem 2. Initial setups a. Create k8s cluster b. Create node group c. Setup auto scaling d. Setup IAM policy e. Install Spark Operator
  • 27. How do I get started? 27
  • 28. API Server Spark Operator Master Pod Worker Pod Spark Operator Spark Operator Spark Application Worker Pod Worker Pod Cluster Autoscaler Spark Application (custom resource) Scheduler
  • 29. Setup Auto-Scaling Group Terraform resource "aws_autoscaling_group" { name = "spark-spots" override_instance_types = [ "r6a.2xlarge", "r6id.2xlarge", "r6in.2xlarge", "r6idn.2xlarge", "r5.2xlarge", "r5a.2xlarge", "r5n.2xlarge", "r5d.2xlarge", "r5ad.2xlarge", "r5dn.2xlarge" ] spot_instance_pools = 8 min_size = 0 max_size = 1000 desired_capacity = 0 tags = { "role" = "<role_name>" "k8s.io/cluster-autoscaler/node-template/label/role" = "<role_name>" "k8s.io/cluster-autoscaler/node-template/taint/role=<role_name>" = "NoSchedule" "k8s.io/cluster-autoscaler/node-template/autoscaling-options/scaledownunneededtime" = "30m" } 1 2 3 4 5 6 7 8 9 1 0 let autoscaler do the job automatically Taint: repel unrelated pods remove redundant nodes after descaling $$$
  • 30. Setup Spark Operator helm / kubectl apply / argoCD Spark-operator: sparkJobNamespace: <namespace> serviceAccounts: spark: name: spark webhook: enable: true namespaceSelector: "kubernetes.io/metadata.name=<namespace>" 1 2 3 4 5 6 7 8 9 1 0 Limit mutating admission webhook to specific <namespace> so it can setup pod tolerations correctly ⚠
  • 31. Apply SparkApplication yamls helm / kubectl apply apiVersion: "sparkoperator.k8s.io/v1beta2" kind: SparkApplication metadata: name: my_spark_app namespace: <namespace> labels: my_very_useful_label : "filter_by_me" spec: dynamicAllocation : enabled: true initialExecutors : 2 driver: tolerations: - key: "role" operator: "Equal" value: "<role_name>" effect: "NoSchedule" 1 2 3 4 5 6 7 8 9 1 0 namespace supervised by spark operator Tolaration: pod can exist in which nodes executor: tolerations: - key: "role" operator: "Equal" value: "<role_name>" effect: "NoSchedule"
  • 32. How can we reduce operational overhead ? (when running thousands of apps)
  • 33. 1. Resiliency to shutdowns 33
  • 34. Resilience to sudden shutdowns 1. If you have custom source connectors: a. Save state in (checkpointed) Metadata Log b. Graceful shutdown Spark Custom Source Connector fetch() checkpoint Metadata Log Source cache shutdown()
  • 35. Resilience to sudden shutdowns 2. Graceful executor decommissioning (Spark >3.1) spark.decommission.enabled : "true" spark.storage.decommission.enabled : "true" spark.storage.decommission.rddBlocks.enabled : "true" spark.storage.decommission.shuffleBlocks.enabled : "true" spark.kubernetes.executor.decommissionLabel : "true" spark.kubernetes.executor.decommissionLabelValue : "decomissioned"
  • 36. Resilience to sudden shutdowns 3. Reliable data write: a. Staging committer b. Magic committer spark.hadoop.mapreduce.outputcommitter.factory.scheme.s3a : "org.apache.hadoop.fs.s3a.commit.S3ACommitterFactory" spark.hadoop.fs.s3a.committer.name : "partitioned" spark.hadoop.fs.s3a.committer.staging.conflict-mode" : "append" spark.hadoop.fs.s3a.committer.staging.unique-filenames : "true"
  • 37. 2. T Shirt Sizing 37
  • 38. Master Node D Node C Worker Node B Master Master Node A Worker Worker Worker Worker Worker Worker Worker Worker Worker Master Worker Worker Worker Worker Worker Worker Worker Worker Node E … Worker Worker Worker Worker Worker T Shirt Sizing
  • 39. T Shirt Sizing Scale-up definitions Property XS S M L XL spark.memory.offHeap.size 2g 4g 8g 10g 12g driver.coreRequest 600m 1600m 1800m 4600m 5600m driver.cores 1 2 2 5 6 driver.memory 1g 2g 4g 6g 8g executor.coreRequest 1800m 3600m 3600m 5800m 5800m executor.cores 2 4 4 6 6 executor.memory 2048m 4096m 5120m 6144m 7168m executor.memoryOverhead 307m 735m 768m 921m 1075m Bonus: app gets auto deployed with new T shirt size in just one click!
  • 40. Example App Config yaml dataflow_id: 3c839ebe-424a-d5f5-efd8-649c54229cd0 data_type: aws-vpc-flow-logs logic: transformer_path : https://xxx/transformers/name/aws-vpc-flow-logs/version/2.2/definition.yaml transformer_params : {} decoder_path : https://xxx/decoders/name/syslog-with-csv-no-header/version/1.0/definition.yaml decoder_params : {“delimiter”: “t”} inputs: data_frame_name : raw_input source_type: s3-list options: maxFileAge: 10d outputs: ... spark_binary_version : latest t_shirt_size: XS spark_conf_overrides : ... 1 2 3 4 5 6 7 8 9 1 0 Notice these
  • 41. T Shirt Sizing Scale-up definitions Property XS S M L XL spark.memory.offHeap.size 2g 4g 8g 10g 12g driver.coreRequest 600m 1600m 1800m 4600m 5600m driver.cores 1 2 2 5 6 driver.memory 1g 2g 4g 6g 8g executor.coreRequest 1800m 3600m 3600m 5800m 5800m executor.cores 2 4 4 6 6 executor.memory 2048m 4096m 5120m 6144m 7168m executor.memoryOverhead 307m 735m 768m 921m 1075m Bonus: app gets auto deployed with new T shirt size in just one click!
  • 43. Summary 1. Isolation - breakdown applications per functionality, per tenant 2. “What you see is what you get” - get everything out to config (inc. logic!) a. Easier management b. Easier debug & testing in Production c. Easier deployment 3. Kubernetes is awesome (and cheap!)
  • 45. Whats next 1. Cluster Autoscaler -> Karpenter a. Reduce network traffic (cross availability zones) b. Combine on-demand with spot instances c. Complex cost saving logic 2. Running on Graviton nodes ($$$)