Scaling Spark
on Kubernetes
Li Gao (Lyft)
Bill Graham (Lyft)
Introduction
Li Gao
Works in the Data Platform team at Lyft, currently leading the Compute Infra
initiatives including Spark on Kubernetes.
Previously at: Salesforce, Fitbit, Groupon, and other startups.
Bill Graham
Engineer/Architect on the Data Platform team at Lyft, currently developing data
ingestion systems.
Previously at Twitter, CBS Interactive, CNET Networks
● Introduction of Data Landscape at Lyft
● The challenges we face
● How Apache Spark on Kubernetes can help
● Remaining work
Agenda
Data Landscape
● Batch data Ingestion and ETL
● Data Streaming
● ML platforms
● Notebooks and BI tools
● Query and Visualization
● Operational Analytics
● Data Discovery & Lineage
● Workflow orchestration
● Cloud Platforms
Data Landscape
● Batch data Ingestion and ETL
● Data Streaming
● ML platforms
● Notebooks and BI tools
● Query and Visualization
● Operational Analytics
● Data Discovery & Lineage
● Workflow orchestration
● Cloud Platforms
The Evolving Batch Compute Architecture
Future2016-2017
Vendor-based
Hadoop
Early 2018
Hive on MR
Vendor Presto
Mid 2018
Hive on Tez +
Spark adhoc
Late 2018
Spark on
Vendor GA
Early 2019
Spark on K8s
Alpha
Spark on K8s
Beta
What batch compute is used for
Events
Ext Data
RDB/KV
Sys Events
InjestPipelines
AWSS3
AWSS3
Batch
Compute
Clusters
HMS
Presto,Hive,andBITools
Analysts
Engineers
Scientists
Services
Initial Architecture
Batch Compute Challenges
● 3rd Party vendor dependency issues
● Data ETL expressed solely in SQL
● Complex logic expressed in Python that hard to adopt in
SQL
● Different dependencies and versions
● Resource load balancing for heterogeneous workloads
3rd Party Vendor Dependencies
● Proprietary patches
● Inconsistent bootstrap
● Release schedule
● Homogeneous environments
● HIPAA Compliance
Is SQL the complete solution?
What about Python functions?
“I want to express my processing logic in python functions
with external geo libraries (i.e. Geomesa) and interact with
Hive tables” --- Lyft data engineer
How Apache Spark helps?
RDB/KV
Applications
APIs
Environments
Data Sources
and Data Sinks
What challenges remain?
● Per job custom dependencies
● Handling version requirements (Py3 v.s. Py2)
● Still need to run on shared clusters for cost efficiency
How about Dependencies?
RTree Libraries
Data CodecsSpatial Libraries
How about different Spark or Hive versions?
● Legacy jobs that require Spark 2.2
● Newer Jobs require Spark 2.3 or Spark 2.4
● Hive 2.1 SQL and Hive 2.3
How Kubernetes can help?
CRD Operators
& Controllers
Pods
Ingress &
CNI Services
Namespaces
Pods
Declarative
Resources
Deployment
& Replicas
Community
What are the challenges running Spark on k8s?
● Spark on k8s is still in its infancy
● Single cluster scaling limit
● CRD and control plane update challenges
● Pod churn and IP address allocations
● ECR container image reliability
Current scale of batch jobs
● PB data lake
● (O) k batch jobs running daily
● ~ 1000s of EC2 nodes spanning multiple
clusters
● ~ 1000s of workflows running daily
How Lyft scales Spark on K8s
# of Clusters # of Namespaces
# of Pods
Pod Churn Rate
# of Nodes
Pod Size
Job:Pod ratio IP Alloc Rate Limit
ECR Rate Limit
The Evolving Architecture
One vs Many Kubernetes Clusters
Cluster Pool HA Support
Cluster 1
Cluster 2
Cluster 3
Cluster Pool A
Cluster 4
● Cluster rotation within a cluster pool
● Automated provisioning of a new cluster and (manually) add into rotation
● Throttle at lower bound when rotation in progress
One vs Many Kubernetes Namespaces
Pod Pod Pod
Namespace 1
Pod Pod Pod
Namespace 2
Pod Pod Pod
Namespace 3
Node A Node B Node C Node D
Role1 Role1 Role2
Max Pod Size 1 Max Pod Size 2
● Practical ~3-5K active pods per namespace observed
● Less preemption required when namespace isolated by quota
● Different namespaces can map different IAM roles and sidecar
configurations
Shared vs Dedicated Kubernetes Pods
Job
Controller Spark Driver
Pod
Spark Exec
Pods
Job 2 Driver
Pod
Job 2 Exec
Pods
Job 3 Driver
Pod
Job 3 Exec
Pods
Shared Pods
Job 1
Job 4
Job 3
Job 2
AWS
S3
Dep
Dep
Dedicate & Isolated Pods
Dep
What about Pod Churn?
Separating DDL from DML to reduce churn
Separating DDL from DML Commands
Pod Priority and Preemptions (WIP)
● Priority base
preemption
● Driver pod
has higher
priority than
executor pod
D1 D2 E1 E2 E3 E4
Scheduler
D1
E5
New Pod Req
Before
D2 E5 E2 E3 E4
After
E1
Evicted
What about ECR reliability?
Node 1 Node 2 Node 3
Pods Pods Pods
DaemonSet + Docker In Docker
Container Images
Spark Job Config Overlays
Cluster Pool Defaults
Cluster Defaults
Spark Job User Specified Config
Cluster and Namespace Overrides
Final Spark Job Config
Job
Controller
and
Event
Watcher
Spark
Operator
X-Rays of the Architecture - Job Controller
X-Rays of the Architecture - Spark Operator
Monitoring & Logging Toolbox
HEKA
JMX
Monitoring Example - OOM Kill in namespace
Automation Toolbox
Kustomize
Template
K8S Deploy
Sidecar injectors
Secrets injectors
DaemonSets
KIAM
Remaining Work
● More intelligent job routing and parameter setting
● Granular cost attribution
● Improved docker image distribution
● Spark 3.0!
Key Takeaways
● Apache Spark can help unify different batch data compute use cases
● Kubernetes can help solve the dependency and multi-version requirements
using its containerized approach
● Spark on Kubernetes can scale significantly by using a multi-cluster approach
with proper resource isolation and scheduling techniques
● Challenges remain when running Spark on Kubernetes at scale
Community
This effort would not be possible
without the help from the open
source and wider communities:
Thank you
Strata SF 2019
Li Gao, in/ligao101 @ligao
Bill Graham, @billgraham
Please rate this session!
Questions?
We’re Hiring! Apply at www.lyft.com/careers
or email data-recruiting@lyft.com
Data Engineering
Engineering Manager
San Francisco
Software Engineer
San Francisco, Seattle, &
New York City
Data Infrastructure
Engineering Manager
San Francisco
Software Engineer
San Francisco & Seattle
Experimentation
Software Engineer
San Francisco
Streaming
Software Engineer
San Francisco
Observability
Software Engineer
San Francisco
Strata SF 2019
Rate this session
session page on conference website
O’Reilly Events App
Scaling spark on kubernetes at Lyft

Scaling spark on kubernetes at Lyft

  • 1.
    Scaling Spark on Kubernetes LiGao (Lyft) Bill Graham (Lyft)
  • 2.
    Introduction Li Gao Works inthe Data Platform team at Lyft, currently leading the Compute Infra initiatives including Spark on Kubernetes. Previously at: Salesforce, Fitbit, Groupon, and other startups. Bill Graham Engineer/Architect on the Data Platform team at Lyft, currently developing data ingestion systems. Previously at Twitter, CBS Interactive, CNET Networks
  • 3.
    ● Introduction ofData Landscape at Lyft ● The challenges we face ● How Apache Spark on Kubernetes can help ● Remaining work Agenda
  • 4.
    Data Landscape ● Batchdata Ingestion and ETL ● Data Streaming ● ML platforms ● Notebooks and BI tools ● Query and Visualization ● Operational Analytics ● Data Discovery & Lineage ● Workflow orchestration ● Cloud Platforms
  • 5.
    Data Landscape ● Batchdata Ingestion and ETL ● Data Streaming ● ML platforms ● Notebooks and BI tools ● Query and Visualization ● Operational Analytics ● Data Discovery & Lineage ● Workflow orchestration ● Cloud Platforms
  • 6.
    The Evolving BatchCompute Architecture Future2016-2017 Vendor-based Hadoop Early 2018 Hive on MR Vendor Presto Mid 2018 Hive on Tez + Spark adhoc Late 2018 Spark on Vendor GA Early 2019 Spark on K8s Alpha Spark on K8s Beta
  • 7.
    What batch computeis used for Events Ext Data RDB/KV Sys Events InjestPipelines AWSS3 AWSS3 Batch Compute Clusters HMS Presto,Hive,andBITools Analysts Engineers Scientists Services
  • 8.
  • 9.
    Batch Compute Challenges ●3rd Party vendor dependency issues ● Data ETL expressed solely in SQL ● Complex logic expressed in Python that hard to adopt in SQL ● Different dependencies and versions ● Resource load balancing for heterogeneous workloads
  • 10.
    3rd Party VendorDependencies ● Proprietary patches ● Inconsistent bootstrap ● Release schedule ● Homogeneous environments ● HIPAA Compliance
  • 11.
    Is SQL thecomplete solution?
  • 12.
    What about Pythonfunctions? “I want to express my processing logic in python functions with external geo libraries (i.e. Geomesa) and interact with Hive tables” --- Lyft data engineer
  • 13.
    How Apache Sparkhelps? RDB/KV Applications APIs Environments Data Sources and Data Sinks
  • 14.
    What challenges remain? ●Per job custom dependencies ● Handling version requirements (Py3 v.s. Py2) ● Still need to run on shared clusters for cost efficiency
  • 15.
    How about Dependencies? RTreeLibraries Data CodecsSpatial Libraries
  • 16.
    How about differentSpark or Hive versions? ● Legacy jobs that require Spark 2.2 ● Newer Jobs require Spark 2.3 or Spark 2.4 ● Hive 2.1 SQL and Hive 2.3
  • 17.
    How Kubernetes canhelp? CRD Operators & Controllers Pods Ingress & CNI Services Namespaces Pods Declarative Resources Deployment & Replicas Community
  • 18.
    What are thechallenges running Spark on k8s? ● Spark on k8s is still in its infancy ● Single cluster scaling limit ● CRD and control plane update challenges ● Pod churn and IP address allocations ● ECR container image reliability
  • 19.
    Current scale ofbatch jobs ● PB data lake ● (O) k batch jobs running daily ● ~ 1000s of EC2 nodes spanning multiple clusters ● ~ 1000s of workflows running daily
  • 20.
    How Lyft scalesSpark on K8s # of Clusters # of Namespaces # of Pods Pod Churn Rate # of Nodes Pod Size Job:Pod ratio IP Alloc Rate Limit ECR Rate Limit
  • 21.
  • 22.
    One vs ManyKubernetes Clusters
  • 23.
    Cluster Pool HASupport Cluster 1 Cluster 2 Cluster 3 Cluster Pool A Cluster 4 ● Cluster rotation within a cluster pool ● Automated provisioning of a new cluster and (manually) add into rotation ● Throttle at lower bound when rotation in progress
  • 24.
    One vs ManyKubernetes Namespaces Pod Pod Pod Namespace 1 Pod Pod Pod Namespace 2 Pod Pod Pod Namespace 3 Node A Node B Node C Node D Role1 Role1 Role2 Max Pod Size 1 Max Pod Size 2 ● Practical ~3-5K active pods per namespace observed ● Less preemption required when namespace isolated by quota ● Different namespaces can map different IAM roles and sidecar configurations
  • 25.
    Shared vs DedicatedKubernetes Pods Job Controller Spark Driver Pod Spark Exec Pods Job 2 Driver Pod Job 2 Exec Pods Job 3 Driver Pod Job 3 Exec Pods Shared Pods Job 1 Job 4 Job 3 Job 2 AWS S3 Dep Dep Dedicate & Isolated Pods Dep
  • 26.
    What about PodChurn? Separating DDL from DML to reduce churn
  • 27.
    Separating DDL fromDML Commands
  • 28.
    Pod Priority andPreemptions (WIP) ● Priority base preemption ● Driver pod has higher priority than executor pod D1 D2 E1 E2 E3 E4 Scheduler D1 E5 New Pod Req Before D2 E5 E2 E3 E4 After E1 Evicted
  • 29.
    What about ECRreliability? Node 1 Node 2 Node 3 Pods Pods Pods DaemonSet + Docker In Docker Container Images
  • 30.
    Spark Job ConfigOverlays Cluster Pool Defaults Cluster Defaults Spark Job User Specified Config Cluster and Namespace Overrides Final Spark Job Config Job Controller and Event Watcher Spark Operator
  • 31.
    X-Rays of theArchitecture - Job Controller
  • 32.
    X-Rays of theArchitecture - Spark Operator
  • 33.
    Monitoring & LoggingToolbox HEKA JMX
  • 34.
    Monitoring Example -OOM Kill in namespace
  • 35.
    Automation Toolbox Kustomize Template K8S Deploy Sidecarinjectors Secrets injectors DaemonSets KIAM
  • 36.
    Remaining Work ● Moreintelligent job routing and parameter setting ● Granular cost attribution ● Improved docker image distribution ● Spark 3.0!
  • 37.
    Key Takeaways ● ApacheSpark can help unify different batch data compute use cases ● Kubernetes can help solve the dependency and multi-version requirements using its containerized approach ● Spark on Kubernetes can scale significantly by using a multi-cluster approach with proper resource isolation and scheduling techniques ● Challenges remain when running Spark on Kubernetes at scale
  • 38.
    Community This effort wouldnot be possible without the help from the open source and wider communities:
  • 39.
    Thank you Strata SF2019 Li Gao, in/ligao101 @ligao Bill Graham, @billgraham Please rate this session! Questions?
  • 40.
    We’re Hiring! Applyat www.lyft.com/careers or email data-recruiting@lyft.com Data Engineering Engineering Manager San Francisco Software Engineer San Francisco, Seattle, & New York City Data Infrastructure Engineering Manager San Francisco Software Engineer San Francisco & Seattle Experimentation Software Engineer San Francisco Streaming Software Engineer San Francisco Observability Software Engineer San Francisco
  • 41.
    Strata SF 2019 Ratethis session session page on conference website O’Reilly Events App