SlideShare a Scribd company logo
Prasanth Kothuri, CERN
Piotr Mrowczynski, CERN
Experience of Running Spark on
Kubernetes on OpenStack for High
Energy Physics Workloads
#SAISEco11
#SAISEco11 !2
CERN
• CERN - European Laboratory for Particle Physics
• The place where the Web was born
• Home of the Large Hadron Collider and 4 big
Experiments:
– ATLAS ~ CMS ~ LHCb ~ ALICE
• 22 member states + 6 associate members + world-
wide collaborations
~3000 Members of Personnel
~12,000 users
~1000 MCHF yearly budget
Founded in 1954
Mission: fundamental research in Physics
Answering questions…
18 months of production experience !3
The Large Hadron Collider (LHC)
Largest machine in the world
27km, 6000+ superconducting magnets
Emptiest place in the solar system
High vacuum inside the magnets
Hottest spot in the galaxy
During Lead ion collisions create temperatures 100 000x hotter than the heart of the sun;
Fastest racetrack on Earth
Protons circulate 11245 times/s (99.9999991% the speed of light)
#SAISEco11
CERN Data Centre
!4
• Physics data are aggregated in the CERN Data Centre, where initial
data reconstruction of physics events is performed
• A remote extension of the CERN data centre is hosted in Budapest,
Hungary. It provides the extra computing power required to cover
CERN’s needs.
Meyrin Data Centre Wigner Extension TOTAL
Servers 11 500 3 500 15 000
Processor cores 174 300 56 000 230 300
Disks 61 900 29 700 91 600 (280 PB
capacity)
Tape Cartridges 32 200 (~ 400 PB
capacity)
~ 12.5 PB per
month
~ 2 PB accessed
every day
~ 4 megawatt of
electricity
#SAISEco11
Data at Scale @ CERN
!5
• Physics data – Today we use WLCG to handle it
• Optimised for physics analysis and concurrent access
• ROOT framework - custom software and data format
• Early stage experimental work ongoing to use Spark for
physics analysis
• Infrastructure data
• Accelerators and detector controllers
• Experiments Data catalogues (collisions, files etc.)
• Monitoring of the WLCG and CERN data centres
• Systems logs
#SAISEco11
Apache Spark @ CERN
!6
• Current state-of-the-art
• Spark running on top of YARN/HDFS. Typically
processing happens on the same cluster of machines
as storage (data locality)
• In total ~1850 physical cores and 15 PB capacity
Cluster Name Configuration Software Version
Accelerator logging 20 nodes (Cores 480, Mem - 8 TB, Storage – 5 PB, 96GB in SSD) Spark 2.2.0 – 2.3.1
General Purpose 48 nodes (Cores – 892,Mem – 7.5TB,Storage – 6 PB) Spark 2.2.0 – 2.3.1
Development
cluster
14 nodes (Cores – 196,Mem – 768GB,Storage – 2.15 PB) Spark 2.2.0 – 2.3.1
ATLAS Event Index 18 nodes (Cores – 288,Mem – 912GB,Storage – 1.29 PB) Spark 2.2.0 – 2.3.1
#SAISEco11
Apache Spark @ CERN
!7
• Current state-of-the-art
• Stable and predictable production workloads from our user
communities
• Physical machines allocated – means no resource elasticity,
no isolation of workloads, compute coupled with data storage
• Ad-hoc workloads from physics users interested analyzing
data at scale using external storages (EOS). Physics analysis
limited to the capacity and resources of shared Spark/Hadoop
clusters.
#SAISEco11 !8
New Programming / Analysis Model for Physics data
- Data and Operational Challenges
#SAISEco11
Data and Operational Challenges
!9
• Data volumes expected to grow dramatically with HL-LHC
• Annual growth at 20-30%
• Decrease time-to-physics
• On-demand generation of physics n-tuples
• Resource Elasticity
• Isolation of workloads
• Physics data stored externally in EOS – CERN custom built
data storage
• Usage of industry big-data solutions
#SAISEco11
CMS Data Reduction Facility
!10
On-demand reduction of large datasets based on
complicated user criteria
Input dataset ~ petabytes and requires 1000’s cores
Reproducibility is key in the physics research world!
#SAISEco11
Interactive physics data analysis using notebooks
!11
• To propose a time-efficient
way to perform analysis of
extensive amounts of data
in CERN
• To investigate impact and
usefulness of external
solutions for HEP computing
needs
• To prepare a ready model
for future analyses
performed in (TOTEM)
experiments
#SAISEco11 !12
Addressing the Challenges
#SAISEco11
Literature Study
!13
„Cloud-native is an approach to building and running applications that
exploits the advantages of the cloud computing model” [1]
Companies attempted cloud-native Spark deployments using
Kubernetes Native resource scheduler implementation had experimental
release in Apache Spark in March 2018.
Research from Google defends the hypothesis, that in cluster computing
disk-locality is becoming irrelevant nowadays [2]
Databricks and Accenture Labs benchmarked Spark with data in
external storages (S3/GCS) compared to HDFS. [3][4]
[1] „Kubernetes becomes the first project to graduate from the cloud native computing foundation.”
[2] „Disk-Locality in Datacenter Computing Considered Irrelevant”
[3] Databricks - „Top 5 Reasons for Choosing S3 over HDFS”
[4] Accenture Technology Labs - „Cloud-based Hadoop Deployments: Benefits and Considerations”
#SAISEco11
Separating Compute and Storage for Elastic Big Data
!14
Spark
HDFS
Spark
External Storage
(EOS, S3, GCS
+ Kafka, HDFS)
Data Locality Data External
(network)
#SAISEco11
Separating Compute and Storage for Elastic Big Data
!15
Spark
HDFS
Spark
External Storage
(EOS, S3, GCS
+ Kafka, HDFS)
Data Locality Data External
(network)
• assumes disk bandwidth is higher
than storage throughput over
network
• assumes mass storage throughput
higher than cluster disk bandwidth
#SAISEco11
Separating Compute and Storage for Elastic Big Data
!16
Spark
HDFS
Spark
External Storage
(EOS, S3, GCS
+ Kafka, HDFS)
Data Locality Data External
(network)
• assumes disk bandwidth is higher than
storage throughput over network
• limited elasticity (both for computation and
storage)
• assumes mass storage throughput higher
than cluster disk bandwidth
• allows compute not being bound to
storage (with cloud elasticity)
#SAISEco11
Separating Compute and Storage for Elastic Big Data
!17
Spark
HDFS
Spark
External Storage
(EOS, S3, GCS
+ Kafka, HDFS)
Data Locality Data External
(network)
• assumes disk bandwidth is higher than
storage throughput over network
• limited elasticity (both for computation and
storage)
• avoids high network bandwidth for analysis on
large datasets
• assumes mass storage throughput higher
than cluster disk bandwidth
• allows compute not being bound to storage
(with cloud elasticity)
• assumes network delivers data to CPU faster
than local disks, might generate high traffic
#SAISEco11
Possible solution for storage elasticity
!18
• mass storage services are easier/cheaper to scale
• data stored on disk can be large, and compute nodes can be adjusted to compute needs
#SAISEco11
Possible solution for storage and compute elasticity and
reproducibility
!19
• Openstack CCE (Cloud Container Engine) with Kubernetes
provides compute elasticity and authentication mechanism
(certificates).
#SAISEco11
Possible solution for storage and compute elasticity and
reproducibility
!20
[1] https://github.com/GoogleCloudPlatform/spark-on-k8s-operator
• Openstack CCE (Cloud Container Engine) with Kubernetes
provides compute elasticity and authentication mechanism
(certificates).
• Separation of monolithic services simplifies operational effort
• Just Spark-Submit with Kubernetes only allows to create drivers
and executors (not so friendly to manage).
• Spark K8S Operator bring feature parity known from YARN
(managing Spark Applications) 

• Spark Operator gives idiomatic submission, application restart
policies, cron support, custom volume/config/secret mounts,
pod affinity/anti-affinity
#SAISEco11
Container as a service to automate deployment
!21
#SAISEco11 !22
Evaluating
Spark on Kubernetes
properties and demo
#SAISEco11 !23
Compute Elasticity
Cloud Container Engine
Spark Operator
Deployment 

(long-running operator
for batch)
Cloud Infrastructure
Kubernetes
Resource Manager
openstack coe cluster

Create/Resize/Delete/GetConfiguration
for

Kubernetes clusters
sparkctl 

Submit batch/
stream jobs

to K8S via Operator

(cluster-mode)
Helm/Kubectl 

Create/Delete/Update

Spark Operator to control 

batch/stream jobs
Spark Drivers/
Executors Pods
(containers)
Container
Registry 

Fetch containers
with
required software
spark-submit 

Create driver on behalf

of user (customised by
operator)
#SAISEco11 !24
Workflow Reproducibility
• Kubernetes and use of containers allows Spark developers to build their isolated and reproducible
environments
• Encapsulate software packages, libraries, configurations
Dockerfiles with required software packages,
configurations etc
Cloud Infrastructure
Kubernetes
Resource Manager
Baremetal Infrastructure
Kubernetes
Resource Manager
Docker containers
#SAISEco11 !25
Management of Spark Applications
• Operator watches for create/delete/
update events of SparkApplication
• Spark Operator executes spark-
submit with required configurations
• Data is uploaded/read from
external storage service, and
monitoring is realised with internal/
external monitoring tools
Kubernetes Operator for Apache Spark
https://github.com/GoogleCloudPlatform/spark-on-k8s-operator
#SAISEco11 !26
• Spark Application Controller
handles restarts and
resubmissions.
• Submission Runner executes
spark-submit
• Spark Pod Monitor reports
updates of pods to controller
• Mutating Admission WebHook
handles customisation of Docker
containers and their affinities
Management of Spark Applications
Kubernetes Operator for Apache Spark
https://github.com/GoogleCloudPlatform/spark-on-k8s-operator
#SAISEco11 !27
Persistent Storage for cloud-native
Data over network - available solutions with CERN Openstack Cloud
Persistent Storage
Feature
Network Volume
Mounts
Filesystem
Connectors
Object Storage
Connectors
Monitoring Storage
Example
CephFS [1],
CVMFS [2], other [3]
EOS/XRootD [4],
HDFS
external, no data locality
Ceph S3, GCS [5]
InfluxDB/Grafana,
Stackdriver,
Prometheus
Use-case
Software Packages,
Checkpoints, Events
(History Server)
Data processing,
Events, Checkpoints
Data processing,
Events (History
Server), Checkpoints
Statistics
Spark Transactional
Writes
- Yes
Requires committer
(Directory Committer
Hadoop 3.1) [6]
-
[2] https://clouddocs.web.cern.ch/clouddocs/containers/tutorials/cvmfs.html
[3] https://kubernetes.io/docs/concepts/storage/storage-classes
[1] https://clouddocs.web.cern.ch/clouddocs/containers/tutorials/cephfs.html [4] https://github.com/cern-eos/eos, https://github.com/cerndb/hadoop-xrootd
[5] https://cloud.google.com/dataproc/docs/concepts/connectors/cloud-storage
[6] http://hadoop.apache.org/docs/r3.1.1/hadoop-aws/tools/hadoop-aws/committers.html
#SAISEco11 !28
Memory management in Kubernetes
Spark and Kubernetes Nodes
• Kubelet monitors memory and disk
available to the Node.
• Detecting MemoryPressure or System
Out Of Memory and reclaiming
resources from running containers
(OOMKilled errors)
#SAISEco11 !29
Memory management in Kubernetes
Spark and Kubernetes Nodes
• Kubelet monitors memory and disk
available to the Node.
• Detecting MemoryPressure or System
Out Of Memory and reclaiming
resources from running containers
(OOMKilled errors)
• MemoryOverheadFactor - memory for
off-heap memory, non-JVM
processes (e.g. in case of Python
higher limit) and processes required for
operation of container.
#SAISEco11 !30
Scaling Spark on Kubernetes
Data Reduction and Dimuon Mass Calculation on 20TB (target is 1 PB dataset)
- 1 VM is 2CPU
and 12GB RAM
- Scaling Test with
60 to 180 VMs
- Load Test with
500 VMs (200
hypervisors)
#SAISEco11 !31
Scaling Spark on Kubernetes
Throughput trend
Job execution
duration trend
Data Reduction and Dimuon Mass Calculation on 20TB (target is 1 PB dataset)
#SAISEco11 !32
Spark Operator Demo
Create cluster
https://youtu.be/vuSLS7-JqQI
#SAISEco11 !33
Conclusions and observations
- Openstack provisions Kubernetes
cluster of 10s of nodes in automated
fashion (Cloud Container Engine, Cloud
Horizontal Autoscalers optional).
- Kubernetes allows isolated and
reproducible Spark, limited operational
effort with Docker containers.
- Spark K8S Operator provides
management of Spark Applications similar
to YARN ecosystem
#SAISEco11 !34
Conclusions and observations
- Data is external, this has advantages but also
drawbacks - do you need data locality and why?
- Compute cluster is managed service, storage can
scale separately from compute or in the cloud.
- Storage interoperability (CephFS, EOS, S3 and GCS
can serve more use-cases than just Hadoop)
- Openstack provisions Kubernetes
cluster of 10s of nodes in automated
fashion (Cloud Container Engine, Cloud
Horizontal Autoscalers optional).
- Kubernetes allows isolated and
reproducible Spark, limited operational
effort with Docker containers.
- Spark K8S Operator provides
management of Spark Applications similar
to YARN ecosystem
#SAISEco11 !35
Conclusions and observations
- Without data locality, network can be a
serious problem/bottleneck (specifically in
case of over-tuning or bugs). Monitoring at
hand.
- Large shuffle writes problematic, assumed
compute VMs and large storage space is
not available. On other hand SSD can be
used cheaply.
- Multi-tenancy still a question.
#SAISEco11 !36
Conclusions and observations
- No particular overhead between K8S VMs and bare-
metal YARN with TPCDS Benchmark on similar
hardware. Production YARN had more spikes than
Cloud K8S due to shared environment.
- Spark/YARN had ~40 nodes fixed, Spark/K8S 500
VMs (~200 nodes) provisioned for short time period
- Without data locality, network can be a
serious problem/bottleneck (specifically in
case of over-tuning or bugs). Monitoring at
hand.
- Large shuffle writes problematic, assumed
compute VMs and large storage space is
not available. On other hand SSD can be
used cheaply.
- Multi-tenancy still a question.
#SAISEco11 !37
Thank you
Questions?

More Related Content

What's hot

Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark Summit
 
Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & Iceberg
Flink Forward
 
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper Optimization
Databricks
 
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Databricks
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta Lake
Databricks
 
Apache Arrow Flight Overview
Apache Arrow Flight OverviewApache Arrow Flight Overview
Apache Arrow Flight Overview
Jacques Nadeau
 
Apache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
Apache Spark on Kubernetes Anirudh Ramanathan and Tim ChenApache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
Apache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
Databricks
 
Introduction to DataFusion An Embeddable Query Engine Written in Rust
Introduction to DataFusion  An Embeddable Query Engine Written in RustIntroduction to DataFusion  An Embeddable Query Engine Written in Rust
Introduction to DataFusion An Embeddable Query Engine Written in Rust
Andrew Lamb
 
Iceberg: a fast table format for S3
Iceberg: a fast table format for S3Iceberg: a fast table format for S3
Iceberg: a fast table format for S3
DataWorks Summit
 
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Flink Forward
 
Spark Operator—Deploy, Manage and Monitor Spark clusters on Kubernetes
 Spark Operator—Deploy, Manage and Monitor Spark clusters on Kubernetes Spark Operator—Deploy, Manage and Monitor Spark clusters on Kubernetes
Spark Operator—Deploy, Manage and Monitor Spark clusters on Kubernetes
Databricks
 
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production DeploymentUsing the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Flink Forward
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Apache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationApache Spark Core – Practical Optimization
Apache Spark Core – Practical Optimization
Databricks
 
Reliable Performance at Scale with Apache Spark on Kubernetes
Reliable Performance at Scale with Apache Spark on KubernetesReliable Performance at Scale with Apache Spark on Kubernetes
Reliable Performance at Scale with Apache Spark on Kubernetes
Databricks
 
Delta from a Data Engineer's Perspective
Delta from a Data Engineer's PerspectiveDelta from a Data Engineer's Perspective
Delta from a Data Engineer's Perspective
Databricks
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)
Ryan Blue
 
Intro to databricks delta lake
 Intro to databricks delta lake Intro to databricks delta lake
Intro to databricks delta lake
Mykola Zerniuk
 
Why your Spark job is failing
Why your Spark job is failingWhy your Spark job is failing
Why your Spark job is failing
Sandy Ryza
 
Building a Virtual Data Lake with Apache Arrow
Building a Virtual Data Lake with Apache ArrowBuilding a Virtual Data Lake with Apache Arrow
Building a Virtual Data Lake with Apache Arrow
Dremio Corporation
 

What's hot (20)

Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
 
Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & Iceberg
 
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper Optimization
 
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta Lake
 
Apache Arrow Flight Overview
Apache Arrow Flight OverviewApache Arrow Flight Overview
Apache Arrow Flight Overview
 
Apache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
Apache Spark on Kubernetes Anirudh Ramanathan and Tim ChenApache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
Apache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
 
Introduction to DataFusion An Embeddable Query Engine Written in Rust
Introduction to DataFusion  An Embeddable Query Engine Written in RustIntroduction to DataFusion  An Embeddable Query Engine Written in Rust
Introduction to DataFusion An Embeddable Query Engine Written in Rust
 
Iceberg: a fast table format for S3
Iceberg: a fast table format for S3Iceberg: a fast table format for S3
Iceberg: a fast table format for S3
 
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
 
Spark Operator—Deploy, Manage and Monitor Spark clusters on Kubernetes
 Spark Operator—Deploy, Manage and Monitor Spark clusters on Kubernetes Spark Operator—Deploy, Manage and Monitor Spark clusters on Kubernetes
Spark Operator—Deploy, Manage and Monitor Spark clusters on Kubernetes
 
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production DeploymentUsing the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production Deployment
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Apache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationApache Spark Core – Practical Optimization
Apache Spark Core – Practical Optimization
 
Reliable Performance at Scale with Apache Spark on Kubernetes
Reliable Performance at Scale with Apache Spark on KubernetesReliable Performance at Scale with Apache Spark on Kubernetes
Reliable Performance at Scale with Apache Spark on Kubernetes
 
Delta from a Data Engineer's Perspective
Delta from a Data Engineer's PerspectiveDelta from a Data Engineer's Perspective
Delta from a Data Engineer's Perspective
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)
 
Intro to databricks delta lake
 Intro to databricks delta lake Intro to databricks delta lake
Intro to databricks delta lake
 
Why your Spark job is failing
Why your Spark job is failingWhy your Spark job is failing
Why your Spark job is failing
 
Building a Virtual Data Lake with Apache Arrow
Building a Virtual Data Lake with Apache ArrowBuilding a Virtual Data Lake with Apache Arrow
Building a Virtual Data Lake with Apache Arrow
 

Similar to Experience of Running Spark on Kubernetes on OpenStack for High Energy Physics Workloads with Prasanth Kothuri Piotr Mrowczynski

Deep Learning on Apache Spark at CERN’s Large Hadron Collider with Intel Tech...
Deep Learning on Apache Spark at CERN’s Large Hadron Collider with Intel Tech...Deep Learning on Apache Spark at CERN’s Large Hadron Collider with Intel Tech...
Deep Learning on Apache Spark at CERN’s Large Hadron Collider with Intel Tech...
Databricks
 
Using Containers and HPC to Solve the Mysteries of the Universe by Deborah Bard
Using Containers and HPC to Solve the Mysteries of the Universe by Deborah BardUsing Containers and HPC to Solve the Mysteries of the Universe by Deborah Bard
Using Containers and HPC to Solve the Mysteries of the Universe by Deborah Bard
Docker, Inc.
 
Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...
Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...
Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...
Databricks
 
How HPC and large-scale data analytics are transforming experimental science
How HPC and large-scale data analytics are transforming experimental scienceHow HPC and large-scale data analytics are transforming experimental science
How HPC and large-scale data analytics are transforming experimental science
inside-BigData.com
 
The OpenStack Cloud at CERN - OpenStack Nordic
The OpenStack Cloud at CERN - OpenStack NordicThe OpenStack Cloud at CERN - OpenStack Nordic
The OpenStack Cloud at CERN - OpenStack Nordic
Tim Bell
 
Cisco: Cassandra adoption on Cisco UCS & OpenStack
Cisco: Cassandra adoption on Cisco UCS & OpenStackCisco: Cassandra adoption on Cisco UCS & OpenStack
Cisco: Cassandra adoption on Cisco UCS & OpenStack
DataStax Academy
 
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...
Spark Summit
 
Apache Spark AI Use Case in Telco: Network Quality Analysis and Prediction wi...
Apache Spark AI Use Case in Telco: Network Quality Analysis and Prediction wi...Apache Spark AI Use Case in Telco: Network Quality Analysis and Prediction wi...
Apache Spark AI Use Case in Telco: Network Quality Analysis and Prediction wi...
Databricks
 
European Exascale System Interconnect & Storage
European Exascale System Interconnect & StorageEuropean Exascale System Interconnect & Storage
European Exascale System Interconnect & Storage
inside-BigData.com
 
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov... Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Databricks
 
Very Large Data Files, Object Stores, and Deep Learning—Lessons Learned While...
Very Large Data Files, Object Stores, and Deep Learning—Lessons Learned While...Very Large Data Files, Object Stores, and Deep Learning—Lessons Learned While...
Very Large Data Files, Object Stores, and Deep Learning—Lessons Learned While...
Databricks
 
Cloud Foundry and OpenStack – Marriage Made in Heaven !
Cloud Foundry and OpenStack – Marriage Made in Heaven !Cloud Foundry and OpenStack – Marriage Made in Heaven !
Cloud Foundry and OpenStack – Marriage Made in Heaven !
Animesh Singh
 
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Databricks
 
AWS re:Invent 2016: Large-Scale, Cloud-Based Analysis of Cancer Genomes: Less...
AWS re:Invent 2016: Large-Scale, Cloud-Based Analysis of Cancer Genomes: Less...AWS re:Invent 2016: Large-Scale, Cloud-Based Analysis of Cancer Genomes: Less...
AWS re:Invent 2016: Large-Scale, Cloud-Based Analysis of Cancer Genomes: Less...
Amazon Web Services
 
NSCC Training Introductory Class
NSCC Training Introductory Class NSCC Training Introductory Class
NSCC Training Introductory Class
National Supercomputing Centre Singapore
 
Cloud Foundry and OpenStack - A Marriage Made in Heaven! (Cloud Foundry Summi...
Cloud Foundry and OpenStack - A Marriage Made in Heaven! (Cloud Foundry Summi...Cloud Foundry and OpenStack - A Marriage Made in Heaven! (Cloud Foundry Summi...
Cloud Foundry and OpenStack - A Marriage Made in Heaven! (Cloud Foundry Summi...
VMware Tanzu
 
Cloud Foundry and OpenStack - A Marriage Made in Heaven! (Cloud Foundry Summi...
Cloud Foundry and OpenStack - A Marriage Made in Heaven! (Cloud Foundry Summi...Cloud Foundry and OpenStack - A Marriage Made in Heaven! (Cloud Foundry Summi...
Cloud Foundry and OpenStack - A Marriage Made in Heaven! (Cloud Foundry Summi...
VMware Tanzu
 
OpenStack at CERN : A 5 year perspective
OpenStack at CERN : A 5 year perspectiveOpenStack at CERN : A 5 year perspective
OpenStack at CERN : A 5 year perspective
Tim Bell
 
MayaData Datastax webinar - Operating Cassandra on Kubernetes with the help ...
MayaData  Datastax webinar - Operating Cassandra on Kubernetes with the help ...MayaData  Datastax webinar - Operating Cassandra on Kubernetes with the help ...
MayaData Datastax webinar - Operating Cassandra on Kubernetes with the help ...
MayaData Inc
 
Enabling Insight to Support World-Class Supercomputing (Stefan Ceballos, Oak ...
Enabling Insight to Support World-Class Supercomputing (Stefan Ceballos, Oak ...Enabling Insight to Support World-Class Supercomputing (Stefan Ceballos, Oak ...
Enabling Insight to Support World-Class Supercomputing (Stefan Ceballos, Oak ...
confluent
 

Similar to Experience of Running Spark on Kubernetes on OpenStack for High Energy Physics Workloads with Prasanth Kothuri Piotr Mrowczynski (20)

Deep Learning on Apache Spark at CERN’s Large Hadron Collider with Intel Tech...
Deep Learning on Apache Spark at CERN’s Large Hadron Collider with Intel Tech...Deep Learning on Apache Spark at CERN’s Large Hadron Collider with Intel Tech...
Deep Learning on Apache Spark at CERN’s Large Hadron Collider with Intel Tech...
 
Using Containers and HPC to Solve the Mysteries of the Universe by Deborah Bard
Using Containers and HPC to Solve the Mysteries of the Universe by Deborah BardUsing Containers and HPC to Solve the Mysteries of the Universe by Deborah Bard
Using Containers and HPC to Solve the Mysteries of the Universe by Deborah Bard
 
Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...
Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...
Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...
 
How HPC and large-scale data analytics are transforming experimental science
How HPC and large-scale data analytics are transforming experimental scienceHow HPC and large-scale data analytics are transforming experimental science
How HPC and large-scale data analytics are transforming experimental science
 
The OpenStack Cloud at CERN - OpenStack Nordic
The OpenStack Cloud at CERN - OpenStack NordicThe OpenStack Cloud at CERN - OpenStack Nordic
The OpenStack Cloud at CERN - OpenStack Nordic
 
Cisco: Cassandra adoption on Cisco UCS & OpenStack
Cisco: Cassandra adoption on Cisco UCS & OpenStackCisco: Cassandra adoption on Cisco UCS & OpenStack
Cisco: Cassandra adoption on Cisco UCS & OpenStack
 
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...
 
Apache Spark AI Use Case in Telco: Network Quality Analysis and Prediction wi...
Apache Spark AI Use Case in Telco: Network Quality Analysis and Prediction wi...Apache Spark AI Use Case in Telco: Network Quality Analysis and Prediction wi...
Apache Spark AI Use Case in Telco: Network Quality Analysis and Prediction wi...
 
European Exascale System Interconnect & Storage
European Exascale System Interconnect & StorageEuropean Exascale System Interconnect & Storage
European Exascale System Interconnect & Storage
 
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov... Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 
Very Large Data Files, Object Stores, and Deep Learning—Lessons Learned While...
Very Large Data Files, Object Stores, and Deep Learning—Lessons Learned While...Very Large Data Files, Object Stores, and Deep Learning—Lessons Learned While...
Very Large Data Files, Object Stores, and Deep Learning—Lessons Learned While...
 
Cloud Foundry and OpenStack – Marriage Made in Heaven !
Cloud Foundry and OpenStack – Marriage Made in Heaven !Cloud Foundry and OpenStack – Marriage Made in Heaven !
Cloud Foundry and OpenStack – Marriage Made in Heaven !
 
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
 
AWS re:Invent 2016: Large-Scale, Cloud-Based Analysis of Cancer Genomes: Less...
AWS re:Invent 2016: Large-Scale, Cloud-Based Analysis of Cancer Genomes: Less...AWS re:Invent 2016: Large-Scale, Cloud-Based Analysis of Cancer Genomes: Less...
AWS re:Invent 2016: Large-Scale, Cloud-Based Analysis of Cancer Genomes: Less...
 
NSCC Training Introductory Class
NSCC Training Introductory Class NSCC Training Introductory Class
NSCC Training Introductory Class
 
Cloud Foundry and OpenStack - A Marriage Made in Heaven! (Cloud Foundry Summi...
Cloud Foundry and OpenStack - A Marriage Made in Heaven! (Cloud Foundry Summi...Cloud Foundry and OpenStack - A Marriage Made in Heaven! (Cloud Foundry Summi...
Cloud Foundry and OpenStack - A Marriage Made in Heaven! (Cloud Foundry Summi...
 
Cloud Foundry and OpenStack - A Marriage Made in Heaven! (Cloud Foundry Summi...
Cloud Foundry and OpenStack - A Marriage Made in Heaven! (Cloud Foundry Summi...Cloud Foundry and OpenStack - A Marriage Made in Heaven! (Cloud Foundry Summi...
Cloud Foundry and OpenStack - A Marriage Made in Heaven! (Cloud Foundry Summi...
 
OpenStack at CERN : A 5 year perspective
OpenStack at CERN : A 5 year perspectiveOpenStack at CERN : A 5 year perspective
OpenStack at CERN : A 5 year perspective
 
MayaData Datastax webinar - Operating Cassandra on Kubernetes with the help ...
MayaData  Datastax webinar - Operating Cassandra on Kubernetes with the help ...MayaData  Datastax webinar - Operating Cassandra on Kubernetes with the help ...
MayaData Datastax webinar - Operating Cassandra on Kubernetes with the help ...
 
Enabling Insight to Support World-Class Supercomputing (Stefan Ceballos, Oak ...
Enabling Insight to Support World-Class Supercomputing (Stefan Ceballos, Oak ...Enabling Insight to Support World-Class Supercomputing (Stefan Ceballos, Oak ...
Enabling Insight to Support World-Class Supercomputing (Stefan Ceballos, Oak ...
 

More from Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack Detection
Databricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack Detection
 

Recently uploaded

Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
Sm321
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
roli9797
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
nuttdpt
 
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
Timothy Spann
 
DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
Timothy Spann
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
Social Samosa
 
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
taqyea
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
Sachin Paul
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
Walaa Eldin Moustafa
 
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
ihavuls
 
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Kaxil Naik
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
Lars Albertsson
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
jitskeb
 
Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024
ElizabethGarrettChri
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
manishkhaire30
 
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
wyddcwye1
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
AlessioFois2
 
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdfUdemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Fernanda Palhano
 
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
hyfjgavov
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
sameer shah
 

Recently uploaded (20)

Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
 
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
 
DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
 
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
 
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
 
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
 
Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
 
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
 
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdfUdemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
 
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
 

Experience of Running Spark on Kubernetes on OpenStack for High Energy Physics Workloads with Prasanth Kothuri Piotr Mrowczynski

  • 1. Prasanth Kothuri, CERN Piotr Mrowczynski, CERN Experience of Running Spark on Kubernetes on OpenStack for High Energy Physics Workloads #SAISEco11
  • 2. #SAISEco11 !2 CERN • CERN - European Laboratory for Particle Physics • The place where the Web was born • Home of the Large Hadron Collider and 4 big Experiments: – ATLAS ~ CMS ~ LHCb ~ ALICE • 22 member states + 6 associate members + world- wide collaborations ~3000 Members of Personnel ~12,000 users ~1000 MCHF yearly budget Founded in 1954 Mission: fundamental research in Physics Answering questions…
  • 3. 18 months of production experience !3 The Large Hadron Collider (LHC) Largest machine in the world 27km, 6000+ superconducting magnets Emptiest place in the solar system High vacuum inside the magnets Hottest spot in the galaxy During Lead ion collisions create temperatures 100 000x hotter than the heart of the sun; Fastest racetrack on Earth Protons circulate 11245 times/s (99.9999991% the speed of light)
  • 4. #SAISEco11 CERN Data Centre !4 • Physics data are aggregated in the CERN Data Centre, where initial data reconstruction of physics events is performed • A remote extension of the CERN data centre is hosted in Budapest, Hungary. It provides the extra computing power required to cover CERN’s needs. Meyrin Data Centre Wigner Extension TOTAL Servers 11 500 3 500 15 000 Processor cores 174 300 56 000 230 300 Disks 61 900 29 700 91 600 (280 PB capacity) Tape Cartridges 32 200 (~ 400 PB capacity) ~ 12.5 PB per month ~ 2 PB accessed every day ~ 4 megawatt of electricity
  • 5. #SAISEco11 Data at Scale @ CERN !5 • Physics data – Today we use WLCG to handle it • Optimised for physics analysis and concurrent access • ROOT framework - custom software and data format • Early stage experimental work ongoing to use Spark for physics analysis • Infrastructure data • Accelerators and detector controllers • Experiments Data catalogues (collisions, files etc.) • Monitoring of the WLCG and CERN data centres • Systems logs
  • 6. #SAISEco11 Apache Spark @ CERN !6 • Current state-of-the-art • Spark running on top of YARN/HDFS. Typically processing happens on the same cluster of machines as storage (data locality) • In total ~1850 physical cores and 15 PB capacity Cluster Name Configuration Software Version Accelerator logging 20 nodes (Cores 480, Mem - 8 TB, Storage – 5 PB, 96GB in SSD) Spark 2.2.0 – 2.3.1 General Purpose 48 nodes (Cores – 892,Mem – 7.5TB,Storage – 6 PB) Spark 2.2.0 – 2.3.1 Development cluster 14 nodes (Cores – 196,Mem – 768GB,Storage – 2.15 PB) Spark 2.2.0 – 2.3.1 ATLAS Event Index 18 nodes (Cores – 288,Mem – 912GB,Storage – 1.29 PB) Spark 2.2.0 – 2.3.1
  • 7. #SAISEco11 Apache Spark @ CERN !7 • Current state-of-the-art • Stable and predictable production workloads from our user communities • Physical machines allocated – means no resource elasticity, no isolation of workloads, compute coupled with data storage • Ad-hoc workloads from physics users interested analyzing data at scale using external storages (EOS). Physics analysis limited to the capacity and resources of shared Spark/Hadoop clusters.
  • 8. #SAISEco11 !8 New Programming / Analysis Model for Physics data - Data and Operational Challenges
  • 9. #SAISEco11 Data and Operational Challenges !9 • Data volumes expected to grow dramatically with HL-LHC • Annual growth at 20-30% • Decrease time-to-physics • On-demand generation of physics n-tuples • Resource Elasticity • Isolation of workloads • Physics data stored externally in EOS – CERN custom built data storage • Usage of industry big-data solutions
  • 10. #SAISEco11 CMS Data Reduction Facility !10 On-demand reduction of large datasets based on complicated user criteria Input dataset ~ petabytes and requires 1000’s cores Reproducibility is key in the physics research world!
  • 11. #SAISEco11 Interactive physics data analysis using notebooks !11 • To propose a time-efficient way to perform analysis of extensive amounts of data in CERN • To investigate impact and usefulness of external solutions for HEP computing needs • To prepare a ready model for future analyses performed in (TOTEM) experiments
  • 13. #SAISEco11 Literature Study !13 „Cloud-native is an approach to building and running applications that exploits the advantages of the cloud computing model” [1] Companies attempted cloud-native Spark deployments using Kubernetes Native resource scheduler implementation had experimental release in Apache Spark in March 2018. Research from Google defends the hypothesis, that in cluster computing disk-locality is becoming irrelevant nowadays [2] Databricks and Accenture Labs benchmarked Spark with data in external storages (S3/GCS) compared to HDFS. [3][4] [1] „Kubernetes becomes the first project to graduate from the cloud native computing foundation.” [2] „Disk-Locality in Datacenter Computing Considered Irrelevant” [3] Databricks - „Top 5 Reasons for Choosing S3 over HDFS” [4] Accenture Technology Labs - „Cloud-based Hadoop Deployments: Benefits and Considerations”
  • 14. #SAISEco11 Separating Compute and Storage for Elastic Big Data !14 Spark HDFS Spark External Storage (EOS, S3, GCS + Kafka, HDFS) Data Locality Data External (network)
  • 15. #SAISEco11 Separating Compute and Storage for Elastic Big Data !15 Spark HDFS Spark External Storage (EOS, S3, GCS + Kafka, HDFS) Data Locality Data External (network) • assumes disk bandwidth is higher than storage throughput over network • assumes mass storage throughput higher than cluster disk bandwidth
  • 16. #SAISEco11 Separating Compute and Storage for Elastic Big Data !16 Spark HDFS Spark External Storage (EOS, S3, GCS + Kafka, HDFS) Data Locality Data External (network) • assumes disk bandwidth is higher than storage throughput over network • limited elasticity (both for computation and storage) • assumes mass storage throughput higher than cluster disk bandwidth • allows compute not being bound to storage (with cloud elasticity)
  • 17. #SAISEco11 Separating Compute and Storage for Elastic Big Data !17 Spark HDFS Spark External Storage (EOS, S3, GCS + Kafka, HDFS) Data Locality Data External (network) • assumes disk bandwidth is higher than storage throughput over network • limited elasticity (both for computation and storage) • avoids high network bandwidth for analysis on large datasets • assumes mass storage throughput higher than cluster disk bandwidth • allows compute not being bound to storage (with cloud elasticity) • assumes network delivers data to CPU faster than local disks, might generate high traffic
  • 18. #SAISEco11 Possible solution for storage elasticity !18 • mass storage services are easier/cheaper to scale • data stored on disk can be large, and compute nodes can be adjusted to compute needs
  • 19. #SAISEco11 Possible solution for storage and compute elasticity and reproducibility !19 • Openstack CCE (Cloud Container Engine) with Kubernetes provides compute elasticity and authentication mechanism (certificates).
  • 20. #SAISEco11 Possible solution for storage and compute elasticity and reproducibility !20 [1] https://github.com/GoogleCloudPlatform/spark-on-k8s-operator • Openstack CCE (Cloud Container Engine) with Kubernetes provides compute elasticity and authentication mechanism (certificates). • Separation of monolithic services simplifies operational effort • Just Spark-Submit with Kubernetes only allows to create drivers and executors (not so friendly to manage). • Spark K8S Operator bring feature parity known from YARN (managing Spark Applications) 
 • Spark Operator gives idiomatic submission, application restart policies, cron support, custom volume/config/secret mounts, pod affinity/anti-affinity
  • 21. #SAISEco11 Container as a service to automate deployment !21
  • 22. #SAISEco11 !22 Evaluating Spark on Kubernetes properties and demo
  • 23. #SAISEco11 !23 Compute Elasticity Cloud Container Engine Spark Operator Deployment 
 (long-running operator for batch) Cloud Infrastructure Kubernetes Resource Manager openstack coe cluster
 Create/Resize/Delete/GetConfiguration for
 Kubernetes clusters sparkctl 
 Submit batch/ stream jobs
 to K8S via Operator
 (cluster-mode) Helm/Kubectl 
 Create/Delete/Update
 Spark Operator to control 
 batch/stream jobs Spark Drivers/ Executors Pods (containers) Container Registry 
 Fetch containers with required software spark-submit 
 Create driver on behalf
 of user (customised by operator)
  • 24. #SAISEco11 !24 Workflow Reproducibility • Kubernetes and use of containers allows Spark developers to build their isolated and reproducible environments • Encapsulate software packages, libraries, configurations Dockerfiles with required software packages, configurations etc Cloud Infrastructure Kubernetes Resource Manager Baremetal Infrastructure Kubernetes Resource Manager Docker containers
  • 25. #SAISEco11 !25 Management of Spark Applications • Operator watches for create/delete/ update events of SparkApplication • Spark Operator executes spark- submit with required configurations • Data is uploaded/read from external storage service, and monitoring is realised with internal/ external monitoring tools Kubernetes Operator for Apache Spark https://github.com/GoogleCloudPlatform/spark-on-k8s-operator
  • 26. #SAISEco11 !26 • Spark Application Controller handles restarts and resubmissions. • Submission Runner executes spark-submit • Spark Pod Monitor reports updates of pods to controller • Mutating Admission WebHook handles customisation of Docker containers and their affinities Management of Spark Applications Kubernetes Operator for Apache Spark https://github.com/GoogleCloudPlatform/spark-on-k8s-operator
  • 27. #SAISEco11 !27 Persistent Storage for cloud-native Data over network - available solutions with CERN Openstack Cloud Persistent Storage Feature Network Volume Mounts Filesystem Connectors Object Storage Connectors Monitoring Storage Example CephFS [1], CVMFS [2], other [3] EOS/XRootD [4], HDFS external, no data locality Ceph S3, GCS [5] InfluxDB/Grafana, Stackdriver, Prometheus Use-case Software Packages, Checkpoints, Events (History Server) Data processing, Events, Checkpoints Data processing, Events (History Server), Checkpoints Statistics Spark Transactional Writes - Yes Requires committer (Directory Committer Hadoop 3.1) [6] - [2] https://clouddocs.web.cern.ch/clouddocs/containers/tutorials/cvmfs.html [3] https://kubernetes.io/docs/concepts/storage/storage-classes [1] https://clouddocs.web.cern.ch/clouddocs/containers/tutorials/cephfs.html [4] https://github.com/cern-eos/eos, https://github.com/cerndb/hadoop-xrootd [5] https://cloud.google.com/dataproc/docs/concepts/connectors/cloud-storage [6] http://hadoop.apache.org/docs/r3.1.1/hadoop-aws/tools/hadoop-aws/committers.html
  • 28. #SAISEco11 !28 Memory management in Kubernetes Spark and Kubernetes Nodes • Kubelet monitors memory and disk available to the Node. • Detecting MemoryPressure or System Out Of Memory and reclaiming resources from running containers (OOMKilled errors)
  • 29. #SAISEco11 !29 Memory management in Kubernetes Spark and Kubernetes Nodes • Kubelet monitors memory and disk available to the Node. • Detecting MemoryPressure or System Out Of Memory and reclaiming resources from running containers (OOMKilled errors) • MemoryOverheadFactor - memory for off-heap memory, non-JVM processes (e.g. in case of Python higher limit) and processes required for operation of container.
  • 30. #SAISEco11 !30 Scaling Spark on Kubernetes Data Reduction and Dimuon Mass Calculation on 20TB (target is 1 PB dataset) - 1 VM is 2CPU and 12GB RAM - Scaling Test with 60 to 180 VMs - Load Test with 500 VMs (200 hypervisors)
  • 31. #SAISEco11 !31 Scaling Spark on Kubernetes Throughput trend Job execution duration trend Data Reduction and Dimuon Mass Calculation on 20TB (target is 1 PB dataset)
  • 32. #SAISEco11 !32 Spark Operator Demo Create cluster https://youtu.be/vuSLS7-JqQI
  • 33. #SAISEco11 !33 Conclusions and observations - Openstack provisions Kubernetes cluster of 10s of nodes in automated fashion (Cloud Container Engine, Cloud Horizontal Autoscalers optional). - Kubernetes allows isolated and reproducible Spark, limited operational effort with Docker containers. - Spark K8S Operator provides management of Spark Applications similar to YARN ecosystem
  • 34. #SAISEco11 !34 Conclusions and observations - Data is external, this has advantages but also drawbacks - do you need data locality and why? - Compute cluster is managed service, storage can scale separately from compute or in the cloud. - Storage interoperability (CephFS, EOS, S3 and GCS can serve more use-cases than just Hadoop) - Openstack provisions Kubernetes cluster of 10s of nodes in automated fashion (Cloud Container Engine, Cloud Horizontal Autoscalers optional). - Kubernetes allows isolated and reproducible Spark, limited operational effort with Docker containers. - Spark K8S Operator provides management of Spark Applications similar to YARN ecosystem
  • 35. #SAISEco11 !35 Conclusions and observations - Without data locality, network can be a serious problem/bottleneck (specifically in case of over-tuning or bugs). Monitoring at hand. - Large shuffle writes problematic, assumed compute VMs and large storage space is not available. On other hand SSD can be used cheaply. - Multi-tenancy still a question.
  • 36. #SAISEco11 !36 Conclusions and observations - No particular overhead between K8S VMs and bare- metal YARN with TPCDS Benchmark on similar hardware. Production YARN had more spikes than Cloud K8S due to shared environment. - Spark/YARN had ~40 nodes fixed, Spark/K8S 500 VMs (~200 nodes) provisioned for short time period - Without data locality, network can be a serious problem/bottleneck (specifically in case of over-tuning or bugs). Monitoring at hand. - Large shuffle writes problematic, assumed compute VMs and large storage space is not available. On other hand SSD can be used cheaply. - Multi-tenancy still a question.