Storage Requirements and Options for Running Spark on Kubernetes

•Download as PPTX, PDF•

1 like•913 views

In a world of serverless computing users tend to be frugal when it comes to expenditure on compute, storage and other resources. Paying for the same when they aren’t in use becomes a significant factor. Offering Spark as service on cloud presents very unique challenges. Running Spark on Kubernetes presents a lot of challenges especially around storage and persistence. Spark workloads have very unique requirements of Storage for intermediate data, long time persistence, Share file system and requirements become very tight when it same need to be offered as a service for enterprise to mange GDPR and other compliance like ISO 27001 and HIPAA certifications. This talk covers challenges involved in providing Serverless Spark Clusters share the specific issues one can encounter when running large Kubernetes clusters in production especially covering the scenarios related to persistence. This talk will help people using Kubernetes or docker runtime in production and help them understand various storage options available and which is more suitable for running Spark workloads on Kubernetes and what more can be done

Technology

Storage requirements for
running Spark workloads on
Kubernetes
Rachit Arora
rachitar@in.ibm.com
IBM, India Software Labs

About Me
• Advisory Software Engineer @ IBM India Software Labs
• General Purpose Developer
• Love Containers & Kubernetes
• Conference traveler
• Upcoming book on Hadoop and Its Ecosystem
• Cricket fan, Foodie

Spark
Unified, open source, parallel, data processing framework for Big Data Analytics
Spark Core Engine
Yarn Mesos
Standalon
e
Scheduler
Kubernete
s
Spark SQL
Interactive
Queries
Spark
Streaming
Stream
processing
Spark
MLlib
Machine
Learning
GraphX
Graph
Computation

Typical Bigdata Application
Secure
Catalog and Search
Ingest &
Store
Prepare Analyze Visualize
Date Engineer Date Scientist
Application
Developer

Evolution of Spark Analytics
On Prem Install
• Acquire
Hardware
• Prepare
Machine
• Install Spark
• Retry
• Apply patches
• security
• Upgrades
• Scale
• High
availability
Virtualization
• Prepare Vm
Imaging
Solution
• Network
Management
• High
Avilability
• Patches
• Scale
Managed
• Configure
Cluster
• Customize
• Scale
• Pay even if
idle
Serverless
• Run analytics

What Kubernetes Bring in?
• Kubernetes is an open-source system for automating deployment,
scaling, and management of containerized applications.
• It Manages Containers for me
• It Manages High availability
• It Provides me flexibility to choose resource I WANT and Persistence I want
• Kubernetes – Lots of addon services: third-party logging, monitoring,
and security tools
• Reduced operational costs
• Improved infrastructure utilization

Storage Requirements
• Distributed File System
• Local Scratch Space
• Fast disk rights – DO NOT Write to Containers!!
• User Library
• Logs
• History Server Events
• Configs
• Secrets

What can we leverage
• Distributed
• NFS
• PV to PVC (1 to 1 Mapping in most of the Cloud Providers)
• Big NFS – Multiple PV – qouta
• HDFS – No Direct Support but can be configured to make it work but no data
localization
• DBFS – s3 based Databricks File System (DBFS) is a distributed file system
• S3/Obect Storage – Performance concerns
• Portworx – under exploration
• Glusterfs

What can we leverage
• Local temp dir scratch space
• emptyDir
• Clean Delete ? Need to return machines
• HostPath
• You manage delete
• Logs
• emptyDir vs NFS
• Push to Object store using fluentd (side containers)
• Roll over
• Do not write to containers

What we are looking for?
• Image as Volume
• https://github.com/kubernetes/kubernetes
/issues/831
• Flex Volume Plugin
• CSI
• Encrypted PVCs options – portworx
• PV to PVC 1 to Many Mapping with
Isolations
• Config Map: Better support for updates
• Local
• Clean Delete for HIPAA
• Distributed
• Clean Delete for HIPAA
• PVC transfer across Namespaces

References
• IBM Watson Studio
https://datascience.ibm.com
• IBM Watson
https://www.ibm.com/analytics/us/en/watson-data-platform/tutorial/
• Analytics Engine
https://www.ibm.com/cloud/analytics-engine
• Apache Spark
• Kubernetes Scheduler
Design & Discussion
• Kuberenetes Clusters on IBM Cloud
Rachit Arora
rachitar@in.ibm.com
@rachit1arora

Thank you
Rachit Arora
rachitar@in.ibm.com
Twitter @rachit1arora

What's hot

Running Apache Spark on Kubernetes: Best Practices and PitfallsDatabricks

Espresso: LinkedIn's Distributed Data Serving Platform (Paper)Amy W. Tang

Introduction to RedisDvir Volk

Apache Ignite vs Alluxio: Memory Speed Big Data AnalyticsDataWorks Summit

Apache Spark ArchitectureAlexey Grishchenko

Introduction to Storm Chandler Huang

Apache NiFi in the Hadoop Ecosystem DataWorks Summit/Hadoop Summit

Introduction and Overview of Apache Kafka, TriHUG July 23, 2013mumrah

Transactional writes to cloud storage with Eric LiangDatabricks

Hive + Tez: A Performance Deep DiveDataWorks Summit

Hadoop REST API Security with Apache Knox GatewayDataWorks Summit

Scaling Apache Spark at FacebookDatabricks

High Availability for OpenStackKamesh Pemmaraju

Apache Iceberg Presentation for the St. Louis Big Data IDEAAdam Doyle

Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudNoritaka Sekiyama

Flink vs. SparkSlim Baltagi

ACID ORC, Iceberg, and Delta Lake—An Overview of Table Formats for Large Scal...Databricks

Parquet performance tuning: the missing guideRyan Blue

Top 5 Mistakes When Writing Spark ApplicationsSpark Summit

Apache Kudu: Technical Deep Dive  Cloudera, Inc.

What's hot (20)

Running Apache Spark on Kubernetes: Best Practices and Pitfalls

Espresso: LinkedIn's Distributed Data Serving Platform (Paper)

Introduction to Redis

Apache Ignite vs Alluxio: Memory Speed Big Data Analytics

Apache Spark Architecture

Introduction to Storm

Apache NiFi in the Hadoop Ecosystem

Introduction and Overview of Apache Kafka, TriHUG July 23, 2013

Transactional writes to cloud storage with Eric Liang

Hive + Tez: A Performance Deep Dive

Hadoop REST API Security with Apache Knox Gateway

Scaling Apache Spark at Facebook

High Availability for OpenStack

Apache Iceberg Presentation for the St. Louis Big Data IDEA

Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud

Flink vs. Spark

ACID ORC, Iceberg, and Delta Lake—An Overview of Table Formats for Large Scal...

Parquet performance tuning: the missing guide

Top 5 Mistakes When Writing Spark Applications

Apache Kudu: Technical Deep Dive  

Similar to Storage Requirements and Options for Running Spark on Kubernetes

Why Kubernetes as a container orchestrator is a right choice for running spar...DataWorks Summit

Meetup Kubernetes Rhein-Neckerinovex GmbH

Webinar - DreamObjects/Ceph Case StudyCeph Community

Netflix oss season 2 episode 1 - meetup Lightning talksRuslan Meshenberg

Best of re:InventAmazon Web Services

State of the Container EcosystemVinay Rao

Lessons learned from running Spark on DockerDataWorks Summit

Trend Micro Big Data Platform and Apache BigtopEvans Ye

Serverless sparkMamathaBusi

Intro Docker october 2013dotCloud

What are clouds made fromJohn Garbutt

Solr + Hadoop: Interactive Search for Hadoopgregchanan

Kubernetes – An open platform for container orchestrationinovex GmbH

Apache Cassandra training. Overview and BasicsOleg Magazov

Hadoop in the cloud – The what, why and how from the expertsDataWorks Summit

Fusion on Kubernetes - Alan Eugenio & Joe Streeky, LucidworksLucidworks

Move your on prem data to a lake in a Lake in CloudCAMMS

Cloud computing UNIT 2.1 presentation inRahulBhole12

Hadoop ppt1chariorienit

Big Data in the Cloud - The What, Why and How from the ExpertsDataWorks Summit/Hadoop Summit

Similar to Storage Requirements and Options for Running Spark on Kubernetes (20)

Why Kubernetes as a container orchestrator is a right choice for running spar...

Meetup Kubernetes Rhein-Necker

Webinar - DreamObjects/Ceph Case Study

Netflix oss season 2 episode 1 - meetup Lightning talks

Best of re:Invent

State of the Container Ecosystem

Lessons learned from running Spark on Docker

Trend Micro Big Data Platform and Apache Bigtop

Serverless spark

Intro Docker october 2013

What are clouds made from

Solr + Hadoop: Interactive Search for Hadoop

Kubernetes – An open platform for container orchestration

Apache Cassandra training. Overview and Basics

Hadoop in the cloud – The what, why and how from the experts

Fusion on Kubernetes - Alan Eugenio & Joe Streeky, Lucidworks

Move your on prem data to a lake in a Lake in Cloud

Cloud computing UNIT 2.1 presentation in

Hadoop ppt1

Big Data in the Cloud - The What, Why and How from the Experts

Recently uploaded

Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous

Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub

Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz

CNIC Information System with Pakdata Cf In Pakistandanishmna97

Introduction to Multilingual Retrieval Augmented Generation (RAG)Zilliz

MINDCTI Revenue Release Quarter One 2024MIND CTI

Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya

Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Angeliki Cooney

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays

Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays

How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes

Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays

[BuildWithAI] Introduction to Gemini.pdfSandro Moreira

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software

Apidays New York 2024 - The value of a flexible API Management solution for O...apidays

DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity

Platformless Horizons for Digital AdaptabilityWSO2

Recently uploaded (20)

Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke

Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf

Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...

CNIC Information System with Pakdata Cf In Pakistan

Introduction to Multilingual Retrieval Augmented Generation (RAG)

MINDCTI Revenue Release Quarter One 2024

Artificial Intelligence Chap.5 : Uncertainty

Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...

Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...

How to Troubleshoot Apps for the Modern Connected Worker

Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...

[BuildWithAI] Introduction to Gemini.pdf

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME

Apidays New York 2024 - The value of a flexible API Management solution for O...

DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam

Platformless Horizons for Digital Adaptability

Storage Requirements and Options for Running Spark on Kubernetes

1. Storage requirements for running Spark workloads on Kubernetes Rachit Arora rachitar@in.ibm.com IBM, India Software Labs

2. About Me • Advisory Software Engineer @ IBM India Software Labs • General Purpose Developer • Love Containers & Kubernetes • Conference traveler • Upcoming book on Hadoop and Its Ecosystem • Cricket fan, Foodie

3. Spark Unified, open source, parallel, data processing framework for Big Data Analytics Spark Core Engine Yarn Mesos Standalon e Scheduler Kubernete s Spark SQL Interactive Queries Spark Streaming Stream processing Spark MLlib Machine Learning GraphX Graph Computation

4. Typical Bigdata Application Secure Catalog and Search Ingest & Store Prepare Analyze Visualize Date Engineer Date Scientist Application Developer

5. Evolution of Spark Analytics On Prem Install • Acquire Hardware • Prepare Machine • Install Spark • Retry • Apply patches • security • Upgrades • Scale • High availability Virtualization • Prepare Vm Imaging Solution • Network Management • High Avilability • Patches • Scale Managed • Configure Cluster • Customize • Scale • Pay even if idle Serverless • Run analytics

6. What Kubernetes Bring in? • Kubernetes is an open-source system for automating deployment, scaling, and management of containerized applications. • It Manages Containers for me • It Manages High availability • It Provides me flexibility to choose resource I WANT and Persistence I want • Kubernetes – Lots of addon services: third-party logging, monitoring, and security tools • Reduced operational costs • Improved infrastructure utilization

7. Typical Spark deployment

8. Storage Requirements • Distributed File System • Local Scratch Space • Fast disk rights – DO NOT Write to Containers!! • User Library • Logs • History Server Events • Configs • Secrets

9. What can we leverage • Distributed • NFS • PV to PVC (1 to 1 Mapping in most of the Cloud Providers) • Big NFS – Multiple PV – qouta • HDFS – No Direct Support but can be configured to make it work but no data localization • DBFS – s3 based Databricks File System (DBFS) is a distributed file system • S3/Obect Storage – Performance concerns • Portworx – under exploration • Glusterfs

10. What can we leverage • Local temp dir scratch space • emptyDir • Clean Delete ? Need to return machines • HostPath • You manage delete • Logs • emptyDir vs NFS • Push to Object store using fluentd (side containers) • Roll over • Do not write to containers

11. What we are looking for? • Image as Volume • https://github.com/kubernetes/kubernetes /issues/831 • Flex Volume Plugin • CSI • Encrypted PVCs options – portworx • PV to PVC 1 to Many Mapping with Isolations • Config Map: Better support for updates • Local • Clean Delete for HIPAA • Distributed • Clean Delete for HIPAA • PVC transfer across Namespaces

12. References • IBM Watson Studio https://datascience.ibm.com • IBM Watson https://www.ibm.com/analytics/us/en/watson-data-platform/tutorial/ • Analytics Engine https://www.ibm.com/cloud/analytics-engine • Apache Spark • Kubernetes Scheduler Design & Discussion • Kuberenetes Clusters on IBM Cloud Rachit Arora rachitar@in.ibm.com @rachit1arora

13. Thank you Rachit Arora rachitar@in.ibm.com Twitter @rachit1arora

Editor's Notes

Spark is an open source, scalable, massively parallel, in-memory execution engine for analytics applications. Think of it as an in-memory layer that sits above multiple data stores, where data can be loaded into memory and analyzed in parallel across a cluster. Spark Core: The foundation of Spark that lot of libraires for scheduling and basic I/O Spark offers over 100s of high-level operators that make it easy to build parallel apps. Spark also includes prebuilt machine-learning algorithms and graph analysis algorithms that are especially written to execute in parallel and in memory. It also supports interactive SQL processing of queries and real-time streaming analytics. As a result, you can write analytics applications in programming languages such as Java, Python, R and Scala. You can run Spark using its standalone cluster mode, on Cloud, on Hadoop YARN, on Apache Mesos, or on Kubernetes. Access data in HDFS, Cassandra, HBase, Hive, Object Store, and any Hadoop data source.
Prepare Even though you have the right data, it may not be in the right format or structure for analysis. That’s where data preparation comes in. Data engineers need to bring raw data into one interface from wherever it lives – on premises, in the cloud or on your desktop – where it can then be shaped, transformed, explored, and prepared for analysis.Data scientist: Primarily responsible for building predictive analytic models and building insights. He will analyze data that’s been cataloged and prepared by the data engineer using machine learning tools like Watson Machine Learning. He will build applications using Jupyter Notebooks, RStudio After the data scientist shares his Analytical outputs , Application developer can build APPs like a cognitive chatbot. As the chatbot engages with customers, it will continuously improve its knowledge and help uncover new insights.
As a data scientist what I was required to do On Prem to Virtuliation as demand increased in my organization for the sevrice I decided to move to virtualized VM to handle many request on demand but there still pain was more Then I decided to try services being offereed on cloud like EMR and IBM Analytics Engine or Microsoft Insights etce but there I need to order cluster sand configure them to suit my work loads Keep them running even when I do not want to use them Cover what is takes to install a hadoop/spark cluster
IBM Watson brings together data management, data policies, data preparation, and analysis capabilities into a common framework. You can index, discover, control, and share data with Watson Knowledge Catalog, refine and prepare the data with Data Refinery, then organize resources to analyze the same data with Watson Studio. The IBM Watson apps are fully integrated to use the same user interface and framework. You can pick whichever apps and tools you need for your organization. Watson Studio (Watson Studio) provides you with the environment and tools to solve your business problems by collaboratively analyzing data What is Analytics Engine? You can use AE to Build and deploy clusters within minutes with simplified user experience, scalability, and reliability. You Custom configure the environment and Scale on demand.

Storage Requirements and Options for Running Spark on Kubernetes

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Storage Requirements and Options for Running Spark on Kubernetes

Similar to Storage Requirements and Options for Running Spark on Kubernetes (20)

More from DataWorks Summit

More from DataWorks Summit (20)

Recently uploaded

Recently uploaded (20)

Storage Requirements and Options for Running Spark on Kubernetes

Editor's Notes