Dynamic Resource Allocation Spark on YARN

•

20 likes•9,351 views

Spark on YARN allows Spark jobs to run efficiently on YARN clusters. It supports two modes: yarn-client mode where the driver runs locally, and yarn-cluster mode where the driver runs in a YARN container. Dynamic resource allocation allows Spark to dynamically allocate containers based on workload, launching and killing executors as needed. This improves resource utilization by avoiding inefficient allocation where containers remain unused after tasks complete. Configuration changes are required to enable the external shuffle service to store RDD state externally rather than within executors.

Dynamic Resource
Allocation for Spark on
YARN
ozawa@apache.org
Tsuyoshi Ozawa

What s YARN
• A resource manager
implementation 
for computer cluster

Hadoop Stack
HDFS
YARN
MapReduceSpark Tez

YARN overview
• All resources are managed by ResourceManager
• All tasks are launched on NodeManager
• Client submit jobs via ResourceManager
NodeManager NodeManager
ResourceManager client

Spark on YARN
• 2 mode
• yarn-cluster
• yarn-client

yarn-cluster mode
• Launching Spark driver on YARN container
• Working well with spark-submit
NodeManager NodeManager NM
container1 container2Spark
AppMaster
clientResource Manager
1 submit
2 launching
master
3 launching
executers
spark driver

yarn-client mode
• Launching Spark driver at client side
• Working well with spark-shell
NodeManager NodeManager NM
container1 container2Spark
AppMaster
client
Resource Manager
1 submit
2 launching
master
3 launching
executers spark driver
4. send
commands

Spark on YARN
• yarn-cluster mode
Node1 Node2 Node3
container
1
container
2
AppMaster
container
2

Problem
• Ineﬃcient resource management
• containers cannot exit until job exits
Node1 Node2
container container container container
stage1
stage2
100% 100% 100% 100%
100%0%0% 0%

Dynamic resource
allocation(since v1.2)
• Allocating containers more dynamically
• number of executers are decided by workload
NodeManager NodeManager NM
container1 container2Spark
AppMaster
clientResource Manager
1 submit
2 launching
master
3 launching
executers/
kill executors
spark driver

Yak shaving
• Where should we hold the state of  
Spark RDD?
• If executers are killed, it ll be lost…
NodeManager
executer executer
RDD RDD

external shuﬄe
• Saving Spark RDD to NodeManager
• NodeManager has a interface, 
external shuﬄe plugin
• Now executers are stateless!
NodeManager
executer executer
external
shuﬄe plugin
RDD
(IntermediateFile)
RDD
(IntermediateFile)

How to install
(with Apache Hadoop)
• Copy shuﬄe plugin to nodemanager s
classpath
• Edit yarn-site.xml
• Edit spark-defaults.conf

Copy shuﬄe jar to
nodemanager s classpath
$ cp
lib/spark-*-yarn-shuffle.jar
/home/ubuntu/hadoop/share/hadoop/yarn/

Edit yarn-site.xml
• Adding shuﬄe plugin
• Note that documentation for 1.2 includes typo - I PRed :-)
• See documentation for 1.4

We re ready!!
• num-executers are deﬁned automatically

Summary
• Spark on YARN
• yarn-client mode
• yarn-cluster mode
• Spark can launch jobs eﬃciently on YARN 
with dynamic allocation

The document discusses 5 common mistakes people make when writing Spark applications: 1) Not properly sizing executors for memory and cores. 2) Having shuffle blocks larger than 2GB which can cause jobs to fail. 3) Not addressing data skew which can cause joins and shuffles to be very slow. 4) Not properly managing the DAG to minimize shuffles and stages. 5) Classpath conflicts from conflicting dependencies like Guava which can cause errors.

Bucketing 2.0: Improve Spark SQL Performance by Removing Shuffle

Databricks

Optimizing Apache Spark SQL Joins

Databricks

Join operations in Apache Spark is often the biggest source of performance problems and even full-blown exceptions in Spark. After this talk, you will understand the two most basic methods Spark employs for joining DataFrames – to the level of detail of how Spark distributes the data within the cluster. You’ll also find out how to work out common errors and even handle the trickiest corner cases we’ve encountered! After this talk, you should be able to write performance joins in Spark SQL that scale and are zippy fast! This session will cover different ways of joining tables in Apache Spark. Speaker: Vida Ha This talk was originally presented at Spark Summit East 2017.

Keep me in the Loop: INotify in HDFS

DataWorks Summit

This document describes how the HDFS client uses the NameNode's audit log and DFSInotifyEventInputStream to monitor file system changes since its last poll. The client caches the highest event ID it has seen and periodically polls the NameNode for any events with higher IDs. The NameNode assigns monotonically increasing IDs to each event in its audit log, which also gets replayed as events through the DFSInotifyEventInputStream that clients can read to learn about changes like renames, deletes, metadata updates etc.

Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...

Databricks

Hoodie - DataEngConf 2017

Vinoth Chandar

An Open Source Incremental Processing Framework called Hoodie is summarized. Key points: - Hoodie provides upsert and incremental processing capabilities on top of a Hadoop data lake to enable near real-time queries while avoiding costly full scans. - It introduces primitives like upsert and incremental pull to apply mutations and consume only changed data. - Hoodie stores data on HDFS and provides different views like read optimized, real-time, and log views to balance query performance and data latency for analytical workloads. - The framework is open source and built on Spark, providing horizontal scalability and leveraging existing Hadoop SQL query engines like Hive and Presto.

Memory Management in Apache Spark

Databricks

Memory management is at the heart of any data-intensive system. Spark, in particular, must arbitrate memory allocation between two main use cases: buffering intermediate data for processing (execution) and caching user data (storage). This talk will take a deep dive through the memory management designs adopted in Spark since its inception and discuss their performance and usability implications for the end user.

202201 AWS Black Belt Online Seminar Apache Spark Performnace Tuning for AWS ...

Amazon Web Services Japan

The Delta Architecture pattern has made the lives of data engineers much simpler, but what about improving query performance for data analysts? What are some common places to look at for tuning query performance? In this session we will cover some common techniques to apply to our delta tables to make them perform better for data analysts queries. We will look at a few examples of how you can analyze a query, and determine what to focus on to deliver better performance results.

Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...

Databricks

Spark SQL is a highly scalable and efficient relational processing engine with ease-to-use APIs and mid-query fault tolerance. It is a core module of Apache Spark. Spark SQL can process, integrate and analyze the data from diverse data sources (e.g., Hive, Cassandra, Kafka and Oracle) and file formats (e.g., Parquet, ORC, CSV, and JSON). This talk will dive into the technical details of SparkSQL spanning the entire lifecycle of a query execution. The audience will get a deeper understanding of Spark SQL and understand how to tune Spark SQL performance.

Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...

Databricks

Uber has real needs to provide faster, fresher data to data consumers & products, running hundreds of thousands of analytical queries everyday. Uber engineers will share the design, architecture & use-cases of the second generation of ‘Hudi’, a self contained Apache Spark library to build large scale analytical datasets designed to serve such needs and beyond. Hudi (formerly Hoodie) is created to effectively manage petabytes of analytical data on distributed storage, while supporting fast ingestion & queries. In this talk, we will discuss how we leveraged Spark as a general purpose distributed execution engine to build Hudi, detailing tradeoffs & operational experience. We will also show to ingest data into Hudi using Spark Datasource/Streaming APIs and build Notebooks/Dashboards on top using Spark SQL.

Deep Dive into the New Features of Apache Spark 3.0

Databricks

Apache Iceberg: An Architectural Look Under the Covers

ScyllaDB

Data Lakes have been built with a desire to democratize data - to allow more and more people, tools, and applications to make use of data. A key capability needed to achieve it is hiding the complexity of underlying data structures and physical data storage from users. The de-facto standard has been the Hive table format addresses some of these problems but falls short at data, user, and application scale. So what is the answer? Apache Iceberg. Apache Iceberg table format is now in use and contributed to by many leading tech companies like Netflix, Apple, Airbnb, LinkedIn, Dremio, Expedia, and AWS. Watch Alex Merced, Developer Advocate at Dremio, as he describes the open architecture and performance-oriented capabilities of Apache Iceberg. You will learn: • The issues that arise when using the Hive table format at scale, and why we need a new table format • How a straightforward, elegant change in table format structure has enormous positive effects • The underlying architecture of an Apache Iceberg table, how a query against an Iceberg table works, and how the table’s underlying structure changes as CRUD operations are done on it • The resulting benefits of this architectural design

Apache Arrow Flight: A New Gold Standard for Data Transport

Wes McKinney

This document discusses how structured data is often moved inefficiently between systems, causing waste. It introduces Apache Arrow, an open standard for in-memory data, and how Arrow can help make data movement more efficient. Systems like Snowflake and BigQuery are now using Arrow to help speed up query result fetching by enabling zero-copy data transfers and sharing file formats between query processing and storage.

Running Spark in Production

DataWorks Summit/Hadoop Summit

This document discusses best practices for running Spark in production. It begins with introductions from the presenters and an overview of Spark deployment modes on YARN. The main topics covered are Spark security using Kerberos authentication and authorization, communication channels and encryption in YARN cluster mode, common issues, and performance tuning. For performance, it recommends choosing executor and task sizes to balance efficiency and overhead, and increasing task parallelism to mitigate data skew problems. The goal is to understand workload patterns and monitor behavior to effectively tune Spark for different situations.

Getting Started with Apache Spark on Kubernetes

Databricks

Community adoption of Kubernetes (instead of YARN) as a scheduler for Apache Spark has been accelerating since the major improvements from Spark 3.0 release. Companies choose to run Spark on Kubernetes to use a single cloud-agnostic technology across their entire stack, and to benefit from improved isolation and resource sharing for concurrent workloads. In this talk, the founders of Data Mechanics, a serverless Spark platform powered by Kubernetes, will show how to easily get started with Spark on Kubernetes.

Cassandra at eBay - Cassandra Summit 2012

Jay Patel

[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic

DataScienceConferenc1

This document provides an overview of the Databricks platform. It discusses how Databricks combines features of data warehouses and data lakes to create a "data lakehouse" that supports both business intelligence/reporting and data science/machine learning use cases. Key components of the Databricks platform include Apache Spark, Delta Lake, MLFlow, Jupyter notebooks, and Delta Live Tables. The platform aims to unify data engineering, data warehousing, streaming, and data science tasks on a single open-source platform.

Data Distribution and Ordering for Efficient Data Source V2

Databricks

This presentation discusses data distribution and ordering in Apache Iceberg's Data Source V2. It explains that proper distribution and ordering of data is important for performance when writing and reading large datasets. The new version introduces an API for connectors to specify their required distribution and ordering, addressing issues in V1 where connectors could apply arbitrary transformations. Supported distribution options include ordered, clustered, and unspecified, and the API supports batch and streaming writes. Future work includes supporting distribution and ordering in table creation and improving partition handling. Proper data distribution and ordering is key to scaling performance in Iceberg.

Optimizing S3 Write-heavy Spark workloads

datamantra

This document discusses optimizing Spark write-heavy workloads to S3 object storage. It describes problems with eventual consistency, renames, and failures when writing to S3. It then presents several solutions implemented at Qubole to improve the performance of Spark writes to Hive tables and directly writing to the Hive warehouse location. These optimizations include parallelizing renames, writing directly to the warehouse, and making recover partitions faster by using more efficient S3 listing. Performance improvements of up to 7x were achieved.

Flexible and Real-Time Stream Processing with Apache Flink

DataWorks Summit

This document provides an overview of stream processing with Apache Flink. It discusses the rise of stream processing and how it enables low-latency applications and real-time analysis. It then describes Flink's stream processing capabilities, including pipelining of data, fault tolerance through checkpointing and recovery, and integration with batch processing. The document also summarizes Flink's programming model, state management, and roadmap for further development.

How We Optimize Spark SQL Jobs With parallel and sync IO

Databricks

Although NVMe has been more and more popular these years, a large amount of HDD are still widely used in super-large scale big data clusters. In a EB-level data platform, IO(including decompression and decode) cost contributes a large proportion of Spark jobs’ cost. In another word, IO operation is worth optimizing. In ByteDancen, we do a series of IO optimization to improve performance, including parallel read and asynchronized shuffle. Firstly we implement file level parallel read to improve performance when there are a lot of small files. Secondly, we design row group level parallel read to accelerate queries for big-file scenario. Thirdly, implement asynchronized spill to improve job peformance. Besides, we design parquet column family, which will split a table into a few column families and different column family will be in different Parquets files. Different column family can be read in parallel, so the read performance is much higher than the existing approach. In our practice, the end to end performance is improved by 5% to 30% In this talk, I will illustrate how we implement these features and how they accelerate Apache Spark jobs.

Webinar: Detecting row patterns with Flink SQL - Dawid Wysakowicz

Ververica

This document discusses using Apache Flink's SQL capabilities to analyze streaming data. It provides an example of detecting "rush hour" periods using taxi ride data streamed from New York City. SQL queries with MATCH_RECOGNIZE are shown to identify periods with increasing and decreasing ride counts over 30 minute windows, indicating morning and evening rush hours. The document also demonstrates finding taxi rides with mid-stops and detecting driver fatigue based on total ride durations per day.

Snowflake Architecture.pptx

chennakesava44

The document discusses Snowflake, a cloud data platform. It covers Snowflake's data landscape and benefits over legacy systems. It also describes how Snowflake can be deployed on AWS, Azure and GCP. Pricing is noted to vary by region but not cloud platform. The document outlines Snowflake's editions, architecture using a shared-nothing model, support for structured data, storage compression, and virtual warehouses that can autoscale. Security features like MFA and encryption are highlighted.

Make your PySpark Data Fly with Arrow!

Databricks

In the big data world, it's not always easy for Python users to move huge amounts of data around. Apache Arrow defines a common format for data interchange, while Arrow Flight introduced in version 0.11.0, provides a means to move that data efficiently between systems. Arrow Flight is a framework for Arrow-based messaging built with gRPC. It enables data microservices where clients can produce and consume streams of Arrow data to share it over the wire. In this session, I'll give a brief overview of Arrow Flight from a Python perspective, and show that it's easy to build high performance connections when systems can talk Arrow. I'll also cover some ongoing work in using Arrow Flight to connect PySpark with TensorFlow - two systems with great Python APIs but very different underlying internal data.

Light-weighted HDFS disaster recovery

DataWorks Summit

HDFS is well designed to operate efficiently at scale for normal hardware failures within a datacenter, but it is not designed to handle significant negative events, such as datacenter failures. To overcome this defect, a common practice of HDFS disaster recovery (DR) is replicating data from one location to another through DistCp, which provides a robust and reliable backup capability for HDFS data through batch operations. However, DistCp also has several drawbacks: (1) Taking HDFS Snapshots is time and space consuming on large HDFS cluster. (2) Applying file changes though MapReduce may introduce additional execution overhead and potential issues. (3) DistCp requires administrator intervene to trigger, perform, and verify DistCp jobs, which is not user-friendly in practice. In this presentation, we will share our experience in HDFS DR and introduce our light-weighted HDFS disaster recovery system that addresses afore-mentioned problems. Different from DistCp, our light-weighted DR system is designed based on HDFS logs (e.g. edit log and Inotify), light-weighted producer/consumer framework, and FileSystem API. During synchronization, it fetches limited subsets of namespace and incremental file changes from NameNode, then our executors apply these changes incrementally to remote clusters through FileSystem API. Furthermore, it also provides a powerful user interface with trigger conditions, path filters and jobs scheduler, etc. Compared to DistCp, it is more straightforward, light-weighted, reliable, efficient, and user-friendly. Speaker Qiyuan Gong, Big Data Software Engineer, Intel

KafkaConsumer - Decoupling Consumption and Processing for Better Resource Uti...

confluent

When working with KafkaConsumer, we usually employ single thread both for reading and processing of messages. KafkaConsumer is not thread-safe, so using single thread fits in well. Downside of this approach is that you are limited to single thread for processing messages. By decoupling consumption and processing, we can achieve processing parallelization with single consumer and get the most out of multi-core CPU architectures available today. While this can be very useful in certain use-case scenarios, it's not trivial to implement. How do we use multiple threads with KafkaConsumer which is not thread safe? How do we react to consumer group rebalancing? Can we get desired processing and ordering guarantees? In this talk we 'll try to answer these questions and explore challenges we face on our path.

Streams and Tables: Two Sides of the Same Coin (BIRTE 2018)

confluent

Understanding Memory Management In Spark For Fun And Profit

Spark Summit

1) The document discusses memory management in Spark applications and summarizes different approaches tried by developers to address out of memory errors in Spark executors. 2) It analyzes the root causes of memory issues like executor overheads and data sizes, and evaluates fixes like increasing memory overhead, reducing cores, frequent garbage collection. 3) The document dives into Spark and JVM level configuration options for memory like storage pool sizes, caching formats, and garbage collection settings to improve reliability, efficiency and performance of Spark jobs.

Scaling Spark Workloads on YARN - Boulder/Denver July 2015

Mac Moore

What's hot

Common Strategies for Improving Performance on Your Delta Lakehouse

Databricks

Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...

Databricks

Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...

Databricks

Deep Dive into the New Features of Apache Spark 3.0

Databricks

Apache Iceberg: An Architectural Look Under the Covers

ScyllaDB

Apache Arrow Flight: A New Gold Standard for Data Transport

Wes McKinney

Running Spark in Production

DataWorks Summit/Hadoop Summit

Getting Started with Apache Spark on Kubernetes

Databricks

Cassandra at eBay - Cassandra Summit 2012

Jay Patel

[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic

DataScienceConferenc1

Data Distribution and Ordering for Efficient Data Source V2

Databricks

Optimizing S3 Write-heavy Spark workloads

datamantra

Flexible and Real-Time Stream Processing with Apache Flink

DataWorks Summit

How We Optimize Spark SQL Jobs With parallel and sync IO

Databricks

Webinar: Detecting row patterns with Flink SQL - Dawid Wysakowicz

Ververica

Snowflake Architecture.pptx

chennakesava44

Make your PySpark Data Fly with Arrow!

Databricks

Light-weighted HDFS disaster recovery

DataWorks Summit

KafkaConsumer - Decoupling Consumption and Processing for Better Resource Uti...

confluent

Streams and Tables: Two Sides of the Same Coin (BIRTE 2018)

confluent

What's hot (20)

Common Strategies for Improving Performance on Your Delta Lakehouse

Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...

Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...

Deep Dive into the New Features of Apache Spark 3.0

Apache Iceberg: An Architectural Look Under the Covers

Apache Arrow Flight: A New Gold Standard for Data Transport

Running Spark in Production

Getting Started with Apache Spark on Kubernetes

Cassandra at eBay - Cassandra Summit 2012

[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic

Data Distribution and Ordering for Efficient Data Source V2

Optimizing S3 Write-heavy Spark workloads

Flexible and Real-Time Stream Processing with Apache Flink

How We Optimize Spark SQL Jobs With parallel and sync IO

Webinar: Detecting row patterns with Flink SQL - Dawid Wysakowicz

Snowflake Architecture.pptx

Make your PySpark Data Fly with Arrow!

Light-weighted HDFS disaster recovery

KafkaConsumer - Decoupling Consumption and Processing for Better Resource Uti...

Streams and Tables: Two Sides of the Same Coin (BIRTE 2018)

Viewers also liked

Understanding Memory Management In Spark For Fun And Profit

Spark Summit

Scaling Spark Workloads on YARN - Boulder/Denver July 2015

Mac Moore

Dynamically Allocate Cluster Resources to your Spark Application

DataWorks Summit

RDD

Tien-Yang (Aiden) Wu

Scheduling Policies in YARN

DataWorks Summit/Hadoop Summit

This document summarizes a presentation about scheduling policies in YARN. It discusses existing scheduling in YARN, adding new resource types and resource profiles, resource scheduling for services like affinity and anti-affinity, and a proposed new GUTS API to provide a unified approach for specifying resource requests and constraints. The new API aims to simplify expressing complex scheduling requirements and relationships between placements in a single request.

Apache Spark & Hadoop

MapR Technologies

http://bit.ly/1BTaXZP – Hadoop has been a huge success in the data world. It’s disrupted decades of data management practices and technologies by introducing a massively parallel processing framework. The community and the development of all the Open Source components pushed Hadoop to where it is now. That's why the Hadoop community is excited about Apache Spark. The Spark software stack includes a core data-processing engine, an interface for interactive querying, Sparkstreaming for streaming data analysis, and growing libraries for machine-learning and graph analysis. Spark is quickly establishing itself as a leading environment for doing fast, iterative in-memory and streaming analysis. This talk will give an introduction the Spark stack, explain how Spark has lighting fast results, and how it complements Apache Hadoop. Keys Botzum - Senior Principal Technologist with MapR Technologies Keys is Senior Principal Technologist with MapR Technologies, where he wears many hats. His primary responsibility is interacting with customers in the field, but he also teaches classes, contributes to documentation, and works with engineering teams. He has over 15 years of experience in large scale distributed system design. Previously, he was a Senior Technical Staff Member with IBM, and a respected author of many articles on the WebSphere Application Server as well as a book.

Spark-on-YARN: Empower Spark Applications on Hadoop Cluster

DataWorks Summit

This document discusses Apache Spark-on-YARN, which allows Spark applications to leverage existing Hadoop clusters. Spark improves efficiency over Hadoop via in-memory computing and supports rich APIs. Spark-on-YARN provides access to HDFS data and resources on Hadoop clusters without extra deployment costs. It supports running Spark jobs in YARN cluster and client modes. The document describes Yahoo's use of Spark-on-YARN for machine learning applications on large datasets.

Spark on YARN

Adarsh Pannu

Spark supports four cluster managers: Local, Standalone, YARN, and Mesos. YARN is highly recommended for production use. When running Spark on YARN, careful tuning of configuration settings like the number of executors, executor memory and cores, and dynamic allocation is important to optimize performance and resource utilization. Configuring queues also allows separating different applications by priority and resource needs.

Spark on Yarn

Qubole

Spark on Yarn allows for dynamic provisioning of resources by allowing the Spark application master to request additional executors from Yarn as needed and release idle executors. This helps optimize resource utilization in the Yarn cluster. Qubole provides interfaces like the command UI, REST APIs, and SDKs to easily submit Spark jobs to Yarn clusters managed in Qubole, and integrates Spark with Hive by configuring Spark programs to access the Hive metastore. Key challenges include ensuring low overhead from Yarn, handling cached data, and network performance between clusters and shared services.

Apache Spark RDDs

Dean Chen

Spark is a general engine for large-scale data processing. It introduces Resilient Distributed Datasets (RDDs) which allow in-memory caching for fault tolerance and act like familiar Scala collections for distributed computation across clusters. RDDs provide a programming model with transformations like map and reduce and actions to compute results. Spark also supports streaming, SQL, machine learning, and graph processing workloads.

Spark on yarn

datamantra

Blazing Performance with Flame Graphs

Brendan Gregg

Delivered as plenary at USENIX LISA 2013. video here: https://www.youtube.com/watch?v=nZfNehCzGdw and https://www.usenix.org/conference/lisa13/technical-sessions/plenary/gregg . "How did we ever analyze performance before Flame Graphs?" This new visualization invented by Brendan can help you quickly understand application and kernel performance, especially CPU usage, where stacks (call graphs) can be sampled and then visualized as an interactive flame graph. Flame Graphs are now used for a growing variety of targets: for applications and kernels on Linux, SmartOS, Mac OS X, and Windows; for languages including C, C++, node.js, ruby, and Lua; and in WebKit Web Inspector. This talk will explain them and provide use cases and new visualizations for other event types, including I/O, memory usage, and latency.

Viewers also liked (12)

Understanding Memory Management In Spark For Fun And Profit

Scaling Spark Workloads on YARN - Boulder/Denver July 2015

Dynamically Allocate Cluster Resources to your Spark Application

RDD

Scheduling Policies in YARN

Apache Spark & Hadoop

Spark-on-YARN: Empower Spark Applications on Hadoop Cluster

Spark on YARN

Spark on Yarn

Apache Spark RDDs

Spark on yarn

Blazing Performance with Flame Graphs

Similar to Dynamic Resource Allocation Spark on YARN

Introduction to YARN Apps

Cloudera, Inc.

Yarn

Yu Xia

YARN (Yet Another Resource Negotiator) is a resource management framework for Hadoop clusters that improves on the scalability limitations of the original MapReduce framework. YARN separates resource management from job scheduling to allow multiple data processing engines like MapReduce, Spark, and Storm to share common cluster resources. It introduces a new architecture with a ResourceManager to allocate resources among applications and per-application ApplicationMasters to manage containers and scheduling within an application. This provides improved scalability, utilization, and multi-tenancy for a variety of workloads compared to the original Hadoop architecture.

[db tech showcase Tokyo 2014] C32: Hadoop最前線 - 開発の現場から by NTT 小沢健史

Insight Technology, Inc.

次期リリースとなるApache Hadoop 2.6 は,2系リリース後の最大のアップデートと言えるほど新しい機能が目白押しです。本講演では、Hadoop開発者の視点からHadoop 2系の中心となる YARN に関する基本的な説明と、Apache Hadoop 2.6 でリリース予定の最新機能の紹介を行います。特に、当方が開発に関わっている YARN のマスタ高可用化の仕組みや、Hadoop 2系を運用する上で必須なYARNのリソース管理の方法について詳細に解説します。

Hadoop bangalore-meetup-dec-2011-hadoop nextgen

InMobi

The speaker outlines the limitations of the current Hadoop MapReduce framework and proposes a next generation architecture. The new architecture splits resource management and application lifecycle management into separate components for improved scalability. It also aims to improve availability, allow for wire compatibility, and provide support for additional programming paradigms beyond MapReduce through the use of application masters. The next generation version is currently in alpha testing on small clusters and moving to beta in 2012.

Hadoop fault-tolerance

Ravindra Bandara

The document discusses fault tolerance in Apache Hadoop. It describes how Hadoop handles failures at different layers through replication and rapid recovery mechanisms. In HDFS, data nodes regularly heartbeat to the name node, and blocks are replicated across racks. The name node tracks block locations and initiates replication if a data node fails. HDFS also supports name node high availability. In MapReduce v1, task and task tracker failures cause re-execution of tasks. YARN improved fault tolerance by removing the job tracker single point of failure.

Homologous Apache Spark Clusters Using Nomad with Alex Dadgar

Databricks

- Nomad is a cluster scheduler that makes deploying Spark clusters easy for developers and operationally simple. It allows Spark jobs to be deployed across multiple datacenters and regions. - Currently, Nomad allows running Spark in production environments without compromising functionality. It enables shared clusters for batch and streaming workloads with higher efficiency. It also integrates with Vault for secure secrets management. - Future enhancements may include preempting lower priority Spark executors, implementing quotas and chargebacks, enabling GPU acceleration, and allowing over-subscription of resources to improve cluster utilization. Nomad aims to make deploying and running Spark easier and more cost effective at scale.

Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLab

CloudxLab

(Big Data with Hadoop & Spark Training: http://bit.ly/2IUsWca This CloudxLab Running in a Cluster tutorial helps you to understand running Spark in the cluster in detail. Below are the topics covered in this tutorial: 1) Spark Runtime Architecture 2) Driver Node 3) Scheduling Tasks on Executors 4) Understanding the Architecture 5) Cluster Managers 6) Executors 7) Launching a Program using spark-submit 8) Local Mode & Cluster-Mode 9) Installing Standalone Cluster 10) Cluster Mode - YARN 11) Launching a Program on YARN 12) Cluster Mode - Mesos and AWS EC2 13) Deployment Modes - Client and Cluster 14) Which Cluster Manager to Use? 15) Common flags for spark-submit

Taming YARN @ Hadoop Conference Japan 2014

Tsuyoshi OZAWA

Anatomy of Hadoop YARN

Rajesh Ananda Kumar

YARN is a framework for job scheduling and cluster resource management. It improves on classic MapReduce by separating resource management from job scheduling and tracking. In YARN, a resource manager allocates containers for tasks from applications and monitors containers. An application master negotiates container resources and coordinates tasks within the application. Tasks execute in containers managed by node managers. The application progress and completion is tracked and reported by the application master.

ApacheCon North America 2014 - Apache Hadoop YARN: The Next-generation Distri...

Zhijie Shen

For diverse organizations, Apache Hadoop has become the de-facto place where data & computational resources are shared. This broad usage has stretched its design beyond its intended target. To address this, Apache Hadoop community has come up with next generation of Hadoop’s compute platform: YARN. YARN in a nutshell is the distributed Operating System of the big-data world. In this talk, we will introduce YARN, covering how the new architecture decouples programming model from resource management, scheduling functions, platform’s fault tolerance & high availability, tools for application tracing & analyses. We will then discuss the exciting ecosystem of Apache Software Foundation projects forming around YARN. We will conclude with a coverage on the applications & services being built around YARN platform which lets user chose the programming models choice, all on the same data.

Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for production

Cloudera, Inc.

It’s no secret that Apache Spark is becoming the successor to MapReduce for data processing in Hadoop. With it’s easy development, flexible API, and performance benefits, Spark is a powerful data processing engine that has quickly gained popularity within the community. On the other hand Hive continues to be the most widely used data warehouse/ETL engine with large scale adoption across enterprises. Therefore, it’s imperative to enable Spark as the underlying execution engine for Hive to seamlessly allow existing and future Hive workloads to leverage the advantages of Spark. With the recent release of Cloudera 5.7, we have delivered on this goal by adding support for Hive-on-Spark. Data engineers and ETL developers can now transition from MR to Spark for their Hive workloads seamlessly thereby benefitting from the advantages of Spark without any disruption on their end. Join Santosh Kumar, Senior Product Manager at Cloudera, and Rui Li, Apache Hive committer and engineer at Intel, as we discuss: An Introduction to Spark and its advantages over MR An introduction of Hive-on-Spark: Goals and Design Principles Migrating to HoS and a live demo Configuring and tuning for batch workloads What’s next for both tools

Orchestrating Linux Containers while tolerating failures

Docker, Inc.

lthough containers are bringing a refreshing flexibility when deploying services in production, the management of those containers in such an environment still requires special care in order to keep the application up and running. In this regard, orchestration platforms like Docker, Kubernetes and Nomad have been trying to alleviate this responsibility, facilitating the task of deploying and maintaining the entire application stack in its desired state. This ensures that a service will be always running, tolerating machine failures, network erratic behavior or software updates and downtime. The purpose of this talk is to explain the mechanisms and architecture of the Docker Engine orchestration platform (using a framework called swarmkit) to tolerate failures of services and machines, from cluster state replication and leader-election to container re-scheduling logic when a host goes down.

Apachecon Hadoop YARN - Under The Hood (at ApacheCon Europe)

Sharad Agarwal

YARN is a new resource management architecture in Hadoop that provides improved scaling for large applications and high cluster utilization. It introduces the concept of separating resource management from job scheduling and tracking. This allows it to scale to larger clusters and support a wider variety of applications beyond just MapReduce. Key aspects of YARN include the use of an event-driven architecture for asynchronous processing of heartbeats, declarative state management for improved debuggability, and application master recovery for fault tolerance.

Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...

Spark Summit

This document summarizes Uber's use of Spark as a data platform to support multi-tenancy and various data applications. Key points include: - Uber uses Spark on YARN for resource management and isolation between teams/jobs. Parquet is used as the columnar file format for performance and schema support. - Challenges include sharing infrastructure between many teams with different backgrounds and use cases. Spark provides a common platform. - An Uber Development Kit (UDK) is used to help users get Spark jobs running quickly on Uber's infrastructure, with templates, defaults, and APIs for common tasks.

Taming YARN @ Hadoop conference Japan 2014

Tsuyoshi OZAWA

Swarm migration

Janakiram MSV

Introduction to yarn

Bhupesh Chawda

Bhupesh Chawda introduces YARN, the next generation architecture in Hadoop that provides better resource management and the ability to run multiple distributed applications beyond just MapReduce. YARN separates resource management from job scheduling and tracking, addressing limitations of the original Hadoop architecture. It introduces the ResourceManager for cluster management and scheduling, NodeManagers to manage containers on each node, and ApplicationMasters to manage applications. This allows different distributed computing frameworks like Spark, Giraph, and Apex to operate on the same Hadoop cluster managed by YARN.

Apache Spark Core

Girish Khanzode

Productionizing Spark and the Spark Job Server

Evan Chan

You won't find this in many places - an overview of deploying, configuring, and running Apache Spark, including Mesos vs YARN vs Standalone clustering modes, useful config tuning parameters, and other tips from years of using Spark in production. Also, learn about the Spark Job Server and how it can help your organization deploy Spark as a RESTful service, track Spark jobs, and enable fast queries (including SQL!) of cached RDDs.

Productionizing Spark and the REST Job Server- Evan Chan

Spark Summit

The document discusses productionizing Apache Spark and using the Spark REST Job Server. It provides an overview of Spark deployment options like YARN, Mesos, and Spark Standalone mode. It also covers Spark configuration topics like jars management, classpath configuration, and tuning garbage collection. The document then discusses running Spark applications in a cluster using tools like spark-submit and the Spark Job Server. It highlights features of the Spark Job Server like enabling low-latency Spark queries and sharing cached RDDs across jobs. Finally, it provides examples of using the Spark Job Server in production environments.

Similar to Dynamic Resource Allocation Spark on YARN (20)

Introduction to YARN Apps

Yarn

[db tech showcase Tokyo 2014] C32: Hadoop最前線 - 開発の現場から by NTT 小沢健史

Hadoop bangalore-meetup-dec-2011-hadoop nextgen

Hadoop fault-tolerance

Homologous Apache Spark Clusters Using Nomad with Alex Dadgar

Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLab

Taming YARN @ Hadoop Conference Japan 2014

Anatomy of Hadoop YARN

ApacheCon North America 2014 - Apache Hadoop YARN: The Next-generation Distri...

Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for production

Orchestrating Linux Containers while tolerating failures

Apachecon Hadoop YARN - Under The Hood (at ApacheCon Europe)

Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...

Taming YARN @ Hadoop conference Japan 2014

Swarm migration

Introduction to yarn

Apache Spark Core

Productionizing Spark and the Spark Job Server

Productionizing Spark and the REST Job Server- Evan Chan

More from Tsuyoshi OZAWA

YARN: a resource manager for analytic platform

Tsuyoshi OZAWA

The document discusses YARN, a resource manager for Apache Hadoop. It provides an overview of YARN and its key features: (1) managing resources in a cluster, (2) managing application history logs, and (3) a service registry mechanism. It then discusses how distributed processing frameworks like Tez and Spark work on YARN, focusing on their directed acyclic graph (DAG) models and techniques for improving performance on YARN like container reuse.

Spark sharkTsuyoshi OZAWA

Fluent logger-scala

Tsuyoshi OZAWA

This document introduces fluent-logger-scala, a simple logger for Scala apps that sends logs to fluentd servers. It allows Scala objects to log to fluentd with just 3 lines of code added to build.sbt. The logger currently supports Scala 2.9.x and sbt 0.12.x, with a roadmap to support Scala 2.10 by using an alternative JSON serialization library instead of msgpack-scala. A demo is shown of how to start casually collecting logs from Scala apps.

Multilevel aggregation for Hadoop/MapReduce

Tsuyoshi OZAWA

The document proposes a multi-level aggregation approach for Hadoop MapReduce to reduce shuffle costs by combining map outputs at the node and rack level. A prototype showed a job was 1.7 times faster and restricted shuffle costs to 50% by having mappers call a combiner before outputs are shuffled. Future work includes adding fault tolerance and supporting frameworks like Pig and Hive. Feedback is welcomed on the approach.

Memcached as a Service for CloudFoundry

Tsuyoshi OZAWA

The document discusses implementing Memcached as a service (MaaS) for Cloud Foundry. NTT Communications developed a MaaS based on Redis that is available on GitHub. It supports basic resource restrictions and multiple instances. A pull request was submitted to integrate MaaS into Cloud Foundry but there has been no response from CloudFoundry teams. Future work includes SASL support and more configurable parameters.

First step for dynticks in FreeBSD

Tsuyoshi OZAWA

This document discusses implementing dynamic ticks in the FreeBSD kernel. Currently, the kernel handles timer interrupts periodically at a fixed frequency (HZ), which is expensive when the CPU is idle. Dynamic ticks would generate timer interrupts using a one-shot timer based on when the next timer event is scheduled to occur, reducing overhead when idle. The author has started implementing this by adding code to scan the callout queue and determine when the next timer needs to fire. When an idle process detects there is no work to do, it could trigger a mode transition from periodic to dynamic ticks until the next scheduled event.

Memory Virtualization

Tsuyoshi OZAWA

The document discusses virtualization techniques used in KVM. It describes how KVM uses shadow page tables to virtualize memory management. The shadow page tables allow virtual addresses used by a guest OS to be translated to physical addresses on the host machine. Different techniques for implementing shadow page tables are described, including pre-validation of guest page tables and using a virtual translation lookaside buffer to cache translations.

第二回Bitvisor読書会前半 Intel-VT について

Tsuyoshi OZAWA

This document discusses virtualization techniques such as Intel VT and VMX. It explains the ring protection model of x86 CPUs and how virtualization works by having a hypervisor sit at the highest ring/privilege level. Key virtualization concepts covered include VMX root/non-root operation, VMCS data structures, VM exits/entries, and instructions for accessing and modifying VMCS like VMPTRLD, VMPTRST, VMWRITE, VMREAD, VMCLEAR. Memory mapped and port IO virtualization techniques are also summarized.

第二回KVM読書会

Tsuyoshi OZAWA

Linux KVM のコードを追いかけてみよう

Tsuyoshi OZAWA

The document discusses Linux KVM (Kernel-based Virtual Machine) and how it enables full virtualization on x86 hardware. KVM uses Intel VT-x and AMD-V virtualization extensions to allow a Linux kernel to function as a hypervisor. Guest virtual machines see a bare metal interface while the host kernel manages scheduling and resource allocation. Qemu is used as a processor emulator to add missing guest architectures.

More from Tsuyoshi OZAWA (10)

YARN: a resource manager for analytic platform

Spark shark

Fluent logger-scala

Multilevel aggregation for Hadoop/MapReduce

Memcached as a Service for CloudFoundry

First step for dynticks in FreeBSD

Memory Virtualization

第二回Bitvisor読書会前半 Intel-VT について

第二回KVM読書会

Linux KVM のコードを追いかけてみよう

Dynamic Resource Allocation Spark on YARN

1. Dynamic Resource Allocation for Spark on YARN ozawa@apache.org Tsuyoshi Ozawa

2. What s YARN • A resource manager implementation  for computer cluster

3. Hadoop Stack HDFS YARN MapReduceSpark Tez

4. YARN overview • All resources are managed by ResourceManager • All tasks are launched on NodeManager • Client submit jobs via ResourceManager NodeManager NodeManager ResourceManager client

5. Spark on YARN • 2 mode • yarn-cluster • yarn-client

6. yarn-cluster mode • Launching Spark driver on YARN container • Working well with spark-submit NodeManager NodeManager NM container1 container2Spark AppMaster clientResource Manager 1 submit 2 launching master 3 launching executers spark driver

7. yarn-client mode • Launching Spark driver at client side • Working well with spark-shell NodeManager NodeManager NM container1 container2Spark AppMaster client Resource Manager 1 submit 2 launching master 3 launching executers spark driver 4. send commands

8. Spark on YARN • yarn-cluster mode Node1 Node2 Node3 container 1 container 2 AppMaster container 2

9. Problem • Ineﬃcient resource management • containers cannot exit until job exits Node1 Node2 container container container container stage1 stage2 100% 100% 100% 100% 100%0%0% 0%

10. Dynamic resource allocation(since v1.2) • Allocating containers more dynamically • number of executers are decided by workload NodeManager NodeManager NM container1 container2Spark AppMaster clientResource Manager 1 submit 2 launching master 3 launching executers/ kill executors spark driver

11. Yak shaving • Where should we hold the state of   Spark RDD? • If executers are killed, it ll be lost… NodeManager executer executer RDD RDD

12. external shuffle • Saving Spark RDD to NodeManager • NodeManager has a interface,  external shuffle plugin • Now executers are stateless! NodeManager executer executer external shuffle plugin RDD (IntermediateFile) RDD (IntermediateFile)

13. How to install (with Apache Hadoop) • Copy shuﬄe plugin to nodemanager s classpath • Edit yarn-site.xml • Edit spark-defaults.conf

14. Copy shuﬄe jar to nodemanager s classpath $ cp lib/spark-*-yarn-shuffle.jar /home/ubuntu/hadoop/share/hadoop/yarn/

15. Edit yarn-site.xml • Adding shuﬄe plugin • Note that documentation for 1.2 includes typo - I PRed :-) • See documentation for 1.4

16. Edit spark-defaults.conf

17. We re ready!! • num-executers are deﬁned automatically

18. Demo

19. Summary • Spark on YARN • yarn-client mode • yarn-cluster mode • Spark can launch jobs eﬃciently on YARN  with dynamic allocation

Dynamic Resource Allocation Spark on YARN

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (12)

Similar to Dynamic Resource Allocation Spark on YARN

Similar to Dynamic Resource Allocation Spark on YARN (20)

More from Tsuyoshi OZAWA

More from Tsuyoshi OZAWA (10)

Dynamic Resource Allocation Spark on YARN