In this talk, we look at YARN scheduler choices available today for Apache Hadoop 2 and discuss their pros and cons. We dive deeper into Capacity Scheduler by providing a comprehensive overview of its various settings with examples from real large-scale Hadoop clusters to promoter a broader understanding of schedulers’ current state and best practices in place today when it comes to queue nomenclature, planning, allocations, and ongoing management. We present detailed cluster, queue, and job behaviors from several different capacity management philosophies.
We then propose practical solutions without any change to the scheduler or core Hadoop that allows managing queue creations and capacity allocations while optimizing for cluster utilization and maintaining SLA guarantees. A unified queue nomenclature, admission and capacity re-allocation policies across BUs, applications, and clusters make service automation possible. Transparency in resources consumed allows for defining realistic SLA expectation. Finally, consistent application tagging completes the feedback loop with SLAs observed through application level reporting.
This document discusses Yahoo's use of the Capacity Scheduler in Hadoop YARN to manage job scheduling and service level agreements (SLAs). It provides an overview of how Capacity Scheduler works, including how it tracks resources, configures queues with guaranteed minimum capacities, and uses parameters like minimum user limits, capacity, and maximum capacity to allocate resources fairly while meeting SLAs. The document is presented by Sumeet Singh and Nathan Roberts of Yahoo to provide insight into how Capacity Scheduler is used at Yahoo to manage their large Hadoop clusters processing over a million jobs per day.
Reservations Based Scheduling: if you’re late don’t blame us! DataWorks Summit
This document proposes techniques for teaching a resource manager about time-based constraints and deadlines. It summarizes an approach called reservation-based scheduling that allows jobs to declare their resource needs and deadlines. Key aspects include a Reservation Definition Language to specify jobs and their requirements, greedy allocation algorithms, a PlanFollower to dynamically enact the resource plan, and adapting to changing conditions over time. The approach aims to meet production job SLAs while improving latency for best-effort jobs and increasing cluster utilization.
This document provides best practices for YARN administrators and application developers. For administrators, it discusses YARN configuration, enabling ResourceManager high availability, configuring schedulers like Capacity Scheduler and Fair Scheduler, sizing containers, configuring NodeManagers, log aggregation, and metrics. For application developers, it discusses whether to use an existing framework or develop a native application, understanding YARN components, writing the client, and writing the ApplicationMaster.
This document provides an introduction and overview of YARN (Yet Another Resource Negotiator), a framework for job scheduling and cluster resource management in Apache Hadoop. It discusses limitations of the "classical" MapReduce framework and how YARN addresses these through its separation of scheduling and application execution responsibilities across a ResourceManager and per-application ApplicationMasters. Key aspects of YARN's architecture like NodeManagers and containers are also introduced.
Scale 12 x Efficient Multi-tenant Hadoop 2 Workloads with YarnDavid Kaiser
Hadoop is about so much more than batch processing. With the recent release of Hadoop 2, there have been significant changes to how a Hadoop cluster uses resources. YARN, the new resource management component, allows for a more efficient mix of workloads across hardware resources, and enables new applications and new processing paradigms such as stream-processing. This talk will discuss the new design and components of Hadoop 2, and examples of Modern Data Architectures that leverage Hadoop for maximum business efficiency.
Operating multi-tenant clusters requires careful planning of capacity for on-time launch of big data projects and applications within expected budget and with appropriate SLA guarantees. Making such guarantees with a set of standard hardware configurations is key to operate big data platforms as a hosted service for your organization.
This talk highlights the tools, techniques and methodology applied on a per-project or user basis across three primary multi-tenant deployments in the Apache Hadoop ecosystem, namely MapReduce/YARN and HDFS, HBase, and Storm due to the significance of capital investments with increasing scale in data nodes, region servers, and supervisor nodes respectively. We will demo the estimation tools developed for these deployments that can be used for capital planning and forecasting, and cluster resource and SLA management, including making latency and throughput guarantees to individual users and projects.
As we discuss the tools, we will share considerations that got incorporated to come up with the most appropriate calculation across these three primary deployments. We will discuss the data sources for calculations, resource drivers for different use cases, and how to plan for optimum capacity allocation per project with respect to given standard hardware configurations.
At the StampedeCon 2015 Big Data Conference: YARN enables Hadoop to move beyond just pure batch processing. With that multiple workloads and tenants now must be able to share a single infrastructure for data processing. Features of the Capacity Scheduler enable resource sharing among multiple tenants in a fair manner with elastic queues to maximize utilization. This talk will focus on the features of the Capacity Scheduler that enable Multi-Tenancy and how resource sharing can be rebalanced using features like Preemption.
Vinod Kumar Vavilapalli and Jian He presented on Apache Hadoop YARN, the next generation architecture for Hadoop. They discussed YARN's role as a data operating system and resource management platform. They outlined YARN's current capabilities and highlighted several features in development, including resource manager high availability, the YARN timeline server, and improved scheduling. They also discussed how YARN enables new applications beyond MapReduce and the growing ecosystem of projects supported by YARN.
This document discusses Yahoo's use of the Capacity Scheduler in Hadoop YARN to manage job scheduling and service level agreements (SLAs). It provides an overview of how Capacity Scheduler works, including how it tracks resources, configures queues with guaranteed minimum capacities, and uses parameters like minimum user limits, capacity, and maximum capacity to allocate resources fairly while meeting SLAs. The document is presented by Sumeet Singh and Nathan Roberts of Yahoo to provide insight into how Capacity Scheduler is used at Yahoo to manage their large Hadoop clusters processing over a million jobs per day.
Reservations Based Scheduling: if you’re late don’t blame us! DataWorks Summit
This document proposes techniques for teaching a resource manager about time-based constraints and deadlines. It summarizes an approach called reservation-based scheduling that allows jobs to declare their resource needs and deadlines. Key aspects include a Reservation Definition Language to specify jobs and their requirements, greedy allocation algorithms, a PlanFollower to dynamically enact the resource plan, and adapting to changing conditions over time. The approach aims to meet production job SLAs while improving latency for best-effort jobs and increasing cluster utilization.
This document provides best practices for YARN administrators and application developers. For administrators, it discusses YARN configuration, enabling ResourceManager high availability, configuring schedulers like Capacity Scheduler and Fair Scheduler, sizing containers, configuring NodeManagers, log aggregation, and metrics. For application developers, it discusses whether to use an existing framework or develop a native application, understanding YARN components, writing the client, and writing the ApplicationMaster.
This document provides an introduction and overview of YARN (Yet Another Resource Negotiator), a framework for job scheduling and cluster resource management in Apache Hadoop. It discusses limitations of the "classical" MapReduce framework and how YARN addresses these through its separation of scheduling and application execution responsibilities across a ResourceManager and per-application ApplicationMasters. Key aspects of YARN's architecture like NodeManagers and containers are also introduced.
Scale 12 x Efficient Multi-tenant Hadoop 2 Workloads with YarnDavid Kaiser
Hadoop is about so much more than batch processing. With the recent release of Hadoop 2, there have been significant changes to how a Hadoop cluster uses resources. YARN, the new resource management component, allows for a more efficient mix of workloads across hardware resources, and enables new applications and new processing paradigms such as stream-processing. This talk will discuss the new design and components of Hadoop 2, and examples of Modern Data Architectures that leverage Hadoop for maximum business efficiency.
Operating multi-tenant clusters requires careful planning of capacity for on-time launch of big data projects and applications within expected budget and with appropriate SLA guarantees. Making such guarantees with a set of standard hardware configurations is key to operate big data platforms as a hosted service for your organization.
This talk highlights the tools, techniques and methodology applied on a per-project or user basis across three primary multi-tenant deployments in the Apache Hadoop ecosystem, namely MapReduce/YARN and HDFS, HBase, and Storm due to the significance of capital investments with increasing scale in data nodes, region servers, and supervisor nodes respectively. We will demo the estimation tools developed for these deployments that can be used for capital planning and forecasting, and cluster resource and SLA management, including making latency and throughput guarantees to individual users and projects.
As we discuss the tools, we will share considerations that got incorporated to come up with the most appropriate calculation across these three primary deployments. We will discuss the data sources for calculations, resource drivers for different use cases, and how to plan for optimum capacity allocation per project with respect to given standard hardware configurations.
At the StampedeCon 2015 Big Data Conference: YARN enables Hadoop to move beyond just pure batch processing. With that multiple workloads and tenants now must be able to share a single infrastructure for data processing. Features of the Capacity Scheduler enable resource sharing among multiple tenants in a fair manner with elastic queues to maximize utilization. This talk will focus on the features of the Capacity Scheduler that enable Multi-Tenancy and how resource sharing can be rebalanced using features like Preemption.
Vinod Kumar Vavilapalli and Jian He presented on Apache Hadoop YARN, the next generation architecture for Hadoop. They discussed YARN's role as a data operating system and resource management platform. They outlined YARN's current capabilities and highlighted several features in development, including resource manager high availability, the YARN timeline server, and improved scheduling. They also discussed how YARN enables new applications beyond MapReduce and the growing ecosystem of projects supported by YARN.
The document discusses enabling diverse workload scheduling in YARN. It covers several topics including node labeling, resource preemption, reservation systems, pluggable scheduler behavior, and Docker container support in YARN. The presenters are Wangda Tan and Craig Welch from Hortonworks who have experience with big data systems like Hadoop, YARN, and OpenMPI. They aim to discuss how these features can help different types of workloads like batch, interactive, and real-time jobs run together more happily in YARN.
Hadoop was originally designed for running large batch jobs, but users wanted to share clusters for better utilization and lower costs. Sharing requires a scheduler that provides guaranteed capacity for production jobs while also giving interactive jobs good response times. The Fair Scheduler was developed to address this by assigning jobs to pools that each get a minimum share of resources, with excess allocated fairly between pools. However, strictly following queues can hurt data locality. Delay Scheduling improves locality by relaxing the queues for a short time to allow more data-local scheduling opportunities.
Hadoop YARN is the next generation computing platform in Apache Hadoop with support for programming paradigms besides MapReduce. In the world of Big Data, one cannot solve all the problems wholly using the Map Reduce programming model. Typical installations run separate programming models like MR, MPI, graph-processing frameworks on individual clusters. Running fewer larger clusters is cheaper than running more small clusters. Therefore,_leveraging YARN to allow both MR and non-MR applications to run on top of a common cluster becomes more important from an economical and operational point of view. This talk will cover the different APIs and RPC protocols that are available for developers to implement new application frameworks on top of YARN. We will also go through a simple application which demonstrates how one can implement their own Application Master, schedule requests to the YARN resource-manager and then subsequently use the allocated resources to run user code on the NodeManagers.
This document discusses the integration of Apache Pig with Apache Tez. Pig provides a procedural scripting language for data processing workflows, while Tez is a framework for executing directed acyclic graphs (DAGs) of tasks. Migrating Pig to use Tez as its execution engine provides benefits like reduced resource usage, improved performance, and container reuse compared to Pig's default MapReduce execution. The document outlines the design changes needed to compile Pig scripts to Tez DAGs and provides examples and performance results. It also discusses ongoing work to achieve full feature parity with MapReduce and further optimize performance.
YARN (Yet Another Resource Negotiator) is a resource management framework for Hadoop clusters that improves on the scalability limitations of the original MapReduce framework. YARN separates resource management from job scheduling to allow multiple data processing engines like MapReduce, Spark, and Storm to share common cluster resources. It introduces a new architecture with a ResourceManager to allocate resources among applications and per-application ApplicationMasters to manage containers and scheduling within an application. This provides improved scalability, utilization, and multi-tenancy for a variety of workloads compared to the original Hadoop architecture.
As part of the recent release of Hadoop 2 by the Apache Software Foundation, YARN and MapReduce 2 deliver significant upgrades to scheduling, resource management, and execution in Hadoop.
At their core, YARN and MapReduce 2’s improvements separate cluster resource management capabilities from MapReduce-specific logic. YARN enables Hadoop to share resources dynamically between multiple parallel processing frameworks such as Cloudera Impala, allows more sensible and finer-grained resource configuration for better cluster utilization, and scales Hadoop to accommodate more and larger jobs.
YARN - Next Generation Compute Platform fo HadoopHortonworks
YARN was developed as part of Hadoop 2.0 to address limitations in the original Hadoop 1.0 architecture. YARN introduces a centralized resource management framework to allow multiple data processing engines like MapReduce, interactive queries, graph processing, and stream processing to efficiently share common Hadoop cluster resources. It also improves cluster utilization, scalability, and supports multiple paradigms beyond just batch processing. Major companies like Yahoo have realized significant performance and resource utilization gains with YARN in production environments.
- The document discusses Apache Hadoop YARN, including its past, present, and future.
- In the past, YARN started as a sub-project of Hadoop and had several alpha and beta releases before the first stable release in 2013.
- Currently, YARN enables rolling upgrades, long running services, node labels, and improved cluster management features like preemption scheduling and fine-grained resource isolation.
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduceMahantesh Angadi
The document summarizes a technical seminar presentation on scheduling methods in the Hadoop MapReduce framework. The presentation covers the motivation for Hadoop and MapReduce, provides an introduction to big data and Hadoop, and describes HDFS and the MapReduce programming model. It then discusses challenges in MapReduce scheduling and surveys the literature on existing scheduling methods. The presentation surveys five papers on proposed MapReduce scheduling methods, summarizing the key points of each. It concludes that improving data locality can enhance performance and that future work could consider scheduling algorithms for heterogeneous clusters.
Apache Hadoop: design and implementation. Lecture in the Big data computing course (http://twiki.di.uniroma1.it/twiki/view/BDC/WebHome), Department of Computer Science, Sapienza University of Rome.
The job throughput and Apache Hadoop cluster utilization benefits of YARN and MapReduce v2 are widely known. Who wouldn’t want job throughput increased by 2x? Most likely you’ve heard (repeatedly) about the key benefits that could be gained from migrating your Hadoop cluster from MapReduce v1 to YARN: namely around improved job throughput and cluster utilization, as well as around permitting different computational frameworks to run on Hadoop. What you probably haven’t heard about are the configuration tweaks needed to ensure your existing MR v1 jobs can run on your YARN cluster as well as YARN specific configuration settings. In this session we’ll start with a list of recommended YARN configurations, and then step through the most common use-cases we’ve seen in the field. Production migrations can quickly go awry without proper guidance. Learn from others’ misconfigurations to get your YARN cluster configured right the first time.
This document provides an overview of YARN (Yet Another Resource Negotiator), the resource management system for Hadoop. It describes the key components of YARN including the Resource Manager, Node Manager, and Application Master. The Resource Manager tracks cluster resources and schedules applications, while Node Managers monitor nodes and containers. Application Masters communicate with the Resource Manager to manage applications. YARN allows Hadoop to run multiple applications like Spark and HBase, improves on MapReduce scheduling, and transforms Hadoop into a distributed operating system for big data processing.
Resource Aware Scheduling for Hadoop [Final Presentation]Lu Wei
The document describes a resource-aware scheduler for Hadoop that aims to improve task scheduling by considering both job resource demands and node resource availability. It captures job and node profiles, estimates task execution times, and applies scheduling policies like shortest job first. Evaluation on word count and Pi estimation workloads showed the estimated task times closely matched the actual times, demonstrating the accuracy of the scheduler's resource modeling and estimations.
YARN - Presented At Dallas Hadoop User GroupRommel Garcia
This document provides an overview of YARN (Yet Another Resource Negotiator) in Hadoop 2.0. It discusses:
1) How YARN improves on Hadoop 1.X by allowing multiple applications to share cluster resources and enabling new types of applications beyond just MapReduce. YARN serves as the cluster resource manager.
2) Key YARN concepts like applications, containers, the resource manager, node manager, and application master. Containers are the basic unit of allocation that replace static map and reduce slots.
3) How MapReduce runs on YARN by using an application master and negotiating containers from the resource manager, rather than being tied to static slots. This improves efficiency.
Vinod Kumar Vavilapalli presented on Apache Hadoop YARN: Present and Future. He discussed how YARN improved on Hadoop 1 by separating resource management from processing, allowing multiple types of applications on the same platform. He summarized recent Hadoop releases including YARN enhancements like high availability and preemption. Future plans include improved isolation, multi-dimensional scheduling, and supporting long-running services. YARN aims to be a general resource management platform powering a growing ecosystem of applications beyond just MapReduce.
This document provides an overview of a Hadoop cluster deployment and configuration best practices from Rohith Sharma, Naganarasimha, and Sunil. It discusses:
1. Examples of YARN resource configurations for high-end node managers with 64GB RAM, 8-16 CPU cores, and 100TB of disk space.
2. Common YARN and MapReduce configuration parameters to tune resources like memory, CPU, and I/O.
3. Anti-patterns related to container memory allocation, long shuffle phases in MapReduce, and RM restarts impacting performance.
4. Best practices for queue configuration, capacity planning, user limits, and application priorities to improve cluster utilization.
Anti patterns in Hadoop Cluster deploymentSunil Govindan
Rohith Sharma, Naganarasimha, and Sunil presented on Hadoop cluster configurations and anti-patterns. They discussed sample node manager configurations with high resources, related YARN and MapReduce resource tuning settings, and anti-patterns like not configuring container heap size properly leading to out of memory errors. They also covered YARN capacity scheduler queue planning best practices like queue mapping, preemption, user limits, and application priority to improve cluster utilization.
The document discusses enabling diverse workload scheduling in YARN. It covers several topics including node labeling, resource preemption, reservation systems, pluggable scheduler behavior, and Docker container support in YARN. The presenters are Wangda Tan and Craig Welch from Hortonworks who have experience with big data systems like Hadoop, YARN, and OpenMPI. They aim to discuss how these features can help different types of workloads like batch, interactive, and real-time jobs run together more happily in YARN.
Hadoop was originally designed for running large batch jobs, but users wanted to share clusters for better utilization and lower costs. Sharing requires a scheduler that provides guaranteed capacity for production jobs while also giving interactive jobs good response times. The Fair Scheduler was developed to address this by assigning jobs to pools that each get a minimum share of resources, with excess allocated fairly between pools. However, strictly following queues can hurt data locality. Delay Scheduling improves locality by relaxing the queues for a short time to allow more data-local scheduling opportunities.
Hadoop YARN is the next generation computing platform in Apache Hadoop with support for programming paradigms besides MapReduce. In the world of Big Data, one cannot solve all the problems wholly using the Map Reduce programming model. Typical installations run separate programming models like MR, MPI, graph-processing frameworks on individual clusters. Running fewer larger clusters is cheaper than running more small clusters. Therefore,_leveraging YARN to allow both MR and non-MR applications to run on top of a common cluster becomes more important from an economical and operational point of view. This talk will cover the different APIs and RPC protocols that are available for developers to implement new application frameworks on top of YARN. We will also go through a simple application which demonstrates how one can implement their own Application Master, schedule requests to the YARN resource-manager and then subsequently use the allocated resources to run user code on the NodeManagers.
This document discusses the integration of Apache Pig with Apache Tez. Pig provides a procedural scripting language for data processing workflows, while Tez is a framework for executing directed acyclic graphs (DAGs) of tasks. Migrating Pig to use Tez as its execution engine provides benefits like reduced resource usage, improved performance, and container reuse compared to Pig's default MapReduce execution. The document outlines the design changes needed to compile Pig scripts to Tez DAGs and provides examples and performance results. It also discusses ongoing work to achieve full feature parity with MapReduce and further optimize performance.
YARN (Yet Another Resource Negotiator) is a resource management framework for Hadoop clusters that improves on the scalability limitations of the original MapReduce framework. YARN separates resource management from job scheduling to allow multiple data processing engines like MapReduce, Spark, and Storm to share common cluster resources. It introduces a new architecture with a ResourceManager to allocate resources among applications and per-application ApplicationMasters to manage containers and scheduling within an application. This provides improved scalability, utilization, and multi-tenancy for a variety of workloads compared to the original Hadoop architecture.
As part of the recent release of Hadoop 2 by the Apache Software Foundation, YARN and MapReduce 2 deliver significant upgrades to scheduling, resource management, and execution in Hadoop.
At their core, YARN and MapReduce 2’s improvements separate cluster resource management capabilities from MapReduce-specific logic. YARN enables Hadoop to share resources dynamically between multiple parallel processing frameworks such as Cloudera Impala, allows more sensible and finer-grained resource configuration for better cluster utilization, and scales Hadoop to accommodate more and larger jobs.
YARN - Next Generation Compute Platform fo HadoopHortonworks
YARN was developed as part of Hadoop 2.0 to address limitations in the original Hadoop 1.0 architecture. YARN introduces a centralized resource management framework to allow multiple data processing engines like MapReduce, interactive queries, graph processing, and stream processing to efficiently share common Hadoop cluster resources. It also improves cluster utilization, scalability, and supports multiple paradigms beyond just batch processing. Major companies like Yahoo have realized significant performance and resource utilization gains with YARN in production environments.
- The document discusses Apache Hadoop YARN, including its past, present, and future.
- In the past, YARN started as a sub-project of Hadoop and had several alpha and beta releases before the first stable release in 2013.
- Currently, YARN enables rolling upgrades, long running services, node labels, and improved cluster management features like preemption scheduling and fine-grained resource isolation.
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduceMahantesh Angadi
The document summarizes a technical seminar presentation on scheduling methods in the Hadoop MapReduce framework. The presentation covers the motivation for Hadoop and MapReduce, provides an introduction to big data and Hadoop, and describes HDFS and the MapReduce programming model. It then discusses challenges in MapReduce scheduling and surveys the literature on existing scheduling methods. The presentation surveys five papers on proposed MapReduce scheduling methods, summarizing the key points of each. It concludes that improving data locality can enhance performance and that future work could consider scheduling algorithms for heterogeneous clusters.
Apache Hadoop: design and implementation. Lecture in the Big data computing course (http://twiki.di.uniroma1.it/twiki/view/BDC/WebHome), Department of Computer Science, Sapienza University of Rome.
The job throughput and Apache Hadoop cluster utilization benefits of YARN and MapReduce v2 are widely known. Who wouldn’t want job throughput increased by 2x? Most likely you’ve heard (repeatedly) about the key benefits that could be gained from migrating your Hadoop cluster from MapReduce v1 to YARN: namely around improved job throughput and cluster utilization, as well as around permitting different computational frameworks to run on Hadoop. What you probably haven’t heard about are the configuration tweaks needed to ensure your existing MR v1 jobs can run on your YARN cluster as well as YARN specific configuration settings. In this session we’ll start with a list of recommended YARN configurations, and then step through the most common use-cases we’ve seen in the field. Production migrations can quickly go awry without proper guidance. Learn from others’ misconfigurations to get your YARN cluster configured right the first time.
This document provides an overview of YARN (Yet Another Resource Negotiator), the resource management system for Hadoop. It describes the key components of YARN including the Resource Manager, Node Manager, and Application Master. The Resource Manager tracks cluster resources and schedules applications, while Node Managers monitor nodes and containers. Application Masters communicate with the Resource Manager to manage applications. YARN allows Hadoop to run multiple applications like Spark and HBase, improves on MapReduce scheduling, and transforms Hadoop into a distributed operating system for big data processing.
Resource Aware Scheduling for Hadoop [Final Presentation]Lu Wei
The document describes a resource-aware scheduler for Hadoop that aims to improve task scheduling by considering both job resource demands and node resource availability. It captures job and node profiles, estimates task execution times, and applies scheduling policies like shortest job first. Evaluation on word count and Pi estimation workloads showed the estimated task times closely matched the actual times, demonstrating the accuracy of the scheduler's resource modeling and estimations.
YARN - Presented At Dallas Hadoop User GroupRommel Garcia
This document provides an overview of YARN (Yet Another Resource Negotiator) in Hadoop 2.0. It discusses:
1) How YARN improves on Hadoop 1.X by allowing multiple applications to share cluster resources and enabling new types of applications beyond just MapReduce. YARN serves as the cluster resource manager.
2) Key YARN concepts like applications, containers, the resource manager, node manager, and application master. Containers are the basic unit of allocation that replace static map and reduce slots.
3) How MapReduce runs on YARN by using an application master and negotiating containers from the resource manager, rather than being tied to static slots. This improves efficiency.
Vinod Kumar Vavilapalli presented on Apache Hadoop YARN: Present and Future. He discussed how YARN improved on Hadoop 1 by separating resource management from processing, allowing multiple types of applications on the same platform. He summarized recent Hadoop releases including YARN enhancements like high availability and preemption. Future plans include improved isolation, multi-dimensional scheduling, and supporting long-running services. YARN aims to be a general resource management platform powering a growing ecosystem of applications beyond just MapReduce.
This document provides an overview of a Hadoop cluster deployment and configuration best practices from Rohith Sharma, Naganarasimha, and Sunil. It discusses:
1. Examples of YARN resource configurations for high-end node managers with 64GB RAM, 8-16 CPU cores, and 100TB of disk space.
2. Common YARN and MapReduce configuration parameters to tune resources like memory, CPU, and I/O.
3. Anti-patterns related to container memory allocation, long shuffle phases in MapReduce, and RM restarts impacting performance.
4. Best practices for queue configuration, capacity planning, user limits, and application priorities to improve cluster utilization.
Anti patterns in Hadoop Cluster deploymentSunil Govindan
Rohith Sharma, Naganarasimha, and Sunil presented on Hadoop cluster configurations and anti-patterns. They discussed sample node manager configurations with high resources, related YARN and MapReduce resource tuning settings, and anti-patterns like not configuring container heap size properly leading to out of memory errors. They also covered YARN capacity scheduler queue planning best practices like queue mapping, preemption, user limits, and application priority to improve cluster utilization.
A sdn based application aware and network provisioningStanley Wang
The document discusses application aware SDN network provisioning. It begins with an overview of YARN architecture in Hadoop, including its benefits over earlier Hadoop architectures like improved scalability and utilization. It then discusses how SDN can be integrated with big data and cloud computing workloads by optimizing network topology and routing based on traffic patterns. Two approaches are proposed - reactive, where the SDN controller learns patterns from job logs/endpoints and modifies paths, and proactive where applications directly inform the network of intent. Finally, it proposes a service profile based SDN platform that uses network profiles and APIs to declaratively define logical topologies and provide network services and abstractions to applications.
IRJET-Framework for Dynamic Resource Allocation and Efficient Scheduling Stra...IRJET Journal
This document discusses a framework for dynamic resource allocation and efficient scheduling strategies in cloud computing platforms for high-performance computing (HPC). It proposes using a parallel genetic algorithm to find optimal allocation of virtual machines to physical resources in order to maximize resource utilization. The algorithm represents the resource allocation problem as an unbalanced job scheduling problem. It uses genetic operators like mutation and crossover to efficiently allocate requests for resources to idle nodes. Compared to a traditional genetic algorithm, the parallel genetic algorithm improves the speed of finding the best allocation and increases resource utilization. Future work could explore implementing dynamic load balancing and using big data concepts on the cloud.
Resource Aware Scheduling in Storm (Hadoop Summit 2016)Boyang Jerry Peng
This presentation discusses resource-aware scheduling in Apache Storm. It introduces the Resource Aware Scheduler (RAS), which aims to improve cluster resource utilization and topology performance in Storm. RAS allows fine-grained control of resource requirements for Storm components. It includes pluggable scheduling strategies that consider resource availability and topology priorities when scheduling work. Preliminary results from Yahoo Storm clusters show RAS improved throughput by 47-50% compared to the default scheduler. Future work includes improved scheduling strategies and real-time resource monitoring.
Dache - a data aware cache system for big-data applications using the MapReduce framework.
Dache aim-extending the MapReduce framework and provisioning a cache layer for efficiently identifying and accessing cache items in a MapReduce job.
Cache mechanism to avoid dulpication of same thing in hadoop system to speed ...eSAT Journals
This document proposes mechanisms to improve the efficiency of the Hadoop distributed file system and MapReduce framework. It suggests using locality-sensitive hashing to colocate related files on the same data nodes, which would improve data locality. It also proposes implementing a cache to store the results of MapReduce tasks, so that duplicate computations can be avoided when the same task is run again on the same data. Implementing these mechanisms could help speed up execution times in Hadoop by reducing unnecessary data transmission and repetitive task executions.
Guide to Application Performance: Planning to Continued OptimizationMuleSoft
Supporting everything from mobile apps with thousands of concurrent users to global deployments processing millions of requests daily, Anypoint Platform has been put to test. In this session, MuleSoft experts will talk through case studies from our most demanding deployments and provide a best practice approach to designing and tuning applications for optimal performance.
This document provides an overview of resource aware scheduling in Apache Storm. It discusses the challenges of scheduling Storm topologies at Yahoo scale, including increasing heterogeneous clusters, low cluster utilization, and unbalanced resource usage. It then introduces the Resource Aware Scheduler (RAS) built for Storm, which allows fine-grained resource control and isolation for topologies through APIs and cgroups. Key features of RAS include pluggable scheduling strategies, per user resource guarantees, and topology priorities. Experimental results from Yahoo Storm clusters show significant improvements to throughput and resource utilization with RAS. The talk concludes with future work on improved scheduling strategies and real-time resource monitoring.
This document provides an overview of resource aware scheduling in Apache Storm. It discusses the challenges of scheduling Storm topologies at Yahoo scale, including increasing heterogeneous clusters, low cluster utilization, and unbalanced resource usage. It then introduces the Resource Aware Scheduler (RAS) built for Storm, which allows fine-grained resource control and isolation for topologies through APIs and cgroups. Key features of RAS include pluggable scheduling strategies, per user resource guarantees, and topology priorities. Experimental results from Yahoo Storm clusters show significant improvements to throughput and resource utilization with RAS. Future work may include improved scheduling strategies and real-time resource monitoring.
Hadoop is a software framework that allows for distributed processing of large data sets across clusters of computers. It uses MapReduce as a programming model and HDFS for storage. MapReduce divides applications into parallelizable map and reduce tasks that process key-value pairs across large datasets in a reliable and fault-tolerant manner. HDFS stores multiple replicas of data blocks for reliability and allows processing of data in parallel on nodes where the data is located. Hadoop can reliably store and process petabytes of data on thousands of low-cost commodity hardware nodes.
Hadoop is a software framework that allows for distributed processing of large data sets across clusters of computers. It uses MapReduce and HDFS to parallelize tasks, distribute data storage, and provide fault tolerance. Applications of Hadoop include log analysis, data mining, and machine learning using large datasets at companies like Yahoo!, Facebook, and The New York Times.
In recent times, YARN Capacity Scheduler has improved a lot in terms of some critical features and refactoring. Here is a quick look into some of the recent changes in scheduler:
Global Scheduling Support
General placement support
Better preemption model to handle resource anomalies across and within queue.
Absolute resources’ configuration support
Priority support between Queues and Applications
In this talk, we will deep dive into each of these new features to give a better picture of their usage and performance comparison. We will also provide some more brief overview about the ongoing efforts and how they can help to solve some of the core issues we face today.
Speakers:
Sunil Govind (Hortonworks), Jian He (Hortonworks)
Cloud-Native Apache Spark Scheduling with YuniKorn SchedulerDatabricks
Kubernetes is the most popular container orchestration system that is natively designed for Cloud. At Lyft and Cloudera, we have both emerged the next-generation, cloud-native infrastructure based on Kubernetes, which supports various distributed workloads.
This document provides an overview of Hadoop and its ecosystem. It describes the key components of Hadoop including HDFS, MapReduce, YARN and various schedulers. It explains the architecture and functions of HDFS, MapReduce and YARN. It also summarizes the different schedulers in Hadoop including FIFO, Fair and Capacity schedulers.
This document contains a resume and profile for Harish Poojary, an Oracle database administrator. It outlines his objective to work for a progressive organization, provides details of his Oracle database administration experience over 6+ years including with Oracle 9i, 10g, 11g and Exadata. It also lists his technical skills, projects handled, and education and certification details.
This document provides an overview of cloud computing research being conducted at UC Berkeley. It discusses the goals of the Reliable Adaptive Distributed Systems (RAD) Lab to enable one person to develop, deploy, and operate next-generation internet applications using statistical machine learning. The document outlines the timeline and topics to be covered in a course on cloud computing, including the history, modern approaches, and infrastructure of cloud computing. It also summarizes research on Nexus, a common substrate that allows multiple cluster computing frameworks like Hadoop and MPI to efficiently share resources.
Nexus is a system that allows multiple distributed computing frameworks like Hadoop and MPI to efficiently share cluster resources. It uses a fine-grained resource sharing approach, offering individual tasks to frameworks rather than entire machines. This approach maximizes cluster utilization. Nexus also aims to allocate resources fairly between frameworks using a dominant resource fairness policy. Experiments show Nexus imposes low overhead while enabling dynamic sharing of resources between multiple Hadoop deployments and efficient elastic web serving.
Bikas saha:the next generation of hadoop– hadoop 2 and yarnhdhappy001
The document discusses Apache YARN, which is the next-generation resource management platform for Apache Hadoop. YARN was designed to address limitations of the original Hadoop 1 architecture by supporting multiple data processing models (e.g. batch, interactive, streaming) and improving cluster utilization. YARN achieves this by separating resource management from application execution, allowing various data processing engines like MapReduce, HBase and Storm to run natively on Hadoop frames. This provides a flexible, efficient and shared platform for distributed applications.
YARN - Hadoop Next Generation Compute PlatformBikas Saha
The presentation emphasizes the new mental model of YARN being the cluster OS where one can write and run different applications in Hadoop in a cooperative multi-tenant cluster
Similar to Hadoop Summit San Jose 2015: Towards SLA-based Scheduling on YARN Clusters (20)
This document discusses Hadoop at Yahoo, including:
- Yahoo has built a large multi-tenant Apache Hadoop deployment that powers many of its businesses and use cases.
- Over the years, Yahoo has scaled its Hadoop infrastructure significantly, now consisting of over 50,000 servers and 50PB of storage.
- Yahoo uses Hadoop for a wide range of use cases across advertising, search, personalization, anti-spam, and more, processing data at massive scales of billions of records daily.
Keynote Hadoop Summit San Jose 2017 : Shaping Data Platform To Create Lasting...Sumeet Singh
With a long history of open innovation with Hadoop, Yahoo continues to invest in and expand the platform capabilities by pushing the boundaries of what the platform can accomplish for the entire organization. In the last 11 years (yes, it is that old!), the Hadoop platform has shown no signs of giving up or giving in. In this talk, we explore what makes the shared multi-tenant Hadoop platform so special at Yahoo.
Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review Sumeet Singh
Over the past year, a lot of progress has been made in advancing the Apache Hadoop platform at Yahoo. We underwent a massive infrastructure consolidation to lower the platform TCO. CaffeOnSpark was open-sourced for distributed deep learning on existing infrastructure with a combination of CPU and GPU-based computing. Traditional compute on MapReduce continues to shift to Apache Tez and Apache Spark for lower processing time. Our internal security, multi-tenancy, and scale changes to Apache Storm got pushed to the community in Storm 0.10. Omid was open-sourced for managing transactions reliably on Apache HBase. Multi-tenancy with region groups, splittable META, ZooKeeper-less assignment manager, favored nodes with HDFS block placement, and support for humongous tables have taken Apache HBase scale to new heights. Dependency management in Apache Oozie for combinatorial, conditional, and optional processing gives increased flexibility to our data pipelines teams in maintaining SLAs. Focus on ease of use and onboarding improvements have brought in a whole new class of use cases and users to the platform. In this talk, we will provide a comprehensive overview of the platform technology stack, recent developments, metrics, and share thoughts on where things are headed when it comes to big data at Yahoo.
With a long history of open innovation with Hadoop, Yahoo continues to invest in and expand the platform capabilities by pushing the boundaries of what the platform can accomplish for the entire organization. In this talk, Sumeet Singh will present some of the recent innovations, open source contributions, and where things are headed when it comes to Hadoop at Yahoo.
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...Sumeet Singh
This document discusses lessons learned from building a scalable, self-serve, real-time, multi-tenant monitoring service at Yahoo. It describes transitioning from a classical architecture to one based on real-time big data technologies like Storm and Kafka. Key lessons include properly handling producer-consumer problems at scale, challenges of debugging skewed data, strategically managing multi-tenancy and resources, issues optimizing asynchronous systems, and not neglecting assumptions outside the application.
HUG Meetup 2013: HCatalog / Hive Data Out Sumeet Singh
Yahoo! Hadoop grid makes use of a managed service to get the data pulled into the clusters. However, when it comes to getting the data-out of the clusters, the choices are limited to proxies such as HDFSProxy and HTTPProxy. With the introduction of HCatalog services, customers of the grid now have their data represented in a central metadata repository. HCatalog abstracts out file locations and underlying storage format of data for the users, along with several other advantages such as sharing of data among MapReduce, Pig, and Hive. In this talk, we will focus on how the ODBC/JDBC interface of HiveServer2 accomplished the use case of getting data out of the clusters when HCatalog is in use and users no longer want to worry about the files, partitions and their location. We will also demo the data out capabilities, and go through other nice properties of the data out feature.
Presenter(s):
Sumeet Singh, Senior Director, Product Management, Yahoo!
Chris Drome, Technical Yahoo!
Hadoop Summit San Jose 2014: Costing Your Big Data Operations Sumeet Singh
As organizations begin to make use of large data sets, approaches to understand and manage true costs of big data will become an important facet with increasing scale of operations.
Whether an on-premise or cloud-based platform is used for storing, processing and analyzing data, our approach explains how to calculate the total cost of ownership (TCO), develop a deeper understanding of compute and storage resources, and run the big data operations with its own P&L, full transparency in costs, and with metering and billing provisions. While our approach is generic, we will illustrate the methodology with three primary deployments in the Apache Hadoop ecosystem, namely MapReduce and HDFS, HBase, and Storm due to the significance of capital investments with increasing scale in data nodes, region servers, and supervisor nodes respectively.
As we discuss our approach, we will share insights gathered from the exercise conducted on one of the largest data infrastructures in the world. We will illustrate how to organize cluster resources, compile data required and typical sources, develop TCO models tailored for individual situations, derive unit costs of usage, measure resources consumed, optimize for higher utilization and ROI, and benchmark the cost.
Hadoop Summit San Jose 2014: Data Discovery on Hadoop Sumeet Singh
In the last eight years, the Hadoop grid infrastructure has allowed us to move towards a unified source of truth for all data at Yahoo that now accounts for over 450 petabytes of raw HDFS and 1.1 billion data files. Managing data location, schema knowledge and evolution, fine-grained business rules based access control, and audit and compliance needs have become critical with the increasing scale of operations.
In this talk, we will share our approach in tackling the above challenges with Apache HCatalog, a table and storage management layer for Hadoop. We will explain how to register existing HDFS files into HCatalog, provide broader but controlled access to data through a data discovery tool, and leverage existing Hadoop ecosystem components like Pig, Hive, HBase and Oozie to seamlessly share data across applications. Integration with data movement tools automates the availability of new data into HCatalog. In addition, the approach allows ever improving Hive performance to open up easy adhoc access to analyze and visualize data through SQL on Hadoop and popular BI tools.
As we discuss our approach, we will also highlight along how our approach minimizes data duplication, eliminates wasteful data retention, and solves for data provenance, lineage and integrity.
Strata Conference + Hadoop World San Jose 2015: Data Discovery on Hadoop Sumeet Singh
Hadoop has allowed us to move towards a unified source of truth for all of organization’s data. Managing data location, schema knowledge and evolution, fine-grained business rules based access control, and audit and compliance needs will become critical with increasing scale of operations.
In this talk, we will share an approach in tackling the above challenges. We will explain how to register existing HDFS files, provide broader but controlled access to data through a data discovery tool with schema browse and search functionality, and leverage existing Hadoop ecosystem components like Pig, Hive, HBase and Oozie to seamlessly share data across applications. Integration with data movement tools automates the availability of new data. In addition, the approach allows us to open up easy adhoc access to analyze and visualize data through SQL on Hadoop and popular BI tools. As we discuss our approach, we will also highlight how our approach minimizes data duplication, eliminates wasteful data retention, and solves for data provenance, lineage and integrity.
URL: http://strataconf.com/big-data-conference-ca-2015/public/schedule/detail/38768
Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...Sumeet Singh
Since 2006, Hadoop and its ecosystem components have evolved into a platform that Yahoo has begun to trust for running its businesses globally. Hadoop’s scalability, efficiency, built-in reliability, and cost effectiveness have made it an enterprise-wide platform that web-scale cloud operations run on. In this talk, we will take a broad look at some of the top software, hardware, and services considerations that have gone in to make the platform indispensable for nearly 1,000 active developers on a daily basis, including the challenges that come from scale, security and multi-tenancy we have dealt with in the last several years of operating one the largest Hadoop footprints in the world. We will cover the current technology stack Yahoo that has built or assembled, infrastructure elements such as configurations, deployment models, and network, and what it takes to offer hosted Hadoop services to a large customer base at Yahoo. Throughout the talk, we will highlight relevant use cases from Yahoo’s Mobile, Search, Advertising, Personalization, Media, and Communications businesses that may make these considerations more pertinent to your situation.
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...Sumeet Singh
Since 2006, Hadoop and its ecosystem components have evolved into a platform that Yahoo has begun to trust for running its businesses globally. In this talk, we will take a broad look at some of the top software, hardware, and services considerations that have gone in to make the platform indispensable for nearly 1,000 active developers, including the challenges that come from scale, security and multi-tenancy. We will cover the current technology stack that we have built or assembled, infrastructure elements such as configurations, deployment models, and network, and and what it takes to offer hosted Hadoop services to a large customer base.
Hadoop Summit San Jose 2013: Compression Options in Hadoop - A Tale of Tradeo...Sumeet Singh
Yahoo! is one of the most-visited web sites in the world. It runs one of the largest private cloud infrastructures, one that operates on petabytes of data every day. Being able to store and manage that data well is essential to the efficient functioning of Yahoo's Hadoop clusters. A key component that enables this efficient operation is data compression.
With regard to compression algorithms, there is an underlying tension between compression ratio and compression performance. Consequently, Hadoop provides support for several compression algorithms, including gzip, bzip2, Snappy, LZ4 and others. This plethora of options can make it difficult for users to select appropriate codecs for their MapReduce jobs. This paper attempts to provide guidance in that regard. Performance results with Gridmix and with several corpuses of data are presented.
The paper also describes enhancements we have made to the bzip2 codec that improve its performance. This will be of particular interest to the increasing number of users operating on "Big Data" who require the best possible ratios. The impact of using the Intel IPP libraries is also investigated; these have the potential to improve performance significantly. Finally, a few proposals for future enhancements to Hadoop in this area are outlined.
SAP Technology Services Conference 2013: Big Data and The Cloud at Yahoo! Sumeet Singh
The Hadoop project is an integral part of Yahoo!'s cloud infrastructure and is at the heart of many of Yahoo!'s important business processes. Sumeet Singh, the Head of Products for Cloud Services and Hadoop at Yahoo!, explains how Yahoo! leverages Hadoop and Cloud Platforms to process and serve Internet- scale data.
Yahoo! operates one of the world's largest private cloud infrastructures. Learn how technologies scale out for building enterprise-wide trusted platforms with tight SLAs.
URL: http://www.saptechnologyservice.com/track1.html
Strata Conference + Hadoop World NY 2013: Running On-premise Hadoop as a Busi...Sumeet Singh
Cloud-based architectures of Hadoop have made it attractive for public cloud service providers to offer hosted Hadoop services and charge customers on a pay-for-what-you-use basis. For enterprises that have already adopted Hadoop, the data infrastructure has long been seen as a cost element in their budgets. As a result, enterprises thinking of adopting Hadoop are increasingly debating between on-premise and cloud-based models for their data processing needs.
We lay out a set of criteria and methodical approaches to help enterprises that have not yet adopted Hadoop evaluate their options, and discuss the pros and cons of both models. For enterprises that have already made significant investments or have plans to build a Hadoop-based infrastructure, we present an approach to manage Hadoop as a Service with a P&L, transparency in costs, and metering & billing provisions.
As we discuss these approaches, we will share insights gathered from the exercise conducted on one of the largest Hadoop footprints in the world. We will illustrate how to organize cluster resources, compile data required and typical sources, develop TCO models tailored for individual situations, derive unit costs for usage, measure the resource usage for services, optimize for higher utilization, and benchmark costs.
URL: http://strataconf.com/stratany2013/public/schedule/detail/30824
HBaseCon 2013: Multi-tenant Apache HBase at Yahoo! Sumeet Singh
Yahoo! has been using HBase for a long time in isolated instances, most notably for the personalization platform powering its homepage experiences. The introduction of multi-tenancy has lowered the barriers for all Hadoop users to use HBase. We will cover traditional use cases for HBase at Yahoo!, and new use cases as a result in content management, advertising, log processing, analytics and reporting, recommendation graphs, and dimension data stores.
We will then talk about the deployment strategy and enhancements made that facilitate multi-tenancy. Region Server groups provide a coarse level of isolation among tenants by designating a subset of region servers to serve designated tables, and Namespaces for logical grouping of resources (region servers, tables) and privileges (quota, ACLs).
We'll also share our experiences in operating HBase with security enabled and contributions made in this area, and results from performance runs conducted to validate customer expectations in a multi-tenant environment.
URL: http://www.cloudera.com/content/cloudera/en/resources/library/hbasecon/hbasecon-2013--multi-tenant-apache-hbase-at-yahoo-video.html
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024Sinan KOZAK
Sinan from the Delivery Hero mobile infrastructure engineering team shares a deep dive into performance acceleration with Gradle build cache optimizations. Sinan shares their journey into solving complex build-cache problems that affect Gradle builds. By understanding the challenges and solutions found in our journey, we aim to demonstrate the possibilities for faster builds. The case study reveals how overlapping outputs and cache misconfigurations led to significant increases in build times, especially as the project scaled up with numerous modules using Paparazzi tests. The journey from diagnosing to defeating cache issues offers invaluable lessons on maintaining cache integrity without sacrificing functionality.
Digital Twins Computer Networking Paper Presentation.pptxaryanpankaj78
A Digital Twin in computer networking is a virtual representation of a physical network, used to simulate, analyze, and optimize network performance and reliability. It leverages real-time data to enhance network management, predict issues, and improve decision-making processes.
Supermarket Management System Project Report.pdfKamal Acharya
Supermarket management is a stand-alone J2EE using Eclipse Juno program.
This project contains all the necessary required information about maintaining
the supermarket billing system.
The core idea of this project to minimize the paper work and centralize the
data. Here all the communication is taken in secure manner. That is, in this
application the information will be stored in client itself. For further security the
data base is stored in the back-end oracle and so no intruders can access it.
Null Bangalore | Pentesters Approach to AWS IAMDivyanshu
#Abstract:
- Learn more about the real-world methods for auditing AWS IAM (Identity and Access Management) as a pentester. So let us proceed with a brief discussion of IAM as well as some typical misconfigurations and their potential exploits in order to reinforce the understanding of IAM security best practices.
- Gain actionable insights into AWS IAM policies and roles, using hands on approach.
#Prerequisites:
- Basic understanding of AWS services and architecture
- Familiarity with cloud security concepts
- Experience using the AWS Management Console or AWS CLI.
- For hands on lab create account on [killercoda.com](https://killercoda.com/cloudsecurity-scenario/)
# Scenario Covered:
- Basics of IAM in AWS
- Implementing IAM Policies with Least Privilege to Manage S3 Bucket
- Objective: Create an S3 bucket with least privilege IAM policy and validate access.
- Steps:
- Create S3 bucket.
- Attach least privilege policy to IAM user.
- Validate access.
- Exploiting IAM PassRole Misconfiguration
-Allows a user to pass a specific IAM role to an AWS service (ec2), typically used for service access delegation. Then exploit PassRole Misconfiguration granting unauthorized access to sensitive resources.
- Objective: Demonstrate how a PassRole misconfiguration can grant unauthorized access.
- Steps:
- Allow user to pass IAM role to EC2.
- Exploit misconfiguration for unauthorized access.
- Access sensitive resources.
- Exploiting IAM AssumeRole Misconfiguration with Overly Permissive Role
- An overly permissive IAM role configuration can lead to privilege escalation by creating a role with administrative privileges and allow a user to assume this role.
- Objective: Show how overly permissive IAM roles can lead to privilege escalation.
- Steps:
- Create role with administrative privileges.
- Allow user to assume the role.
- Perform administrative actions.
- Differentiation between PassRole vs AssumeRole
Try at [killercoda.com](https://killercoda.com/cloudsecurity-scenario/)
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODELijaia
As digital technology becomes more deeply embedded in power systems, protecting the communication
networks of Smart Grids (SG) has emerged as a critical concern. Distributed Network Protocol 3 (DNP3)
represents a multi-tiered application layer protocol extensively utilized in Supervisory Control and Data
Acquisition (SCADA)-based smart grids to facilitate real-time data gathering and control functionalities.
Robust Intrusion Detection Systems (IDS) are necessary for early threat detection and mitigation because
of the interconnection of these networks, which makes them vulnerable to a variety of cyberattacks. To
solve this issue, this paper develops a hybrid Deep Learning (DL) model specifically designed for intrusion
detection in smart grids. The proposed approach is a combination of the Convolutional Neural Network
(CNN) and the Long-Short-Term Memory algorithms (LSTM). We employed a recent intrusion detection
dataset (DNP3), which focuses on unauthorized commands and Denial of Service (DoS) cyberattacks, to
train and test our model. The results of our experiments show that our CNN-LSTM method is much better
at finding smart grid intrusions than other deep learning algorithms used for classification. In addition,
our proposed approach improves accuracy, precision, recall, and F1 score, achieving a high detection
accuracy rate of 99.50%.
Home security is of paramount importance in today's world, where we rely more on technology, home
security is crucial. Using technology to make homes safer and easier to control from anywhere is
important. Home security is important for the occupant’s safety. In this paper, we came up with a low cost,
AI based model home security system. The system has a user-friendly interface, allowing users to start
model training and face detection with simple keyboard commands. Our goal is to introduce an innovative
home security system using facial recognition technology. Unlike traditional systems, this system trains
and saves images of friends and family members. The system scans this folder to recognize familiar faces
and provides real-time monitoring. If an unfamiliar face is detected, it promptly sends an email alert,
ensuring a proactive response to potential security threats.
Software Engineering and Project Management - Software Testing + Agile Method...Prakhyath Rai
Software Testing: A Strategic Approach to Software Testing, Strategic Issues, Test Strategies for Conventional Software, Test Strategies for Object -Oriented Software, Validation Testing, System Testing, The Art of Debugging.
Agile Methodology: Before Agile – Waterfall, Agile Development.
Discover the latest insights on Data Driven Maintenance with our comprehensive webinar presentation. Learn about traditional maintenance challenges, the right approach to utilizing data, and the benefits of adopting a Data Driven Maintenance strategy. Explore real-world examples, industry best practices, and innovative solutions like FMECA and the D3M model. This presentation, led by expert Jules Oudmans, is essential for asset owners looking to optimize their maintenance processes and leverage digital technologies for improved efficiency and performance. Download now to stay ahead in the evolving maintenance landscape.
Prediction of Electrical Energy Efficiency Using Information on Consumer's Ac...PriyankaKilaniya
Energy efficiency has been important since the latter part of the last century. The main object of this survey is to determine the energy efficiency knowledge among consumers. Two separate districts in Bangladesh are selected to conduct the survey on households and showrooms about the energy and seller also. The survey uses the data to find some regression equations from which it is easy to predict energy efficiency knowledge. The data is analyzed and calculated based on five important criteria. The initial target was to find some factors that help predict a person's energy efficiency knowledge. From the survey, it is found that the energy efficiency awareness among the people of our country is very low. Relationships between household energy use behaviors are estimated using a unique dataset of about 40 households and 20 showrooms in Bangladesh's Chapainawabganj and Bagerhat districts. Knowledge of energy consumption and energy efficiency technology options is found to be associated with household use of energy conservation practices. Household characteristics also influence household energy use behavior. Younger household cohorts are more likely to adopt energy-efficient technologies and energy conservation practices and place primary importance on energy saving for environmental reasons. Education also influences attitudes toward energy conservation in Bangladesh. Low-education households indicate they primarily save electricity for the environment while high-education households indicate they are motivated by environmental concerns.
Hadoop Summit San Jose 2015: Towards SLA-based Scheduling on YARN Clusters
1. To w a r d s S L A - b a s e d S c h e d u l i n g o n YA R N
C l u s t e r s
PRESENTED BY Sumeet Singh, Nathan Roberts ⎪ June 9, 2015
H a d o o p S u m m i t 2 0 1 5 , S a n J o s e
2. Introduction
2
§ Manages Cloud Storage and Big Data products team
at Yahoo
§ Responsible for Product Management, Strategy and
Customer Engagements
§ Managed Cloud Engineering products teams and
headed Strategy functions for the Cloud Platform
Group at Yahoo
§ MBA from UCLA and MS from RPI
Sumeet Singh
Sr. Director, Product Management
Cloud Storage and Big Data Platforms
701 First Avenue,
Sunnyvale, CA 94089 USA
@sumeetksingh
§ Software Architect with the Hadoop Core team
§ With Yahoo since 2007 focused on high performance
storage solutions, Linux kernel, and Hadoop
§ Previously with Motorola for 17 years as a
Distinguished Member of Technical Staff
§ BS in Computer Science from the University of Illinois
at Urbana-Champaign
Nathan Roberts
Sr. Principle Architect
Core Hadoop
701 First Avenue,
Sunnyvale, CA 94089 USA
3. Agenda
3
Job Scheduling in Hadoop
Capacity Scheduler at Yahoo
Capacity Scheduler Queue Management
2
3
Managing for SLAs4
Q&A5
1
4. Hadoop Grid Jobs at Yahoo – A Million a Day and Growing
4
HDFS
(File System and Storage)
Pig
(Scripting)
Hive
(SQL)
Java MR APIs
YARN
(Resource Management and Scheduling)
Tez
(Execution Engine for
Pig and Hive)
Spark
(Alternate Exec Engine)
MapReduce
(Legacy)
Data Processing
ML
Custom App on
Slider
Oozie
Data
Management
6. Job Scheduling with YARN
6
AMService
NMNM
AM
NM
Task Task Task
Task AM Task
Client
AppClientProtocol
Data Node 1 Data Node 2 Data Node 3
ContainerManager
§ Unit of allocation and
control for YARN
§ AM and individual
tasks run in their own
container
Client
Scheduler
RM
§ Single central daemon
§ Schedules containers for apps
§ Monitors nodes and apps
§ Daemon running on each worker node
§ Launches, monitors, controls
containers
§ Sched., monitor, control of an app instance
§ RM launches an AM for each app submitted
§ AM requests containers via RM, launches
containers via NM
7. Pluggable RM Scheduler – Current Choices
7
…
Default FIFO Scheduler
§ Single queue for all jobs and
the cluster
§ Oldest jobs picked first from
the head of the queue
§ No concept of priority of size of
the jobs
§ Not suited for production, ok
for testing or development
Capacity Scheduler
…
…
…
…
§ Jobs are assigned to pools
with guaranteed min resources
§ Jobs with highest time deficit
picked up for freed up resource
§ Free resources can be
allocated to other pools,
excess pool capacity is shared
among jobs
§ Preemption supports fairness
among pools, priority supports
importance within a pool
§ Jobs are submitted to queues
with guaranteed min resources
§ Queues are ordered according
to current_used/ grt’d_capacity.
Most underserved queue is
offered the resources first
§ Excess queue capacity is
shared among cluster tenants
§ Preemption and reservations
supports returning guaranteed
capacity back to the queues
…
…
Fair Scheduler
…
8. Related Scheduler Proposals
8
Resource
Aware
Delay1
Dynamic
Priority2
Deadline
Constrained3
§ Memory and CPU already tracked and available as a resource in scheduling decisions
§ Disk IO and Network explicitly are the other potential resources to manage
§ Address the conflict between locality and fairness in Fair Scheduler to increase throughput
§ When the job to be scheduled next according to fairness cannot launch a local task, it waits for a small
time, letting other jobs launch tasks instead
§ Users control allocated capacity by adjusting spending over time
§ Gives users the tool to optimize and customize their allocations to fit the importance and requirements of
their jobs by scaling back when the cost is high
§ Schedule jobs based on user specified deadline constraints
§ Use a job execution cost model that considers several parameters such as runtime, input data size etc.
1 http://www.cs.berkeley.edu/~matei/papers/2010/eurosys_delay_scheduling.pdf
2 http://www.cs.huji.ac.il/~feit/parsched/jsspp10/p7-sandholm.pdf
3 http://www4.ncsu.edu/~kkc/papers/rev2.pdf
9. So, Fair Scheduler or Capacity Scheduler?
9
§ Both are very capable schedulers to handle user demands from a Hadoop Cluster
§ Similar in capabilities, difference perhaps just in their roots and goals when first
developed at Facebook and Yahoo respectively
§ Fairshare started with the concept of fairly allocating resources among jobs, pools
and users, while the Capacity scheduler grew from the need to guarantee certain
amounts of capacity to queues and users
§ Label-based Scheduling (YARN-796) and Resource Reservation (YARN-1051) on
Capacity Scheduler today
§ Policy-driven Scheduling (YARN-3306) unifies much of the functionalities.
Scheduling policies (capacity, fairshare, etc.) are configurable per queue (you do
not have to run a single policy for the entire cluster). Ordering of apps (considered
for resources) are prescribed by the queue’s application ordering policy
10. Capacity Scheduler at Yahoo
10
§ Designed for running applications in
a shared secure multi-tenant
environment
§ Meets individual application needs
with capacity guarantees
§ Maximizes cluster utilization by
providing elasticity through access to
excess cluster capacity
§ Safeguards against misbehaving
applications and users through limits
§ Capacity abstractions through
queues and hierarchical queues for
predictable sharing
§ Queue ACLs control who can submit
applications
Cluster-level metrics
show total resources
available and used
Configured
queues and sub-
queues for the
cluster
Recently
scheduled jobs
11. Resources Tracked with Capacity Scheduler
11
Memory CPU Servers
§ Scheduler today considers both
Memory and CPU as a resource
§ Dominant Resource First Calculator
(used Dominant Resource Fairness) for
resource allocation
§ Utilization can suffer if not careful
§ Specifying resources for containers is
framework-specific
§ mapreduce.[map|reduce].cpu.vcores
§ mapreduce.[map|reduce].memory.mb
§ MAX(Physical_Memory_Bytes) à
memory.mb
§ MAX(CPU_time_spent / task_time) à
cpu.vcores
§ vCores is tricky, but also more forgiving
§ default as 1.5/2 G and 10 vCores
Resource Allocation Container Resources in MapReduce
12. Speculate execution helps with “slow” nodes,
although can be too late for tighter SLAs
task 1
task 1
Additional Available Optimizations (1 / 2)
12
attempt 0
attempt 1
Node X
Node Y
Node A
Node B
t
Pick faster
attempt 1
output
Speculative Execution
(through MR/ Tez AM)
J2J3J4
J6
Preemptive Execution
J4J5
Running
Queue 1, 40%
(pre-emtable)
Queue 2, 20%
Queue 3, 20%
Queue 4, 20%
J1
Waiting
J6 claims
resources
from J4
mapreduce.map.speculative = true
mapreduce.reduce.speculative = true
yarn.resourcemanager.scheduler.monitor.enable = true,
yarn.resourcemanager.scheduler.monitor.policies =
ProportionalCapacityPreemptionPolicy
Preemption helps SLAs, but careful on queues with long
running tasks and high “max capacity” that can lockdown
a large part of the cluster
13. Additional Available Optimizations (2 / 2)
13
Node Labels
J2J3
J4
Queue 1, 40%
Label x
Queue 2, 40%
Label x, y
J1
Queue 3, 20%
x x x x x x
x x x x x x
y y y y y y
y y y y y y
yarn.scheduler.capacity.root.<queue name>.accessible-node-labels = <label name>
yarn.scheduler.capacity.root.<label name>.default-node-label-expression sets the default label asked for by queue
Hadoop Cluster
15. Configuration Capacity Scheduler Queues (1 / 2)
15
Queue State RUNNING or STOPPED, primarily used for stopping and draining a queue
Used Capacity Percentage of absolute capacity of queue in use, up to its absolute max capacity
Absolute Used Capacity Percentage of cluster capacity the queue is using
Absolute Max Capacity Percentage of cluster capacity the queue is allowed to take
Used Resources Memory and CPU consumed by jobs submitted to the queue
Num Schedulable Apps Applications that the scheduler is actively considering for resource requests
Num Non-Schedulable Apps Applications pending to be scheduled on the cluster
1
2
3
5
6
7
8
Absolute Capacity Percentage of cluster’s total capacity allocated to the queue4
Max applications, active and pending, in the queueMax Apps
Number of YARN containers in use by the running apps submitted to the queue9
10
Num Containers
16. Configuration Capacity Scheduler Queues (2 / 2)
16
Max applications in the queue that can be concurrently active for a given user
Maximum applications that can be active/ running on the cluster from the queue
Maximum applications that can be active/ running per user for the given queue
Percentage of parent's queue capacity this queue will use
Percentage of the parent's max capacity this queue will use at the maximum
Lower bound & guarantee on resources to a single user when there is demand
11
12
13
14
15
16
Max Apps Per User
Max Schedulable Apps
Max Sched. Apps Per User
Configured Capacity
Configured Max Capacity
Config. Min User Limit %
All users currently running apps in the queue
Node labels the queue is allowed to access19
Active Users
Accessible Node Labels
18
Multiplier to the user limit when a single user is in the queue17 Config. User Limit Factor
17. Capacity Scheduler Parameters – The Important Four
17
Min User Limit % Capacity User Limit Factor (150%) Max Capacity
§ “Capacity” is what scheduler tries to guarantee for each queue
§ “Max Capacity” is HARD limit for the queue
§ “User Limit Factor” is HARD limit for individual users – No user over 150% of
capacity
§ “Min User Limit %” is how much the scheduler will give to an app before evenly
distributing
§ Once a user is above “Min User Limit %”, scheduler will try to evenly distribute
resources to applications requesting more resource
25%
18. Understanding Minimum User Limit Percent
18
App 1 App 2 App 3
Scheduler
§ Minimum User Limit Percent =
25% (3 containers)
§ All Applications initially requesting
resource
Requesting Requesting Requesting
User A User B User C
§ FIFO until Minimum User Limit
§ Evenly distribute after Min User
Limit
§ Evenly among requestors
§ User A becomes more favored when
it starts requesting resource again
19. Common Queue Setup and Nomenclature
19
root
BU1
BU2
BU3
Unfunded
Hadoop Dev
Hadoop Ops
_
+
+
+
+
+
+
BU-based Allocations
root
Initiative 1
Initiative 2
Initiative 3
Unfunded
Hadoop Dev
Hadoop Ops
_
+
+
+
+
+
Initiatives-based Allocations
root
BU1
BU2
Unfunded
Hadoop Dev
Hadoop Ops
_
+
+
+
+
+
Hybrid Allocations
Little to no use of hierarchical queues
Proj 1
Proj 2
_
+
+
Some use of hierarchical queues
Initiative 1
Proj 1
Proj 2
+
+
_
Some use of hierarchical queues
20. Decomposing Production Queues for Seasonality
20
ObservedSeasonalRandom
t
Most production queues exhibit high degree of randomness
21. Recommended Approach to Queue Setup
21
root
BU1
BU2
default
Hadoop Dev
Hadoop Ops
_
+
+
+
+
+
BU3
Initiative 1
_
_
Initiative 1 - scheduled
Initiative 1 - adhoc
Initiative 2
+
+
+
Cluster 1, 2, …,n
§ Ubiquitous queues
§ “default” does not require apps specify a
queue name, typically for adhoc pre-
emptable jobs open to all, helpful for
managing spare capacity or headroom
§ BU based allocations for capex and metering,
potential automated onboarding
§ BU manages given capacity among initiatives
§ Initiatives / major projects as sub-queues
§ Separation of scheduled production and
adhoc jobs
§ Space start times, space out peaks
§ Low “absolute” and high “absolute max” on
adhoc, potentially pre-emtable
22. Compute Capacity Allocation – Provisioned vs. Observed
22
Projects On-boarded
#MappersProvisioned/Used(MonthlyEqv.)
Accurately estimating compute needs in advance is hard
Mappers Provisioned Mappers Observed
23. Notes on Compute Capacity Estimation
23
Step 1: Sample Run (with a tenth of data on a sandbox cluster)
Stages # Map Map Size Map Time # Reduce Reduce Size Reduce Time Shuffle Time
Stage 1 100 1.5 GB 15 Min 50 2 GB 10 Min 3 Min
Stage 2 - L 150 1.5 GB 10 Min 50 2 GB 10 Min 4 Min
Stage 2 - R 100 1.5 GB 5 Min 25 2 GB 5 Min 1 Min
Stage 3 200 1.5 GB 10 Min 75 2 GB 5 Min 2 Min
Notes:
§ SLOT_MILLIS_MAPS and SLOT_MILLIS_REDUCES gives the time spent
§ TOTAL_LAUNCHED_MAPS and TOTAL_LAUNCHED_REDUCES gives # Map and # Reduce
§ Shuffle Time is Data per Reducer / est. 4 MB/s (bandwidth for data transfer from Map to Reduce)
§ Reduce time includes the Sort time , Add 10% for speculative execution (failed/killed task attempts)
Step 2: Mappers and Reducers
Number of mappers 278 [ (Max of Stage 1,2 & 3) x 10 ] / (SLA of 6 Hrs. / 35)
Number of reducers 84 [ (Max of Stage 1,2 & 3) x 10 ] / (SLA of 6 Hrs. / 25)
Memory required for mappers and reducers 278 x 1.5 + 84 x 2 = 585 GB
Number of servers 585/ 44 = 14 servers
24. Observe Queue Utilization
24
Cluster Utilization
Queue Utilization – Project 1 / Queue 1
Queue Utilization – Project 1 / Queue 2
Absolute Capacity: 13.0%
Absolute Max Capacity: 24.0%
Configured Minimum User Limit Percent: 100%
Configured User Limit Factor: 1.5
Absolute Capacity: 7.0 %
Absolute Max Capacity: 12.0%
Configured Minimum User Limit Percent: 100%
Configured User Limit Factor: 1.5
§ Cluster load shows no pattern.
§ Queues here are almost always above
“absolute capacity”
§ Prevent SLA queues from running over
capacity
25. Factors Impacting SLAs
25
§ New queues created for new projects
§ New projects or users added to an existing
queue
§ Existing projects and users move to a
different queue
§ Existing projects in a queue grow
§ Adhoc / rogue users
§ Cluster downtime
§ Pipeline catch-ups
Plan, Measure and Monitor
Rolling upgrades and HA
Know what to suspend and how
to move capacity from one queue
to the other
26. Measuring Compute Consumption
26
For a queue, user, cluster over time (GB-
Hr / vCore-Hr)
sum(map_slot_seconds +
reduce_slots_seconds) *
yarn.scheduler.minimum-allocation-mb /
1024/60/60
OR,
sum(memoryseconds)/1024/60/60,
sum(vcoreseconds)/60/60 from
rmappsummary by apptype;
0
50,000
100,000
150,000
200,000
250,000
300,000
350,000
400,000
MR Tez
0
50,000
100,000
150,000
200,000
250,000
300,000
350,000
400,000
MR Tez
0
50,000
100,000
150,000
200,000
250,000
300,000
350,000
400,000
MR Tez
April 1-13, 2015 May 16-31, 2015
While chargeback models work, monitoring is critical in preserving SLAs while maximizing cluster util.
Measure Compute Monitor
27. Measuring and Reporting SLAs
27
Absolute Capacity 8.8%
Absolute Max Capacity 32%
User Limit Factor 2
Min User Limit % 100%
Dominant user (of 7 total users) of a sub-queue
Memory(MB)SecondsRuntime(seconds)
19,000
20,000
21,000
22,000
23,000
24,000
25,000
5/25/15 5/26/15 5/27/15 5/28/15 5/29/15 5/30/15 5/31/15
# Jobs by the User
AD-SUPPLY-SUMMARY-15M
(96 jobs total in a day)
28. Measuring and Reporting SLAs ( cont’d)
28
Stage 1
SLA = x mins
Stage 2
SLA = y mins
Stage 3
SLA = z mins Stage N…
End-to-End Pipeline SLA “s” minutes
PigLatin:AD-SUPPLY-SUMMARY-15M-201505242145
PigLatin:AD-SUPPLY-SUMMARY-15M-201505242200
PigLatin:AD-SUPPLY-SUMMARY-15M-201505242215
PigLatin:AD-SUPPLY-SUMMARY-15M-201505242230
PigLatin:AD-SUPPLY-SUMMARY-15M-201505242245
PigLatin:AD-SUPPLY-SUMMARY-15M-201505242330
PigLatin:AD-SUPPLY-SUMMARY-15M-201505242315
PigLatin:AD-SUPPLY-SUMMARY-15M-201505242345
PigLatin:AD-SUPPLY-SUMMARY-15M-201505242300
Name Application to Enable Reporting Tag Jobs with IDs to Enable Reporting
§ Four unique identifiers can do the job: Pipeline ID,
Instance ID, Start, End
§ MR, Pig, Hive and Oozie all can take arbitrary tags as
job parameters
§ Job logs re-constructs the pipeline or sections of
pipeline’s execution arranged by timestamp
§ Scheduled reports provide SLA meet or misses
29. Measuring and Reporting SLAs ( cont’d)
29
§ Oozie can actively track SLAs on Jobs
§ Start-time, End-time, Duration (Met or Miss)
§ At any time, the SLA processing stage will
reflect:
§ Not_Started à Job not yet begun
§ In_Process à Job started and is running, and
SLAs are being tracked
§ Met à caused by an END_MET
§ Miss à caused by an END_MISS
§ Access/Filter SLA info via
§ Web-console dashboard
§ REST API
§ JMS Messages
§ Email alerts
<workflow-‐app
xmlns="uri:oozie:workflow:
0.5"
xmlns:sla="uri:oozie:sla:0.2"
name="sla-‐wf">
...
<end
name="end"/>
<sla:info>
<sla:nominal-‐time>${nominalTime}
</
sla:nominal-‐time>
<sla:should-‐start>${shouldStart}
</sla:should-‐start>
<sla:should-‐end>${shouldEnd}
</
sla:should-‐end>
<sla:max-‐duration>${duration}
</
sla:max-‐duration>
<sla:alert-‐events>start_miss,end_miss
</sla:alert-‐events>
<sla:alert-‐contact>joe@yahoo
</
sla:alert-‐contact>
</sla:info>
</workflow-‐app>
31. 31
Going Forward
YARN-624
§ Gang Scheduling – Stalled?
§ Scheduler capable of running a set of tasks all at the same time
YARN-1051
§ Reservation Based Scheduling in Hadoop 2.6+
§ Jobs / users can negotiate with the RM at admission time for time-bounded, guaranteed
allocation of cluster resources
§ RM has an understanding of future resource demand (e.g., a job submitted now with
time before its deadline might run after a job showing up later but in a rush)
§ Lots of potential, needs evaluation on our end
YARN-1963
§ In-queue priorities – Implementation phase
§ Allows dynamic adjustment of what’s important in a queue
YARN-2915
§ Resource Manager Federation – Design phase
§ Scale YARN to manage 10s of thousands of nodes
YARN-3306 § Per queue Policy driven scheduling – Implementation phase
32. 32
Related Talks at the Summit
Day 1 (2:35 PM) Apache Hadoop YARN: Past, Present and Future
Day 2 (12:05 PM) Reservation-based Scheduling: If You’re Late Don’t Blame Us!
Day 2 (1:45 PM) Enabling diverse workload scheduling in YARN
Day 3 (11:00 AM) Node Labels in YARN