YARN (Yet Another Resource Negotiator) is a resource management framework for Hadoop clusters that improves on the scalability limitations of the original MapReduce framework. YARN separates resource management from job scheduling to allow multiple data processing engines like MapReduce, Spark, and Storm to share common cluster resources. It introduces a new architecture with a ResourceManager to allocate resources among applications and per-application ApplicationMasters to manage containers and scheduling within an application. This provides improved scalability, utilization, and multi-tenancy for a variety of workloads compared to the original Hadoop architecture.
This presentation about Hadoop YARN will help you understand the Hadoop 1.0 and Hadoop 2.0, limitations of Hadoop 1.0, need for YARN, what is YARN, workloads running on YARN, YARN components, YARN architecture and you will also go through a demo on YARN. YARN is the cluster resource management layer of the Apache Hadoop Ecosystem, which schedules jobs and assigns resources. Hadoop 1.0 is designed to run MapReduce jobs only and had issues in scalability, resource utilization, etc. whereas YARN solved those issues and users could work on multiple processing models. Now let us get started and learn YARN in detail.
Below topics are explained in this Hadoop YARN presentation:
1. Hadoop 1.0 (MapReduce 1)
2. Limitations of Hadoop 1.0 (MapReduce 1)
3. Need for YARN
4. What is YARN
5. Workloads running on YARN
6. YARN components
7. YARN architecture
8. Demo on YARN
What is this Big Data Hadoop training course about?
The Big Data Hadoop and Spark developer course have been designed to impart an in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab.
What are the course objectives?
This course will enable you to:
1. Understand the different components of the Hadoop ecosystem such as Hadoop 2.7, Yarn, MapReduce, Pig, Hive, Impala, HBase, Sqoop, Flume, and Apache Spark
2. Understand Hadoop Distributed File System (HDFS) and YARN as well as their architecture, and learn how to work with them for storage and resource management
3. Understand MapReduce and its characteristics, and assimilate some advanced MapReduce concepts
4. Get an overview of Sqoop and Flume and describe how to ingest data using them
5. Create database and tables in Hive and Impala, understand HBase, and use Hive and Impala for partitioning
6. Understand different types of file formats, Avro Schema, using Arvo with Hive, and Sqoop and Schema evolution
7. Understand Flume, Flume architecture, sources, flume sinks, channels, and flume configurations
8. Understand HBase, its architecture, data storage, and working with HBase. You will also understand the difference between HBase and RDBMS
9. Gain a working knowledge of Pig and its components
10. Do functional programming in Spark
11. Understand resilient distribution datasets (RDD) in detail
12. Implement and build Spark applications
13. Gain an in-depth understanding of parallel processing in Spark and Spark RDD optimization techniques
14. Understand the common use-cases of Spark and the various interactive algorithms
15. Learn Spark SQL, creating, transforming, and querying Data frames
Learn more at https://www.simplilearn.com/big-data-and-analytics/big-data-and-hadoop-training
As part of the recent release of Hadoop 2 by the Apache Software Foundation, YARN and MapReduce 2 deliver significant upgrades to scheduling, resource management, and execution in Hadoop.
At their core, YARN and MapReduce 2’s improvements separate cluster resource management capabilities from MapReduce-specific logic. YARN enables Hadoop to share resources dynamically between multiple parallel processing frameworks such as Cloudera Impala, allows more sensible and finer-grained resource configuration for better cluster utilization, and scales Hadoop to accommodate more and larger jobs.
ORC files were originally introduced in Hive, but have now migrated to an independent Apache project. This has sped up the development of ORC and simplified integrating ORC into other projects, such as Hadoop, Spark, Presto, and Nifi. There are also many new tools that are built on top of ORC, such as Hive’s ACID transactions and LLAP, which provides incredibly fast reads for your hot data. LLAP also provides strong security guarantees that allow each user to only see the rows and columns that they have permission for.
This talk will discuss the details of the ORC and Parquet formats and what the relevant tradeoffs are. In particular, it will discuss how to format your data and the options to use to maximize your read performance. In particular, we’ll discuss when and how to use ORC’s schema evolution, bloom filters, and predicate push down. It will also show you how to use the tools to translate ORC files into human-readable formats, such as JSON, and display the rich metadata from the file including the type in the file and min, max, and count for each column.
Everything you wanted to know about Apache Tez:
-- Distributed execution framework targeted towards data-processing applications.
-- Based on expressing a computation as a dataflow graph.
-- Highly customizable to meet a broad spectrum of use cases.
-- Built on top of YARN – the resource management framework for Hadoop.
-- Open source Apache incubator project and Apache licensed.
Presented at #H2OWorld 2017 in Mountain View, CA.
Enjoy the video: https://youtu.be/42Oo8TOl85I.
Learn more about H2O.ai: https://www.h2o.ai/.
Follow @h2oai: https://twitter.com/h2oai.
- - -
Abstract:
In recent years, the demand for machine learning experts has outpaced the supply, despite the surge of people entering the field. To address this gap, there have been big strides in the development of user-friendly machine learning software that can be used by non-experts. Although H2O has made it easier for practitioners to train and deploy machine learning models at scale, there is still a fair bit of knowledge and background in data science that is required to produce high-performing machine learning models. Deep Neural Networks in particular, are notoriously difficult for a non-expert to tune properly. In this presentation, we provide an overview of the field of "Automatic Machine Learning" and introduce the new AutoML functionality in H2O. H2O's AutoML provides an easy-to-use interface which automates the process of training a large, comprehensive selection of candidate models and a stacked ensemble model which, in most cases, will be the top performing model in the AutoML Leaderboard. H2O AutoML is available in all the H2O interfaces including the h2o R package, Python module and the Flow web GUI. We will also provide simple code examples to get you started using AutoML.
Erin's Bio:
Erin is a Statistician and Machine Learning Scientist at H2O.ai. She is the main author of H2O Ensemble. Before joining H2O, she was the Principal Data Scientist at Wise.io and Marvin Mobile Security (acquired by Veracode in 2012) and the founder of DataScientific, Inc. Erin received her Ph.D. in Biostatistics with a Designated Emphasis in Computational Science and Engineering from University of California, Berkeley. Her research focuses on ensemble machine learning, learning from imbalanced binary-outcome data, influence curve based variance estimation and statistical computing. She also holds a B.S. and M.A. in Mathematics.
This presentation about Hadoop YARN will help you understand the Hadoop 1.0 and Hadoop 2.0, limitations of Hadoop 1.0, need for YARN, what is YARN, workloads running on YARN, YARN components, YARN architecture and you will also go through a demo on YARN. YARN is the cluster resource management layer of the Apache Hadoop Ecosystem, which schedules jobs and assigns resources. Hadoop 1.0 is designed to run MapReduce jobs only and had issues in scalability, resource utilization, etc. whereas YARN solved those issues and users could work on multiple processing models. Now let us get started and learn YARN in detail.
Below topics are explained in this Hadoop YARN presentation:
1. Hadoop 1.0 (MapReduce 1)
2. Limitations of Hadoop 1.0 (MapReduce 1)
3. Need for YARN
4. What is YARN
5. Workloads running on YARN
6. YARN components
7. YARN architecture
8. Demo on YARN
What is this Big Data Hadoop training course about?
The Big Data Hadoop and Spark developer course have been designed to impart an in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab.
What are the course objectives?
This course will enable you to:
1. Understand the different components of the Hadoop ecosystem such as Hadoop 2.7, Yarn, MapReduce, Pig, Hive, Impala, HBase, Sqoop, Flume, and Apache Spark
2. Understand Hadoop Distributed File System (HDFS) and YARN as well as their architecture, and learn how to work with them for storage and resource management
3. Understand MapReduce and its characteristics, and assimilate some advanced MapReduce concepts
4. Get an overview of Sqoop and Flume and describe how to ingest data using them
5. Create database and tables in Hive and Impala, understand HBase, and use Hive and Impala for partitioning
6. Understand different types of file formats, Avro Schema, using Arvo with Hive, and Sqoop and Schema evolution
7. Understand Flume, Flume architecture, sources, flume sinks, channels, and flume configurations
8. Understand HBase, its architecture, data storage, and working with HBase. You will also understand the difference between HBase and RDBMS
9. Gain a working knowledge of Pig and its components
10. Do functional programming in Spark
11. Understand resilient distribution datasets (RDD) in detail
12. Implement and build Spark applications
13. Gain an in-depth understanding of parallel processing in Spark and Spark RDD optimization techniques
14. Understand the common use-cases of Spark and the various interactive algorithms
15. Learn Spark SQL, creating, transforming, and querying Data frames
Learn more at https://www.simplilearn.com/big-data-and-analytics/big-data-and-hadoop-training
As part of the recent release of Hadoop 2 by the Apache Software Foundation, YARN and MapReduce 2 deliver significant upgrades to scheduling, resource management, and execution in Hadoop.
At their core, YARN and MapReduce 2’s improvements separate cluster resource management capabilities from MapReduce-specific logic. YARN enables Hadoop to share resources dynamically between multiple parallel processing frameworks such as Cloudera Impala, allows more sensible and finer-grained resource configuration for better cluster utilization, and scales Hadoop to accommodate more and larger jobs.
ORC files were originally introduced in Hive, but have now migrated to an independent Apache project. This has sped up the development of ORC and simplified integrating ORC into other projects, such as Hadoop, Spark, Presto, and Nifi. There are also many new tools that are built on top of ORC, such as Hive’s ACID transactions and LLAP, which provides incredibly fast reads for your hot data. LLAP also provides strong security guarantees that allow each user to only see the rows and columns that they have permission for.
This talk will discuss the details of the ORC and Parquet formats and what the relevant tradeoffs are. In particular, it will discuss how to format your data and the options to use to maximize your read performance. In particular, we’ll discuss when and how to use ORC’s schema evolution, bloom filters, and predicate push down. It will also show you how to use the tools to translate ORC files into human-readable formats, such as JSON, and display the rich metadata from the file including the type in the file and min, max, and count for each column.
Everything you wanted to know about Apache Tez:
-- Distributed execution framework targeted towards data-processing applications.
-- Based on expressing a computation as a dataflow graph.
-- Highly customizable to meet a broad spectrum of use cases.
-- Built on top of YARN – the resource management framework for Hadoop.
-- Open source Apache incubator project and Apache licensed.
Presented at #H2OWorld 2017 in Mountain View, CA.
Enjoy the video: https://youtu.be/42Oo8TOl85I.
Learn more about H2O.ai: https://www.h2o.ai/.
Follow @h2oai: https://twitter.com/h2oai.
- - -
Abstract:
In recent years, the demand for machine learning experts has outpaced the supply, despite the surge of people entering the field. To address this gap, there have been big strides in the development of user-friendly machine learning software that can be used by non-experts. Although H2O has made it easier for practitioners to train and deploy machine learning models at scale, there is still a fair bit of knowledge and background in data science that is required to produce high-performing machine learning models. Deep Neural Networks in particular, are notoriously difficult for a non-expert to tune properly. In this presentation, we provide an overview of the field of "Automatic Machine Learning" and introduce the new AutoML functionality in H2O. H2O's AutoML provides an easy-to-use interface which automates the process of training a large, comprehensive selection of candidate models and a stacked ensemble model which, in most cases, will be the top performing model in the AutoML Leaderboard. H2O AutoML is available in all the H2O interfaces including the h2o R package, Python module and the Flow web GUI. We will also provide simple code examples to get you started using AutoML.
Erin's Bio:
Erin is a Statistician and Machine Learning Scientist at H2O.ai. She is the main author of H2O Ensemble. Before joining H2O, she was the Principal Data Scientist at Wise.io and Marvin Mobile Security (acquired by Veracode in 2012) and the founder of DataScientific, Inc. Erin received her Ph.D. in Biostatistics with a Designated Emphasis in Computational Science and Engineering from University of California, Berkeley. Her research focuses on ensemble machine learning, learning from imbalanced binary-outcome data, influence curve based variance estimation and statistical computing. She also holds a B.S. and M.A. in Mathematics.
Technological Geeks Video 13 :-
Video Link :- https://youtu.be/mfLxxD4vjV0
FB page Link :- https://www.facebook.com/bitwsandeep/
Contents :-
Hive Architecture
Hive Components
Limitations of Hive
Hive data model
Difference with traditional RDBMS
Type system in Hive
Introduction to Apache Flink - Fast and reliable big data processingTill Rohrmann
This presentation introduces Apache Flink, a massively parallel data processing engine which currently undergoes the incubation process at the Apache Software Foundation. Flink's programming primitives are presented and it is shown how easily a distributed PageRank algorithm can be implemented with Flink. Intriguing features such as dedicated memory management, Hadoop compatibility, streaming and automatic optimisation make it an unique system in the world of Big Data processing.
This Hadoop tutorial on MapReduce Example ( Mapreduce Tutorial Blog Series: https://goo.gl/w0on2G ) will help you understand how to write a MapReduce program in Java. You will also get to see multiple mapreduce examples on Analytics and Testing.
Check our complete Hadoop playlist here: https://goo.gl/ExJdZs
Below are the topics covered in this tutorial:
1) MapReduce Way
2) Classes and Packages in MapReduce
3) Explanation of a Complete MapReduce Program
4) MapReduce Examples on Analytics
5) MapReduce Example on Testing - MRUnit
Best Practices for Enabling Speculative Execution on Large Scale PlatformsDatabricks
Apache Spark has the ‘speculative execution’ feature to handle the slow tasks in a stage due to environment issues like slow network, disk etc. If one task is running slowly in a stage, Spark driver can launch a speculation task for it on a different host. Between the regular task and its speculation task, Spark system will later take the result from the first successfully completed task and kill the slower one.
When we first enabled the speculation feature for all Spark applications by default on a large cluster of 10K+ nodes at LinkedIn, we observed that the default values set for Spark’s speculation configuration parameters did not work well for LinkedIn’s batch jobs. For example, the system launched too many fruitless speculation tasks (i.e. tasks that were killed later). Besides, the speculation tasks did not help shorten the shuffle stages. In order to reduce the number of fruitless speculation tasks, we tried to find out the root cause, enhanced Spark engine, and tuned the speculation parameters carefully. We analyzed the number of speculation tasks launched, number of fruitful versus fruitless speculation tasks, and their corresponding cpu-memory resource consumption in terms of gigabytes-hours. We were able to reduce the average job response times by 13%, decrease the standard deviation of job elapsed times by 40%, and lower total resource consumption by 24% in a heavily utilized multi-tenant environment on a large cluster. In this talk, we will share our experience on enabling the speculative execution to achieve good job elapsed time reduction at the same time keeping a minimal overhead.
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...StampedeCon
At the StampedeCon 2015 Big Data Conference: Picking your distribution and platform is just the first decision of many you need to make in order to create a successful data ecosystem. In addition to things like replication factor and node configuration, the choice of file format can have a profound impact on cluster performance. Each of the data formats have different strengths and weaknesses, depending on how you want to store and retrieve your data. For instance, we have observed performance differences on the order of 25x between Parquet and Plain Text files for certain workloads. However, it isn’t the case that one is always better than the others.
Storing State Forever: Why It Can Be Good For Your AnalyticsYaroslav Tkachenko
State is an essential part of the modern streaming pipelines: it enables a variety of foundational capabilities like windowing, aggregation, enrichment, etc. But usually, the state is either transient, so we only keep it until the window is closed, or it's fairly small and doesn't grow much. But what if we treat the state differently? The keyed state in Flink can be scaled vertically and horizontally, it's reliable and fault-tolerant... so is scaling a stateful Flink application that different from scaling any data store like Kafka or MySQL?
At Shopify, we've worked on a massive analytical data pipeline that's needed to support complex streaming joins and correctly handle arbitrarily late-arriving data. We came up with an idea to never clear state and support joins this way. We've made a successful proof of concept, ingested all historical transactional Shopify data and ended up storing more than 10 TB of Flink state. In the end, it allowed us to achieve 100% data correctness.
In KDD2011, Vijay Narayanan (Yahoo!) and Milind Bhandarkar (Greenplum Labs, EMC) conducted a tutorial on "Modeling with Hadoop". This is the first half of the tutorial.
In the session, we discussed the End-to-end working of Apache Airflow that mainly focused on "Why What and How" factors. It includes the DAG creation/implementation, Architecture, pros & cons. It also includes how the DAG is created for scheduling the Job and what all steps are required to create the DAG using python script & finally with the working demo.
At Salesforce, we have deployed many thousands of HBase/HDFS servers, and learned a lot about tuning during this process. This talk will walk you through the many relevant HBase, HDFS, Apache ZooKeeper, Java/GC, and Operating System configuration options and provides guidelines about which options to use in what situation, and how they relate to each other.
Technological Geeks Video 13 :-
Video Link :- https://youtu.be/mfLxxD4vjV0
FB page Link :- https://www.facebook.com/bitwsandeep/
Contents :-
Hive Architecture
Hive Components
Limitations of Hive
Hive data model
Difference with traditional RDBMS
Type system in Hive
Introduction to Apache Flink - Fast and reliable big data processingTill Rohrmann
This presentation introduces Apache Flink, a massively parallel data processing engine which currently undergoes the incubation process at the Apache Software Foundation. Flink's programming primitives are presented and it is shown how easily a distributed PageRank algorithm can be implemented with Flink. Intriguing features such as dedicated memory management, Hadoop compatibility, streaming and automatic optimisation make it an unique system in the world of Big Data processing.
This Hadoop tutorial on MapReduce Example ( Mapreduce Tutorial Blog Series: https://goo.gl/w0on2G ) will help you understand how to write a MapReduce program in Java. You will also get to see multiple mapreduce examples on Analytics and Testing.
Check our complete Hadoop playlist here: https://goo.gl/ExJdZs
Below are the topics covered in this tutorial:
1) MapReduce Way
2) Classes and Packages in MapReduce
3) Explanation of a Complete MapReduce Program
4) MapReduce Examples on Analytics
5) MapReduce Example on Testing - MRUnit
Best Practices for Enabling Speculative Execution on Large Scale PlatformsDatabricks
Apache Spark has the ‘speculative execution’ feature to handle the slow tasks in a stage due to environment issues like slow network, disk etc. If one task is running slowly in a stage, Spark driver can launch a speculation task for it on a different host. Between the regular task and its speculation task, Spark system will later take the result from the first successfully completed task and kill the slower one.
When we first enabled the speculation feature for all Spark applications by default on a large cluster of 10K+ nodes at LinkedIn, we observed that the default values set for Spark’s speculation configuration parameters did not work well for LinkedIn’s batch jobs. For example, the system launched too many fruitless speculation tasks (i.e. tasks that were killed later). Besides, the speculation tasks did not help shorten the shuffle stages. In order to reduce the number of fruitless speculation tasks, we tried to find out the root cause, enhanced Spark engine, and tuned the speculation parameters carefully. We analyzed the number of speculation tasks launched, number of fruitful versus fruitless speculation tasks, and their corresponding cpu-memory resource consumption in terms of gigabytes-hours. We were able to reduce the average job response times by 13%, decrease the standard deviation of job elapsed times by 40%, and lower total resource consumption by 24% in a heavily utilized multi-tenant environment on a large cluster. In this talk, we will share our experience on enabling the speculative execution to achieve good job elapsed time reduction at the same time keeping a minimal overhead.
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...StampedeCon
At the StampedeCon 2015 Big Data Conference: Picking your distribution and platform is just the first decision of many you need to make in order to create a successful data ecosystem. In addition to things like replication factor and node configuration, the choice of file format can have a profound impact on cluster performance. Each of the data formats have different strengths and weaknesses, depending on how you want to store and retrieve your data. For instance, we have observed performance differences on the order of 25x between Parquet and Plain Text files for certain workloads. However, it isn’t the case that one is always better than the others.
Storing State Forever: Why It Can Be Good For Your AnalyticsYaroslav Tkachenko
State is an essential part of the modern streaming pipelines: it enables a variety of foundational capabilities like windowing, aggregation, enrichment, etc. But usually, the state is either transient, so we only keep it until the window is closed, or it's fairly small and doesn't grow much. But what if we treat the state differently? The keyed state in Flink can be scaled vertically and horizontally, it's reliable and fault-tolerant... so is scaling a stateful Flink application that different from scaling any data store like Kafka or MySQL?
At Shopify, we've worked on a massive analytical data pipeline that's needed to support complex streaming joins and correctly handle arbitrarily late-arriving data. We came up with an idea to never clear state and support joins this way. We've made a successful proof of concept, ingested all historical transactional Shopify data and ended up storing more than 10 TB of Flink state. In the end, it allowed us to achieve 100% data correctness.
In KDD2011, Vijay Narayanan (Yahoo!) and Milind Bhandarkar (Greenplum Labs, EMC) conducted a tutorial on "Modeling with Hadoop". This is the first half of the tutorial.
In the session, we discussed the End-to-end working of Apache Airflow that mainly focused on "Why What and How" factors. It includes the DAG creation/implementation, Architecture, pros & cons. It also includes how the DAG is created for scheduling the Job and what all steps are required to create the DAG using python script & finally with the working demo.
At Salesforce, we have deployed many thousands of HBase/HDFS servers, and learned a lot about tuning during this process. This talk will walk you through the many relevant HBase, HDFS, Apache ZooKeeper, Java/GC, and Operating System configuration options and provides guidelines about which options to use in what situation, and how they relate to each other.
Dache - a data aware cache system for big-data applications using the MapReduce framework.
Dache aim-extending the MapReduce framework and provisioning a cache layer for efficiently identifying and accessing cache items in a MapReduce job.
Apache Hadoop YARN: Understanding the Data Operating System of HadoopHortonworks
This deck covers concepts and motivations behind Apache Hadoop YARN, the key technology in Hadoop 2 to deliver a Data Operating System for the enterprise.
Apache Hadoop has made giant strides since the last Hadoop Summit: the community has released hadoop-1.0 after nearly 6 years and is now on the cusp of the Hadoop.next (think of it as hadoop-2.0). Given the next generation of MR is out with 0.23.0 and 0.23.1, there is a new set of features that have been requested in the community. In this talk we will talk about the next set of features like pre emption, web services and near real time analysis and how we are working on tackling these in the near future. In this talk we will also cover the roadmap for Next Gen Map Reduce and timelines along with the release schedule for Apache Hadoop.
Hadoop World 2011: Next Generation Apache Hadoop MapReduce - Mohadev Konar, H...Cloudera, Inc.
The Apache Hadoop MapReduce framework has hit a scalability limit around 4,000 machines. In this session, we will be presenting the architecture and design of the next generation of MapReduce and will delve into the details of the architecture that makes it much easier to innovate. The architecture will have built in HA, security and multi-tenancy to support many users on the larger clusters. It will also increase innovation, agility and hardware utilization. We will also be presenting large scale and small scale comparisons on some benchmarks with MRV1.
Yarn Resource Management Using Machine Learningojavajava
HadoopCon 2016 In Taiwan - How to maximum the utilization of Hadoop computing power is the biggest challenge for Hadoop administer. In this talk I will explain how we use Machine Learning to build the prediction model for the computing power requirements and setting up the MapReduce scheduler parameters dynamically, to fully utilize our Hadoop cluster computing power.
Scaling Spark Workloads on YARN - Boulder/Denver July 2015Mac Moore
Hortonworks Presentation at The Boulder/Denver BigData Meetup on July 22nd, 2015. Topic: Scaling Spark Workloads on YARN. Spark as a workload in a multi-tenant Hadoop infrastructure, scaling, cloud deployment, tuning.
Scale 12 x Efficient Multi-tenant Hadoop 2 Workloads with YarnDavid Kaiser
Hadoop is about so much more than batch processing. With the recent release of Hadoop 2, there have been significant changes to how a Hadoop cluster uses resources. YARN, the new resource management component, allows for a more efficient mix of workloads across hardware resources, and enables new applications and new processing paradigms such as stream-processing. This talk will discuss the new design and components of Hadoop 2, and examples of Modern Data Architectures that leverage Hadoop for maximum business efficiency.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfPeter Spielvogel
Building better applications for business users with SAP Fiori.
• What is SAP Fiori and why it matters to you
• How a better user experience drives measurable business benefits
• How to get started with SAP Fiori today
• How SAP Fiori elements accelerates application development
• How SAP Build Code includes SAP Fiori tools and other generative artificial intelligence capabilities
• How SAP Fiori paves the way for using AI in SAP apps
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
The Metaverse and AI: how can decision-makers harness the Metaverse for their...Jen Stirrup
The Metaverse is popularized in science fiction, and now it is becoming closer to being a part of our daily lives through the use of social media and shopping companies. How can businesses survive in a world where Artificial Intelligence is becoming the present as well as the future of technology, and how does the Metaverse fit into business strategy when futurist ideas are developing into reality at accelerated rates? How do we do this when our data isn't up to scratch? How can we move towards success with our data so we are set up for the Metaverse when it arrives?
How can you help your company evolve, adapt, and succeed using Artificial Intelligence and the Metaverse to stay ahead of the competition? What are the potential issues, complications, and benefits that these technologies could bring to us and our organizations? In this session, Jen Stirrup will explain how to start thinking about these technologies as an organisation.
A tale of scale & speed: How the US Navy is enabling software delivery from l...sonjaschweigert1
Rapid and secure feature delivery is a goal across every application team and every branch of the DoD. The Navy’s DevSecOps platform, Party Barge, has achieved:
- Reduction in onboarding time from 5 weeks to 1 day
- Improved developer experience and productivity through actionable findings and reduction of false positives
- Maintenance of superior security standards and inherent policy enforcement with Authorization to Operate (ATO)
Development teams can ship efficiently and ensure applications are cyber ready for Navy Authorizing Officials (AOs). In this webinar, Sigma Defense and Anchore will give attendees a look behind the scenes and demo secure pipeline automation and security artifacts that speed up application ATO and time to production.
We will cover:
- How to remove silos in DevSecOps
- How to build efficient development pipeline roles and component templates
- How to deliver security artifacts that matter for ATO’s (SBOMs, vulnerability reports, and policy evidence)
- How to streamline operations with automated policy checks on container images
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Enhancing Performance with Globus and the Science DMZGlobus
ESnet has led the way in helping national facilities—and many other institutions in the research community—configure Science DMZs and troubleshoot network issues to maximize data transfer performance. In this talk we will present a summary of approaches and tips for getting the most out of your network infrastructure using Globus Connect Server.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
2. CC BY 2.0 / Richard Bumgardner
Been there, done that.
3. Agenda
• Why
YARN?
• YARN
Architecture
and
Concepts
• Resources
&
Scheduling
– Capacity
Scheduler
– Fair
Scheduler
• Configuring
the
Fair
Scheduler
• Managing
Running
Jobs
4. Agenda
• Why
YARN?
• YARN
Architecture
and
Concepts
• Resources
&
Scheduling
– Capacity
Scheduler
– Fair
Scheduler
• Configuring
the
Fair
Scheduler
• Managing
Running
Jobs
5. The 1st Generation of Hadoop: Batch
HADOOP 1.0
Built for Web-Scale Batch Apps
Single App
BATCH
HDFS
Single App
INTERACTIVE
Single App
BATCH
HDFS
• All other usage
patterns must
leverage that same
infrastructure
• Forces the creation
of silos for managing
mixed workloads
Single App
BATCH
HDFS
Single App
ONLINE
7. MapReduce Classic: Limitations
• Scalability
– Maximum
Cluster
size
–
4,000
nodes
– Maximum
concurrent
tasks
–
40,000
– Coarse
synchronizaPon
in
JobTracker
• Availability
– Failure
kills
all
queued
and
running
jobs
• Hard
parPPon
of
resources
into
map
and
reduce
slots
– Low
resource
uPlizaPon
• Lacks
support
for
alternate
paradigms
and
services
– IteraPve
applicaPons
implemented
using
MapReduce
are
10x
slower
8. Our Vision: Hadoop as Next-Gen Platform
MapReduce
(cluster resource management
& data processing)
HDFS
(redundant, reliable storage)
Single Use System
Batch Apps
HADOOP 1.0
Multi Purpose Platform
Batch, Interactive, Online, Streaming, …
HADOOP 2.0
Others
(data processing)
YARN
(cluster resource management)
HDFS2
(redundant, reliable storage)
MapReduce
(data processing)
9. YARN: Talking Hadoop Beyond Batch
YARN (Cluster Resource Management)
HDFS2 (Redundant, Reliable Storage)
BATCH
(MapReduce)
INTERACTIVE
(Tez)
STREAMING
(Storm, S4,…)
GRAPH
(Giraph)
IN-‐MEMORY
(Spark)
HPC MPI
(OpenMPI)
ONLINE
(HBase)
Store ALL DATA in one place…
Interact with that data in MULTIPLE WAYS
with Predictable Performance and Quality of Service
ApplicaRons Run NaRvely IN Hadoop
OTHER
(Search)
(Weave…)
10. Why YARN / MR2 ?
•
Scalability
– JobTracker
kept
track
of
individual
tasks
and
wouldn’t
scale
• UPlizaPon
– All
slots
are
equal
even
if
the
work
is
not
equal
• MulP-‐tenancy
– Every
framework
shouldn’t
need
to
write
its
own
execuPon
engine
– All
frameworks
should
share
the
resources
on
a
cluster
11. Multiple levels of scheduling
•
YARN
– Which
applicaPon
(framework)
to
give
resources
to
?
• ApplicaPon
(Framework
–
MR
etc.)
– Which
task
within
the
applicaPon
should
use
these
resources
?
12. Agenda
• Why
YARN?
• YARN
Architecture
and
Concepts
• Resources
&
Scheduling
– Capacity
Scheduler
– Fair
Scheduler
• Configuring
the
Fair
Scheduler
• Managing
Running
Jobs
13. YARN Concepts
• ApplicaPon
– ApplicaPon
is
a
job
submi^ed
to
the
framework
– Example
–
Map
Reduce
Job
• Container
– Basic
unit
of
allocaPon
– Fine-‐grained
resource
allocaPon
across
mulPple
resource
types
(memory,
cpu,
disk,
network,
gpu
etc.)
• container_0
=
2GB,
1
CPU
• container_1
=
1GB,
6
CPU
– Replaces
the
fixed
map/reduce
slots
15. Architecture
• Resource
Manager
– Global
resource
scheduler
– Hierarchical
queues
• Node
Manager
– Per-‐machine
agent
– Manages
the
life-‐cycle
of
container
– Container
resource
monitoring
• ApplicaPon
Master
– Per-‐applicaPon
– Manages
applicaPon
scheduling
and
task
execuPon
– E.g.
MapReduce
ApplicaPon
Master
16. Design Centre
• Split
up
the
two
major
funcPons
of
JobTracker
– Cluster
resource
management
– ApplicaPon
life-‐cycle
management
• MapReduce
becomes
user-‐land
library
23. Container Types
• DefaultContainerExecutor
– Unix’s
process-‐based
Executor
by
using
ulimit
• LinuxContainerExecutor
– Linux
container-‐based
Executor
by
using
cgroups
• Choose
it
based
on
isolaPon
level
you
need
24. Agenda
• Why
YARN?
• YARN
Architecture
and
Concepts
• Resources
&
Scheduling
– Capacity
Scheduler
– Fair
Scheduler
• Configuring
the
Fair
Scheduler
• Managing
Running
Jobs
25.
26. Resource Model and Capacities
• Resource
vectors
– e.g.
1024
MB,
2
vcores,
…
– No
more
task
slots!
• Nodes
specify
the
amount
of
resources
they
have
– yarn.nodemanager.resource.memory-‐mb
– yarn.nodemanager.resource.cpu-‐vcores
• vcores
to
cores
relaPon,
not
really
“virtual”
27. Resources and Scheduling
• What
you
request
is
what
you
get
– No
more
fixed-‐size
slots
– Framework/applicaPon
requests
resources
for
a
task
• MR
AM
requests
resources
for
map
and
reduce
tasks,
these
requests
can
potenPally
be
for
different
amounts
of
resources
28. YARN Scheduling
ResourceManager
ApplicaPon
Master
1
ApplicaPon
Master
2
Node
1
Node
2
Node
3
I want 2 containers
with 1024 MB and a
1 core each
Noted
I’m
sPll
here
I’ll
reserve
some
space on
node1
for AM1
Got anything
for me?
Here’s a security
token to let you
launch a container
on Node 1
Hey, launch my
container with this
shell command
Container
29. YARN Schedulers
• Same
as
MR1
• FIFO
Scheduler
– Processing
Jobs
in
order
• Fair
Scheduler
– Fair
to
all
users,
dominant
fair
scheduler
• Capacity
Scheduler
– Queue
shares
as
percentage
of
clusters
– FIFO
scheduling
within
each
queue
– SupporPng
preempPon
• Default
is
Capacity
Scheduler
31. YARN Capacity Scheduler
• ConfiguraPon
in
capacity-‐scheduler.xml
• Take
some
Pme
to
setup
your
queues!
• Queues
have
per-‐queue
ACLs
to
restrict
queue
access
– Access
can
be
dynamically
changed
• ElasPcity
can
be
limited
on
a
per-‐queue
basis
– use
yarn.scheduler.capacity.<queue-‐path>.maximum-‐capacity
• Use
yarn.scheduler.capacity.<queue-‐path>.state
to
drain
queues
– ‘Decommissioning’
a
queue
• yarn
rmadmin
–refreshQueues
to
make
runPme
changes
32. YARN Fair Scheduler
• The
Fair
Scheduler
is
the
default
YARN
scheduler
in
CDH5
• The
only
YARN
scheduler
that
Cloudera
recommends
for
producPon
clusters
• Provides
fine-‐grained
resource
allocaPon
for
mulPple
resource
types
– Memory
(by
default)
– CPU
(opPonal)
33. Goals of the Fair Scheduler
• Should
allow
short
interacPve
jobs
to
coexist
with
long
producPon
jobs
• Should
allow
resources
to
be
controlled
proporPonally
• Should
ensure
that
the
cluster
is
efficiently
uPlized
34. The Fair Scheduler
• The
Fair
Scheduler
promotes
fairness
between
schedulable
enPPes
• The
Fair
Scheduler
awards
resources
to
pools
that
are
most
underserved
– Gives
a
container
to
the
pool
that
has
the
fewest
resources
allocated
35. Fair Scheduler Pools
• Each
job
is
assigned
to
a
pool
– Also
known
as
a
queue
in
YARN
terminology
• All
pools
in
YARN
descend
from
the
root
pool
• Physical
resource
are
not
bound
to
any
specific
pool
• Pools
can
be
predefined
or
defined
dynamically
by
specifying
a
pool
name
when
you
submit
a
job
• Pools
and
subpools
are
defined
in
the
fair-scheduler.xml file
Total:
30GB
Alice Bob
15GB15GB
36. In Which Pool Will a Job Run
• The
default
pool
for
a
job
is
root.username
– For
example,
root.Alice
and
root.Bob
– You
can
drop
root
when
referring
to
a
pool
• For
example,
you
can
refer
to
root.Alice
simply
as
Alice
• Jobs
can
be
assigned
to
arbitrarily-‐named
pools
– To
specify
the
pool
name
when
submirng
a
MapReduce
job,
use
• -D mapreduce.job.queuename
37. When Will a Job Run Within a Pool?
• The
Fair
Scheduler
grants
resources
to
a
pool,
but
which
job’s
task
will
get
resources?
• The
policies
for
assigning
resources
to
jobs
within
a
pool
are
defined
in
fair-scheduler.xml
• The
Fair
Scheduler
uses
three
techniques
for
prioriPzing
jobs
within
pools:
– Single
resource
fairness
– Dominant
resource
fairness
– FIFO
• You
can
also
configure
the
Fair
Scheduler
to
delay
assignment
of
resources
when
a
preferred
rack
or
node
is
not
available
38. Single Resource Fairness
• Single
resource
fairness
– Is
the
default
Fair
Scheduler
policy
– Schedules
jobs
using
memory
• Example
– Two
pools:
Alice
has
15GB
allocated,
and
Bob
has
5GB
– Both
pools
request
a
10GB
container
of
memory
– Bob
has
less
resources
and
will
be
granted
the
next
10GB
that
becomes
available
Total:
30GB
Alice Bob
10GB
15GB
5GB
39. Adding Pools Redistributes Resources
• The
user
Charlie
now
submits
a
job
to
a
new
pool
– Resource
allocaPons
are
adjusted
– Each
pool
receives
a
fair
share
of
cluster
resources
Total:
30GB
Alice Bob Charlie
10GB 10GB10GB
40. Determining the Fair Share
• The
fair
share
of
resources
assigned
to
the
pool
is
based
on
– The
total
resources
available
across
the
cluster
– The
number
of
pools
compePng
for
cluster
resources
• Excess
cluster
capacity
is
spread
across
all
pools
– The
aim
is
to
maintain
the
most
even
allocaPon
possible
so
every
pool
receives
its
fair
share
of
resources
• The
fair
share
will
never
be
higher
than
the
actual
demand
• Pools
can
use
more
than
their
fair
share
when
other
pools
are
not
in
need
of
resources
– This
happens
when
there
are
no
tasks
eligible
to
run
in
other
pools
41. Minimum Resources
• A
pool
with
minimum
resources
defined
receives
priority
during
resource
allocaPon
• The
minimum
resources,
minResources,
are
the
minimum
amount
of
resources
that
must
be
allocated
to
the
pool
prior
to
fair
share
allocaPon
– Minimum
resources
are
allocated
to
each
pool
assuming
there
is
cluster
capacity
– Pools
that
have
minimum
resources
specified
will
receive
priority
in
resource
assignment
42. Minimum Resource Allocation Example
• First,
fill
up
the
Production
pool
to
the
20GB
minimum
guarantee
• Then
distribute
the
remaining
10GB
evenly
across
Alice
and
Bob
Total:
30GB
ProducPon BobAlice
Demand:
100GB
Demand:
30GB
Demand:
25GB
minResources:
20GB
5GB 5GB
20GB
43. Minimum Resource Allocation Example 2:
Production Pool Empty
• Production
has
no
demand,
so
no
resources
are
allocated
to
it
• All
resources
are
allocated
evenly
between
Alice
and
Bob
Total:
30GB
ProducPon
15GB
BobAlice
Demand:
0GB
Demand:
30GB
Demand:
25GB
minResources:
20GB
15GB
44. • Combined
minResources
of
Production
and
Research
exceed
capacity
• Minimum
resources
are
assigned
proporPonally
based
on
defined
minResources
unPl
available
resources
are
exhausted
• No
memory
remains
for
pools
without
minResources
defined
(i.e.,Bob)
Minimum Resource Allocation Example 3:
MinResources Exceed Resources
Total:
30GB
ProducPon BobResearch
Demand:
100GB
Demand:
30GB
Demand:
25GB
minResources:
50GB
minResources:
25GB
20GB
10GB
45. • Production
is
filled
to
minResources
• Remaining
25GB
is
distributed
across
all
pools
• Production
pool
receives
more
than
its
minResources,
to
maintain
fairness
Minimum Resource Allocation Example 4:
MinResources < Fair Share
Total:
30GB
ProducPon BobAlice
Demand:
100GB
Demand:
30GB
Demand:
25GB
minResources:
5GB
10GB 10GB10GB
46. Pools with Weights
• Instead
of
(or
in
addiPon
to)
serng
minResources,
pools
can
be
assigned
a
weight
• Pools
with
higher
weight
receive
more
resources
during
allocaPon
• ‘Even
water
glass
height’
analogy:
– Think
of
the
weight
as
controlling
the
‘width’
of
the
glass
47. Example: Pool with Double Weight
• Production
is
filled
to
minResources
(5Gb)
• Remaining
25GB
is
distributed
across
all
pools
• Bob
pool
receives
twice
the
amount
of
memory
during
fair
share
allocaPon
Total:
30GB
ProducPon BobAlice
Demand:
100GB
Demand:
30GB
Demand:
25GB
minResources:
5GB
Weight:
2
8GB 14GB8GB
48. Dominant Resource Fairness
• The
Fair
Scheduler
can
be
configured
to
schedule
with
both
memory
and
CPU
using
dominant
resource
fairness
• Scenario
#1:
– Alice
has
6GB
and
3
cores,
and
Bob
has
4GB
and
2
cores
–
which
pool
receives
the
next
resource
allocaPon?
• Bob
will
receive
the
next
container
because
it
has
less
memory
and
less
CPU
cores
allocated
than
Alice
6GB
3
cores
4GB
2
cores
Alice
Usage Bob
Usage
49. Dominant Resource Fairness Example
• Scenario
#2:
– A
cluster
has
10GB
of
total
memory
and
20
cores
– Pool
Alice
has
containers
granted
for
4GB
of
memory
and
5
cores
– Pool
Bob
has
containers
granted
for
1GB
of
memory
and
10
cores
• Alice
will
receive
the
next
container
because
its
40%
dominant
share
of
memory
is
less
than
the
Bob
pool’s
50%
dominant
share
of
CPU
4GB
40%
capacity
5
cores
25%
capacity
1GB
10%
capacity
10
cores
50%
capacity
Alice
Usage Bob
Usage
50. Achieving Fair Share: The Patient Approach
• If
shares
are
imbalanced,
pools
which
are
over
their
fair
share
may
not
assign
new
tasks
when
their
old
ones
complete
– Those
resources
then
become
available
to
pools
which
are
operaPng
below
their
fair
share
• However,
waiPng
paPently
for
a
task
in
another
pool
to
finish
may
not
be
acceptable
in
a
producPon
environment
– Tasks
could
take
a
long
Pme
to
complete
51. Achieving Fair Share: The Brute Force Approach
• With
preempPon
enabled,
the
Fair
Scheduler
acPvely
kills
tasks
that
belong
to
pools
operaPng
over
their
fair
share
– Pools
operaPng
below
fair
share
receive
those
reaped
resources
• There
are
two
types
of
preempPon
available
– Minimum
share
preempPon
– Fair
share
preempPon
• PreempPon
code
avoids
killing
a
task
in
a
pool
if
it
would
cause
that
pool
to
begin
preempPng
tasks
in
other
pools
– This
prevents
a
potenPally
endless
cycle
of
pools
killing
one
another’s
tasks
52. Minimum Share Preemption
• Pools
with
a
minResources
configured
are
operaPng
on
an
SLA
(Service
Level
Agreement)
• Pools
that
are
below
their
minimum
share
as
defined
by
minResources
can
preempt
tasks
in
other
pools
– Set
minSharePreemptionTimeout
to
the
number
of
seconds
the
pool
is
under
its
minimum
share
before
preempPon
should
begin
– Default
is
infinite
(Java’s
Long.MAX_VALUE)
53. Fair Share Preemption
• Pools
not
receiving
their
fair
share
can
preempt
tasks
in
other
pools
– Only
pools
that
exceed
their
fair
share
are
candidates
for
preempPon
• Use
fair
share
preempPon
conservaPvely
– Set
fairSharePreemptionTimeout
to
the
number
of
seconds
a
pool
is
under
fair
share
before
preempPon
should
begin
– Default
is
infinite
(Java’s
Long.MAX_VALUE)
54. Agenda
• Why
YARN?
• YARN
Architecture
and
Concepts
• Resources
&
Scheduling
– Capacity
Scheduler
– Fair
Scheduler
• Configuring
the
Fair
Scheduler
• Managing
Running
Jobs
55. Configuring Fair Scheduler Capabilities (1)
• yarn.scheduler.fair.allow-‐undeclared-‐pools
(yarn-‐site.xml)
– When
true,
new
pools
can
be
created
at
applicaPon
submission
Pme
or
by
the
user-‐as-‐default-‐queue
property.
When
false,
submirng
to
a
pool
that
is
not
specified
in
the
fair-scheduler.xml file
causes
the
applicaPon
to
be
placed
in
the
“default”
pool.
Default:
true.
Ignored
if
a
pool
placement
policy
is
defined
in
the
fair-
scheduler.xml file.
• yarn.scheduler.fair.preempPon
(yarn-‐site.xml)
– Enables
preempPon
in
Fair
Scheduler.
Set
to
true
if
you
have
pools
that
must
operate
on
an
SLA.
Default:
false.
• yarn.scheduler.fair.user-‐as-‐default-‐queue
(yarn-‐site.xml)
– Send
jobs
to
pools
based
on
users’
names
instead
of
to
the
default
pool,
root.default.
Default:
true
56. Configuring Fair Scheduler Capabilities (2)
• yarn.scheduler.fair.locality.threshold.node
、
yarn.scheduler.fair.locality.threshold.rack
(yarn-‐site.xml)
– For
applicaPon
that
request
containers
on
parPcular
nodes
or
racks,
the
number
of
scheduling
opportuniPes
since
the
last
container
assignment
to
wait
before
accepPng
a
placement
on
another
node.
Expressed
as
a
float
between
0
and
1,
which,
as
a
fracPon
of
the
cluster
size,
is
the
number
of
scheduling
opportuniPes
to
pass
up..
Default:
1
(don’t
pass
up
any
scheduling
opportuniPes)
• Example:
yarn.scheduler.fair.locality.threshold.node
=
0.02,
cluster
size
=
100
nodes.
At
most
2
scheduling
opportuniPes
can
be
skipped
when
preferred
placement
cannot
be
met.
57. Configuring Resource Allocation for Pools and Users (1)
• You
configure
Fair
Scheduler
pools
in
the
/etc/hadoop/
conf/fair-scheduler.xml
file
• The
Fair
Scheduler
rereads
this
file
every
10
seconds
– ResourceManager
restart
is
not
required
when
the
file
changes
• The
fair-scheduler.xml file
must
contain
an
<allocations> element
• Use
the
<queue> element
to
configure
resource
allocaPon
for
a
pool
• Use
the
<user> element
to
configure
resource
allocaPon
for
a
user
across
mulPple
pools
58. Configuring Resource Allocation for Pools and Users (2)
• To
specify
resource
allocaPons,
use
the
<queue> or
<user> element
with
any
or
all
of
the
the
following
subelements
– <minResources>
• The
minimum
resources
to
which
the
pool
is
enPtled
• Format
is
x mb, y vcores
• Example:
10000mb, 5 vcores
– <maxResources>
• The
maximum
resources
to
which
the
pool
is
enPtled
• Format
is
x mb, y vcores
59. Configuring Resource Allocation for Pools and Users (3)
• AddiPonal
sub-‐elements
of
<queue> or
<user> to
use
when
specifying
resource
allocaPons
– <maxRunningApps>
• The
maximum
applicaPons
in
the
pool
that
can
be
run
concurrently
– <weight>
• Used
for
non-‐proporPonate
sharing
with
other
pools
• The
default
is
1
– <minSharePreemptionTimeout>
• Time
to
wait
before
pre-‐empPng
tasks
– <schedulingPolicy>
• SRF
for
single
resource
fairness
(the
default)
• SRF
for
dominant
resource
fairness
• FIFO
for
first-‐in,
first-‐out
60. fair-scheduler.xml Example (1)
• Allow
users
to
run
three
jobs,
but
allow
Bob
to
run
six
jobs
<?xml
version=“1.0”?>
<allocaPons>
<userMaxAppsDefault>3</userMaxAppsDefault>
<user
name=“bob”>
<maxRunningApps>6</maxRunningApps>
</user>
</allocaPons>
61. fair-scheduler.xml Example (2)
• Add
a
fair
share
Pmeout
<?xml
version=“1.0”?>
<allocaPons>
<userMaxAppsDefault>3</userMaxAppsDefault>
<user
name=“bob”>
<maxRunningApps>6</maxRunningApps>
</user>
<fairSharePreempPonTimeout>300</fairSharePreempPonTimeout>
</allocaPons>
62. fair-scheduler.xml Example (3)
• Define
the
producPon
pool
with
a
weight
of
2
and
a
resource
allocaPon
of
10000
MB
and
1
core
<?xml
version=“1.0”?>
<allocaPons>
<userMaxAppsDefault>3</userMaxAppsDefault>
<queue
name=“producPon”>
<minResources>10000
mb,
1
vcores</minResources>
<weight>2.0</weight>
</queue
>
</allocaPons>
63. fair-scheduler.xml Example (4)
• Add
an
SLA
to
the
producPon
pool
<?xml
version=“1.0”?>
<allocaPons>
<userMaxAppsDefault>3</userMaxAppsDefault>
<queue
name=“producPon”>
<minResources>10000
mb,
1
vcores</minResources>
<weight>2.0</weight>
<minSharePreempPonTimeout>60</minSharePreempPonTimeout>
</queue
>
</allocaPons>
64. The Fair Scheduler User Interface
• h^p://<resource_manager_host>:8088/cluster/scheduler
65. Agenda
• Why
YARN?
• YARN
Architecture
and
Concepts
• Resources
&
Scheduling
– Capacity
Scheduler
– Fair
Scheduler
• Configuring
the
Fair
Scheduler
• Managing
Running
Jobs
66. Displaying Jobs
• To
view
jobs
currently
running
on
the
cluster
– yarn application –list
– List
all
running
jobs,
including
the
applicaPon
ID
for
each
• To
view
all
jobs
on
the
cluster,
including
completed
jobs
– yarn application –list all
• To
display
the
status
of
an
individual
job
– yarn application –status <application_ID>
• You
can
also
use
the
ResourceManager
Web
UI,
Hue,
Ambari,
Cloudera
Manager
to
display
jobs
67. Killing Jobs
• It
is
important
to
note
that
once
a
user
has
submi^ed
a
job,
they
can
not
stop
it
just
by
hirng
CTRL-‐C
on
their
terminal
– This
stops
job
output
appearing
on
the
user’s
console
– The
job
is
sPll
running
on
the
cluster!
• To
kill
a
job
running
on
the
cluster
– yarn application –kill <application_ID>
• You
can
also
kill
job
from
Cloudera
Manager