This document provides an overview of Hadoop internals, including:
- Hadoop is an open-source software framework for distributed storage and processing of large datasets across clusters of computers.
- It describes the key components of Hadoop - HDFS for storage, YARN for resource management, and MapReduce as the programming model.
- It explains the execution of MapReduce jobs including the map and reduce phases, and how tasks are assigned to nodes by YARN with a focus on data locality.
Hadoop Summit 2012 | Optimizing MapReduce Job PerformanceCloudera, Inc.
Optimizing MapReduce job performance is often seen as something of a black art. In order to maximize performance, developers need to understand the inner workings of the MapReduce execution framework and how they are affected by various configuration parameters and MR design patterns. The talk will illustrate the underlying mechanics of job and task execution, including the map side sort/spill, the shuffle, and the reduce side merge, and then explain how different job configuration parameters and job design strategies affect the performance of these operations. Though the talk will cover internals, it will also provide practical tips, guidelines, and rules of thumb for better job performance. The talk is primarily targeted towards developers directly using the MapReduce API, though will also include some tips for users of higher level frameworks.
This lecture covers the principles and the architectures of modern cluster schedulers, including Apache Mesos, Apache Yarn, Google Borg and K8s, and some notes on Omega
This slide deck is used as an introduction to the MapReduce programming model, trying hard to be Hadoop-agnostic, as part of the Distributed Systems and Cloud Computing course I hold at Eurecom.
Course website:
http://michiard.github.io/DISC-CLOUD-COURSE/
Sources available here:
https://github.com/michiard/DISC-CLOUD-COURSE
Hadoop Summit 2012 | Optimizing MapReduce Job PerformanceCloudera, Inc.
Optimizing MapReduce job performance is often seen as something of a black art. In order to maximize performance, developers need to understand the inner workings of the MapReduce execution framework and how they are affected by various configuration parameters and MR design patterns. The talk will illustrate the underlying mechanics of job and task execution, including the map side sort/spill, the shuffle, and the reduce side merge, and then explain how different job configuration parameters and job design strategies affect the performance of these operations. Though the talk will cover internals, it will also provide practical tips, guidelines, and rules of thumb for better job performance. The talk is primarily targeted towards developers directly using the MapReduce API, though will also include some tips for users of higher level frameworks.
This lecture covers the principles and the architectures of modern cluster schedulers, including Apache Mesos, Apache Yarn, Google Borg and K8s, and some notes on Omega
This slide deck is used as an introduction to the MapReduce programming model, trying hard to be Hadoop-agnostic, as part of the Distributed Systems and Cloud Computing course I hold at Eurecom.
Course website:
http://michiard.github.io/DISC-CLOUD-COURSE/
Sources available here:
https://github.com/michiard/DISC-CLOUD-COURSE
Some slides about the Map/Reduce programming model (academic purposes) adapting some examples of the book Map/Reduce design patterns.
Special thanks to the next authors:
-http://shop.oreilly.com/product/0636920025122.do
-http://mapreducepatterns.com/index.php?title=Main_Page
-http://highlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/
Parallel Linear Regression in Interative Reduce and YARNDataWorks Summit
Online learning techniques, such as Stochastic Gradient Descent (SGD), are powerful when applied to risk minimization and convex games on large problems. However, their sequential design prevents them from taking advantage of newer distributed frameworks such as Hadoop/MapReduce. In this session, we will take a look at how we parallelized linear regression parameter optimization on the next-gen YARN framework Iterative Reduce.
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Deanna Kosaraju
Optimal Execution Of MapReduce Jobs In Cloud
Anshul Aggarwal, Software Engineer, Cisco Systems
Session Length: 1 Hour
Tue March 10 21:30 PST
Wed March 11 0:30 EST
Wed March 11 4:30:00 UTC
Wed March 11 10:00 IST
Wed March 11 15:30 Sydney
Voices 2015 www.globaltechwomen.com
We use MapReduce programming paradigm because it lends itself well to most data-intensive analytics jobs run on cloud these days, given its ability to scale-out and leverage several machines to parallel process data. Research has demonstrates that existing approaches to provisioning other applications in the cloud are not immediately relevant to MapReduce -based applications. Provisioning a MapReduce job entails requesting optimum number of resource sets (RS) and configuring MapReduce parameters such that each resource set is maximally utilized.
Each application has a different bottleneck resource (CPU :Disk :Network), and different bottleneck resource utilization, and thus needs to pick a different combination of these parameters based on the job profile such that the bottleneck resource is maximally utilized.
The problem at hand is thus defining a resource provisioning framework for MapReduce jobs running in a cloud keeping in mind performance goals such as Optimal resource utilization with Minimum incurred cost, Lower execution time, Energy Awareness, Automatic handling of node failure and Highly scalable solution.
Mapreduce examples starting from the basic WordCount to a more complex K-means algorithm. The code contained in these slides is available at https://github.com/andreaiacono/MapReduce
And introdution to MR and Hadoop and an view on the opportunities to use MR with databases i.e., SQL-MapReduce by Teradata and In-database MR by Oracle.
The presentation was used during a class of Datenbanken Implementierungstechniken in 2013.
Some slides about the Map/Reduce programming model (academic purposes) adapting some examples of the book Map/Reduce design patterns.
Special thanks to the next authors:
-http://shop.oreilly.com/product/0636920025122.do
-http://mapreducepatterns.com/index.php?title=Main_Page
-http://highlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/
Parallel Linear Regression in Interative Reduce and YARNDataWorks Summit
Online learning techniques, such as Stochastic Gradient Descent (SGD), are powerful when applied to risk minimization and convex games on large problems. However, their sequential design prevents them from taking advantage of newer distributed frameworks such as Hadoop/MapReduce. In this session, we will take a look at how we parallelized linear regression parameter optimization on the next-gen YARN framework Iterative Reduce.
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Deanna Kosaraju
Optimal Execution Of MapReduce Jobs In Cloud
Anshul Aggarwal, Software Engineer, Cisco Systems
Session Length: 1 Hour
Tue March 10 21:30 PST
Wed March 11 0:30 EST
Wed March 11 4:30:00 UTC
Wed March 11 10:00 IST
Wed March 11 15:30 Sydney
Voices 2015 www.globaltechwomen.com
We use MapReduce programming paradigm because it lends itself well to most data-intensive analytics jobs run on cloud these days, given its ability to scale-out and leverage several machines to parallel process data. Research has demonstrates that existing approaches to provisioning other applications in the cloud are not immediately relevant to MapReduce -based applications. Provisioning a MapReduce job entails requesting optimum number of resource sets (RS) and configuring MapReduce parameters such that each resource set is maximally utilized.
Each application has a different bottleneck resource (CPU :Disk :Network), and different bottleneck resource utilization, and thus needs to pick a different combination of these parameters based on the job profile such that the bottleneck resource is maximally utilized.
The problem at hand is thus defining a resource provisioning framework for MapReduce jobs running in a cloud keeping in mind performance goals such as Optimal resource utilization with Minimum incurred cost, Lower execution time, Energy Awareness, Automatic handling of node failure and Highly scalable solution.
Mapreduce examples starting from the basic WordCount to a more complex K-means algorithm. The code contained in these slides is available at https://github.com/andreaiacono/MapReduce
And introdution to MR and Hadoop and an view on the opportunities to use MR with databases i.e., SQL-MapReduce by Teradata and In-database MR by Oracle.
The presentation was used during a class of Datenbanken Implementierungstechniken in 2013.
Estude inglês nos EUA pela Intrax (Chicago, San Diego, San Francisco) Intrax
A escolha da escola de inglês certa pode fazer muita diferença. Os cursos de inglês da Intrax atendem às suas necessidades, tanto para estudar em uma universidade nos Estados Unidos quanto falar inglês no trabalho ou apenas viajar pelo mundo onde se fala inglês.
Confira mais informações no nosso site: http://www.intrax.edu/pt-br/
Apache Hadoop: design and implementation. Lecture in the Big data computing course (http://twiki.di.uniroma1.it/twiki/view/BDC/WebHome), Department of Computer Science, Sapienza University of Rome.
This slide deck is used as an introduction to the internals of Hadoop MapReduce, as part of the Distributed Systems and Cloud Computing course I hold at Eurecom.
Course website:
http://michiard.github.io/DISC-CLOUD-COURSE/
Sources available here:
https://github.com/michiard/DISC-CLOUD-COURSE
Forrester predicts, CIOs who are late to the Hadoop game will finally make the platform a priority in 2015. Hadoop has evolved as a must-to-know technology and has been a reason for better career, salary and job opportunities for many professionals.
Hadoop interview questions for freshers and experienced people. This is the best place for all beginners and Experts who are eager to learn Hadoop Tutorial from the scratch.
Read more here http://softwarequery.com/hadoop/
This slide deck is used as an introduction to the internals of Apache Spark, as part of the Distributed Systems and Cloud Computing course I hold at Eurecom.
Course website:
http://michiard.github.io/DISC-CLOUD-COURSE/
Sources available here:
https://github.com/michiard/DISC-CLOUD-COURSE
This page provides access to information about how to integrate Apache Hadoop with Lustre. We have made several enhancements to improve the use of Hadoop with Lustre and have conducted performance tests to compare the performance of Lustre vs. HDFS when used with Hadoop.
http://wiki.lustre.org/index.php/Integrating_Hadoop_with_Lustre
Accompanying slides for the class “Introduction to Hadoop” at the PRACE Autumn school 2020 - HPC and FAIR Big Data organized by the faculty of Mechanical Engineering of the University of Ljubljana (Slovenia).
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfPeter Spielvogel
Building better applications for business users with SAP Fiori.
• What is SAP Fiori and why it matters to you
• How a better user experience drives measurable business benefits
• How to get started with SAP Fiori today
• How SAP Fiori elements accelerates application development
• How SAP Build Code includes SAP Fiori tools and other generative artificial intelligence capabilities
• How SAP Fiori paves the way for using AI in SAP apps
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionAggregage
Join Maher Hanafi, VP of Engineering at Betterworks, in this new session where he'll share a practical framework to transform Gen AI prototypes into impactful products! He'll delve into the complexities of data collection and management, model selection and optimization, and ensuring security, scalability, and responsible use.
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™UiPathCommunity
In questo evento online gratuito, organizzato dalla Community Italiana di UiPath, potrai esplorare le nuove funzionalità di Autopilot, il tool che integra l'Intelligenza Artificiale nei processi di sviluppo e utilizzo delle Automazioni.
📕 Vedremo insieme alcuni esempi dell'utilizzo di Autopilot in diversi tool della Suite UiPath:
Autopilot per Studio Web
Autopilot per Studio
Autopilot per Apps
Clipboard AI
GenAI applicata alla Document Understanding
👨🏫👨💻 Speakers:
Stefano Negro, UiPath MVPx3, RPA Tech Lead @ BSP Consultant
Flavio Martinelli, UiPath MVP 2023, Technical Account Manager @UiPath
Andrei Tasca, RPA Solutions Team Lead @NTT Data
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
Hadoop Internals
1. Introduction YARN MapReduce Conclusion
Hadoop Internals
Emilio Coppa
April 29, 2014
Big Data Computing
Master of Science in Computer Science
1 / 49 Emilio Coppa Hadoop Internals
2. Introduction YARN MapReduce Conclusion
Hadoop Facts
Open-source software framework for storage and large-scale
processing of data-sets on clusters of commodity hardware.
MapReduce paradigm: “Our abstraction is inspired by the map
and reduce primitives present in Lisp and many other
functional languages” (Dean && Ghemawat – Google – 2004)
First released in 2005 by D. Cutting (Yahoo) and Mike
Cafarella (U. Michigan)
2 / 49 Emilio Coppa Hadoop Internals
3. Introduction YARN MapReduce Conclusion
Hadoop Facts (2)
2,5 millions of LOC – Java (47%), XML (36%)
681 years of effort (COCOMO)
Organized in 4 projects: Common, HDFS, YARN, MapReduce
81 contributors
3 / 49 Emilio Coppa Hadoop Internals
4. Introduction YARN MapReduce Conclusion
Hadoop Facts (3) – Top Contributors
Analyzing the top 10 of contributors...
4 / 49 Emilio Coppa Hadoop Internals
5. Introduction YARN MapReduce Conclusion
Hadoop Facts (3) – Top Contributors
Analyzing the top 10 of contributors...
1 6 HortonWorks (“We Do Hadoop”)
4 / 49 Emilio Coppa Hadoop Internals
6. Introduction YARN MapReduce Conclusion
Hadoop Facts (3) – Top Contributors
Analyzing the top 10 of contributors...
1 6 HortonWorks (“We Do Hadoop”)
2 3 Cloudera (“Ask Big Questions”)
4 / 49 Emilio Coppa Hadoop Internals
7. Introduction YARN MapReduce Conclusion
Hadoop Facts (3) – Top Contributors
Analyzing the top 10 of contributors...
1 6 HortonWorks (“We Do Hadoop”)
2 3 Cloudera (“Ask Big Questions”)
3 1 Yahoo
4 / 49 Emilio Coppa Hadoop Internals
8. Introduction YARN MapReduce Conclusion
Hadoop Facts (3) – Top Contributors
Analyzing the top 10 of contributors...
1 6 HortonWorks (“We Do Hadoop”)
2 3 Cloudera (“Ask Big Questions”)
3 1 Yahoo
Doug Cutting currently works at Cloudera.
4 / 49 Emilio Coppa Hadoop Internals
10. Introduction YARN MapReduce Conclusion
Apache Hadoop Architecture
Cluster: set of host machines (nodes). Nodes may be partitioned in
racks. This is the hardware part of the infrastructure.
5 / 49 Emilio Coppa Hadoop Internals
11. Introduction YARN MapReduce Conclusion
Apache Hadoop Architecture
YARN: Yet Another Resource Negotiator – framework responsible
for providing the computational resources (e.g., CPUs,
memory, etc.) needed for application executions.
5 / 49 Emilio Coppa Hadoop Internals
12. Introduction YARN MapReduce Conclusion
Apache Hadoop Architecture
HDFS: framework responsible for providing permanent, reliable and
distributed storage. This is typically used for storing inputs
and output (but not intermediate ones).
5 / 49 Emilio Coppa Hadoop Internals
13. Introduction YARN MapReduce Conclusion
Apache Hadoop Architecture
Storage: Other alternative storage solutions. Amazon uses the Simple
Storage Service (S3).
5 / 49 Emilio Coppa Hadoop Internals
14. Introduction YARN MapReduce Conclusion
Apache Hadoop Architecture
MapReduce: the software layer implementing the MapReduce paradigm.
Notice that YARN and HDFS can easily support other
frameworks (highly decoupled).
5 / 49 Emilio Coppa Hadoop Internals
15. Introduction YARN MapReduce Conclusion
YARN Infrastructure:
Yet Another Resource Negotiator
6 / 49 Emilio Coppa Hadoop Internals
16. Introduction YARN MapReduce Conclusion
YARN Infrastructure: overview
YARN handles the computational resources (CPU, memory, etc.) of
the cluster. The main actors are:
7 / 49 Emilio Coppa Hadoop Internals
17. Introduction YARN MapReduce Conclusion
YARN Infrastructure: overview
YARN handles the computational resources (CPU, memory, etc.) of
the cluster. The main actors are:
– Job Submitter: the client who submits an application
7 / 49 Emilio Coppa Hadoop Internals
18. Introduction YARN MapReduce Conclusion
YARN Infrastructure: overview
YARN handles the computational resources (CPU, memory, etc.) of
the cluster. The main actors are:
– Job Submitter: the client who submits an application
– Resource Manager: the master of the infrastructure
7 / 49 Emilio Coppa Hadoop Internals
19. Introduction YARN MapReduce Conclusion
YARN Infrastructure: overview
YARN handles the computational resources (CPU, memory, etc.) of
the cluster. The main actors are:
– Job Submitter: the client who submits an application
– Resource Manager: the master of the infrastructure
– Node Manager: A slave of the infrastructure
7 / 49 Emilio Coppa Hadoop Internals
20. Introduction YARN MapReduce Conclusion
YARN Infrastructure: Node Manager
The Node Manager (NM) is the slave.
When it starts, it announces himself to the
RM. Periodically, it sends an heartbeat to
the RM. Its resource capacity is the amount
of memory and the number of vcores.
A container is a fraction of the NM capacity:
container := (amount of memory, # vcores)
# containers
(on a NM)
yarn.nodemanager.resource.memory-mb /
yarn.scheduler.minimum-allocation-mb
8 / 49 Emilio Coppa Hadoop Internals
21. Introduction YARN MapReduce Conclusion
YARN Infrastructure: Resource Manager
The Resource Manager (RM) is the master. It knows where the
Node Managers are located (Rack Awareness) and how many
resources (containers) they have. It runs several services, the most
important is the Resource Scheduler.
9 / 49 Emilio Coppa Hadoop Internals
22. Introduction YARN MapReduce Conclusion
YARN Infrastructure: Application Startup
1 a client submits an application to the RM
2 the RM allocates a container
3 the RM contacts the NM
4 the NM launches the container
5 the container executes the Application Master
10 / 49 Emilio Coppa Hadoop Internals
23. Introduction YARN MapReduce Conclusion
YARN Infrastructure: Application Master
The AM is responsible for the execution of an application. It asks
for containers to the Resource Scheduler (RM) and executes
specific programs (e.g., the main of a Java class) on the obtained
containers. The AM is framework-specific.
The RM is a single point of failure in YARN. Using AMs, YARN is
spreading over the cluster the metadata related to the running
applications.
«
RM: reduced load & fast recovery
11 / 49 Emilio Coppa Hadoop Internals
24. Introduction YARN MapReduce Conclusion Map Phase Reduce Phase Extra
MapReduce Framework:
Anatomy of MR Job
12 / 49 Emilio Coppa Hadoop Internals
25. Introduction YARN MapReduce Conclusion Map Phase Reduce Phase Extra
MapReduce: Application MR Job
Timeline of a MR Job execution:
Map Phase: executed several Map Tasks
Reduce Phase: executed several Reduce Tasks
The MRAppMaster is the director of the job.
13 / 49 Emilio Coppa Hadoop Internals
26. Introduction YARN MapReduce Conclusion Map Phase Reduce Phase Extra
MapReduce: what does the user give us?
A Job submitted by a user is composed by:
a configuration: if partial then use global/default values
a JAR containing:
a map() implementation
a combine implementation
a reduce() implementation
input and output information:
input directory: are they on HDFS? S3? How many files?
output directory: where? HDFS? S3?
14 / 49 Emilio Coppa Hadoop Internals
27. Introduction YARN MapReduce Conclusion Map Phase Reduce Phase Extra
Map Phase: How many Map Tasks?
One Map Task for each input split (Job Submitter):
num_splits = 0
for each input file f:
remaining = f.length
while remaining / split_size > split_slope:
num_splits += 1
remaining -= split_size
where:
split_slope = 1.1
split_size dfs.blocksize
mapreduce.job.maps is ignored in MRv2 (before it was an hint)!
15 / 49 Emilio Coppa Hadoop Internals
28. Introduction YARN MapReduce Conclusion Map Phase Reduce Phase Extra
Map Phase: MapTask launch
The MRAppMaster immediately asks for containers needed by all
MapTasks:
=⇒ num_splits container requests
A container request for a MapTask tries to exploit data locality:
a node where input split is stored
if not, a node in same rack
if not, any other node
This is just an hint to the Resource Scheduler!
After a container has been assigned, the MapTask is launched.
16 / 49 Emilio Coppa Hadoop Internals
29. Introduction YARN MapReduce Conclusion Map Phase Reduce Phase Extra
Map Phase: Execution Overview
Possible execution scenario:
2 Node Managers (capacity 2 containers)
no other running applications
8 input splits
17 / 49 Emilio Coppa Hadoop Internals
31. Introduction YARN MapReduce Conclusion Map Phase Reduce Phase Extra
Map Phase: MapTask – Init
1 create a context (TaskAttemptContext)
2 create an instance of the user Mapper class
3 setup input (InputFormat, InputSplit, RecordReader)
4 setup output (NewOutputCollector)
5 create a mapper context (MapContext, Mapper.Context)
6 initialize input, e.g.:
create a SplitLineReader obj
create a HdfsDataInputStream obj
19 / 49 Emilio Coppa Hadoop Internals
32. Introduction YARN MapReduce Conclusion Map Phase Reduce Phase Extra
Map Phase: MapTask – Execution
Mapper.Context.nextKeyValue() will load data from the input
Mapper.Context.write() will write the output to a circular
buffer
20 / 49 Emilio Coppa Hadoop Internals
33. Introduction YARN MapReduce Conclusion Map Phase Reduce Phase Extra
Map Phase: MapTask – Spilling
Mapper.Context.write() writes to a MapOutputBuffer of size
mapreduce.task.io.sort.mb (100MB). If it is mapreduce.map.
sort.spill.percent (80%) full, then parallel spilling phase is
started.
If the circular buffer is 100% full, then map() is blocked!
21 / 49 Emilio Coppa Hadoop Internals
34. Introduction YARN MapReduce Conclusion Map Phase Reduce Phase Extra
Map Phase: MapTask – Spilling (2)
1 create a SpillRecord & create a FSOutputStream (local fs)
2 in-memory sort the chunk of the buffer (quicksort):
sort by <partitionIdx, key>
3 divide in partitions:
1 partition for each reducer (mapreduce.job.reduces)
write partitions into output file
22 / 49 Emilio Coppa Hadoop Internals
35. Introduction YARN MapReduce Conclusion Map Phase Reduce Phase Extra
Map Phase: MapTask – Spilling (partitioning)
How do we partition the <key, value> tuples?
During a Mapper.Context.write():
partitionIdx = (key.hashCode() & Integer.MAX_VALUE)
% numReducers
Stored as metadata of the tuple in circular buffer.
Use mapreduce.job.partitioner.class for a custom partitioner
23 / 49 Emilio Coppa Hadoop Internals
36. Introduction YARN MapReduce Conclusion Map Phase Reduce Phase Extra
Map Phase: MapTask – Spilling (combine)
If the user specifies a combiner then, before writing the tuples to
the file, we apply it on tuples of a partition:
1 create an instance of the user Reducer class
2 create a Reducer.Context: output on the local fs file
3 execute Reduce.run(): see Reduce Task slides
The combiner typically use the same implementation of the
reduce() function and thus can be seen as a local reducer.
24 / 49 Emilio Coppa Hadoop Internals
37. Introduction YARN MapReduce Conclusion Map Phase Reduce Phase Extra
Map Phase: MapTask – Spilling (end of execution)
At the end of the execution of the Mapper.run():
1 sort and spill the remaining unspilled tuples
2 start the shuffle phase
25 / 49 Emilio Coppa Hadoop Internals
38. Introduction YARN MapReduce Conclusion Map Phase Reduce Phase Extra
Map Phase: MapTask – Shuffle
Spill files need to be merged: this is done by a k-way merge where
k is equal to mapreduce.task.io.sort.factor (100).
These are intermediate output files of only one MapTask!
26 / 49 Emilio Coppa Hadoop Internals
39. Introduction YARN MapReduce Conclusion Map Phase Reduce Phase Extra
Map Phase: Execution Overview
Possible execution scenario:
2 Node Managers (capacity 2 containers)
no other running applications
8 input splits
The Node Managers locally store the map outputs (reduce inputs).
27 / 49 Emilio Coppa Hadoop Internals
40. Introduction YARN MapReduce Conclusion Map Phase Reduce Phase Extra
Reduce Phase: Reduce Task Launch
The MRAppMaster waits until mapreduce.job.reduce.
slowstart.completedmaps (5%) MapTasks are completed.
Then (periodically executed):
if all maps have a container assigned then all (remaining)
reducers are scheduled
otherwise it checks percentage of completed maps:
check available cluster resources for the app
check resource needed for unassigned rescheduled maps
ramp down (unschedule/kill) or ramp up (schedule) reduce
tasks
When a reduce task is scheduled, a container request is made. This
does NOT exploit data locality.
A MapTask request has a higher priority than Reduce Task request.
28 / 49 Emilio Coppa Hadoop Internals
41. Introduction YARN MapReduce Conclusion Map Phase Reduce Phase Extra
Reduce Phase: Execution Overview
Possible execution scenario:
2 Node Managers (capacity 2 containers each)
no other running applications
4 reducers (mapreduce.job.reduces, default: 1)
29 / 49 Emilio Coppa Hadoop Internals
43. Introduction YARN MapReduce Conclusion Map Phase Reduce Phase Extra
Reduce Phase: Reduce Task – Init
1 init a codec (if map outputs are compressed)
2 create an instance of the combine output collector (if needed)
3 create an instance of the shuffle plugin (mapreduce.job.
reduce.shuffle.consumer.plugin.class, default:
org.apache.hadoop.mapreduce.task.reduce.Shuffle.class)
4 create a shuffle context (ShuffleConsumerPlugin.Context)
31 / 49 Emilio Coppa Hadoop Internals
44. Introduction YARN MapReduce Conclusion Map Phase Reduce Phase Extra
Reduce Phase: Reduce Task – Shuffle
The shuffle has two steps:
1 fetch map outputs from Node Managers
2 merge them
32 / 49 Emilio Coppa Hadoop Internals
45. Introduction YARN MapReduce Conclusion Map Phase Reduce Phase Extra
Reduce Phase: Reduce Task – Shuffle (fetch)
Several parallel fetchers are started (up to mapreduce.reduce.
shuffle.parallelcopies, default: 5). Each fetcher collects map
outputs from one NM (possibly many containers):
if output size less than 25% of NM memory then create an in
memory output (wait until enough memory is available)
otherwise create a disk output
33 / 49 Emilio Coppa Hadoop Internals
46. Introduction YARN MapReduce Conclusion Map Phase Reduce Phase Extra
Reduce Phase: Reduce Task – Shuffle (fetch) (2)
Fetch the outputs over HTTP and add to related merge queue.
A Reduce Task may start before the end of the Map Phase thus
you can fetch only from completed map tasks. Periodically repeat
fetch process.
34 / 49 Emilio Coppa Hadoop Internals
47. Introduction YARN MapReduce Conclusion Map Phase Reduce Phase Extra
Reduce Phase: Reduce Task – Shuffle (in memory merge)
The in memory merger:
1 perform a k-way merge
2 run the combiner (if needed)
3 result is written on a On Disk Map Output and it is queued
35 / 49 Emilio Coppa Hadoop Internals
48. Introduction YARN MapReduce Conclusion Map Phase Reduce Phase Extra
Reduce Phase: Reduce Task – Shuffle (on disk merge)
Extract from the queue, k-way merge and queue the result:
Stop when all files has been merged together: the final merge will
provide a RawKeyValueIterator instance (input of the reducer).
36 / 49 Emilio Coppa Hadoop Internals
49. Introduction YARN MapReduce Conclusion Map Phase Reduce Phase Extra
Reduce Phase: Reduce Task – Execution (init)
1 create a context (TaskAttemptContext)
2 create an instance of the user Reduce class
3 setup output (RecordWriter, TextOutputFormat)
4 create a reducer context (Reducer.Context)
37 / 49 Emilio Coppa Hadoop Internals
50. Introduction YARN MapReduce Conclusion Map Phase Reduce Phase Extra
Reduce Phase: Reduce Task – Execution (run)
The output is typically written on HDFS file.
38 / 49 Emilio Coppa Hadoop Internals
52. Introduction YARN MapReduce Conclusion Map Phase Reduce Phase Extra
MapReduce: Application MR Job
Possible execution timeline:
That’s it!
40 / 49 Emilio Coppa Hadoop Internals
53. Introduction YARN MapReduce Conclusion Map Phase Reduce Phase Extra
MapReduce: Task Progress
A MapTask has two phases:
Map (66%): progress due to perc. of processed input
Sort (33%): 1 subphase for each reducer
subphase progress due to perc. of merged bytes
A ReduceTask has three phases:
Copy (33%): progress due to perc. of fetched input
Sort (33%): progress due to processed bytes in final merge
Reduce (33%): progress due to perc. of processed input
41 / 49 Emilio Coppa Hadoop Internals
54. Introduction YARN MapReduce Conclusion Map Phase Reduce Phase Extra
MapReduce: Speculation
MRAppMaster may launch speculative tasks:
est = (ts - start) / MAX(0.0001, Status.progress())
estEndTime = start + est
estReplacementEndTime = now() + TaskDurations.mean()
if estEndTime < now() then
return PROGRESS_IS_GOOD
elif estReplacementEndTime >= estEndTime then
return TOO_LATE_TO_SPECULATE
else then
return estEndTime - estReplacementEndTime // score
Speculate the task with highest score.
42 / 49 Emilio Coppa Hadoop Internals
55. Introduction YARN MapReduce Conclusion Map Phase Reduce Phase Extra
MapReduce: Application Status
The status of a MR job is tracked by the MRAppMaster using
several Finite State Machines:
Job: 14 states, 80 transitions, 19 events
Task: 14 states, 36 transitions, 9 events
Task Attempt: 13 states, 60 transitions, 17 events
A job is composed by several tasks. Each tasks may have several
task attempts. Each task attempt is executed on a container.
Instead, a Node Manager maintains the states of:
Application: 7 states, 21 transitions, 9 events
Container: 11 states, 46 transitions, 12 events
43 / 49 Emilio Coppa Hadoop Internals
57. Introduction YARN MapReduce Conclusion Map Phase Reduce Phase Extra
Configuration Parameters (recap)
Parameter Meaning
mapreduce.framework.name The runtime framework for executing
MapReduce jobs. Set to YARN.
mapreduce.job.reduces Number of reduce tasks. Default: 1
dfs.blocksize HDFS block size. Default 128MB.
yarn.resourcemanager.
scheduler.class
Scheduler class. Default: CapacityScheduler
yarn.nodemanager.
resource.memory-mb
Memory available on a NM for containers.
Default: 8192
yarn.scheduler.
minimum-allocation-mb
Min allocation for every container request.
Default: 1024
mapreduce.map. memory.mb Memory request for a MapTask. Default:
1024
mapreduce.reduce.
memory.mb
Memory request for a ReduceTask. Default:
1024
45 / 49 Emilio Coppa Hadoop Internals
58. Introduction YARN MapReduce Conclusion Map Phase Reduce Phase Extra
Configuration Parameters (recap) (2)
Parameter Meaning
mapreduce.task.
io.sort.mb
Size of the circular buffer (map output). De-
fault: 100MB
mapreduce.map.
sort.spill.percent
Circular buffer soft limit. Once reached,
start the spilling process. Default: 0.80
mapreduce.job.
partitioner.class
The Partitioner class. Default: HashParti-
tioner.class
map.sort.class The sort class for sorting keys. Default:
org.apache.hadoop.util.QuickSort
mapreduce.reduce.shuffle
.memory.limit.percent
Maximum percentage of the in-memory limit
that a single shuffle can consume. Default:
0.25
mapreduce.reduce.shuffle
.input.buffer.percent
The % of memory to be allocated from the
maximum heap size to storing map outputs
during the shuffle. Default: 0.70
46 / 49 Emilio Coppa Hadoop Internals
59. Introduction YARN MapReduce Conclusion Map Phase Reduce Phase Extra
Configuration Parameters (recap) (3)
Parameter Meaning
mapreduce.reduce.shuffle
.merge.percent
The usage % at which an in-memory merge
will be initiated. Default: 0.66
mapreduce.map.
combine.minspills
Apply combine only if you at least this num-
ber of spill files. Default: 3.
mapreduce.task.
io.sort.factor
The number of streams to merge at once
while sorting files. Default: 100 (10)
mapreduce.job.reduce.
slowstart.completedmaps
Fraction of the number of maps in the job
which should be complete before reduces are
scheduled for the job. Default: 0.05
mapreduce.reduce.
shuffle.parallelcopies
Number of parallel transfers run by reduce
during the shuffle (fetch) phase. Default: 5
mapreduce.reduce.
memory.totalbytes
Memory of a NM. Default: Run-
time.maxMemory()
47 / 49 Emilio Coppa Hadoop Internals
60. Introduction YARN MapReduce Conclusion
Hadoop: a bad angel
Writing a MapReduce program is relatively easy. On the other
hand, writing an efficient MapReduce program is hard:
many configuration parameters:
YARN: 115 parameters
MapReduce: 195 parameters
HDFS: 173 parameters
core: 145 parameters
lack of control over the execution: how to debug?
many implementation details: what is happening?
48 / 49 Emilio Coppa Hadoop Internals
62. Introduction YARN MapReduce Conclusion
How can we help the user?
We need profilers!
49 / 49 Emilio Coppa Hadoop Internals
63. Introduction YARN MapReduce Conclusion
How can we help the user?
We need profilers!
My current research is focused on this goal.
49 / 49 Emilio Coppa Hadoop Internals