SlideShare a Scribd company logo
1 of 29
unit-5
1
 MapReduce is a programming model Google
has used successfully is processing its “big-
data” sets (~ 20000 peta bytes per day)
Users specify the computation in terms of a map
and a reduce function,
Underlying runtime system automatically
parallelizes the computation across large-scale
clusters of machines, and
Underlying system also handles machine failures,
efficient communications, and performance issues.
.
2
Consider a large data collection:
{web, weed, green, sun, moon, land, part, web, green,…}
Problem: Count the occurrences of the different words in
the collection.
Lets design a solution for this problem;
We will start from scratch
We will add and relax constraints
We will do incremental design, improving the solution for
performance and scalability
3
 Very large scale data: peta, exa bytes
 Write once and read many data: allows for parallelism
without mutexes
 Map and Reduce are the main operations: simple code
 There are other supporting operations such as combine and
partition (out of the scope of this talk).
 All the map should be completed before reduce operation
starts.
 Map and reduce operations are typically performed by the
same physical processor.
 Number of map tasks and reduce tasks are configurable
4
5
Pipelined Instruction level
Concurrent Thread level
Service Object level
Indexed File level
Mega Block level
Virtual System Level
Data size: small
Data size: large
 At Google MapReduce operation are run on a
special file system called Google File System (GFS)
that is highly optimized for this purpose.
 GFS is not open source.
 Doug Cutting and Yahoo! reverse engineered the
GFS and called it Hadoop Distributed File System
(HDFS).
 The software framework that supports HDFS,
MapReduce and other related entities is called
the project Hadoop or simply Hadoop.
 This is open source and distributed by Apache.
6
 Highly fault-tolerant
 High throughput
 Suitable for applications with large data sets
 Streaming access to file system data
 Can be built out of commodity hardware
7
 Hadoop is a software framework for distributed
processing of large datasets across large clusters of
computers
 Large datasets  Terabytes or petabytes of data
 Large clusters  hundreds or thousands of nodes
 Hadoop is open-source implementation for Google
MapReduce
 Hadoop is based on a simple programming model
called MapReduce
 Hadoop is based on a simple data model, any data
will fit
8
 Hadoop framework consists on two main layers
 Distributed file system (HDFS)
 Execution engine (MapReduce)
9
 Automatic parallelization & distribution
 Hidden from the end-user
 Fault tolerance and automatic recovery
 Nodes/tasks will fail and will recover automatically
 Clean and simple programming abstraction
 Users only provide two functions “map” and
“reduce”
10
 Google: Inventors of MapReduce computing
paradigm
 Yahoo: Developing Hadoop open-source of
MapReduce
 IBM, Microsoft, Oracle
 Facebook, Amazon, AOL, NetFlex
 Many others + universities and research labs
11
12
Master node (single node)
Many slave nodes
• Distributed file system (HDFS)
• Execution engine (MapReduce)
13
Centralized namenode
- Maintains metadata info about files
Many datanode (1000s)
- Store the actual data
- Files are divided into blocks
- Each block is replicated N times
(Default = 3)
File F 1 2 3 4 5
Blocks (64 MB)
 Large: A HDFS instance may consist of thousands
of server machines, each storing part of the file
system’s data
 Replication: Each data block is replicated many
times (default is 3)
 Failure: Failure is the norm rather than exception
 Fault Tolerance: Detection of faults and quick,
automatic recovery from them is a core
architectural goal of HDFS
 Namenode is consistently checking Datanodes
14
15
Deciding on what will be the key and what will be the value 
developer’s responsibility
Distributed Databases Hadoop
Computing Model - Notion of transactions
- Transaction is the unit of work
- ACID properties, Concurrency
control
- Notion of jobs
- Job is the unit of work
- No concurrency control
Data Model - Structured data with known
schema
- Read/Write mode
- Any data will fit in any format
- (un)(semi)structured
- ReadOnly mode
Cost Model - Expensive servers - Cheap commodity machines
Fault Tolerance - Failures are rare
- Recovery mechanisms
- Failures are common over
thousands of machines
- Simple yet efficient fault
tolerance
Key Characteristics - Efficiency, optimizations, fine-
tuning
- Scalability, flexibility, fault
tolerance
16
 Master: JobTracker (JT)
 Worker: Tasktracker (TT)
◦ Fixed # of map slots and reduce slots
client
Master Node
(JobTracker)
Worker Node
(TaskTracker)
Worker Node
(TaskTracker)
Worker Node
(TaskTracker)
 Hadoop is being used for all kinds of tasks
beyond its original design
 Tight coupling of a specific programming
model with the resource management
infrastructure
 Centralized handling of jobs’ control flow
 Scalability
 Multi-tenancy
 Serviceability
 Locality Awareness
 High Cluster Utilization
◦ HoD does not resize the cluster between stages
◦ Users allocate more nodes than needed
 Competing for resources results in longer latency to
start a job
 Scalability
 Multi-tenancy
 Serviceability
 Locality Awareness
 High Cluster Utilization
 Reliability/Availability
 Secure and auditable operation
 Support for Programming Model Diversity
 Flexible Resource Model
◦ Hadoop: # of Map/reduce slots are fixed.
◦ Easy, but lower utilization
 Separating resource management functions
from the programming model
 MapReduce becomes just one of the
application
 Dryad, …. Etc
 Binary compatible/Source compatible
 The head of a job
 Runs as a container
 Request resources from RM
◦ # of containers/ resource per container/ locality …
 Dynamically changing resource consumption
 Can run any user code (Dryad, MapReduce,
Tez, REEF…etc)
 Requests are “late-binding”
 Optimizes for locality among map tasks with
identical resource requirements
◦ Selecting a task with input data close to the
container.
 AM determines the semantics of the success
or failure of the container
1. Submitting the application by passing a CLC for the
Application Master to the RM.
2. When RM starts the AM, it should register with the RM
and periodically advertise its liveness and requirements
over the heartbeat protocol
3. Once the RM allocates a container, AM can construct a
CLC to launch the container on the corresponding NM. It
may also monitor the status of the running container and
stop it when the resource should be reclaimed.
Monitoring the progress of work done inside the
container is strictly the AM’s responsibility.
4. Once the AM is done with its work, it should unregister
from the RM and exit cleanly.
5. Optionally, framework authors may add control flow
between their own clients to report job status and expose
a control plane.
 RM Failure
 Recover using persistent storage
 Kill all containers, including AMs’
 Relaunch AMs
 NM Failure
 RM detects it, mark the containers as killed, report
to Ams
 AM Failure
 RM kills the container and restarts it.
 Container Failure
 The framework is responsible for recovery
 In a 2500-node cluster, throughput improves
from 77 K jobs/day to 150 K jobs/day
 Pig, Hive, Oozie
 Decompose a DAG job into multiple MR jobs
 Apache Tez
 DAG execution framework
 Spark
 Dryad
 Giraph
 Vertice centric graph computation framework
 fits naturally within YARN model
 Storm – distributed real time processing engine
(parallel stream processing)
 REEF
 Simplify implementing ApplicationMaster
 Haya – Hbase clusters

More Related Content

What's hot

Intro to Big Data Hadoop
Intro to Big Data HadoopIntro to Big Data Hadoop
Intro to Big Data HadoopApache Apex
 
Big data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irBig data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irdatastack
 
Intro to bigdata on gcp (1)
Intro to bigdata on gcp (1)Intro to bigdata on gcp (1)
Intro to bigdata on gcp (1)SahilRaina21
 
Big Data with SQL Server
Big Data with SQL ServerBig Data with SQL Server
Big Data with SQL ServerMark Kromer
 
Big data vahidamiri-datastack.ir
Big data vahidamiri-datastack.irBig data vahidamiri-datastack.ir
Big data vahidamiri-datastack.irdatastack
 
Big Data Technology Stack : Nutshell
Big Data Technology Stack : NutshellBig Data Technology Stack : Nutshell
Big Data Technology Stack : NutshellKhalid Imran
 
Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Vikram Andem Big Data Strategy @ IATA Technology Roadmap Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Vikram Andem Big Data Strategy @ IATA Technology Roadmap IT Strategy Group
 
Big Data technology Landscape
Big Data technology LandscapeBig Data technology Landscape
Big Data technology LandscapeShivanandaVSeeri
 
Introduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-SystemIntroduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-SystemMd. Hasan Basri (Angel)
 
Hadoop - Architectural road map for Hadoop Ecosystem
Hadoop -  Architectural road map for Hadoop EcosystemHadoop -  Architectural road map for Hadoop Ecosystem
Hadoop - Architectural road map for Hadoop Ecosystemnallagangus
 
RDBMS vs Hadoop vs Spark
RDBMS vs Hadoop vs SparkRDBMS vs Hadoop vs Spark
RDBMS vs Hadoop vs SparkLaxmi8
 

What's hot (20)

Big data hadoop rdbms
Big data hadoop rdbmsBig data hadoop rdbms
Big data hadoop rdbms
 
Intro to Big Data Hadoop
Intro to Big Data HadoopIntro to Big Data Hadoop
Intro to Big Data Hadoop
 
Big data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irBig data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.ir
 
Intro to bigdata on gcp (1)
Intro to bigdata on gcp (1)Intro to bigdata on gcp (1)
Intro to bigdata on gcp (1)
 
Big Data with SQL Server
Big Data with SQL ServerBig Data with SQL Server
Big Data with SQL Server
 
Big data vahidamiri-datastack.ir
Big data vahidamiri-datastack.irBig data vahidamiri-datastack.ir
Big data vahidamiri-datastack.ir
 
Big data
Big dataBig data
Big data
 
Big data concepts
Big data conceptsBig data concepts
Big data concepts
 
Hadoop
HadoopHadoop
Hadoop
 
Big Data Concepts
Big Data ConceptsBig Data Concepts
Big Data Concepts
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Big Data Technology Stack : Nutshell
Big Data Technology Stack : NutshellBig Data Technology Stack : Nutshell
Big Data Technology Stack : Nutshell
 
Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Vikram Andem Big Data Strategy @ IATA Technology Roadmap Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Vikram Andem Big Data Strategy @ IATA Technology Roadmap
 
Big Data technology Landscape
Big Data technology LandscapeBig Data technology Landscape
Big Data technology Landscape
 
Hdfs Dhruba
Hdfs DhrubaHdfs Dhruba
Hdfs Dhruba
 
Big Data , Big Problem?
Big Data , Big Problem?Big Data , Big Problem?
Big Data , Big Problem?
 
Big data landscape
Big data landscapeBig data landscape
Big data landscape
 
Introduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-SystemIntroduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-System
 
Hadoop - Architectural road map for Hadoop Ecosystem
Hadoop -  Architectural road map for Hadoop EcosystemHadoop -  Architectural road map for Hadoop Ecosystem
Hadoop - Architectural road map for Hadoop Ecosystem
 
RDBMS vs Hadoop vs Spark
RDBMS vs Hadoop vs SparkRDBMS vs Hadoop vs Spark
RDBMS vs Hadoop vs Spark
 

Similar to Hadoop mapreduce and yarn frame work- unit5

Report Hadoop Map Reduce
Report Hadoop Map ReduceReport Hadoop Map Reduce
Report Hadoop Map ReduceUrvashi Kataria
 
Hadoop training-in-hyderabad
Hadoop training-in-hyderabadHadoop training-in-hyderabad
Hadoop training-in-hyderabadsreehari orienit
 
Introduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to HadoopIntroduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to HadoopGERARDO BARBERENA
 
A Survey on Big Data Analysis Techniques
A Survey on Big Data Analysis TechniquesA Survey on Big Data Analysis Techniques
A Survey on Big Data Analysis Techniquesijsrd.com
 
Hadoop bigdata overview
Hadoop bigdata overviewHadoop bigdata overview
Hadoop bigdata overviewharithakannan
 
Hadoop Tutorial.ppt
Hadoop Tutorial.pptHadoop Tutorial.ppt
Hadoop Tutorial.pptSathish24111
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation HadoopVarun Narang
 
Hadoop live online training
Hadoop live online trainingHadoop live online training
Hadoop live online trainingHarika583
 
Learn what is Hadoop-and-BigData
Learn  what is Hadoop-and-BigDataLearn  what is Hadoop-and-BigData
Learn what is Hadoop-and-BigDataThanusha154
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Deanna Kosaraju
 
Introduction to map reduce
Introduction to map reduceIntroduction to map reduce
Introduction to map reduceM Baddar
 
Sawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data CloudsSawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data CloudsRobert Grossman
 
Hadoop online-training
Hadoop online-trainingHadoop online-training
Hadoop online-trainingGeohedrick
 
Seminar_Report_hadoop
Seminar_Report_hadoopSeminar_Report_hadoop
Seminar_Report_hadoopVarun Narang
 
Meethadoop
MeethadoopMeethadoop
MeethadoopIIIT-H
 
Big data & Hadoop
Big data & HadoopBig data & Hadoop
Big data & HadoopAhmed Gamil
 
Managing Big data Module 3 (1st part)
Managing Big data Module 3 (1st part)Managing Big data Module 3 (1st part)
Managing Big data Module 3 (1st part)Soumee Maschatak
 

Similar to Hadoop mapreduce and yarn frame work- unit5 (20)

Report Hadoop Map Reduce
Report Hadoop Map ReduceReport Hadoop Map Reduce
Report Hadoop Map Reduce
 
Hadoop training-in-hyderabad
Hadoop training-in-hyderabadHadoop training-in-hyderabad
Hadoop training-in-hyderabad
 
Introduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to HadoopIntroduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to Hadoop
 
Hadoop
HadoopHadoop
Hadoop
 
A Survey on Big Data Analysis Techniques
A Survey on Big Data Analysis TechniquesA Survey on Big Data Analysis Techniques
A Survey on Big Data Analysis Techniques
 
Hadoop bigdata overview
Hadoop bigdata overviewHadoop bigdata overview
Hadoop bigdata overview
 
Hadoop Tutorial.ppt
Hadoop Tutorial.pptHadoop Tutorial.ppt
Hadoop Tutorial.ppt
 
Hadoop tutorial
Hadoop tutorialHadoop tutorial
Hadoop tutorial
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 
Hadoop live online training
Hadoop live online trainingHadoop live online training
Hadoop live online training
 
E031201032036
E031201032036E031201032036
E031201032036
 
Learn what is Hadoop-and-BigData
Learn  what is Hadoop-and-BigDataLearn  what is Hadoop-and-BigData
Learn what is Hadoop-and-BigData
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
 
Introduction to map reduce
Introduction to map reduceIntroduction to map reduce
Introduction to map reduce
 
Sawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data CloudsSawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data Clouds
 
Hadoop online-training
Hadoop online-trainingHadoop online-training
Hadoop online-training
 
Seminar_Report_hadoop
Seminar_Report_hadoopSeminar_Report_hadoop
Seminar_Report_hadoop
 
Meethadoop
MeethadoopMeethadoop
Meethadoop
 
Big data & Hadoop
Big data & HadoopBig data & Hadoop
Big data & Hadoop
 
Managing Big data Module 3 (1st part)
Managing Big data Module 3 (1st part)Managing Big data Module 3 (1st part)
Managing Big data Module 3 (1st part)
 

Recently uploaded

1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdfQucHHunhnh
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdfQucHHunhnh
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphThiyagu K
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfciinovamais
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactPECB
 
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...fonyou31
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionSafetyChain Software
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...Sapna Thakur
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13Steve Thomason
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpinRaunakKeshri1
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfchloefrazer622
 
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...anjaliyadav012327
 
social pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajansocial pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajanpragatimahajan3
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptxVS Mahajan Coaching Centre
 

Recently uploaded (20)

1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
 
Advance Mobile Application Development class 07
Advance Mobile Application Development class 07Advance Mobile Application Development class 07
Advance Mobile Application Development class 07
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory Inspection
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpin
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdf
 
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
 
social pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajansocial pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajan
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
 

Hadoop mapreduce and yarn frame work- unit5

  • 2.  MapReduce is a programming model Google has used successfully is processing its “big- data” sets (~ 20000 peta bytes per day) Users specify the computation in terms of a map and a reduce function, Underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, and Underlying system also handles machine failures, efficient communications, and performance issues. . 2
  • 3. Consider a large data collection: {web, weed, green, sun, moon, land, part, web, green,…} Problem: Count the occurrences of the different words in the collection. Lets design a solution for this problem; We will start from scratch We will add and relax constraints We will do incremental design, improving the solution for performance and scalability 3
  • 4.  Very large scale data: peta, exa bytes  Write once and read many data: allows for parallelism without mutexes  Map and Reduce are the main operations: simple code  There are other supporting operations such as combine and partition (out of the scope of this talk).  All the map should be completed before reduce operation starts.  Map and reduce operations are typically performed by the same physical processor.  Number of map tasks and reduce tasks are configurable 4
  • 5. 5 Pipelined Instruction level Concurrent Thread level Service Object level Indexed File level Mega Block level Virtual System Level Data size: small Data size: large
  • 6.  At Google MapReduce operation are run on a special file system called Google File System (GFS) that is highly optimized for this purpose.  GFS is not open source.  Doug Cutting and Yahoo! reverse engineered the GFS and called it Hadoop Distributed File System (HDFS).  The software framework that supports HDFS, MapReduce and other related entities is called the project Hadoop or simply Hadoop.  This is open source and distributed by Apache. 6
  • 7.  Highly fault-tolerant  High throughput  Suitable for applications with large data sets  Streaming access to file system data  Can be built out of commodity hardware 7
  • 8.  Hadoop is a software framework for distributed processing of large datasets across large clusters of computers  Large datasets  Terabytes or petabytes of data  Large clusters  hundreds or thousands of nodes  Hadoop is open-source implementation for Google MapReduce  Hadoop is based on a simple programming model called MapReduce  Hadoop is based on a simple data model, any data will fit 8
  • 9.  Hadoop framework consists on two main layers  Distributed file system (HDFS)  Execution engine (MapReduce) 9
  • 10.  Automatic parallelization & distribution  Hidden from the end-user  Fault tolerance and automatic recovery  Nodes/tasks will fail and will recover automatically  Clean and simple programming abstraction  Users only provide two functions “map” and “reduce” 10
  • 11.  Google: Inventors of MapReduce computing paradigm  Yahoo: Developing Hadoop open-source of MapReduce  IBM, Microsoft, Oracle  Facebook, Amazon, AOL, NetFlex  Many others + universities and research labs 11
  • 12. 12 Master node (single node) Many slave nodes • Distributed file system (HDFS) • Execution engine (MapReduce)
  • 13. 13 Centralized namenode - Maintains metadata info about files Many datanode (1000s) - Store the actual data - Files are divided into blocks - Each block is replicated N times (Default = 3) File F 1 2 3 4 5 Blocks (64 MB)
  • 14.  Large: A HDFS instance may consist of thousands of server machines, each storing part of the file system’s data  Replication: Each data block is replicated many times (default is 3)  Failure: Failure is the norm rather than exception  Fault Tolerance: Detection of faults and quick, automatic recovery from them is a core architectural goal of HDFS  Namenode is consistently checking Datanodes 14
  • 15. 15 Deciding on what will be the key and what will be the value  developer’s responsibility
  • 16. Distributed Databases Hadoop Computing Model - Notion of transactions - Transaction is the unit of work - ACID properties, Concurrency control - Notion of jobs - Job is the unit of work - No concurrency control Data Model - Structured data with known schema - Read/Write mode - Any data will fit in any format - (un)(semi)structured - ReadOnly mode Cost Model - Expensive servers - Cheap commodity machines Fault Tolerance - Failures are rare - Recovery mechanisms - Failures are common over thousands of machines - Simple yet efficient fault tolerance Key Characteristics - Efficiency, optimizations, fine- tuning - Scalability, flexibility, fault tolerance 16
  • 17.
  • 18.  Master: JobTracker (JT)  Worker: Tasktracker (TT) ◦ Fixed # of map slots and reduce slots client Master Node (JobTracker) Worker Node (TaskTracker) Worker Node (TaskTracker) Worker Node (TaskTracker)
  • 19.  Hadoop is being used for all kinds of tasks beyond its original design  Tight coupling of a specific programming model with the resource management infrastructure  Centralized handling of jobs’ control flow
  • 20.  Scalability  Multi-tenancy  Serviceability  Locality Awareness  High Cluster Utilization ◦ HoD does not resize the cluster between stages ◦ Users allocate more nodes than needed  Competing for resources results in longer latency to start a job
  • 21.  Scalability  Multi-tenancy  Serviceability  Locality Awareness  High Cluster Utilization  Reliability/Availability  Secure and auditable operation  Support for Programming Model Diversity  Flexible Resource Model ◦ Hadoop: # of Map/reduce slots are fixed. ◦ Easy, but lower utilization
  • 22.  Separating resource management functions from the programming model  MapReduce becomes just one of the application  Dryad, …. Etc  Binary compatible/Source compatible
  • 23.
  • 24.  The head of a job  Runs as a container  Request resources from RM ◦ # of containers/ resource per container/ locality …  Dynamically changing resource consumption  Can run any user code (Dryad, MapReduce, Tez, REEF…etc)  Requests are “late-binding”
  • 25.  Optimizes for locality among map tasks with identical resource requirements ◦ Selecting a task with input data close to the container.  AM determines the semantics of the success or failure of the container
  • 26. 1. Submitting the application by passing a CLC for the Application Master to the RM. 2. When RM starts the AM, it should register with the RM and periodically advertise its liveness and requirements over the heartbeat protocol 3. Once the RM allocates a container, AM can construct a CLC to launch the container on the corresponding NM. It may also monitor the status of the running container and stop it when the resource should be reclaimed. Monitoring the progress of work done inside the container is strictly the AM’s responsibility. 4. Once the AM is done with its work, it should unregister from the RM and exit cleanly. 5. Optionally, framework authors may add control flow between their own clients to report job status and expose a control plane.
  • 27.  RM Failure  Recover using persistent storage  Kill all containers, including AMs’  Relaunch AMs  NM Failure  RM detects it, mark the containers as killed, report to Ams  AM Failure  RM kills the container and restarts it.  Container Failure  The framework is responsible for recovery
  • 28.  In a 2500-node cluster, throughput improves from 77 K jobs/day to 150 K jobs/day
  • 29.  Pig, Hive, Oozie  Decompose a DAG job into multiple MR jobs  Apache Tez  DAG execution framework  Spark  Dryad  Giraph  Vertice centric graph computation framework  fits naturally within YARN model  Storm – distributed real time processing engine (parallel stream processing)  REEF  Simplify implementing ApplicationMaster  Haya – Hbase clusters

Editor's Notes

  1. CC BY 3.0 http://creativecommons.org/licenses/by/3.0/deed.en_US Figures are copied from the original paper and therefore owned by ACM
  2. Add a figure to show the system diagram
  3. High level frameworks compose workflow as a DAG of Mapreduce jobs. The size of nodes in each stage may be different.
  4. The use of AM satisfies scalability, programing model flexibility, improved upgrading/testing. Late-binding: the received lease may not be the same as the request. AM must accommodate the difference.
  5. When the AM receives a container, it matches it against the set of pending map tasks, selecting a task with input data close to the container. If the AM decides to run a map task mi in the container, then the hosts storing replicas of mi’s input data are less desirable. The AM will update its request to diminish the weight on the other k-1 hosts.
  6. Even a simple AM can be fairly complex. Frameworks to ease development of YARN applications exist.We explore some of these in section 4.2. Client libraries - YarnClient, NMClient, AMRMClient - ship with YARN and expose higher level APIs to avoid coding against low level protocols.
  7. Work is in progress to add sufficient protocol support for AMs to survive RM restart
  8. Essentially, moving to YARN, the CPU utilization almost doubled One of the most important architectural differences that partially explains these improvements is the removal of the static split between map and reduce slots.