SlideShare a Scribd company logo
1 of 13
Download to read offline
MapReduce:
Simplified Data Processing on Large Clusters	
2015-08-30 さとうかずま
Jeffery (2004), MapReduce: Simplified Data Processing on Large Clusters	
The reasons for the success of MapReduce	
2	
The MapReduce programming model has been
successfully used at Google 	
Reasons
•  MapReduce is easy to use
•  Many problems are easily expressible
as MapReduce computations
•  An implementation of MapReduce can run on a large
cluster of commodity machines and is high scalable
Jeffery (2004), MapReduce: Simplified Data Processing on Large Clusters	
The reasons why MapReduce is easy to use	
3	
MapReduce hides the messy details of the following	
•  Parallelization
•  Fault-tolerance
•  Locality optimization
•  Load balancing
Jeffery (2004), MapReduce: Simplified Data Processing on Large Clusters	
MapReduce computation	
4	
Computation is expressed as two functions: Map and Reduce 	
(n1, s1)
(n2, s2)
(n1,s1)
(w1, 1)
(w2, 1)
(w1, (1,1)) (2)
(k1, v1) → list(k2,v2)	
 (k1, list(v2)) → list(v2)	
(n1, s1)	
(n2, s2)	
phase 	
type	
map reduce
(w2, (1,1)) (2)(n2,s2)
(w1, 1)
(w2, 1)
time	
input file	
split	
A Flow of an execution
(Counting the number of occurrence of each word in a document)	
Input and output is a set of key/value pairs	
machine	
 machine	
machine	
 machine	
PCs	
→	
→	
 →	
→
Jeffery (2004), MapReduce: Simplified Data Processing on Large Clusters	
Map	
5	
Map takes an input pair and produces a set of intermediate
key/value pairs	
Map example
(Counting the number of occurrence of each word in a document)
Map, written by the user, produces a set of intermediate key/value pairs	
Map(String	
  key,	
  String	
  value):	
  
	
  	
  //	
  key:	
  document	
  name	
  
	
  	
  //	
  value:	
  document	
  contents	
  
	
  	
  for	
  each	
  word	
  w	
  in	
  value:	
  
	
  	
  	
  	
  EmitIntermediate(w,	
  “1”’);	
  
Jeffery (2004), MapReduce: Simplified Data Processing on Large Clusters	
Reduce	
6	
Reduce accepts an intermediate key and a set of values for
that key, then form a possibly smaller set of values	
Reduce example
(Counting the number of occurrence of each word in a document)
Reduce(String	
  key,	
  String	
  value):	
  
	
  	
  //	
  key:	
  a	
  word	
  
	
  	
  //	
  value:	
  a	
  list	
  of	
  counts	
  
	
  	
  for	
  each	
  v	
  in	
  values:	
  
	
  	
  	
  	
  result	
  +=	
  ParseInt(v);	
  
	
  	
  Emit	
  (AsString(result));	
  	
  
Jeffery (2004), MapReduce: Simplified Data Processing on Large Clusters	
Parallelization	
7	
Map and reduce allow programmers to parallelize
computations easily	
•  They are inspired by map and reduce present in
functional languages
•  Referential transparency is one of the principle of
functional programming
•  Referential transparency encourages language based
parallelism of computation
Jeffery (2004), MapReduce: Simplified Data Processing on Large Clusters	
MapReduce operation	
8	
First, The MapReduce library copies user program
on a cluster of machines	
Copy of user program on a cluster of machines	
User program	
fork	
fork	
 fork
Jeffery (2004), MapReduce: Simplified Data Processing on Large Clusters	
The master and workers	
9	
One of copies is the master, and it assigns map tasks and
reduce tasks to workers(the rest) 	
master	
worker	
 worker	
Copy of user program on a cluster of machines	
assign map	
 assign reduce
Jeffery (2004), MapReduce: Simplified Data Processing on Large Clusters	
Fault-tolerance | worker	
10	
Any map task or reduce task in progress on a failed worker
becomes eligible for rescheduling	
Task scheduling on a failure	
master	
machine A	
machine B	
Task status	
idle	
assign task	
in-progress	
ping	
 (no response)	
idle	
 in-progress	
assign task	
time	
exception	
ping	
pong	
Master stores states(idle, in-progress, or completed) of tasks
Jeffery (2004), MapReduce: Simplified Data Processing on Large Clusters	
Fault-tolerance | master	
11	
If the master task dies, a new copy can be started from
the last checkpoint state	
Restart of the master task from a checkpoint	
master task	
exception	
new copy	
checkpoint	
The master writes periodic checkpoints of the master data structures
Jeffery (2004), MapReduce: Simplified Data Processing on Large Clusters	
Locality optimization	
12	
Network bandwidth is conserved by taking
advantage of the GFS	
•  Input data(managed by GFS) is stored on the local disks of
the machines that make up a cluster
•  GFS divides each file into 64MB blocks, and stores several
copies of each block on different machines
•  The master attempts to schedule a map task on a machine
that contains a replica of the corresponding input data
Jeffery (2004), MapReduce: Simplified Data Processing on Large Clusters	
Load balancing	
13	
Having each worker perform many different tasks improves
dynamic load balancing	
The many map tasks a worker has completed can be
spread out across all the other machines

More Related Content

What's hot

06 how to write a map reduce version of k-means clustering
06 how to write a map reduce version of k-means clustering06 how to write a map reduce version of k-means clustering
06 how to write a map reduce version of k-means clusteringSubhas Kumar Ghosh
 
Hadoop map reduce in operation
Hadoop map reduce in operationHadoop map reduce in operation
Hadoop map reduce in operationSubhas Kumar Ghosh
 
Mapreduce script
Mapreduce scriptMapreduce script
Mapreduce scriptHaripritha
 
MapReduce:Simplified Data Processing on Large Cluster Presented by Areej Qas...
MapReduce:Simplified Data Processing on Large Cluster  Presented by Areej Qas...MapReduce:Simplified Data Processing on Large Cluster  Presented by Areej Qas...
MapReduce:Simplified Data Processing on Large Cluster Presented by Areej Qas...areej qasrawi
 
Map reduce in Hadoop
Map reduce in HadoopMap reduce in Hadoop
Map reduce in Hadoopishan0019
 
Hadoop secondary sort and a custom comparator
Hadoop secondary sort and a custom comparatorHadoop secondary sort and a custom comparator
Hadoop secondary sort and a custom comparatorSubhas Kumar Ghosh
 
MapReduce Scheduling Algorithms
MapReduce Scheduling AlgorithmsMapReduce Scheduling Algorithms
MapReduce Scheduling AlgorithmsLeila panahi
 
Map reduce presentation
Map reduce presentationMap reduce presentation
Map reduce presentationateeq ateeq
 
Introduction to MapReduce
Introduction to MapReduceIntroduction to MapReduce
Introduction to MapReduceHassan A-j
 

What's hot (20)

Map reduce
Map reduceMap reduce
Map reduce
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Hadoop map reduce v2
Hadoop map reduce v2Hadoop map reduce v2
Hadoop map reduce v2
 
06 how to write a map reduce version of k-means clustering
06 how to write a map reduce version of k-means clustering06 how to write a map reduce version of k-means clustering
06 how to write a map reduce version of k-means clustering
 
Hadoop map reduce in operation
Hadoop map reduce in operationHadoop map reduce in operation
Hadoop map reduce in operation
 
Mapreduce script
Mapreduce scriptMapreduce script
Mapreduce script
 
MapReduce:Simplified Data Processing on Large Cluster Presented by Areej Qas...
MapReduce:Simplified Data Processing on Large Cluster  Presented by Areej Qas...MapReduce:Simplified Data Processing on Large Cluster  Presented by Areej Qas...
MapReduce:Simplified Data Processing on Large Cluster Presented by Areej Qas...
 
Introduction to MapReduce
Introduction to MapReduceIntroduction to MapReduce
Introduction to MapReduce
 
Map reduce in Hadoop
Map reduce in HadoopMap reduce in Hadoop
Map reduce in Hadoop
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Hadoop secondary sort and a custom comparator
Hadoop secondary sort and a custom comparatorHadoop secondary sort and a custom comparator
Hadoop secondary sort and a custom comparator
 
MapReduce Scheduling Algorithms
MapReduce Scheduling AlgorithmsMapReduce Scheduling Algorithms
MapReduce Scheduling Algorithms
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
E031201032036
E031201032036E031201032036
E031201032036
 
MapReduce
MapReduceMapReduce
MapReduce
 
Map reduce presentation
Map reduce presentationMap reduce presentation
Map reduce presentation
 
Map Reduce Online
Map Reduce OnlineMap Reduce Online
Map Reduce Online
 
Introduction to MapReduce
Introduction to MapReduceIntroduction to MapReduce
Introduction to MapReduce
 
MapReduce Algorithm Design
MapReduce Algorithm DesignMapReduce Algorithm Design
MapReduce Algorithm Design
 
Hadoop
HadoopHadoop
Hadoop
 

Viewers also liked

The google file system
The google file systemThe google file system
The google file systemkazuma_sato
 
Cochin Muziris Biennale | Gogeo Holidays
Cochin Muziris Biennale | Gogeo HolidaysCochin Muziris Biennale | Gogeo Holidays
Cochin Muziris Biennale | Gogeo HolidaysGoGeo Holidays
 
William Isbell Portfolio for Uploads
William Isbell Portfolio for UploadsWilliam Isbell Portfolio for Uploads
William Isbell Portfolio for UploadsWilliam Isbell
 
kannur as a folklore destination
kannur as a folklore destination kannur as a folklore destination
kannur as a folklore destination nidhiee123
 
Tutela interpretacion
Tutela interpretacionTutela interpretacion
Tutela interpretacionOscar Roa
 
MECANISMOS DE PROTECCION DE DERECHOS FUNDAMENTALES S.s semana clase 4
 MECANISMOS DE PROTECCION DE DERECHOS FUNDAMENTALES S.s semana clase 4 MECANISMOS DE PROTECCION DE DERECHOS FUNDAMENTALES S.s semana clase 4
MECANISMOS DE PROTECCION DE DERECHOS FUNDAMENTALES S.s semana clase 4Miguel Rodrifuez
 

Viewers also liked (9)

The google file system
The google file systemThe google file system
The google file system
 
Impuesto
ImpuestoImpuesto
Impuesto
 
Telemática
TelemáticaTelemática
Telemática
 
Cochin Muziris Biennale | Gogeo Holidays
Cochin Muziris Biennale | Gogeo HolidaysCochin Muziris Biennale | Gogeo Holidays
Cochin Muziris Biennale | Gogeo Holidays
 
Power Point
Power PointPower Point
Power Point
 
William Isbell Portfolio for Uploads
William Isbell Portfolio for UploadsWilliam Isbell Portfolio for Uploads
William Isbell Portfolio for Uploads
 
kannur as a folklore destination
kannur as a folklore destination kannur as a folklore destination
kannur as a folklore destination
 
Tutela interpretacion
Tutela interpretacionTutela interpretacion
Tutela interpretacion
 
MECANISMOS DE PROTECCION DE DERECHOS FUNDAMENTALES S.s semana clase 4
 MECANISMOS DE PROTECCION DE DERECHOS FUNDAMENTALES S.s semana clase 4 MECANISMOS DE PROTECCION DE DERECHOS FUNDAMENTALES S.s semana clase 4
MECANISMOS DE PROTECCION DE DERECHOS FUNDAMENTALES S.s semana clase 4
 

Similar to MapReduce: Simplified Data Processing On Large Clusters

Hadoop Mapreduce Performance Enhancement Using In-Node Combiners
Hadoop Mapreduce Performance Enhancement Using In-Node CombinersHadoop Mapreduce Performance Enhancement Using In-Node Combiners
Hadoop Mapreduce Performance Enhancement Using In-Node Combinersijcsit
 
Hadoop scheduler with deadline constraint
Hadoop scheduler with deadline constraintHadoop scheduler with deadline constraint
Hadoop scheduler with deadline constraintijccsa
 
Hadoop fault tolerance
Hadoop  fault toleranceHadoop  fault tolerance
Hadoop fault tolerancePallav Jha
 
Report Hadoop Map Reduce
Report Hadoop Map ReduceReport Hadoop Map Reduce
Report Hadoop Map ReduceUrvashi Kataria
 
Parallel Data Processing with MapReduce: A Survey
Parallel Data Processing with MapReduce: A SurveyParallel Data Processing with MapReduce: A Survey
Parallel Data Processing with MapReduce: A SurveyKyong-Ha Lee
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersXiao Qin
 
A Brief on MapReduce Performance
A Brief on MapReduce PerformanceA Brief on MapReduce Performance
A Brief on MapReduce PerformanceAM Publications
 
Map reduce presentation
Map reduce presentationMap reduce presentation
Map reduce presentationAhmad El Tawil
 
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraBrief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraSomnath Mazumdar
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxHARIKRISHNANU13
 
Hadoop & MapReduce
Hadoop & MapReduceHadoop & MapReduce
Hadoop & MapReduceNewvewm
 
Hadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.comHadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.comsoftwarequery
 
Hadoop eco system with mapreduce hive and pig
Hadoop eco system with mapreduce hive and pigHadoop eco system with mapreduce hive and pig
Hadoop eco system with mapreduce hive and pigKhanKhaja1
 
Enhancing Performance and Fault Tolerance of Hadoop Cluster
Enhancing Performance and Fault Tolerance of Hadoop ClusterEnhancing Performance and Fault Tolerance of Hadoop Cluster
Enhancing Performance and Fault Tolerance of Hadoop ClusterIRJET Journal
 
Generating Frequent Itemsets by RElim on Hadoop Clusters
Generating Frequent Itemsets by RElim on Hadoop ClustersGenerating Frequent Itemsets by RElim on Hadoop Clusters
Generating Frequent Itemsets by RElim on Hadoop ClustersBRNSSPublicationHubI
 
Hadoop training-in-hyderabad
Hadoop training-in-hyderabadHadoop training-in-hyderabad
Hadoop training-in-hyderabadsreehari orienit
 
Survey on load balancing and data skew mitigation in mapreduce applications
Survey on load balancing and data skew mitigation in mapreduce applicationsSurvey on load balancing and data skew mitigation in mapreduce applications
Survey on load balancing and data skew mitigation in mapreduce applicationsIAEME Publication
 

Similar to MapReduce: Simplified Data Processing On Large Clusters (20)

Hadoop Mapreduce Performance Enhancement Using In-Node Combiners
Hadoop Mapreduce Performance Enhancement Using In-Node CombinersHadoop Mapreduce Performance Enhancement Using In-Node Combiners
Hadoop Mapreduce Performance Enhancement Using In-Node Combiners
 
Hadoop scheduler with deadline constraint
Hadoop scheduler with deadline constraintHadoop scheduler with deadline constraint
Hadoop scheduler with deadline constraint
 
Hadoop fault tolerance
Hadoop  fault toleranceHadoop  fault tolerance
Hadoop fault tolerance
 
Report Hadoop Map Reduce
Report Hadoop Map ReduceReport Hadoop Map Reduce
Report Hadoop Map Reduce
 
Parallel Data Processing with MapReduce: A Survey
Parallel Data Processing with MapReduce: A SurveyParallel Data Processing with MapReduce: A Survey
Parallel Data Processing with MapReduce: A Survey
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
 
A Brief on MapReduce Performance
A Brief on MapReduce PerformanceA Brief on MapReduce Performance
A Brief on MapReduce Performance
 
Map reduce presentation
Map reduce presentationMap reduce presentation
Map reduce presentation
 
Hadoop Map Reduce
Hadoop Map ReduceHadoop Map Reduce
Hadoop Map Reduce
 
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraBrief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptx
 
Hadoop & MapReduce
Hadoop & MapReduceHadoop & MapReduce
Hadoop & MapReduce
 
Hadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.comHadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.com
 
Hadoop eco system with mapreduce hive and pig
Hadoop eco system with mapreduce hive and pigHadoop eco system with mapreduce hive and pig
Hadoop eco system with mapreduce hive and pig
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Enhancing Performance and Fault Tolerance of Hadoop Cluster
Enhancing Performance and Fault Tolerance of Hadoop ClusterEnhancing Performance and Fault Tolerance of Hadoop Cluster
Enhancing Performance and Fault Tolerance of Hadoop Cluster
 
H04502048051
H04502048051H04502048051
H04502048051
 
Generating Frequent Itemsets by RElim on Hadoop Clusters
Generating Frequent Itemsets by RElim on Hadoop ClustersGenerating Frequent Itemsets by RElim on Hadoop Clusters
Generating Frequent Itemsets by RElim on Hadoop Clusters
 
Hadoop training-in-hyderabad
Hadoop training-in-hyderabadHadoop training-in-hyderabad
Hadoop training-in-hyderabad
 
Survey on load balancing and data skew mitigation in mapreduce applications
Survey on load balancing and data skew mitigation in mapreduce applicationsSurvey on load balancing and data skew mitigation in mapreduce applications
Survey on load balancing and data skew mitigation in mapreduce applications
 

Recently uploaded

Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackVICTOR MAESTRE RAMIREZ
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)OPEN KNOWLEDGE GmbH
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based projectAnoyGreter
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityNeo4j
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaHanief Utama
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureDinusha Kumarasiri
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsAhmed Mohamed
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - InfographicHr365.us smith
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...gurkirankumar98700
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxTier1 app
 
chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptkotipi9215
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Andreas Granig
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software DevelopersVinodh Ram
 
software engineering Chapter 5 System modeling.pptx
software engineering Chapter 5 System modeling.pptxsoftware engineering Chapter 5 System modeling.pptx
software engineering Chapter 5 System modeling.pptxnada99848
 
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样umasea
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...soniya singh
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024StefanoLambiase
 

Recently uploaded (20)

Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStack
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based project
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered Sustainability
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief Utama
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with Azure
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML Diagrams
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - Infographic
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
 
chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.ppt
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024
 
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software Developers
 
software engineering Chapter 5 System modeling.pptx
software engineering Chapter 5 System modeling.pptxsoftware engineering Chapter 5 System modeling.pptx
software engineering Chapter 5 System modeling.pptx
 
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
 

MapReduce: Simplified Data Processing On Large Clusters

  • 1. MapReduce: Simplified Data Processing on Large Clusters 2015-08-30 さとうかずま
  • 2. Jeffery (2004), MapReduce: Simplified Data Processing on Large Clusters The reasons for the success of MapReduce 2 The MapReduce programming model has been successfully used at Google Reasons •  MapReduce is easy to use •  Many problems are easily expressible as MapReduce computations •  An implementation of MapReduce can run on a large cluster of commodity machines and is high scalable
  • 3. Jeffery (2004), MapReduce: Simplified Data Processing on Large Clusters The reasons why MapReduce is easy to use 3 MapReduce hides the messy details of the following •  Parallelization •  Fault-tolerance •  Locality optimization •  Load balancing
  • 4. Jeffery (2004), MapReduce: Simplified Data Processing on Large Clusters MapReduce computation 4 Computation is expressed as two functions: Map and Reduce (n1, s1) (n2, s2) (n1,s1) (w1, 1) (w2, 1) (w1, (1,1)) (2) (k1, v1) → list(k2,v2) (k1, list(v2)) → list(v2) (n1, s1) (n2, s2) phase type map reduce (w2, (1,1)) (2)(n2,s2) (w1, 1) (w2, 1) time input file split A Flow of an execution (Counting the number of occurrence of each word in a document) Input and output is a set of key/value pairs machine machine machine machine PCs → → → →
  • 5. Jeffery (2004), MapReduce: Simplified Data Processing on Large Clusters Map 5 Map takes an input pair and produces a set of intermediate key/value pairs Map example (Counting the number of occurrence of each word in a document) Map, written by the user, produces a set of intermediate key/value pairs Map(String  key,  String  value):      //  key:  document  name      //  value:  document  contents      for  each  word  w  in  value:          EmitIntermediate(w,  “1”’);  
  • 6. Jeffery (2004), MapReduce: Simplified Data Processing on Large Clusters Reduce 6 Reduce accepts an intermediate key and a set of values for that key, then form a possibly smaller set of values Reduce example (Counting the number of occurrence of each word in a document) Reduce(String  key,  String  value):      //  key:  a  word      //  value:  a  list  of  counts      for  each  v  in  values:          result  +=  ParseInt(v);      Emit  (AsString(result));    
  • 7. Jeffery (2004), MapReduce: Simplified Data Processing on Large Clusters Parallelization 7 Map and reduce allow programmers to parallelize computations easily •  They are inspired by map and reduce present in functional languages •  Referential transparency is one of the principle of functional programming •  Referential transparency encourages language based parallelism of computation
  • 8. Jeffery (2004), MapReduce: Simplified Data Processing on Large Clusters MapReduce operation 8 First, The MapReduce library copies user program on a cluster of machines Copy of user program on a cluster of machines User program fork fork fork
  • 9. Jeffery (2004), MapReduce: Simplified Data Processing on Large Clusters The master and workers 9 One of copies is the master, and it assigns map tasks and reduce tasks to workers(the rest) master worker worker Copy of user program on a cluster of machines assign map assign reduce
  • 10. Jeffery (2004), MapReduce: Simplified Data Processing on Large Clusters Fault-tolerance | worker 10 Any map task or reduce task in progress on a failed worker becomes eligible for rescheduling Task scheduling on a failure master machine A machine B Task status idle assign task in-progress ping (no response) idle in-progress assign task time exception ping pong Master stores states(idle, in-progress, or completed) of tasks
  • 11. Jeffery (2004), MapReduce: Simplified Data Processing on Large Clusters Fault-tolerance | master 11 If the master task dies, a new copy can be started from the last checkpoint state Restart of the master task from a checkpoint master task exception new copy checkpoint The master writes periodic checkpoints of the master data structures
  • 12. Jeffery (2004), MapReduce: Simplified Data Processing on Large Clusters Locality optimization 12 Network bandwidth is conserved by taking advantage of the GFS •  Input data(managed by GFS) is stored on the local disks of the machines that make up a cluster •  GFS divides each file into 64MB blocks, and stores several copies of each block on different machines •  The master attempts to schedule a map task on a machine that contains a replica of the corresponding input data
  • 13. Jeffery (2004), MapReduce: Simplified Data Processing on Large Clusters Load balancing 13 Having each worker perform many different tasks improves dynamic load balancing The many map tasks a worker has completed can be spread out across all the other machines