SlideShare a Scribd company logo
Map Reduce
By
Manuel Correa
Background

Large set of data needs to be processed in a fast and efficient
way

In order to process large set of data in a reasonable amount
time, this needs to be distributed across thousands of
machines

Programmers need to focus in solving problems without
worrying about the implementation
Map Reduce is the answer.
What is Map reduce?

Programming model for processing large data sets

Hides the implementation of parallelization, faul-tolerance, data
distribution and load balancing in a library

Inspired on some characteristics functional programming

Functional operations do not modify data structures.
They always create new ones

Original data is not modified

Data flow is implicit within the application

The order of the operations does not matter
What is Map reduce?

There is two functions: Map and Reduce

Map

Input: Key/Value pairs

Output: Intermediate key/value pairs

Reduce

Input: Key, Iterator values

Output: list with results
map(k1, v1) --> list(k2, v2)
reduce(k2, values(k2)) --> list(v2)
Complicated?
Map Reduce by example
Counting each word in a large set of documents
map(String key, String value):
// key: document name
// value: document contents
for each word w in value:
EmitIntermediate(w, "1");
reduce(String key, Iterator values):
// key: a word
// values: a list of counts
int result = 0;
for each v in values:
result += ParseInt(v);
Emit(AsString(result));
Map Reduce by example
Counting each word in a large set of documents
Document_1
foo
bar
baz
foo
bar
test
Document_2
test
foo
baz
bar
foo
Expected results:
<foo, 4>,<bar, 3>,<baz,2>,<test,2>
Map Reduce by example
Counting each word in a large set of documents
map(String key, String value):
// key: document name
// value: document contents
for each word w in value:
EmitIntermediate(w, "1");
Map(document_1,contents(document_1))
<foo, “1”>
<bar,”1”>
<baz, “1” >
<foo, “1”>
<bar, “1”>
<test, ”1”>
Map(document_2,contents(document_2))
<test, “1”>
<foo, “1”>
<baz, ”1”>
<bar, ”1”>
<foo, “1”>
Map Reduce by example
Counting each word in a large set of documents
reduce(String key, Iterator values):
// key: a word
// values: a list of counts
int result = 0;
for each v in values:
result += ParseInt(v);
Emit(AsString(result));
Reduce(word, values)
<foo, “2”>
<bar,”2”>
<baz, “1” >
<test,”1”>
Reduce(word, values)
<test, “1”>
<foo, “2”>
<baz, ”1”>
<bar, ”1”>
Map Reduce by example
Counting each word in a large set of documents
Reduce(word, values)
<foo, “4”>
<bar, ”3”>
<baz, “2”>
<test,”2”>
<foo, “2”>
<bar, ”2”>
<baz, “1”>
<test,”1”>
<test, “1”>
<foo, “2”>
<baz, ”1”>
<bar, ”1”>
Expected results:
<foo, 4>,<bar, 3>,<baz,2>,<test,2>
Implementation
Master node

Master keeps different data structures for Map and reduce
tasks where the status of each process is maintain

Status: idle, in-progress or completed

The master node keeps track of the intermediate files to feed
the reduce tasks

The master node control the interaction between the M map
tasks and R reduce tasks
Fault Tolerance

Master pings every worker periodically

If a worker fail, then the master mark this worker as failed and
assign the task to another worker

Every worker must notify that has finish its task. The master
then assign another task

Each tasks is independent and can be restarted at any
moment. Map reduce is resilient to workers failures

If the master failed, then? The Master periodically its status
and data structures. Then another master can start from the
last checkpoint
Task Granularity

There are M maps tasks and R reduce tasks

M and R should be larger than the number of workers

Dynamic loading and load balancing on workers to optimize
resources

Master must make O(M+R) scheduling decisions and keeps
O(M*R) states. One byte to save the state of each worker

According to the paper, Google performs M=200,000 and
R=5,000 using 2,000 workers
Refinements

Partition function: load balancing

Ordering function: optimized generation of keys and easy to
generate sorted output files

Combiner function = Reduce function. See count word in
documents example

Input and output Readers: Standard input and output

Skipping bad records: Control of bad input

Local execution for debugging

Status information through an external application
What are the benefits of map reduce?

Easy to use for programmers that don't need to worry about
the details of distributed computing

A large set of problems can be expressed in Map reduce
programming model

Flexible and scalable in large clusters of machines. The fault
tolerance is elegant and works
Programs that can be expressed
with Map Reduce

Distributed Grep <word, match>

Count URL Access Frequency <URL, total_count>

Reverse Web-link graph <target, list(source)>

Term-Vector per Host <word, frequency>

Inverted index <word, document ID>

Distributed Sort <key, record>
References

MapReduce: Simplified Data Processing on Large Clusters (
http://labs.google.com/papers/mapreduce-osdi04.pdf)

http://code.google.com/edu/parallel/mapreduce-tutorial.html

www.mapreduce.org

http://www.youtube.com/watch?v=yjPBkvYh-ss&feature=PlayList&p=

http://hadoop.apache.org/
Map Reduce
Questions?

More Related Content

What's hot

Apache HBase™
Apache HBase™Apache HBase™
Apache HBase™
Prashant Gupta
 
Hadoop YARN
Hadoop YARNHadoop YARN
Hadoop YARN
Vigen Sahakyan
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
Prashant Gupta
 
Unit 4-apache pig
Unit 4-apache pigUnit 4-apache pig
Unit 4-apache pig
vishal choudhary
 
Hadoop Map Reduce
Hadoop Map ReduceHadoop Map Reduce
Hadoop Map Reduce
VNIT-ACM Student Chapter
 
Introduction to Map-Reduce
Introduction to Map-ReduceIntroduction to Map-Reduce
Introduction to Map-Reduce
Brendan Tierney
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
Alexey Grishchenko
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
Vigen Sahakyan
 
Introduction to HBase
Introduction to HBaseIntroduction to HBase
Introduction to HBase
Avkash Chauhan
 
Map reduce - simplified data processing on large clusters
Map reduce - simplified data processing on large clustersMap reduce - simplified data processing on large clusters
Map reduce - simplified data processing on large clusters
Cleverence Kombe
 
Query processing
Query processingQuery processing
Query processing
Deepak Singh
 
Introduction to MapReduce
Introduction to MapReduceIntroduction to MapReduce
Introduction to MapReduce
Chicago Hadoop Users Group
 
Map reduce presentation
Map reduce presentationMap reduce presentation
Map reduce presentation
ateeq ateeq
 
HDFS Architecture
HDFS ArchitectureHDFS Architecture
HDFS Architecture
Jeff Hammerbacher
 
Hadoop combiner and partitioner
Hadoop combiner and partitionerHadoop combiner and partitioner
Hadoop combiner and partitioner
Subhas Kumar Ghosh
 
Graph mining ppt
Graph mining pptGraph mining ppt
Graph mining ppt
tallalfarooq1
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
Duyhai Doan
 
Google File System
Google File SystemGoogle File System
Google File System
guest2cb4689
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
EMC
 
Intro to HBase
Intro to HBaseIntro to HBase
Intro to HBase
alexbaranau
 

What's hot (20)

Apache HBase™
Apache HBase™Apache HBase™
Apache HBase™
 
Hadoop YARN
Hadoop YARNHadoop YARN
Hadoop YARN
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Unit 4-apache pig
Unit 4-apache pigUnit 4-apache pig
Unit 4-apache pig
 
Hadoop Map Reduce
Hadoop Map ReduceHadoop Map Reduce
Hadoop Map Reduce
 
Introduction to Map-Reduce
Introduction to Map-ReduceIntroduction to Map-Reduce
Introduction to Map-Reduce
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Introduction to HBase
Introduction to HBaseIntroduction to HBase
Introduction to HBase
 
Map reduce - simplified data processing on large clusters
Map reduce - simplified data processing on large clustersMap reduce - simplified data processing on large clusters
Map reduce - simplified data processing on large clusters
 
Query processing
Query processingQuery processing
Query processing
 
Introduction to MapReduce
Introduction to MapReduceIntroduction to MapReduce
Introduction to MapReduce
 
Map reduce presentation
Map reduce presentationMap reduce presentation
Map reduce presentation
 
HDFS Architecture
HDFS ArchitectureHDFS Architecture
HDFS Architecture
 
Hadoop combiner and partitioner
Hadoop combiner and partitionerHadoop combiner and partitioner
Hadoop combiner and partitioner
 
Graph mining ppt
Graph mining pptGraph mining ppt
Graph mining ppt
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
 
Google File System
Google File SystemGoogle File System
Google File System
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
 
Intro to HBase
Intro to HBaseIntro to HBase
Intro to HBase
 

Similar to Map Reduce

Lecture 1 mapreduce
Lecture 1  mapreduceLecture 1  mapreduce
Lecture 1 mapreduce
Shubham Bansal
 
Map reduce
Map reduceMap reduce
Map reduce
Shahbaz Sidhu
 
2004 map reduce simplied data processing on large clusters (mapreduce)
2004 map reduce simplied data processing on large clusters (mapreduce)2004 map reduce simplied data processing on large clusters (mapreduce)
2004 map reduce simplied data processing on large clusters (mapreduce)
anh tuan
 
Map reduce
Map reduceMap reduce
Map reduce
xydii
 
Map reduceoriginalpaper mandatoryreading
Map reduceoriginalpaper mandatoryreadingMap reduceoriginalpaper mandatoryreading
Map reduceoriginalpaper mandatoryreading
coolmirza143
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reduce
rantav
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptx
HARIKRISHNANU13
 
mapreduce.pptx
mapreduce.pptxmapreduce.pptx
mapreduce.pptx
ShimoFcis
 
MapReduce
MapReduceMapReduce
MapReduce
ahmedelmorsy89
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
Xiao Qin
 
Mypreson 27
Mypreson 27Mypreson 27
Mypreson 27
Venkatesh Nandigama
 
Mapreduce Osdi04
Mapreduce Osdi04Mapreduce Osdi04
Mapreduce Osdi04
Jyotirmoy Dey
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
Rahul Agarwal
 
Map reduce in Hadoop BIG DATA ANALYTICS
Map reduce in Hadoop BIG DATA ANALYTICSMap reduce in Hadoop BIG DATA ANALYTICS
Map reduce in Hadoop BIG DATA ANALYTICS
Archana Gopinath
 
Introduction to Map-Reduce Programming with Hadoop
Introduction to Map-Reduce Programming with HadoopIntroduction to Map-Reduce Programming with Hadoop
Introduction to Map-Reduce Programming with Hadoop
Dilum Bandara
 
Mapreduce - Simplified Data Processing on Large Clusters
Mapreduce - Simplified Data Processing on Large ClustersMapreduce - Simplified Data Processing on Large Clusters
Mapreduce - Simplified Data Processing on Large Clusters
Abhishek Singh
 
"MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ...
"MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ..."MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ...
"MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ...
Adrian Florea
 
MapReduce: Ordering and Large-Scale Indexing on Large Clusters
MapReduce: Ordering and  Large-Scale Indexing on Large ClustersMapReduce: Ordering and  Large-Scale Indexing on Large Clusters
MapReduce: Ordering and Large-Scale Indexing on Large Clusters
IRJET Journal
 
MapReduce
MapReduceMapReduce
Hadoop & MapReduce
Hadoop & MapReduceHadoop & MapReduce
Hadoop & MapReduce
Newvewm
 

Similar to Map Reduce (20)

Lecture 1 mapreduce
Lecture 1  mapreduceLecture 1  mapreduce
Lecture 1 mapreduce
 
Map reduce
Map reduceMap reduce
Map reduce
 
2004 map reduce simplied data processing on large clusters (mapreduce)
2004 map reduce simplied data processing on large clusters (mapreduce)2004 map reduce simplied data processing on large clusters (mapreduce)
2004 map reduce simplied data processing on large clusters (mapreduce)
 
Map reduce
Map reduceMap reduce
Map reduce
 
Map reduceoriginalpaper mandatoryreading
Map reduceoriginalpaper mandatoryreadingMap reduceoriginalpaper mandatoryreading
Map reduceoriginalpaper mandatoryreading
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reduce
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptx
 
mapreduce.pptx
mapreduce.pptxmapreduce.pptx
mapreduce.pptx
 
MapReduce
MapReduceMapReduce
MapReduce
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
 
Mypreson 27
Mypreson 27Mypreson 27
Mypreson 27
 
Mapreduce Osdi04
Mapreduce Osdi04Mapreduce Osdi04
Mapreduce Osdi04
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Map reduce in Hadoop BIG DATA ANALYTICS
Map reduce in Hadoop BIG DATA ANALYTICSMap reduce in Hadoop BIG DATA ANALYTICS
Map reduce in Hadoop BIG DATA ANALYTICS
 
Introduction to Map-Reduce Programming with Hadoop
Introduction to Map-Reduce Programming with HadoopIntroduction to Map-Reduce Programming with Hadoop
Introduction to Map-Reduce Programming with Hadoop
 
Mapreduce - Simplified Data Processing on Large Clusters
Mapreduce - Simplified Data Processing on Large ClustersMapreduce - Simplified Data Processing on Large Clusters
Mapreduce - Simplified Data Processing on Large Clusters
 
"MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ...
"MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ..."MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ...
"MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ...
 
MapReduce: Ordering and Large-Scale Indexing on Large Clusters
MapReduce: Ordering and  Large-Scale Indexing on Large ClustersMapReduce: Ordering and  Large-Scale Indexing on Large Clusters
MapReduce: Ordering and Large-Scale Indexing on Large Clusters
 
MapReduce
MapReduceMapReduce
MapReduce
 
Hadoop & MapReduce
Hadoop & MapReduceHadoop & MapReduce
Hadoop & MapReduce
 

More from Manuel Correa

How Netflix does Microservices
How Netflix does Microservices How Netflix does Microservices
How Netflix does Microservices
Manuel Correa
 
Ads final project
Ads final projectAds final project
Ads final project
Manuel Correa
 
Big table
Big tableBig table
Big table
Manuel Correa
 
Big table
Big tableBig table
Big table
Manuel Correa
 
Protocol buffers
Protocol buffersProtocol buffers
Protocol buffers
Manuel Correa
 
Optimal Adaptation
Optimal Adaptation Optimal Adaptation
Optimal Adaptation
Manuel Correa
 
RESTFul Web Services - Intro
RESTFul Web Services - IntroRESTFul Web Services - Intro
RESTFul Web Services - Intro
Manuel Correa
 

More from Manuel Correa (7)

How Netflix does Microservices
How Netflix does Microservices How Netflix does Microservices
How Netflix does Microservices
 
Ads final project
Ads final projectAds final project
Ads final project
 
Big table
Big tableBig table
Big table
 
Big table
Big tableBig table
Big table
 
Protocol buffers
Protocol buffersProtocol buffers
Protocol buffers
 
Optimal Adaptation
Optimal Adaptation Optimal Adaptation
Optimal Adaptation
 
RESTFul Web Services - Intro
RESTFul Web Services - IntroRESTFul Web Services - Intro
RESTFul Web Services - Intro
 

Recently uploaded

Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Nexer Digital
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
Uni Systems S.M.S.A.
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
Neo4j
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
Neo4j
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
SOFTTECHHUB
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
Quotidiano Piemontese
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
danishmna97
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
DianaGray10
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
Neo4j
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
Pierluigi Pugliese
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
sonjaschweigert1
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
KAMESHS29
 
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofszkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
Alex Pruden
 

Recently uploaded (20)

Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
 
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofszkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
 

Map Reduce

  • 2. Background  Large set of data needs to be processed in a fast and efficient way  In order to process large set of data in a reasonable amount time, this needs to be distributed across thousands of machines  Programmers need to focus in solving problems without worrying about the implementation Map Reduce is the answer.
  • 3. What is Map reduce?  Programming model for processing large data sets  Hides the implementation of parallelization, faul-tolerance, data distribution and load balancing in a library  Inspired on some characteristics functional programming  Functional operations do not modify data structures. They always create new ones  Original data is not modified  Data flow is implicit within the application  The order of the operations does not matter
  • 4. What is Map reduce?  There is two functions: Map and Reduce  Map  Input: Key/Value pairs  Output: Intermediate key/value pairs  Reduce  Input: Key, Iterator values  Output: list with results map(k1, v1) --> list(k2, v2) reduce(k2, values(k2)) --> list(v2) Complicated?
  • 5. Map Reduce by example Counting each word in a large set of documents map(String key, String value): // key: document name // value: document contents for each word w in value: EmitIntermediate(w, "1"); reduce(String key, Iterator values): // key: a word // values: a list of counts int result = 0; for each v in values: result += ParseInt(v); Emit(AsString(result));
  • 6. Map Reduce by example Counting each word in a large set of documents Document_1 foo bar baz foo bar test Document_2 test foo baz bar foo Expected results: <foo, 4>,<bar, 3>,<baz,2>,<test,2>
  • 7. Map Reduce by example Counting each word in a large set of documents map(String key, String value): // key: document name // value: document contents for each word w in value: EmitIntermediate(w, "1"); Map(document_1,contents(document_1)) <foo, “1”> <bar,”1”> <baz, “1” > <foo, “1”> <bar, “1”> <test, ”1”> Map(document_2,contents(document_2)) <test, “1”> <foo, “1”> <baz, ”1”> <bar, ”1”> <foo, “1”>
  • 8. Map Reduce by example Counting each word in a large set of documents reduce(String key, Iterator values): // key: a word // values: a list of counts int result = 0; for each v in values: result += ParseInt(v); Emit(AsString(result)); Reduce(word, values) <foo, “2”> <bar,”2”> <baz, “1” > <test,”1”> Reduce(word, values) <test, “1”> <foo, “2”> <baz, ”1”> <bar, ”1”>
  • 9. Map Reduce by example Counting each word in a large set of documents Reduce(word, values) <foo, “4”> <bar, ”3”> <baz, “2”> <test,”2”> <foo, “2”> <bar, ”2”> <baz, “1”> <test,”1”> <test, “1”> <foo, “2”> <baz, ”1”> <bar, ”1”> Expected results: <foo, 4>,<bar, 3>,<baz,2>,<test,2>
  • 11. Master node  Master keeps different data structures for Map and reduce tasks where the status of each process is maintain  Status: idle, in-progress or completed  The master node keeps track of the intermediate files to feed the reduce tasks  The master node control the interaction between the M map tasks and R reduce tasks
  • 12. Fault Tolerance  Master pings every worker periodically  If a worker fail, then the master mark this worker as failed and assign the task to another worker  Every worker must notify that has finish its task. The master then assign another task  Each tasks is independent and can be restarted at any moment. Map reduce is resilient to workers failures  If the master failed, then? The Master periodically its status and data structures. Then another master can start from the last checkpoint
  • 13. Task Granularity  There are M maps tasks and R reduce tasks  M and R should be larger than the number of workers  Dynamic loading and load balancing on workers to optimize resources  Master must make O(M+R) scheduling decisions and keeps O(M*R) states. One byte to save the state of each worker  According to the paper, Google performs M=200,000 and R=5,000 using 2,000 workers
  • 14. Refinements  Partition function: load balancing  Ordering function: optimized generation of keys and easy to generate sorted output files  Combiner function = Reduce function. See count word in documents example  Input and output Readers: Standard input and output  Skipping bad records: Control of bad input  Local execution for debugging  Status information through an external application
  • 15. What are the benefits of map reduce?  Easy to use for programmers that don't need to worry about the details of distributed computing  A large set of problems can be expressed in Map reduce programming model  Flexible and scalable in large clusters of machines. The fault tolerance is elegant and works
  • 16. Programs that can be expressed with Map Reduce  Distributed Grep <word, match>  Count URL Access Frequency <URL, total_count>  Reverse Web-link graph <target, list(source)>  Term-Vector per Host <word, frequency>  Inverted index <word, document ID>  Distributed Sort <key, record>
  • 17. References  MapReduce: Simplified Data Processing on Large Clusters ( http://labs.google.com/papers/mapreduce-osdi04.pdf)  http://code.google.com/edu/parallel/mapreduce-tutorial.html  www.mapreduce.org  http://www.youtube.com/watch?v=yjPBkvYh-ss&feature=PlayList&p=  http://hadoop.apache.org/