SlideShare a Scribd company logo
1 of 18
Map Reduce
By
Manuel Correa
Background

Large set of data needs to be processed in a fast and efficient
way

In order to process large set of data in a reasonable amount
time, this needs to be distributed across thousands of
machines

Programmers need to focus in solving problems without
worrying about the implementation
Map Reduce is the answer.
What is Map reduce?

Programming model for processing large data sets

Hides the implementation of parallelization, faul-tolerance, data
distribution and load balancing in a library

Inspired on some characteristics functional programming

Functional operations do not modify data structures.
They always create new ones

Original data is not modified

Data flow is implicit within the application

The order of the operations does not matter
What is Map reduce?

There is two functions: Map and Reduce

Map

Input: Key/Value pairs

Output: Intermediate key/value pairs

Reduce

Input: Key, Iterator values

Output: list with results
map(k1, v1) --> list(k2, v2)
reduce(k2, values(k2)) --> list(v2)
Complicated?
Map Reduce by example
Counting each word in a large set of documents
map(String key, String value):
// key: document name
// value: document contents
for each word w in value:
EmitIntermediate(w, "1");
reduce(String key, Iterator values):
// key: a word
// values: a list of counts
int result = 0;
for each v in values:
result += ParseInt(v);
Emit(AsString(result));
Map Reduce by example
Counting each word in a large set of documents
Document_1
foo
bar
baz
foo
bar
test
Document_2
test
foo
baz
bar
foo
Expected results:
<foo, 4>,<bar, 3>,<baz,2>,<test,2>
Map Reduce by example
Counting each word in a large set of documents
map(String key, String value):
// key: document name
// value: document contents
for each word w in value:
EmitIntermediate(w, "1");
Map(document_1,contents(document_1))
<foo, “1”>
<bar,”1”>
<baz, “1” >
<foo, “1”>
<bar, “1”>
<test, ”1”>
Map(document_2,contents(document_2))
<test, “1”>
<foo, “1”>
<baz, ”1”>
<bar, ”1”>
<foo, “1”>
Map Reduce by example
Counting each word in a large set of documents
reduce(String key, Iterator values):
// key: a word
// values: a list of counts
int result = 0;
for each v in values:
result += ParseInt(v);
Emit(AsString(result));
Reduce(word, values)
<foo, “2”>
<bar,”2”>
<baz, “1” >
<test,”1”>
Reduce(word, values)
<test, “1”>
<foo, “2”>
<baz, ”1”>
<bar, ”1”>
Map Reduce by example
Counting each word in a large set of documents
Reduce(word, values)
<foo, “4”>
<bar, ”3”>
<baz, “2”>
<test,”2”>
<foo, “2”>
<bar, ”2”>
<baz, “1”>
<test,”1”>
<test, “1”>
<foo, “2”>
<baz, ”1”>
<bar, ”1”>
Expected results:
<foo, 4>,<bar, 3>,<baz,2>,<test,2>
Implementation
Master node

Master keeps different data structures for Map and reduce
tasks where the status of each process is maintain

Status: idle, in-progress or completed

The master node keeps track of the intermediate files to feed
the reduce tasks

The master node control the interaction between the M map
tasks and R reduce tasks
Fault Tolerance

Master pings every worker periodically

If a worker fail, then the master mark this worker as failed and
assign the task to another worker

Every worker must notify that has finish its task. The master
then assign another task

Each tasks is independent and can be restarted at any
moment. Map reduce is resilient to workers failures

If the master failed, then? The Master periodically its status
and data structures. Then another master can start from the
last checkpoint
Task Granularity

There are M maps tasks and R reduce tasks

M and R should be larger than the number of workers

Dynamic loading and load balancing on workers to optimize
resources

Master must make O(M+R) scheduling decisions and keeps
O(M*R) states. One byte to save the state of each worker

According to the paper, Google performs M=200,000 and
R=5,000 using 2,000 workers
Refinements

Partition function: load balancing

Ordering function: optimized generation of keys and easy to
generate sorted output files

Combiner function = Reduce function. See count word in
documents example

Input and output Readers: Standard input and output

Skipping bad records: Control of bad input

Local execution for debugging

Status information through an external application
What are the benefits of map reduce?

Easy to use for programmers that don't need to worry about
the details of distributed computing

A large set of problems can be expressed in Map reduce
programming model

Flexible and scalable in large clusters of machines. The fault
tolerance is elegant and works
Programs that can be expressed
with Map Reduce

Distributed Grep <word, match>

Count URL Access Frequency <URL, total_count>

Reverse Web-link graph <target, list(source)>

Term-Vector per Host <word, frequency>

Inverted index <word, document ID>

Distributed Sort <key, record>
References

MapReduce: Simplified Data Processing on Large Clusters (
http://labs.google.com/papers/mapreduce-osdi04.pdf)

http://code.google.com/edu/parallel/mapreduce-tutorial.html

www.mapreduce.org

http://www.youtube.com/watch?v=yjPBkvYh-ss&feature=PlayList&p=

http://hadoop.apache.org/
Map Reduce
Questions?

More Related Content

What's hot

Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to sparkDuyhai Doan
 
Apache Spark Internals
Apache Spark InternalsApache Spark Internals
Apache Spark InternalsKnoldus Inc.
 
Hadoop: Distributed Data Processing
Hadoop: Distributed Data ProcessingHadoop: Distributed Data Processing
Hadoop: Distributed Data ProcessingCloudera, Inc.
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsAnton Kirillov
 
How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...DataWorks Summit/Hadoop Summit
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overviewDataArt
 
Learn to setup a Hadoop Multi Node Cluster
Learn to setup a Hadoop Multi Node ClusterLearn to setup a Hadoop Multi Node Cluster
Learn to setup a Hadoop Multi Node ClusterEdureka!
 
Cassandra Introduction & Features
Cassandra Introduction & FeaturesCassandra Introduction & Features
Cassandra Introduction & FeaturesDataStax Academy
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reducerantav
 
Introduction to Map-Reduce
Introduction to Map-ReduceIntroduction to Map-Reduce
Introduction to Map-ReduceBrendan Tierney
 
Apache Spark : Genel Bir Bakış
Apache Spark : Genel Bir BakışApache Spark : Genel Bir Bakış
Apache Spark : Genel Bir BakışBurak KÖSE
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsSpark Summit
 

What's hot (20)

MapReduce
MapReduceMapReduce
MapReduce
 
Introduction to HDFS
Introduction to HDFSIntroduction to HDFS
Introduction to HDFS
 
Hadoop and Big Data
Hadoop and Big DataHadoop and Big Data
Hadoop and Big Data
 
Hadoop Map Reduce
Hadoop Map ReduceHadoop Map Reduce
Hadoop Map Reduce
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
 
Apache Spark Internals
Apache Spark InternalsApache Spark Internals
Apache Spark Internals
 
Apache hive introduction
Apache hive introductionApache hive introduction
Apache hive introduction
 
Hadoop: Distributed Data Processing
Hadoop: Distributed Data ProcessingHadoop: Distributed Data Processing
Hadoop: Distributed Data Processing
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
 
Spark
SparkSpark
Spark
 
How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overview
 
Learn to setup a Hadoop Multi Node Cluster
Learn to setup a Hadoop Multi Node ClusterLearn to setup a Hadoop Multi Node Cluster
Learn to setup a Hadoop Multi Node Cluster
 
Cassandra Introduction & Features
Cassandra Introduction & FeaturesCassandra Introduction & Features
Cassandra Introduction & Features
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reduce
 
Introduction to Map-Reduce
Introduction to Map-ReduceIntroduction to Map-Reduce
Introduction to Map-Reduce
 
Apache Spark : Genel Bir Bakış
Apache Spark : Genel Bir BakışApache Spark : Genel Bir Bakış
Apache Spark : Genel Bir Bakış
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark Applications
 
Hadoop and Spark
Hadoop and SparkHadoop and Spark
Hadoop and Spark
 

Similar to Map Reduce

2004 map reduce simplied data processing on large clusters (mapreduce)
2004 map reduce simplied data processing on large clusters (mapreduce)2004 map reduce simplied data processing on large clusters (mapreduce)
2004 map reduce simplied data processing on large clusters (mapreduce)anh tuan
 
Map reduceoriginalpaper mandatoryreading
Map reduceoriginalpaper mandatoryreadingMap reduceoriginalpaper mandatoryreading
Map reduceoriginalpaper mandatoryreadingcoolmirza143
 
Map reduce
Map reduceMap reduce
Map reducexydii
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxHARIKRISHNANU13
 
mapreduce.pptx
mapreduce.pptxmapreduce.pptx
mapreduce.pptxShimoFcis
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersXiao Qin
 
Map reduce in Hadoop BIG DATA ANALYTICS
Map reduce in Hadoop BIG DATA ANALYTICSMap reduce in Hadoop BIG DATA ANALYTICS
Map reduce in Hadoop BIG DATA ANALYTICSArchana Gopinath
 
Introduction to Map-Reduce Programming with Hadoop
Introduction to Map-Reduce Programming with HadoopIntroduction to Map-Reduce Programming with Hadoop
Introduction to Map-Reduce Programming with HadoopDilum Bandara
 
Mapreduce - Simplified Data Processing on Large Clusters
Mapreduce - Simplified Data Processing on Large ClustersMapreduce - Simplified Data Processing on Large Clusters
Mapreduce - Simplified Data Processing on Large ClustersAbhishek Singh
 
"MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ...
"MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ..."MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ...
"MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ...Adrian Florea
 
MapReduce: Ordering and Large-Scale Indexing on Large Clusters
MapReduce: Ordering and  Large-Scale Indexing on Large ClustersMapReduce: Ordering and  Large-Scale Indexing on Large Clusters
MapReduce: Ordering and Large-Scale Indexing on Large ClustersIRJET Journal
 
Hadoop & MapReduce
Hadoop & MapReduceHadoop & MapReduce
Hadoop & MapReduceNewvewm
 

Similar to Map Reduce (20)

Lecture 1 mapreduce
Lecture 1  mapreduceLecture 1  mapreduce
Lecture 1 mapreduce
 
Map reduce
Map reduceMap reduce
Map reduce
 
2004 map reduce simplied data processing on large clusters (mapreduce)
2004 map reduce simplied data processing on large clusters (mapreduce)2004 map reduce simplied data processing on large clusters (mapreduce)
2004 map reduce simplied data processing on large clusters (mapreduce)
 
Map reduceoriginalpaper mandatoryreading
Map reduceoriginalpaper mandatoryreadingMap reduceoriginalpaper mandatoryreading
Map reduceoriginalpaper mandatoryreading
 
Map reduce
Map reduceMap reduce
Map reduce
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptx
 
mapreduce.pptx
mapreduce.pptxmapreduce.pptx
mapreduce.pptx
 
MapReduce
MapReduceMapReduce
MapReduce
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
 
Mypreson 27
Mypreson 27Mypreson 27
Mypreson 27
 
Mapreduce Osdi04
Mapreduce Osdi04Mapreduce Osdi04
Mapreduce Osdi04
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Map reduce in Hadoop BIG DATA ANALYTICS
Map reduce in Hadoop BIG DATA ANALYTICSMap reduce in Hadoop BIG DATA ANALYTICS
Map reduce in Hadoop BIG DATA ANALYTICS
 
Introduction to Map-Reduce Programming with Hadoop
Introduction to Map-Reduce Programming with HadoopIntroduction to Map-Reduce Programming with Hadoop
Introduction to Map-Reduce Programming with Hadoop
 
Mapreduce - Simplified Data Processing on Large Clusters
Mapreduce - Simplified Data Processing on Large ClustersMapreduce - Simplified Data Processing on Large Clusters
Mapreduce - Simplified Data Processing on Large Clusters
 
"MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ...
"MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ..."MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ...
"MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ...
 
MapReduce: Ordering and Large-Scale Indexing on Large Clusters
MapReduce: Ordering and  Large-Scale Indexing on Large ClustersMapReduce: Ordering and  Large-Scale Indexing on Large Clusters
MapReduce: Ordering and Large-Scale Indexing on Large Clusters
 
MapReduce
MapReduceMapReduce
MapReduce
 
Hadoop & MapReduce
Hadoop & MapReduceHadoop & MapReduce
Hadoop & MapReduce
 

More from Manuel Correa

More from Manuel Correa (7)

How Netflix does Microservices
How Netflix does Microservices How Netflix does Microservices
How Netflix does Microservices
 
Ads final project
Ads final projectAds final project
Ads final project
 
Big table
Big tableBig table
Big table
 
Big table
Big tableBig table
Big table
 
Protocol buffers
Protocol buffersProtocol buffers
Protocol buffers
 
Optimal Adaptation
Optimal Adaptation Optimal Adaptation
Optimal Adaptation
 
RESTFul Web Services - Intro
RESTFul Web Services - IntroRESTFul Web Services - Intro
RESTFul Web Services - Intro
 

Recently uploaded

Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 

Recently uploaded (20)

Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 

Map Reduce

  • 2. Background  Large set of data needs to be processed in a fast and efficient way  In order to process large set of data in a reasonable amount time, this needs to be distributed across thousands of machines  Programmers need to focus in solving problems without worrying about the implementation Map Reduce is the answer.
  • 3. What is Map reduce?  Programming model for processing large data sets  Hides the implementation of parallelization, faul-tolerance, data distribution and load balancing in a library  Inspired on some characteristics functional programming  Functional operations do not modify data structures. They always create new ones  Original data is not modified  Data flow is implicit within the application  The order of the operations does not matter
  • 4. What is Map reduce?  There is two functions: Map and Reduce  Map  Input: Key/Value pairs  Output: Intermediate key/value pairs  Reduce  Input: Key, Iterator values  Output: list with results map(k1, v1) --> list(k2, v2) reduce(k2, values(k2)) --> list(v2) Complicated?
  • 5. Map Reduce by example Counting each word in a large set of documents map(String key, String value): // key: document name // value: document contents for each word w in value: EmitIntermediate(w, "1"); reduce(String key, Iterator values): // key: a word // values: a list of counts int result = 0; for each v in values: result += ParseInt(v); Emit(AsString(result));
  • 6. Map Reduce by example Counting each word in a large set of documents Document_1 foo bar baz foo bar test Document_2 test foo baz bar foo Expected results: <foo, 4>,<bar, 3>,<baz,2>,<test,2>
  • 7. Map Reduce by example Counting each word in a large set of documents map(String key, String value): // key: document name // value: document contents for each word w in value: EmitIntermediate(w, "1"); Map(document_1,contents(document_1)) <foo, “1”> <bar,”1”> <baz, “1” > <foo, “1”> <bar, “1”> <test, ”1”> Map(document_2,contents(document_2)) <test, “1”> <foo, “1”> <baz, ”1”> <bar, ”1”> <foo, “1”>
  • 8. Map Reduce by example Counting each word in a large set of documents reduce(String key, Iterator values): // key: a word // values: a list of counts int result = 0; for each v in values: result += ParseInt(v); Emit(AsString(result)); Reduce(word, values) <foo, “2”> <bar,”2”> <baz, “1” > <test,”1”> Reduce(word, values) <test, “1”> <foo, “2”> <baz, ”1”> <bar, ”1”>
  • 9. Map Reduce by example Counting each word in a large set of documents Reduce(word, values) <foo, “4”> <bar, ”3”> <baz, “2”> <test,”2”> <foo, “2”> <bar, ”2”> <baz, “1”> <test,”1”> <test, “1”> <foo, “2”> <baz, ”1”> <bar, ”1”> Expected results: <foo, 4>,<bar, 3>,<baz,2>,<test,2>
  • 11. Master node  Master keeps different data structures for Map and reduce tasks where the status of each process is maintain  Status: idle, in-progress or completed  The master node keeps track of the intermediate files to feed the reduce tasks  The master node control the interaction between the M map tasks and R reduce tasks
  • 12. Fault Tolerance  Master pings every worker periodically  If a worker fail, then the master mark this worker as failed and assign the task to another worker  Every worker must notify that has finish its task. The master then assign another task  Each tasks is independent and can be restarted at any moment. Map reduce is resilient to workers failures  If the master failed, then? The Master periodically its status and data structures. Then another master can start from the last checkpoint
  • 13. Task Granularity  There are M maps tasks and R reduce tasks  M and R should be larger than the number of workers  Dynamic loading and load balancing on workers to optimize resources  Master must make O(M+R) scheduling decisions and keeps O(M*R) states. One byte to save the state of each worker  According to the paper, Google performs M=200,000 and R=5,000 using 2,000 workers
  • 14. Refinements  Partition function: load balancing  Ordering function: optimized generation of keys and easy to generate sorted output files  Combiner function = Reduce function. See count word in documents example  Input and output Readers: Standard input and output  Skipping bad records: Control of bad input  Local execution for debugging  Status information through an external application
  • 15. What are the benefits of map reduce?  Easy to use for programmers that don't need to worry about the details of distributed computing  A large set of problems can be expressed in Map reduce programming model  Flexible and scalable in large clusters of machines. The fault tolerance is elegant and works
  • 16. Programs that can be expressed with Map Reduce  Distributed Grep <word, match>  Count URL Access Frequency <URL, total_count>  Reverse Web-link graph <target, list(source)>  Term-Vector per Host <word, frequency>  Inverted index <word, document ID>  Distributed Sort <key, record>
  • 17. References  MapReduce: Simplified Data Processing on Large Clusters ( http://labs.google.com/papers/mapreduce-osdi04.pdf)  http://code.google.com/edu/parallel/mapreduce-tutorial.html  www.mapreduce.org  http://www.youtube.com/watch?v=yjPBkvYh-ss&feature=PlayList&p=  http://hadoop.apache.org/