SlideShare a Scribd company logo
MapReduce
Presentation – Advance Distributed system
MapReduce
 MapReduce is a programming model and an associated implementation for
processing and generating large data sets with a parallel, distributed
algorithm on a cluster.
 Map() procedure that performs filtering and sorting.
 Reduce() procedure that performs a summary operation (such as statistical
operations)
HADOOP
 Hadoop is a free, Java-based programming framework that supports the
processing of large data sets in a distributed computing environment. It is
part of the Apache project sponsored by the Apache Software Foundation.
MapReduce - Orchestrating
 MapReduce System is orchestrates the processing by
1. marshalling (saving the system) the distributed servers
2. Running the various tasks in parallel
3. managing all communications and data transfers between the various parts of
the system
4. providing for redundancy and fault tolerance.
MapReduce contributions
 The key contributions ( an important values added to the system ) are
Scalability (by controlling a big number of nodes) and fault-tolerance
achieved for a variety of applications by optimizing the execution engine
once.
MapReduce main steps (3 phases way)
 Steps of MapReduce :
1. "Map" step: Each worker node applies the "map()" function to the local data,
and writes the output to a temporary storage. A master node orchestrates
that for redundant copies of input data, only one is processed.
2. "Shuffle" step: Worker nodes redistribute data based on the output keys
(produced by the "map()" function), all the data belonging to one key are
located on the same worker node.
3. "Reduce" step: Worker nodes now process each group of output data, per key,
in parallel.
MapReduce main steps (3 phases way)
MapReduce main steps (5 phases way)
 Another way to look at MapReduce is as a 5-step parallel and distributed
computation:
1. Map step.
2. Run the user-provided Map() code – Map() is run exactly once for each K1 key
value, generating output organized by key values K2.
3. Shuffle step.
4. Reduce step.
5. Produce the final output – the MapReduce system collects all the Reduce
output, and sorts it by K2 to produce the final outcome
MapReduce main steps (5 phases way)
Working in parallel
 Each mapping operation is independent of the others ( run in parallel )
 Limitations are:
1. Number of independent data sources
2. Number of CPUs near each source.
 Set of 'reducers' can perform the reduction phase, all outputs of the map
operation that share the same key are presented to the same reducer at the
same time.
 The main advantage for working in parallel is the recovering from partial
failure of servers or storage during the operation, if one mapper or reducer
fails, the work can be rescheduled – assuming the input data is still available.
Working in parallel (cont)
Logical work
 The Map and Reduce functions of MapReduce are both defined with respect to the
data structured in (key, value) pairs. Map takes one pair of data with a type in one
data domain, and returns a list of pairs in a different domain:
Map(k1,v1) → list(k2,v2)
 The Map function is applied in parallel to every pair in the input dataset. This
produces a list of pairs for each call. After that, the MapReduce framework
collects all pairs with the same key from all lists and groups them together,
creating one group for each key.
 The Reduce function is then applied in parallel to each group, which in turn
produces a collection of values in the same domain:
Reduce(k2, list (v2)) → list(v3)
 Each Reduce call typically produces either one value v3 or an empty return,
though one call is allowed to return more than one value. The returns of all calls
are collected as the desired result list.
Logical work (cont)
Implementation
 Distributed implementations of MapReduce require a means of connecting
the processes performing the Map and Reduce phases
Implementation (cont)
 counts the appearance of each word in a set of documents
function map(String name, String document):
// name: document name
// document: document contents
for each word w in document:
emit (w, 1)
function reduce(String word, Iterator partialCounts):
// word: a word
// partialCounts: a list of aggregated partial counts
sum = 0
for each pc in partialCounts:
sum += ParseInt(pc)
emit (word, sum)
Dataflow
 The hotspots in the dataflow are ( depends on the application)
 an input reader
 a Map function
 a partition function
 a compare function
 a Reduce function
 an output writer
Dataflow (cont)
 Input reader
 The input reader divides the input into appropriate size and the framework
assigns one split to each Map function. The input reader reads data from
stable storage (typically a distributed file system) and generates key/value
pairs.
Dataflow (cont)
 Map function
 The Map function takes a series of key/value pairs, processes each, and
generates zero or more output key/value pairs. The input and output types of
the map are often different from each other.
 As in example If the application is doing a word count, the map function
would break the line into words and output a key/value pair for each word.
Each output pair would contain the word as the key and the number of
instances of that word in the line as the value.
Dataflow (cont)
 Partition function
 Each Map function output is allocated to a particular reducer by the application's
partition function for sharing purposes. The partition function is given the key and
the number of reducers and returns the index of the desired reducer.
 It is important to pick a partition function that gives an approximately uniform
distribution of data per shard for load-balancing purposes, otherwise the
MapReduce operation can be held up waiting for slow reducers (reducers assigned
more than their share of data) to finish.
 Between the map and reduce stages, the data is shuffled (means that the data are
parallel-sorted or exchanged between nodes) in order to move the data from the
map node that produced it to the shard in which it will be reduced.
Dataflow (cont)
 Comparison function
 The input for each Reduce is pulled from the machine where the Map ran and
sorted using the application's comparison function.
Dataflow (cont)
 Reduce function
 The framework calls the application's Reduce function once for each unique
key in the sorted order. The Reduce can iterate through the values that are
associated with that key and produce zero or more outputs.
 In the word count example, the Reduce function takes the input values, sums
them and generates a single output of the word and the final sum.
Dataflow (cont)
 Output writer
 The Output Writer writes the output of the Reduce to the stable storage,
usually a distributed file system.
Performance
 MapReduce programs are not guaranteed to be fast.
 The partition function and the amount of data written by the Map function
can have a large impact on the performance.
 Additional modules such as the Combiner function can help to reduce the
amount of data written to disk, and transmitted over the network.
 Communication cost often dominates the computation cost, and many
MapReduce implementations are designed to write all communication to
distributed storage for crash recovery.
Distribution and reliability
 MapReduce achieves reliability by parceling out a number of operations on
the set of data to each node in the network (load distributing).
 Each node is expected to report back periodically with completed work and
status updates.
 If a node falls silent for longer than that interval, the master node records
the node as dead and sends out the node's assigned work to other nodes.
 Individual operations use atomic operations for naming file outputs as a check
to ensure that there are not parallel conflicting threads running
 Reduce operations operate much the same way to save the bandwidth across
the backbone network of the datacenter
Uses
 Useful in distributed pattern-based searching, distributed sorting, web link-
graph reversal, Singular Value Decomposition, web access log stats, inverted
index construction, document clustering, machine learning, and statistical
machine translation.
 adapted to several computing environments like multi-core and many-core
systems, desktop grids, volunteer computing environments, dynamic cloud
environments, and mobile environments
Criticism
 Lack of novelty
 Restricted programming framework

More Related Content

What's hot

Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Simplilearn
 
Introduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Introduction to MapReduce | MapReduce Architecture | MapReduce FundamentalsIntroduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Introduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Skillspeed
 

What's hot (20)

Hadoop YARN
Hadoop YARNHadoop YARN
Hadoop YARN
 
Map reduce in BIG DATA
Map reduce in BIG DATAMap reduce in BIG DATA
Map reduce in BIG DATA
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
 
Introduction to Hadoop Technology
Introduction to Hadoop TechnologyIntroduction to Hadoop Technology
Introduction to Hadoop Technology
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Big data and Hadoop
Big data and HadoopBig data and Hadoop
Big data and Hadoop
 
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
 
Hadoop & MapReduce
Hadoop & MapReduceHadoop & MapReduce
Hadoop & MapReduce
 
BIGDATA ANALYTICS LAB MANUAL final.pdf
BIGDATA  ANALYTICS LAB MANUAL final.pdfBIGDATA  ANALYTICS LAB MANUAL final.pdf
BIGDATA ANALYTICS LAB MANUAL final.pdf
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 
PPT on Hadoop
PPT on HadoopPPT on Hadoop
PPT on Hadoop
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY ppt
 
Map reduce prashant
Map reduce prashantMap reduce prashant
Map reduce prashant
 
Hadoop Architecture
Hadoop ArchitectureHadoop Architecture
Hadoop Architecture
 
Introduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Introduction to MapReduce | MapReduce Architecture | MapReduce FundamentalsIntroduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Introduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
 
MapReduce Scheduling Algorithms
MapReduce Scheduling AlgorithmsMapReduce Scheduling Algorithms
MapReduce Scheduling Algorithms
 

Viewers also liked

An Introduction to MapReduce
An Introduction to MapReduceAn Introduction to MapReduce
An Introduction to MapReduce
Frane Bandov
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & Pig
Milind Bhandarkar
 
HIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopHIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on Hadoop
Zheng Shao
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reduce
rantav
 

Viewers also liked (14)

Analysing of big data using map reduce
Analysing of big data using map reduceAnalysing of big data using map reduce
Analysing of big data using map reduce
 
An Introduction to MapReduce
An Introduction to MapReduceAn Introduction to MapReduce
An Introduction to MapReduce
 
MapReduce Design Patterns
MapReduce Design PatternsMapReduce Design Patterns
MapReduce Design Patterns
 
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and HadoopFacebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
 
Pig, Making Hadoop Easy
Pig, Making Hadoop EasyPig, Making Hadoop Easy
Pig, Making Hadoop Easy
 
introduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pigintroduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pig
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & Pig
 
HIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopHIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on Hadoop
 
Hive Quick Start Tutorial
Hive Quick Start TutorialHive Quick Start Tutorial
Hive Quick Start Tutorial
 
Integration of Hive and HBase
Integration of Hive and HBaseIntegration of Hive and HBase
Integration of Hive and HBase
 
Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)
 
MapReduce in Simple Terms
MapReduce in Simple TermsMapReduce in Simple Terms
MapReduce in Simple Terms
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reduce
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 

Similar to Map reduce presentation

2004 map reduce simplied data processing on large clusters (mapreduce)
2004 map reduce simplied data processing on large clusters (mapreduce)2004 map reduce simplied data processing on large clusters (mapreduce)
2004 map reduce simplied data processing on large clusters (mapreduce)
anh tuan
 
Map reduce
Map reduceMap reduce
Map reduce
xydii
 
Map reduceoriginalpaper mandatoryreading
Map reduceoriginalpaper mandatoryreadingMap reduceoriginalpaper mandatoryreading
Map reduceoriginalpaper mandatoryreading
coolmirza143
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
Xiao Qin
 

Similar to Map reduce presentation (20)

E031201032036
E031201032036E031201032036
E031201032036
 
Map reduce
Map reduceMap reduce
Map reduce
 
2004 map reduce simplied data processing on large clusters (mapreduce)
2004 map reduce simplied data processing on large clusters (mapreduce)2004 map reduce simplied data processing on large clusters (mapreduce)
2004 map reduce simplied data processing on large clusters (mapreduce)
 
MapReduce.pptx
MapReduce.pptxMapReduce.pptx
MapReduce.pptx
 
Mapreduce script
Mapreduce scriptMapreduce script
Mapreduce script
 
Lecture 1 mapreduce
Lecture 1  mapreduceLecture 1  mapreduce
Lecture 1 mapreduce
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptx
 
MapReduce-Notes.pdf
MapReduce-Notes.pdfMapReduce-Notes.pdf
MapReduce-Notes.pdf
 
Map reduce
Map reduceMap reduce
Map reduce
 
Map reduceoriginalpaper mandatoryreading
Map reduceoriginalpaper mandatoryreadingMap reduceoriginalpaper mandatoryreading
Map reduceoriginalpaper mandatoryreading
 
Report Hadoop Map Reduce
Report Hadoop Map ReduceReport Hadoop Map Reduce
Report Hadoop Map Reduce
 
mapreduce.pptx
mapreduce.pptxmapreduce.pptx
mapreduce.pptx
 
MapReduce: Ordering and Large-Scale Indexing on Large Clusters
MapReduce: Ordering and  Large-Scale Indexing on Large ClustersMapReduce: Ordering and  Large-Scale Indexing on Large Clusters
MapReduce: Ordering and Large-Scale Indexing on Large Clusters
 
Mapreduce2008 cacm
Mapreduce2008 cacmMapreduce2008 cacm
Mapreduce2008 cacm
 
Unit-2 Hadoop Framework.pdf
Unit-2 Hadoop Framework.pdfUnit-2 Hadoop Framework.pdf
Unit-2 Hadoop Framework.pdf
 
Unit-2 Hadoop Framework.pdf
Unit-2 Hadoop Framework.pdfUnit-2 Hadoop Framework.pdf
Unit-2 Hadoop Framework.pdf
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
 
Introduction to map reduce
Introduction to map reduceIntroduction to map reduce
Introduction to map reduce
 
Unit3 MapReduce
Unit3 MapReduceUnit3 MapReduce
Unit3 MapReduce
 
Applying stratosphere for big data analytics
Applying stratosphere for big data analyticsApplying stratosphere for big data analytics
Applying stratosphere for big data analytics
 

Recently uploaded

一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
yhkoc
 
Computer Presentation.pptx ecommerce advantage s
Computer Presentation.pptx ecommerce advantage sComputer Presentation.pptx ecommerce advantage s
Computer Presentation.pptx ecommerce advantage s
MAQIB18
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
ewymefz
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
enxupq
 
Exploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptxExploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptx
DilipVasan
 
Investigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesInvestigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_Crimes
StarCompliance.io
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
enxupq
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
ewymefz
 

Recently uploaded (20)

一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
 
Computer Presentation.pptx ecommerce advantage s
Computer Presentation.pptx ecommerce advantage sComputer Presentation.pptx ecommerce advantage s
Computer Presentation.pptx ecommerce advantage s
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
 
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
 
Pre-ProductionImproveddsfjgndflghtgg.pptx
Pre-ProductionImproveddsfjgndflghtgg.pptxPre-ProductionImproveddsfjgndflghtgg.pptx
Pre-ProductionImproveddsfjgndflghtgg.pptx
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
 
2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...
2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...
2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...
 
Exploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptxExploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptx
 
Supply chain analytics to combat the effects of Ukraine-Russia-conflict
Supply chain analytics to combat the effects of Ukraine-Russia-conflictSupply chain analytics to combat the effects of Ukraine-Russia-conflict
Supply chain analytics to combat the effects of Ukraine-Russia-conflict
 
Investigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesInvestigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_Crimes
 
Using PDB Relocation to Move a Single PDB to Another Existing CDB
Using PDB Relocation to Move a Single PDB to Another Existing CDBUsing PDB Relocation to Move a Single PDB to Another Existing CDB
Using PDB Relocation to Move a Single PDB to Another Existing CDB
 
Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMI
 
Slip-and-fall Injuries: Top Workers' Comp Claims
Slip-and-fall Injuries: Top Workers' Comp ClaimsSlip-and-fall Injuries: Top Workers' Comp Claims
Slip-and-fall Injuries: Top Workers' Comp Claims
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
 
Tabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflowsTabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflows
 
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPsWebinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
 
Jpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization SampleJpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization Sample
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
 

Map reduce presentation

  • 2. MapReduce  MapReduce is a programming model and an associated implementation for processing and generating large data sets with a parallel, distributed algorithm on a cluster.  Map() procedure that performs filtering and sorting.  Reduce() procedure that performs a summary operation (such as statistical operations)
  • 3. HADOOP  Hadoop is a free, Java-based programming framework that supports the processing of large data sets in a distributed computing environment. It is part of the Apache project sponsored by the Apache Software Foundation.
  • 4. MapReduce - Orchestrating  MapReduce System is orchestrates the processing by 1. marshalling (saving the system) the distributed servers 2. Running the various tasks in parallel 3. managing all communications and data transfers between the various parts of the system 4. providing for redundancy and fault tolerance.
  • 5. MapReduce contributions  The key contributions ( an important values added to the system ) are Scalability (by controlling a big number of nodes) and fault-tolerance achieved for a variety of applications by optimizing the execution engine once.
  • 6. MapReduce main steps (3 phases way)  Steps of MapReduce : 1. "Map" step: Each worker node applies the "map()" function to the local data, and writes the output to a temporary storage. A master node orchestrates that for redundant copies of input data, only one is processed. 2. "Shuffle" step: Worker nodes redistribute data based on the output keys (produced by the "map()" function), all the data belonging to one key are located on the same worker node. 3. "Reduce" step: Worker nodes now process each group of output data, per key, in parallel.
  • 7. MapReduce main steps (3 phases way)
  • 8. MapReduce main steps (5 phases way)  Another way to look at MapReduce is as a 5-step parallel and distributed computation: 1. Map step. 2. Run the user-provided Map() code – Map() is run exactly once for each K1 key value, generating output organized by key values K2. 3. Shuffle step. 4. Reduce step. 5. Produce the final output – the MapReduce system collects all the Reduce output, and sorts it by K2 to produce the final outcome
  • 9. MapReduce main steps (5 phases way)
  • 10. Working in parallel  Each mapping operation is independent of the others ( run in parallel )  Limitations are: 1. Number of independent data sources 2. Number of CPUs near each source.  Set of 'reducers' can perform the reduction phase, all outputs of the map operation that share the same key are presented to the same reducer at the same time.  The main advantage for working in parallel is the recovering from partial failure of servers or storage during the operation, if one mapper or reducer fails, the work can be rescheduled – assuming the input data is still available.
  • 12. Logical work  The Map and Reduce functions of MapReduce are both defined with respect to the data structured in (key, value) pairs. Map takes one pair of data with a type in one data domain, and returns a list of pairs in a different domain: Map(k1,v1) → list(k2,v2)  The Map function is applied in parallel to every pair in the input dataset. This produces a list of pairs for each call. After that, the MapReduce framework collects all pairs with the same key from all lists and groups them together, creating one group for each key.  The Reduce function is then applied in parallel to each group, which in turn produces a collection of values in the same domain: Reduce(k2, list (v2)) → list(v3)  Each Reduce call typically produces either one value v3 or an empty return, though one call is allowed to return more than one value. The returns of all calls are collected as the desired result list.
  • 14. Implementation  Distributed implementations of MapReduce require a means of connecting the processes performing the Map and Reduce phases
  • 15. Implementation (cont)  counts the appearance of each word in a set of documents function map(String name, String document): // name: document name // document: document contents for each word w in document: emit (w, 1) function reduce(String word, Iterator partialCounts): // word: a word // partialCounts: a list of aggregated partial counts sum = 0 for each pc in partialCounts: sum += ParseInt(pc) emit (word, sum)
  • 16. Dataflow  The hotspots in the dataflow are ( depends on the application)  an input reader  a Map function  a partition function  a compare function  a Reduce function  an output writer
  • 17. Dataflow (cont)  Input reader  The input reader divides the input into appropriate size and the framework assigns one split to each Map function. The input reader reads data from stable storage (typically a distributed file system) and generates key/value pairs.
  • 18. Dataflow (cont)  Map function  The Map function takes a series of key/value pairs, processes each, and generates zero or more output key/value pairs. The input and output types of the map are often different from each other.  As in example If the application is doing a word count, the map function would break the line into words and output a key/value pair for each word. Each output pair would contain the word as the key and the number of instances of that word in the line as the value.
  • 19. Dataflow (cont)  Partition function  Each Map function output is allocated to a particular reducer by the application's partition function for sharing purposes. The partition function is given the key and the number of reducers and returns the index of the desired reducer.  It is important to pick a partition function that gives an approximately uniform distribution of data per shard for load-balancing purposes, otherwise the MapReduce operation can be held up waiting for slow reducers (reducers assigned more than their share of data) to finish.  Between the map and reduce stages, the data is shuffled (means that the data are parallel-sorted or exchanged between nodes) in order to move the data from the map node that produced it to the shard in which it will be reduced.
  • 20. Dataflow (cont)  Comparison function  The input for each Reduce is pulled from the machine where the Map ran and sorted using the application's comparison function.
  • 21. Dataflow (cont)  Reduce function  The framework calls the application's Reduce function once for each unique key in the sorted order. The Reduce can iterate through the values that are associated with that key and produce zero or more outputs.  In the word count example, the Reduce function takes the input values, sums them and generates a single output of the word and the final sum.
  • 22. Dataflow (cont)  Output writer  The Output Writer writes the output of the Reduce to the stable storage, usually a distributed file system.
  • 23. Performance  MapReduce programs are not guaranteed to be fast.  The partition function and the amount of data written by the Map function can have a large impact on the performance.  Additional modules such as the Combiner function can help to reduce the amount of data written to disk, and transmitted over the network.  Communication cost often dominates the computation cost, and many MapReduce implementations are designed to write all communication to distributed storage for crash recovery.
  • 24. Distribution and reliability  MapReduce achieves reliability by parceling out a number of operations on the set of data to each node in the network (load distributing).  Each node is expected to report back periodically with completed work and status updates.  If a node falls silent for longer than that interval, the master node records the node as dead and sends out the node's assigned work to other nodes.  Individual operations use atomic operations for naming file outputs as a check to ensure that there are not parallel conflicting threads running  Reduce operations operate much the same way to save the bandwidth across the backbone network of the datacenter
  • 25. Uses  Useful in distributed pattern-based searching, distributed sorting, web link- graph reversal, Singular Value Decomposition, web access log stats, inverted index construction, document clustering, machine learning, and statistical machine translation.  adapted to several computing environments like multi-core and many-core systems, desktop grids, volunteer computing environments, dynamic cloud environments, and mobile environments
  • 26. Criticism  Lack of novelty  Restricted programming framework