SlideShare a Scribd company logo
1 of 25
Cloud Computing
MapReduce, Batch Processing
Eva Kalyvianaki
ek264@cam.ac.uk
Contents
 Batch processing: processing large amounts of data at-
once, in one-go to deliver a result according to a query on
the data.
 Material is from the paper:
“MapReduce: Simplified Data Processing on Large Clusters”,
By Jeffrey Dean and Sanjay Ghemawat from Google
published in Usenix OSDI conference, 2004
2
Motivation: Processing Large-sets of Data
 Need for many computations over large/huge sets of data:
 Input data: crawled documents, web request logs
 Output data: inverted indices, summary of pages crawled per host, the
set of the most frequent queries in a given day, …
 Most of these computation are relatively straight-forward
 To speedup computation and shorten processing time, we can
distribute data across 100s of machines and process them in
parallel
 But, parallel computations are difficult and complex to manage:
 Race conditions, debugging, data distribution, fault-tolerance, load
balancing, etc
 Ideally, we would like to process data in parallel but not deal
with the complexity of parallelisation and data distribution
3
MapReduce
“A new abstraction that allows us to express the simple
computations we were trying to perform but hides the messy
details of parallelization, fault-tolerance, data distribution and
load balancing in a library.”
 Programming model:
 Provides abstraction to express computation
 Library:
 To take care the runtime parallelisation of the computation.
4
Example: Counting the number of occurrences of each
work in the text below from Wikipedia
“Cloud computing is a recently evolved computing terminology
or metaphor based on utility and consumption of computing
resources. Cloud computing involves deploying groups of
remote servers and software networks that allow centralized
data storage and online access to computer services or
resources. Cloud can be classified as public, private or hybrid.”
Word: Number of Occurrences
Cloud 1
computing 1
is 1
a 1
recently 1
evolved 1
computing 1?
terminology 1
5
2
Programming Model
 Input: a set of key/value pairs
 Output: a set of key/value pairs
 Computation is expressed using the two functions:
1. Map task: a single pair  a list of intermediate pairs
 map(input-key, input-value)  list(out-key, intermediate-value)
 <ki, vi>  { < kint, vint > }
1. Reduce task: all intermediate pairs with the same kint  a list of values
 reduce(out-key, list(intermediate-value))  list(out-values)
 < kint, {vint} >  < ko, vo >
6
Example: Counting the number of occurrences of each
work in a collection of documents
map(String input_key, String input_value):
// input_key: document name
// input_value: document contents
for each word w in input_value:
EmitIntermediate(w, "1");
reduce(String output_key, Iterator intermediate_values):
// output_key: a word
// output_values: a list of counts
int result = 0;
for each v in intermediate_values:
result += ParseInt(v);
Emit(AsString(result));
7
MapReduce Example Applications
 The MapReduce model can be applied to many applications:
 Distributed grep:
 map: emits a line, if line matched the pattern
 reduce: identity function
 Count of URL access Frequency
 Reverse Web-Link Graph
 Inverted Index
 Distributed Sort
 ….
8
MapReduce Implementation
 MapReduce implementation presented in the paper
matched Google infrastructure at-the-time:
1. Large cluster of commodity PCs connected via switched Ethernet
2. Machines are typically dual-processor x86, running Linux, 2-4GB of
mem! (slow machines for today’s standards)
3. A cluster of machines, so failures are anticipated
4. Storage with (GFS) Google File System (2003) on IDE disks
attached to PCs. GFS is a distributed file system, uses replication
for availability and reliability.
 Scheduling system:
1. Users submit jobs
2. Each job consists of tasks; scheduler assigns tasks to machines
9
Google File System (GFS)
 File is divided into several chunks of predefined size;
 Typically, 16-64 MB
 The system replicates each chunk by a number:
 Usually three replicas
 To achieve fault-tolerance, availability and reliability
10
Parallel Execution
 User specifies:
 M: number of map tasks
 R: number of reduce tasks
 Map:
 MapReduce library splits the input file into M pieces
 Typically 16-64MB per piece
 Map tasks are distributed across the machines
 Reduce:
 Partitioning the intermediate key space into R pieces
 hash(intermediate_key) mod R
 Typical setting:
 2,000 machines
 M = 200,000
 R = 5,000
11
Execution Flow
12
Master Data Structures
 For each map/reduce task:
 State status {idle, in-progress, completed}
 Identity of the worker machine (for non-idle tasks)
 The location of intermediate file regions is passed from maps to
reducers tasks through the master.
 This information is pushed incrementally (as map tasks finish) to
workers that have in-progress reduce tasks.
13
Fault-Tolerance
Two types of failures:
1. worker failures:
 Identified by sending heartbeat messages by the master. If no
response within a certain amount of time, then the worker is dead.
 In-progress and completed map tasks are re-scheduled  idle
 In-progress reduce tasks are re-scheduled  idle
 Workers executing reduce tasks affected from failed map/workers are
notified of re-scheduling
 Question: Why completed tasks have to be re-scheduler?
 Answer: Map output is stored on local fs, while reduce output is stored
on GFS
2. master failure:
1. Rare
2. Can be recovered from checkpoints
3. Solution: aborts the MapReduce computation and starts again
14
Disk Locality
 Network bandwidth is a relatively scarce resource and also
increases latency
 The goal is to save network bandwidth
 Use of GFS that stores typically three copies of the data block
on different machines
 Map tasks are scheduled “close” to data
 On nodes that have input data (local disk)
 If not, on nodes that are nearer to input data (e.g., same switch)
15
Task Granularity
 Number of map tasks > number of worker nodes
 Better load balancing
 Better recovery
 But, this, increases load on the master
 More scheduling
 More states to be saved
 M could be chosen with respect to the block size of the file
system
 For locality properties
 R is usually specified by users
 Each reduce tasks produces one output file
16
Stragglers
 Slow workers delay overall completion time  stragglers
 Bad disks with soft errors
 Other tasks using up resources
 Machine configuration problems, etc
 Very close to end of MapReduce operation, master schedules backup
execution of the remaining in-progress tasks.
 A task is marked as complete whenever either the primary or the
backup execution completes.
 Example: sort operation takes 44% longer to complete when the
backup task mechanism is disabled.
17
Refinements: Partitioning Function
 Partitioning function identifies the reduce task
 Users specify the desired output files they want, R
 But, there may be more keys than R
 Uses the intermediate key and R
 Default: hash(key) mod R
 Important to choose well-balanced partitioning functions:
 hash(hostname(urlkey)) mod R
 For output keys that are URLs
18
Refinements: Combiner Function
 Introduce a mini-reduce phase before intermediate data is
sent to reduce
 When there is significant repetition of intermediate keys
 Merge values of intermediate keys before sending to reduce tasks
 Example: word count, many records of the form <word_name, 1>.
Merge records with the same word_name
 Similar to reduce function
 Saves network bandwidth
19
Evaluation - Setup
 Evaluation on two programs running on a large cluster and
processing 1 TB of data:
1. grep: search over 1010 100-byte records looking for a rare 3-character
pattern
2. sort: sorts 1010 100-byte records
 Cluster configuration:
 1,800 machines
 Each machine has 2 GHz Intel Xeon proc., 4GB mem, 2 160GB IDE disks
 Gigabit Ethernet link
 Hosted in the same facility
20
Grep
 M = 15,000 of 64MB each split
 R = 1
 Entire computation finishes at 150s
 Startup overhead ~60s
 Propagation of program to workers
 Delays to interact with GFS to open 1,000 files
 …
 Picks at 30GB/s with 1,764 workers
21
Sort
 M = 15,000 splits, 64MB each
 R = 4,000 files
 Workers = 1,700
 Evaluated on three executions:
 With backup tasks
 Without backup tasks
 With machine failures
22
Sort Results
23
Top: rate at which input is read
Middle: rate at which data is sent from mappers to reducers
Bottom: rate at which sorted data is written to output file by reducers
Without backup
tasks, 5 reduce
tasks stragglers,
44% increase
With machine failures,
200 out of 1746 workers,
 a 5% increase over
normal execution time
Normal execution
with backup
Implementation
 First MapReduce library in 02/2003
 Use cases (back then):
 Large-scale machine learning problems
 Clustering problems for the Google News
 Extraction of data for reports Google zeitgeist
 Large-scale graph computations
24
MapReduce jobs run in 8/2004
Summary
 MapReduce is a very powerful and expressive model
 Performance depends a lot on implementation details
 Material is from the paper:
“MapReduce: Simplified Data Processing on Large Clusters”,
By Jeffrey Dean and Sanjay Ghemawat from Google
published in Usenix OSDI conference, 2004
25

More Related Content

Similar to mapreduce.pptx

Hadoop training-in-hyderabad
Hadoop training-in-hyderabadHadoop training-in-hyderabad
Hadoop training-in-hyderabadsreehari orienit
 
Hadoop Network Performance profile
Hadoop Network Performance profileHadoop Network Performance profile
Hadoop Network Performance profilepramodbiligiri
 
Big data & Hadoop
Big data & HadoopBig data & Hadoop
Big data & HadoopAhmed Gamil
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation HadoopVarun Narang
 
MapReduce:Simplified Data Processing on Large Cluster Presented by Areej Qas...
MapReduce:Simplified Data Processing on Large Cluster  Presented by Areej Qas...MapReduce:Simplified Data Processing on Large Cluster  Presented by Areej Qas...
MapReduce:Simplified Data Processing on Large Cluster Presented by Areej Qas...areej qasrawi
 
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...Reynold Xin
 
MapReduce: Simplified Data Processing on Large Clusters
MapReduce: Simplified Data Processing on Large ClustersMapReduce: Simplified Data Processing on Large Clusters
MapReduce: Simplified Data Processing on Large ClustersAshraf Uddin
 
Mapreduce2008 cacm
Mapreduce2008 cacmMapreduce2008 cacm
Mapreduce2008 cacmlmphuong06
 
Hadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologiesHadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologiesKelly Technologies
 
MapReduce presentation
MapReduce presentationMapReduce presentation
MapReduce presentationVu Thi Trang
 
Hadoop trainting-in-hyderabad@kelly technologies
Hadoop trainting-in-hyderabad@kelly technologiesHadoop trainting-in-hyderabad@kelly technologies
Hadoop trainting-in-hyderabad@kelly technologiesKelly Technologies
 
2004 map reduce simplied data processing on large clusters (mapreduce)
2004 map reduce simplied data processing on large clusters (mapreduce)2004 map reduce simplied data processing on large clusters (mapreduce)
2004 map reduce simplied data processing on large clusters (mapreduce)anh tuan
 

Similar to mapreduce.pptx (20)

Hadoop training-in-hyderabad
Hadoop training-in-hyderabadHadoop training-in-hyderabad
Hadoop training-in-hyderabad
 
Hadoop Network Performance profile
Hadoop Network Performance profileHadoop Network Performance profile
Hadoop Network Performance profile
 
Unit-2 Hadoop Framework.pdf
Unit-2 Hadoop Framework.pdfUnit-2 Hadoop Framework.pdf
Unit-2 Hadoop Framework.pdf
 
Unit-2 Hadoop Framework.pdf
Unit-2 Hadoop Framework.pdfUnit-2 Hadoop Framework.pdf
Unit-2 Hadoop Framework.pdf
 
Hadoop
HadoopHadoop
Hadoop
 
Handout3o
Handout3oHandout3o
Handout3o
 
Big data & Hadoop
Big data & HadoopBig data & Hadoop
Big data & Hadoop
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 
E031201032036
E031201032036E031201032036
E031201032036
 
MapReduce:Simplified Data Processing on Large Cluster Presented by Areej Qas...
MapReduce:Simplified Data Processing on Large Cluster  Presented by Areej Qas...MapReduce:Simplified Data Processing on Large Cluster  Presented by Areej Qas...
MapReduce:Simplified Data Processing on Large Cluster Presented by Areej Qas...
 
MapReduce
MapReduceMapReduce
MapReduce
 
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
 
MapReduce: Simplified Data Processing on Large Clusters
MapReduce: Simplified Data Processing on Large ClustersMapReduce: Simplified Data Processing on Large Clusters
MapReduce: Simplified Data Processing on Large Clusters
 
Mapreduce2008 cacm
Mapreduce2008 cacmMapreduce2008 cacm
Mapreduce2008 cacm
 
Hadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologiesHadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologies
 
MapReduce presentation
MapReduce presentationMapReduce presentation
MapReduce presentation
 
Hadoop trainting-in-hyderabad@kelly technologies
Hadoop trainting-in-hyderabad@kelly technologiesHadoop trainting-in-hyderabad@kelly technologies
Hadoop trainting-in-hyderabad@kelly technologies
 
Lecture 1 mapreduce
Lecture 1  mapreduceLecture 1  mapreduce
Lecture 1 mapreduce
 
Map reduce
Map reduceMap reduce
Map reduce
 
2004 map reduce simplied data processing on large clusters (mapreduce)
2004 map reduce simplied data processing on large clusters (mapreduce)2004 map reduce simplied data processing on large clusters (mapreduce)
2004 map reduce simplied data processing on large clusters (mapreduce)
 

More from ShimoFcis

Motif Finding.pdf
Motif Finding.pdfMotif Finding.pdf
Motif Finding.pdfShimoFcis
 
05_SQA_Overview.ppt
05_SQA_Overview.ppt05_SQA_Overview.ppt
05_SQA_Overview.pptShimoFcis
 
Topic21 Elect. Codebook, Cipher Block Chaining.pptx
Topic21 Elect. Codebook, Cipher Block Chaining.pptxTopic21 Elect. Codebook, Cipher Block Chaining.pptx
Topic21 Elect. Codebook, Cipher Block Chaining.pptxShimoFcis
 
lab-8 (1).pptx
lab-8 (1).pptxlab-8 (1).pptx
lab-8 (1).pptxShimoFcis
 
Lab-11-C-Problems.pptx
Lab-11-C-Problems.pptxLab-11-C-Problems.pptx
Lab-11-C-Problems.pptxShimoFcis
 
Mid-Term Problem Solving Part.pptx
Mid-Term Problem Solving Part.pptxMid-Term Problem Solving Part.pptx
Mid-Term Problem Solving Part.pptxShimoFcis
 
Lecture 6.pptx
Lecture 6.pptxLecture 6.pptx
Lecture 6.pptxShimoFcis
 
Lecture Cloud Security.pptx
Lecture Cloud Security.pptxLecture Cloud Security.pptx
Lecture Cloud Security.pptxShimoFcis
 
storage-systems.pptx
storage-systems.pptxstorage-systems.pptx
storage-systems.pptxShimoFcis
 
mapreduce-advanced.pptx
mapreduce-advanced.pptxmapreduce-advanced.pptx
mapreduce-advanced.pptxShimoFcis
 

More from ShimoFcis (11)

Motif Finding.pdf
Motif Finding.pdfMotif Finding.pdf
Motif Finding.pdf
 
05_SQA_Overview.ppt
05_SQA_Overview.ppt05_SQA_Overview.ppt
05_SQA_Overview.ppt
 
Topic21 Elect. Codebook, Cipher Block Chaining.pptx
Topic21 Elect. Codebook, Cipher Block Chaining.pptxTopic21 Elect. Codebook, Cipher Block Chaining.pptx
Topic21 Elect. Codebook, Cipher Block Chaining.pptx
 
4-DES.pdf
4-DES.pdf4-DES.pdf
4-DES.pdf
 
lab-8 (1).pptx
lab-8 (1).pptxlab-8 (1).pptx
lab-8 (1).pptx
 
Lab-11-C-Problems.pptx
Lab-11-C-Problems.pptxLab-11-C-Problems.pptx
Lab-11-C-Problems.pptx
 
Mid-Term Problem Solving Part.pptx
Mid-Term Problem Solving Part.pptxMid-Term Problem Solving Part.pptx
Mid-Term Problem Solving Part.pptx
 
Lecture 6.pptx
Lecture 6.pptxLecture 6.pptx
Lecture 6.pptx
 
Lecture Cloud Security.pptx
Lecture Cloud Security.pptxLecture Cloud Security.pptx
Lecture Cloud Security.pptx
 
storage-systems.pptx
storage-systems.pptxstorage-systems.pptx
storage-systems.pptx
 
mapreduce-advanced.pptx
mapreduce-advanced.pptxmapreduce-advanced.pptx
mapreduce-advanced.pptx
 

Recently uploaded

Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Celine George
 
Roles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceRoles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceSamikshaHamane
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationnomboosow
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Educationpboyjonauth
 
भारत-रोम व्यापार.pptx, Indo-Roman Trade,
भारत-रोम व्यापार.pptx, Indo-Roman Trade,भारत-रोम व्यापार.pptx, Indo-Roman Trade,
भारत-रोम व्यापार.pptx, Indo-Roman Trade,Virag Sontakke
 
History Class XII Ch. 3 Kinship, Caste and Class (1).pptx
History Class XII Ch. 3 Kinship, Caste and Class (1).pptxHistory Class XII Ch. 3 Kinship, Caste and Class (1).pptx
History Class XII Ch. 3 Kinship, Caste and Class (1).pptxsocialsciencegdgrohi
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxOH TEIK BIN
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTiammrhaywood
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17Celine George
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentInMediaRes1
 
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdfFraming an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdfUjwalaBharambe
 
Hierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of managementHierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of managementmkooblal
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxGaneshChakor2
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for BeginnersSabitha Banu
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxthorishapillay1
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxNirmalaLoungPoorunde1
 
MARGINALIZATION (Different learners in Marginalized Group
MARGINALIZATION (Different learners in Marginalized GroupMARGINALIZATION (Different learners in Marginalized Group
MARGINALIZATION (Different learners in Marginalized GroupJonathanParaisoCruz
 
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Celine George
 

Recently uploaded (20)

Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17
 
Roles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceRoles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in Pharmacovigilance
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communication
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Education
 
OS-operating systems- ch04 (Threads) ...
OS-operating systems- ch04 (Threads) ...OS-operating systems- ch04 (Threads) ...
OS-operating systems- ch04 (Threads) ...
 
भारत-रोम व्यापार.pptx, Indo-Roman Trade,
भारत-रोम व्यापार.pptx, Indo-Roman Trade,भारत-रोम व्यापार.pptx, Indo-Roman Trade,
भारत-रोम व्यापार.pptx, Indo-Roman Trade,
 
History Class XII Ch. 3 Kinship, Caste and Class (1).pptx
History Class XII Ch. 3 Kinship, Caste and Class (1).pptxHistory Class XII Ch. 3 Kinship, Caste and Class (1).pptx
History Class XII Ch. 3 Kinship, Caste and Class (1).pptx
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptx
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media Component
 
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdfFraming an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
 
Hierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of managementHierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of management
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptx
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for Beginners
 
ESSENTIAL of (CS/IT/IS) class 06 (database)
ESSENTIAL of (CS/IT/IS) class 06 (database)ESSENTIAL of (CS/IT/IS) class 06 (database)
ESSENTIAL of (CS/IT/IS) class 06 (database)
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptx
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptx
 
MARGINALIZATION (Different learners in Marginalized Group
MARGINALIZATION (Different learners in Marginalized GroupMARGINALIZATION (Different learners in Marginalized Group
MARGINALIZATION (Different learners in Marginalized Group
 
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
 

mapreduce.pptx

  • 1. Cloud Computing MapReduce, Batch Processing Eva Kalyvianaki ek264@cam.ac.uk
  • 2. Contents  Batch processing: processing large amounts of data at- once, in one-go to deliver a result according to a query on the data.  Material is from the paper: “MapReduce: Simplified Data Processing on Large Clusters”, By Jeffrey Dean and Sanjay Ghemawat from Google published in Usenix OSDI conference, 2004 2
  • 3. Motivation: Processing Large-sets of Data  Need for many computations over large/huge sets of data:  Input data: crawled documents, web request logs  Output data: inverted indices, summary of pages crawled per host, the set of the most frequent queries in a given day, …  Most of these computation are relatively straight-forward  To speedup computation and shorten processing time, we can distribute data across 100s of machines and process them in parallel  But, parallel computations are difficult and complex to manage:  Race conditions, debugging, data distribution, fault-tolerance, load balancing, etc  Ideally, we would like to process data in parallel but not deal with the complexity of parallelisation and data distribution 3
  • 4. MapReduce “A new abstraction that allows us to express the simple computations we were trying to perform but hides the messy details of parallelization, fault-tolerance, data distribution and load balancing in a library.”  Programming model:  Provides abstraction to express computation  Library:  To take care the runtime parallelisation of the computation. 4
  • 5. Example: Counting the number of occurrences of each work in the text below from Wikipedia “Cloud computing is a recently evolved computing terminology or metaphor based on utility and consumption of computing resources. Cloud computing involves deploying groups of remote servers and software networks that allow centralized data storage and online access to computer services or resources. Cloud can be classified as public, private or hybrid.” Word: Number of Occurrences Cloud 1 computing 1 is 1 a 1 recently 1 evolved 1 computing 1? terminology 1 5 2
  • 6. Programming Model  Input: a set of key/value pairs  Output: a set of key/value pairs  Computation is expressed using the two functions: 1. Map task: a single pair  a list of intermediate pairs  map(input-key, input-value)  list(out-key, intermediate-value)  <ki, vi>  { < kint, vint > } 1. Reduce task: all intermediate pairs with the same kint  a list of values  reduce(out-key, list(intermediate-value))  list(out-values)  < kint, {vint} >  < ko, vo > 6
  • 7. Example: Counting the number of occurrences of each work in a collection of documents map(String input_key, String input_value): // input_key: document name // input_value: document contents for each word w in input_value: EmitIntermediate(w, "1"); reduce(String output_key, Iterator intermediate_values): // output_key: a word // output_values: a list of counts int result = 0; for each v in intermediate_values: result += ParseInt(v); Emit(AsString(result)); 7
  • 8. MapReduce Example Applications  The MapReduce model can be applied to many applications:  Distributed grep:  map: emits a line, if line matched the pattern  reduce: identity function  Count of URL access Frequency  Reverse Web-Link Graph  Inverted Index  Distributed Sort  …. 8
  • 9. MapReduce Implementation  MapReduce implementation presented in the paper matched Google infrastructure at-the-time: 1. Large cluster of commodity PCs connected via switched Ethernet 2. Machines are typically dual-processor x86, running Linux, 2-4GB of mem! (slow machines for today’s standards) 3. A cluster of machines, so failures are anticipated 4. Storage with (GFS) Google File System (2003) on IDE disks attached to PCs. GFS is a distributed file system, uses replication for availability and reliability.  Scheduling system: 1. Users submit jobs 2. Each job consists of tasks; scheduler assigns tasks to machines 9
  • 10. Google File System (GFS)  File is divided into several chunks of predefined size;  Typically, 16-64 MB  The system replicates each chunk by a number:  Usually three replicas  To achieve fault-tolerance, availability and reliability 10
  • 11. Parallel Execution  User specifies:  M: number of map tasks  R: number of reduce tasks  Map:  MapReduce library splits the input file into M pieces  Typically 16-64MB per piece  Map tasks are distributed across the machines  Reduce:  Partitioning the intermediate key space into R pieces  hash(intermediate_key) mod R  Typical setting:  2,000 machines  M = 200,000  R = 5,000 11
  • 13. Master Data Structures  For each map/reduce task:  State status {idle, in-progress, completed}  Identity of the worker machine (for non-idle tasks)  The location of intermediate file regions is passed from maps to reducers tasks through the master.  This information is pushed incrementally (as map tasks finish) to workers that have in-progress reduce tasks. 13
  • 14. Fault-Tolerance Two types of failures: 1. worker failures:  Identified by sending heartbeat messages by the master. If no response within a certain amount of time, then the worker is dead.  In-progress and completed map tasks are re-scheduled  idle  In-progress reduce tasks are re-scheduled  idle  Workers executing reduce tasks affected from failed map/workers are notified of re-scheduling  Question: Why completed tasks have to be re-scheduler?  Answer: Map output is stored on local fs, while reduce output is stored on GFS 2. master failure: 1. Rare 2. Can be recovered from checkpoints 3. Solution: aborts the MapReduce computation and starts again 14
  • 15. Disk Locality  Network bandwidth is a relatively scarce resource and also increases latency  The goal is to save network bandwidth  Use of GFS that stores typically three copies of the data block on different machines  Map tasks are scheduled “close” to data  On nodes that have input data (local disk)  If not, on nodes that are nearer to input data (e.g., same switch) 15
  • 16. Task Granularity  Number of map tasks > number of worker nodes  Better load balancing  Better recovery  But, this, increases load on the master  More scheduling  More states to be saved  M could be chosen with respect to the block size of the file system  For locality properties  R is usually specified by users  Each reduce tasks produces one output file 16
  • 17. Stragglers  Slow workers delay overall completion time  stragglers  Bad disks with soft errors  Other tasks using up resources  Machine configuration problems, etc  Very close to end of MapReduce operation, master schedules backup execution of the remaining in-progress tasks.  A task is marked as complete whenever either the primary or the backup execution completes.  Example: sort operation takes 44% longer to complete when the backup task mechanism is disabled. 17
  • 18. Refinements: Partitioning Function  Partitioning function identifies the reduce task  Users specify the desired output files they want, R  But, there may be more keys than R  Uses the intermediate key and R  Default: hash(key) mod R  Important to choose well-balanced partitioning functions:  hash(hostname(urlkey)) mod R  For output keys that are URLs 18
  • 19. Refinements: Combiner Function  Introduce a mini-reduce phase before intermediate data is sent to reduce  When there is significant repetition of intermediate keys  Merge values of intermediate keys before sending to reduce tasks  Example: word count, many records of the form <word_name, 1>. Merge records with the same word_name  Similar to reduce function  Saves network bandwidth 19
  • 20. Evaluation - Setup  Evaluation on two programs running on a large cluster and processing 1 TB of data: 1. grep: search over 1010 100-byte records looking for a rare 3-character pattern 2. sort: sorts 1010 100-byte records  Cluster configuration:  1,800 machines  Each machine has 2 GHz Intel Xeon proc., 4GB mem, 2 160GB IDE disks  Gigabit Ethernet link  Hosted in the same facility 20
  • 21. Grep  M = 15,000 of 64MB each split  R = 1  Entire computation finishes at 150s  Startup overhead ~60s  Propagation of program to workers  Delays to interact with GFS to open 1,000 files  …  Picks at 30GB/s with 1,764 workers 21
  • 22. Sort  M = 15,000 splits, 64MB each  R = 4,000 files  Workers = 1,700  Evaluated on three executions:  With backup tasks  Without backup tasks  With machine failures 22
  • 23. Sort Results 23 Top: rate at which input is read Middle: rate at which data is sent from mappers to reducers Bottom: rate at which sorted data is written to output file by reducers Without backup tasks, 5 reduce tasks stragglers, 44% increase With machine failures, 200 out of 1746 workers,  a 5% increase over normal execution time Normal execution with backup
  • 24. Implementation  First MapReduce library in 02/2003  Use cases (back then):  Large-scale machine learning problems  Clustering problems for the Google News  Extraction of data for reports Google zeitgeist  Large-scale graph computations 24 MapReduce jobs run in 8/2004
  • 25. Summary  MapReduce is a very powerful and expressive model  Performance depends a lot on implementation details  Material is from the paper: “MapReduce: Simplified Data Processing on Large Clusters”, By Jeffrey Dean and Sanjay Ghemawat from Google published in Usenix OSDI conference, 2004 25

Editor's Notes

  1. Managing of Parallelization, Fault Tolerance, Data Distribution and Load Balancing is invisible to the user and comes free Motivation: primarily meant for Google’s web indexing Once the documents are crawled Several mapreduce operations are run in sequence to index documents From the MapReduce paper notes presentation
  2. Best-effort round robin Hard-requirement earliest deadline first
  3. Map tasks write their outputs on temporary files
  4. Top: rate at which input is read