SlideShare a Scribd company logo
MapReduce:
Simplified Data Processing on Large Clusters
Presented by Cleverence Kombe
By Jeffrey Dean and Sanjay Ghemawat
OUTLINES
1. Introduction
2. Programming Model
3. Implementation
4. Refinements
5. Performance
6. Experience and Conclusion
1. INTRODUCTION
o Many tasks in large scale data processing composed of:
o Computations that processes large amount of raw data which produce a lots of other data.
o Due to massiveness of input data, the computation is distributed to the hundreds or thousands of machines to complete the tasks in
reasonable period of time.
o Techniques such as crawled documents and web request logs have been used by Google to parallelize the computation, distribute the
data, and handle failures.
o But these techniques contains very complex programming codes.
o Jeffrey Dean and Sanjay Ghemawat came up with MapReduce concept which Simplify Data
Processing by hiding the messy details of parallelization, fault-tolerance, data distribution and
load balancing in a library.
oWhat is MapReduce?
Programming Model, approach, for processing large data sets.
Contains Map and Reduce functions.
Runs on a large cluster of commodity machines.
Many real world tasks are expressible in this model.
oMapReduce provides:
User-defined functions
Automatic parallelization and distribution
Fault-tolerance
I/O scheduling
Status and monitoring
1. INTRODUCTION CONT…
oInput & Output are sets of key/value pairs
oProgrammer specifies two functions:
1. map (in_key, in_value) -> list(out_key, intermediate_value)
Processes input key/value pair
Produces set of intermediate pairs
1. reduce (out_key, list(intermediate_value)) -> list(out_value)
Combines all intermediate values for a particular key
Produces a set of merged output values (most cases just one)
2. PROGRAMMING MODEL
2. PROGRAMMING MODEL
…
Input Files
Input file1
Input file2
Each line passed
to individual
mapper instances
Map Key Value
Splitting
Sort and Shuffle
Reduce Key Value Pairs
Final Output
Output file
o Words Count Example
2. PROGRAMMING MODEL
…More Examples
Distributed Grep
 The map function emits a line if it matches a supplied pattern
Count of URL access frequency.
 The map function processes logs of web page requests and outputs <URL, 1>
Reverse web-link graph
 The map function outputs <target, source> pairs for each link to a target URL found in a page named source
Term-Vector per Host
 A term vector summarizes the most important words that occur in a document or a set of documents as a list
of (word, frequency) pairs
Inverted Index
 The map function parses each document, and emits a sequence of (word, document ID) pairs
Distributed Sort
 The map function extracts the key from each record, and emits a (key, record) pair
 Many different implementations are possible
 The right choice is depending on the environment.
 Typical cluster: (wide use at Google, large clusters of PC’s connected via
switched nets)
• Hundreds to thousands of dual-processors x86 machines, Linux, 2-4 GB of
memory per machine.
• connected with networking HW, Limited bisection bandwidth
• Storage is on local IDE disks (inexpensive)
• GFS: distributed file system manages data
• Scheduling system by the users to submit the tasks (Job=set of tasks
mapped by scheduler to set of available PC within the cluster)
Implemented using C++ library and linked into user programs
3. IMPLEMENTATION
Execution Overview
Map
• Divide the input into M equal-sized splits
• Each split is 16-64 MB large
Reduce
• Partitioning intermediate key space into R pieces
• hash(intermediate_key) mod R
Typical setting:
• 2,000 machines
• M = 200,000
• R = 5,000
3. IMPLEMENTATION...
M input
splits of 16-
64MB each
Partitioning function
hash(intermediate_key) mod R
(0) mapreduce(spec, &result)
R regions
• Read all intermediate data
• Sort it by intermediate keys
Execution Overview…3. IMPLEMENTATION…
Fault Tolerance
Works: Handled through re-execution
• Detect failure via periodic heartbeats
• Re-execute completed + in-progress map tasks
• Why do we need to re-execute even the completed tasks?
• Re-execute in progress reduce tasks
• Task completion committed through master
Master failure:
• It can be handled, but don't yet (master failure unlikely)
3. IMPLEMENTATION…
Locality
Master scheduling policy:
• Asks GFS for locations of replicas of input file blocks
• Map tasks typically split into 64MB (GFS block size)
• Map tasks scheduled so GFS input block replica are on same machine or
same rack
As a result:
• most task’s input data is read locally and consumes no network bandwidth
3. IMPLEMENTATION…
Backup Tasks
common causes that lengthens the total time taken for a
MapReduce operation is a straggler.
mechanism to alleviate the problem of stragglers.
the master schedules backup executions of the remaining in-
progress tasks.
significantly reduces the time to complete large MapReduce
operations.( up to 40% )
3. IMPLEMENTATION…
• Different partitioning functions.
• User specify the number of reduce tasks/output that they desire (R).
• Combiner function.
• Useful for saving network bandwidth
• Different input/output types
• Skipping bad records
• Master asks next worker is told to skip the bad record
• Local execution
• an alternative implementation of the MapReduce library that sequentially executes all of the work for
a MapReduce operation on the local machine.
• Status info
• Progress of the computation & more info…
• Counters
• count occurrences of various events. (Ex: total number of words processed)
4. REFINEMENT
Measure the performance of MapReduce on two
computations running on a large cluster of machines.
Grep
• searches through approximately one terabyte of
data looking for a particular pattern
Sort
• sorts approximately one terabyte of data
5. PERFORMANCE
Specifications
Cluster 1800 machines
Memory 4 GB
Processors Dual-processor 2 GHz Xeons with Hyper-
threading
Hard disk Dual 160 GB IDE disks
Network Gigabit Ethernet per machine
bandwidth approximately 100 Gbps
Cluster Configuration
5. PERFORMANCE…
Grep
Computation
Scans 10 billions 100-byte
records, searching for rare 3-
character pattern (occurs in
92,337 records).
 input is split into
approximately 64MB pieces (M
= 15000), entire output is
placed in one file , R = 1
Startup overhead is significant
for short jobs
Data Transfer rate over time
5. PERFORMANCE…
Sort Computation
 Backup tasks improves completion time reasonably
 System manages machine failures relatively quickly.
5. PERFORMANCE…
Data transfer rates over time for different executions of the sort program
44% longer 5% longer
MapReduce has proven to be a useful abstraction
Greatly simplifies large-scale computations at Google
Fun to use: focus on problem, let library deal with messy details
No big need for parallelization knowledge
• (relief the user from dealing with low level parallelization details)
6. Experience & Conclusions
Thank
you!

More Related Content

What's hot

program partitioning and scheduling IN Advanced Computer Architecture
program partitioning and scheduling  IN Advanced Computer Architectureprogram partitioning and scheduling  IN Advanced Computer Architecture
program partitioning and scheduling IN Advanced Computer Architecture
Pankaj Kumar Jain
 
Remote Procedure Call in Distributed System
Remote Procedure Call in Distributed SystemRemote Procedure Call in Distributed System
Remote Procedure Call in Distributed System
PoojaBele1
 
Parallel algorithms
Parallel algorithmsParallel algorithms
Parallel algorithms
Danish Javed
 
Parallel computing
Parallel computingParallel computing
Parallel computing
Vinay Gupta
 
Fault tolerance in distributed systems
Fault tolerance in distributed systemsFault tolerance in distributed systems
Fault tolerance in distributed systems
sumitjain2013
 
Presentation on flynn’s classification
Presentation on flynn’s classificationPresentation on flynn’s classification
Presentation on flynn’s classification
vani gupta
 
Resource management
Resource managementResource management
Resource management
Dr Sandeep Kumar Poonia
 
Distributed computing bsics
Distributed computing bsicsDistributed computing bsics
Distributed computing bsics
Deepak John
 
Hadoop Map Reduce
Hadoop Map ReduceHadoop Map Reduce
Hadoop Map Reduce
VNIT-ACM Student Chapter
 
2. Distributed Systems Hardware & Software concepts
2. Distributed Systems Hardware & Software concepts2. Distributed Systems Hardware & Software concepts
2. Distributed Systems Hardware & Software concepts
Prajakta Rane
 
Communication costs in parallel machines
Communication costs in parallel machinesCommunication costs in parallel machines
Communication costs in parallel machines
Syed Zaid Irshad
 
Pram model
Pram modelPram model
Pram model
MANASYJAYASURYA
 
Types of Load distributing algorithm in Distributed System
Types of Load distributing algorithm in Distributed SystemTypes of Load distributing algorithm in Distributed System
Types of Load distributing algorithm in Distributed System
DHIVYADEVAKI
 
8. mutual exclusion in Distributed Operating Systems
8. mutual exclusion in Distributed Operating Systems8. mutual exclusion in Distributed Operating Systems
8. mutual exclusion in Distributed Operating Systems
Dr Sandeep Kumar Poonia
 
Communication primitives
Communication primitivesCommunication primitives
Communication primitives
Student
 
RPC: Remote procedure call
RPC: Remote procedure callRPC: Remote procedure call
RPC: Remote procedure call
Sunita Sahu
 
Agreement Protocols, distributed File Systems, Distributed Shared Memory
Agreement Protocols, distributed File Systems, Distributed Shared MemoryAgreement Protocols, distributed File Systems, Distributed Shared Memory
Agreement Protocols, distributed File Systems, Distributed Shared Memory
SHIKHA GAUTAM
 
Dichotomy of parallel computing platforms
Dichotomy of parallel computing platformsDichotomy of parallel computing platforms
Dichotomy of parallel computing platforms
Syed Zaid Irshad
 
Clock synchronization in distributed system
Clock synchronization in distributed systemClock synchronization in distributed system
Clock synchronization in distributed system
Sunita Sahu
 
Design issues of dos
Design issues of dosDesign issues of dos
Design issues of dos
vanamali_vanu
 

What's hot (20)

program partitioning and scheduling IN Advanced Computer Architecture
program partitioning and scheduling  IN Advanced Computer Architectureprogram partitioning and scheduling  IN Advanced Computer Architecture
program partitioning and scheduling IN Advanced Computer Architecture
 
Remote Procedure Call in Distributed System
Remote Procedure Call in Distributed SystemRemote Procedure Call in Distributed System
Remote Procedure Call in Distributed System
 
Parallel algorithms
Parallel algorithmsParallel algorithms
Parallel algorithms
 
Parallel computing
Parallel computingParallel computing
Parallel computing
 
Fault tolerance in distributed systems
Fault tolerance in distributed systemsFault tolerance in distributed systems
Fault tolerance in distributed systems
 
Presentation on flynn’s classification
Presentation on flynn’s classificationPresentation on flynn’s classification
Presentation on flynn’s classification
 
Resource management
Resource managementResource management
Resource management
 
Distributed computing bsics
Distributed computing bsicsDistributed computing bsics
Distributed computing bsics
 
Hadoop Map Reduce
Hadoop Map ReduceHadoop Map Reduce
Hadoop Map Reduce
 
2. Distributed Systems Hardware & Software concepts
2. Distributed Systems Hardware & Software concepts2. Distributed Systems Hardware & Software concepts
2. Distributed Systems Hardware & Software concepts
 
Communication costs in parallel machines
Communication costs in parallel machinesCommunication costs in parallel machines
Communication costs in parallel machines
 
Pram model
Pram modelPram model
Pram model
 
Types of Load distributing algorithm in Distributed System
Types of Load distributing algorithm in Distributed SystemTypes of Load distributing algorithm in Distributed System
Types of Load distributing algorithm in Distributed System
 
8. mutual exclusion in Distributed Operating Systems
8. mutual exclusion in Distributed Operating Systems8. mutual exclusion in Distributed Operating Systems
8. mutual exclusion in Distributed Operating Systems
 
Communication primitives
Communication primitivesCommunication primitives
Communication primitives
 
RPC: Remote procedure call
RPC: Remote procedure callRPC: Remote procedure call
RPC: Remote procedure call
 
Agreement Protocols, distributed File Systems, Distributed Shared Memory
Agreement Protocols, distributed File Systems, Distributed Shared MemoryAgreement Protocols, distributed File Systems, Distributed Shared Memory
Agreement Protocols, distributed File Systems, Distributed Shared Memory
 
Dichotomy of parallel computing platforms
Dichotomy of parallel computing platformsDichotomy of parallel computing platforms
Dichotomy of parallel computing platforms
 
Clock synchronization in distributed system
Clock synchronization in distributed systemClock synchronization in distributed system
Clock synchronization in distributed system
 
Design issues of dos
Design issues of dosDesign issues of dos
Design issues of dos
 

Viewers also liked

Intrusion Detection Systems and Intrusion Prevention Systems
Intrusion Detection Systems  and Intrusion Prevention Systems Intrusion Detection Systems  and Intrusion Prevention Systems
Intrusion Detection Systems and Intrusion Prevention Systems
Cleverence Kombe
 
Digital Forensic
Digital ForensicDigital Forensic
Digital Forensic
Cleverence Kombe
 
Towards granular data placement strategies for cloud platforms
Towards granular data placement strategies for cloud platformsTowards granular data placement strategies for cloud platforms
Towards granular data placement strategies for cloud platforms
Cleverence Kombe
 
Hadoop Map Reduce 程式設計
Hadoop Map Reduce 程式設計Hadoop Map Reduce 程式設計
Hadoop Map Reduce 程式設計
Wei-Yu Chen
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reduce
rantav
 
아모레퍼시픽 -090430- 알고리즘 기업분석 보고서
아모레퍼시픽 -090430- 알고리즘 기업분석 보고서아모레퍼시픽 -090430- 알고리즘 기업분석 보고서
아모레퍼시픽 -090430- 알고리즘 기업분석 보고서
알고리즘 기업분석 컨설팅-알기컨,algikeon
 
NextGen Apache Hadoop MapReduce
NextGen Apache Hadoop MapReduceNextGen Apache Hadoop MapReduce
NextGen Apache Hadoop MapReduce
Hortonworks
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
Stanley Wang
 
Dataiku big data paris - the rise of the hadoop ecosystem
Dataiku   big data paris - the rise of the hadoop ecosystemDataiku   big data paris - the rise of the hadoop ecosystem
Dataiku big data paris - the rise of the hadoop ecosystem
Dataiku
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystem
Sandip Darwade
 
The Hadoop Ecosystem for Developers
The Hadoop Ecosystem for DevelopersThe Hadoop Ecosystem for Developers
The Hadoop Ecosystem for Developers
Zohar Elkayam
 
Big Data and Hadoop Ecosystem
Big Data and Hadoop EcosystemBig Data and Hadoop Ecosystem
Big Data and Hadoop Ecosystem
Rajkumar Singh
 
Hadoop And Their Ecosystem ppt
 Hadoop And Their Ecosystem ppt Hadoop And Their Ecosystem ppt
Hadoop And Their Ecosystem ppt
sunera pathan
 
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystem
sunera pathan
 
Hadoop Ecosystem at a Glance
Hadoop Ecosystem at a GlanceHadoop Ecosystem at a Glance
Hadoop Ecosystem at a Glance
Neev Technologies
 
Cehv8 - Module 17: Evading, IDS, firewalls, and honeypots
Cehv8 - Module 17: Evading, IDS, firewalls, and honeypotsCehv8 - Module 17: Evading, IDS, firewalls, and honeypots
Cehv8 - Module 17: Evading, IDS, firewalls, and honeypots
Vuz Dở Hơi
 
Tips for go backpacking in hue imperial city
Tips for go backpacking in hue imperial cityTips for go backpacking in hue imperial city
Tips for go backpacking in hue imperial city
Kiko travel
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
Ran Silberman
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystem
Lior Sidi
 
Hadoop Ecosystem at Twitter - Kevin Weil - Hadoop World 2010
Hadoop Ecosystem at Twitter - Kevin Weil - Hadoop World 2010Hadoop Ecosystem at Twitter - Kevin Weil - Hadoop World 2010
Hadoop Ecosystem at Twitter - Kevin Weil - Hadoop World 2010
Cloudera, Inc.
 

Viewers also liked (20)

Intrusion Detection Systems and Intrusion Prevention Systems
Intrusion Detection Systems  and Intrusion Prevention Systems Intrusion Detection Systems  and Intrusion Prevention Systems
Intrusion Detection Systems and Intrusion Prevention Systems
 
Digital Forensic
Digital ForensicDigital Forensic
Digital Forensic
 
Towards granular data placement strategies for cloud platforms
Towards granular data placement strategies for cloud platformsTowards granular data placement strategies for cloud platforms
Towards granular data placement strategies for cloud platforms
 
Hadoop Map Reduce 程式設計
Hadoop Map Reduce 程式設計Hadoop Map Reduce 程式設計
Hadoop Map Reduce 程式設計
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reduce
 
아모레퍼시픽 -090430- 알고리즘 기업분석 보고서
아모레퍼시픽 -090430- 알고리즘 기업분석 보고서아모레퍼시픽 -090430- 알고리즘 기업분석 보고서
아모레퍼시픽 -090430- 알고리즘 기업분석 보고서
 
NextGen Apache Hadoop MapReduce
NextGen Apache Hadoop MapReduceNextGen Apache Hadoop MapReduce
NextGen Apache Hadoop MapReduce
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Dataiku big data paris - the rise of the hadoop ecosystem
Dataiku   big data paris - the rise of the hadoop ecosystemDataiku   big data paris - the rise of the hadoop ecosystem
Dataiku big data paris - the rise of the hadoop ecosystem
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystem
 
The Hadoop Ecosystem for Developers
The Hadoop Ecosystem for DevelopersThe Hadoop Ecosystem for Developers
The Hadoop Ecosystem for Developers
 
Big Data and Hadoop Ecosystem
Big Data and Hadoop EcosystemBig Data and Hadoop Ecosystem
Big Data and Hadoop Ecosystem
 
Hadoop And Their Ecosystem ppt
 Hadoop And Their Ecosystem ppt Hadoop And Their Ecosystem ppt
Hadoop And Their Ecosystem ppt
 
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystem
 
Hadoop Ecosystem at a Glance
Hadoop Ecosystem at a GlanceHadoop Ecosystem at a Glance
Hadoop Ecosystem at a Glance
 
Cehv8 - Module 17: Evading, IDS, firewalls, and honeypots
Cehv8 - Module 17: Evading, IDS, firewalls, and honeypotsCehv8 - Module 17: Evading, IDS, firewalls, and honeypots
Cehv8 - Module 17: Evading, IDS, firewalls, and honeypots
 
Tips for go backpacking in hue imperial city
Tips for go backpacking in hue imperial cityTips for go backpacking in hue imperial city
Tips for go backpacking in hue imperial city
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystem
 
Hadoop Ecosystem at Twitter - Kevin Weil - Hadoop World 2010
Hadoop Ecosystem at Twitter - Kevin Weil - Hadoop World 2010Hadoop Ecosystem at Twitter - Kevin Weil - Hadoop World 2010
Hadoop Ecosystem at Twitter - Kevin Weil - Hadoop World 2010
 

Similar to Map reduce - simplified data processing on large clusters

MapReduce:Simplified Data Processing on Large Cluster Presented by Areej Qas...
MapReduce:Simplified Data Processing on Large Cluster  Presented by Areej Qas...MapReduce:Simplified Data Processing on Large Cluster  Presented by Areej Qas...
MapReduce:Simplified Data Processing on Large Cluster Presented by Areej Qas...
areej qasrawi
 
MapReduce presentation
MapReduce presentationMapReduce presentation
MapReduce presentation
Vu Thi Trang
 
mapreduce.pptx
mapreduce.pptxmapreduce.pptx
mapreduce.pptx
ShimoFcis
 
Cloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
Cloud infrastructure. Google File System and MapReduce - Andrii VozniukCloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
Cloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
Andrii Vozniuk
 
Hadoop - Introduction to HDFS
Hadoop - Introduction to HDFSHadoop - Introduction to HDFS
Hadoop - Introduction to HDFS
Vibrant Technologies & Computers
 
Hadoop Network Performance profile
Hadoop Network Performance profileHadoop Network Performance profile
Hadoop Network Performance profile
pramodbiligiri
 
Hadoop
HadoopHadoop
Introduction to map reduce
Introduction to map reduceIntroduction to map reduce
Introduction to map reduce
M Baddar
 
Map reduce presentation
Map reduce presentationMap reduce presentation
Map reduce presentation
Ahmad El Tawil
 
Big Data.pptx
Big Data.pptxBig Data.pptx
Big Data.pptx
NelakurthyVasanthRed1
 
module3part-1-bigdata-230301002404-3db4f2a4 (1).pdf
module3part-1-bigdata-230301002404-3db4f2a4 (1).pdfmodule3part-1-bigdata-230301002404-3db4f2a4 (1).pdf
module3part-1-bigdata-230301002404-3db4f2a4 (1).pdf
TSANKARARAO
 
try
trytry
Mapreduce script
Mapreduce scriptMapreduce script
Mapreduce script
Haripritha
 
MapReduce
MapReduceMapReduce
MapReduce
KavyaGo
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
Varun Narang
 
MapReduce.pptx
MapReduce.pptxMapReduce.pptx
MapReduce.pptx
AtulYadav218546
 
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
Reynold Xin
 
Hadoop
HadoopHadoop
MapReduce: Distributed Computing for Machine Learning
MapReduce: Distributed Computing for Machine LearningMapReduce: Distributed Computing for Machine Learning
MapReduce: Distributed Computing for Machine Learning
butest
 
Hadoop training-in-hyderabad
Hadoop training-in-hyderabadHadoop training-in-hyderabad
Hadoop training-in-hyderabad
sreehari orienit
 

Similar to Map reduce - simplified data processing on large clusters (20)

MapReduce:Simplified Data Processing on Large Cluster Presented by Areej Qas...
MapReduce:Simplified Data Processing on Large Cluster  Presented by Areej Qas...MapReduce:Simplified Data Processing on Large Cluster  Presented by Areej Qas...
MapReduce:Simplified Data Processing on Large Cluster Presented by Areej Qas...
 
MapReduce presentation
MapReduce presentationMapReduce presentation
MapReduce presentation
 
mapreduce.pptx
mapreduce.pptxmapreduce.pptx
mapreduce.pptx
 
Cloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
Cloud infrastructure. Google File System and MapReduce - Andrii VozniukCloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
Cloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
 
Hadoop - Introduction to HDFS
Hadoop - Introduction to HDFSHadoop - Introduction to HDFS
Hadoop - Introduction to HDFS
 
Hadoop Network Performance profile
Hadoop Network Performance profileHadoop Network Performance profile
Hadoop Network Performance profile
 
Hadoop
HadoopHadoop
Hadoop
 
Introduction to map reduce
Introduction to map reduceIntroduction to map reduce
Introduction to map reduce
 
Map reduce presentation
Map reduce presentationMap reduce presentation
Map reduce presentation
 
Big Data.pptx
Big Data.pptxBig Data.pptx
Big Data.pptx
 
module3part-1-bigdata-230301002404-3db4f2a4 (1).pdf
module3part-1-bigdata-230301002404-3db4f2a4 (1).pdfmodule3part-1-bigdata-230301002404-3db4f2a4 (1).pdf
module3part-1-bigdata-230301002404-3db4f2a4 (1).pdf
 
try
trytry
try
 
Mapreduce script
Mapreduce scriptMapreduce script
Mapreduce script
 
MapReduce
MapReduceMapReduce
MapReduce
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 
MapReduce.pptx
MapReduce.pptxMapReduce.pptx
MapReduce.pptx
 
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
 
Hadoop
HadoopHadoop
Hadoop
 
MapReduce: Distributed Computing for Machine Learning
MapReduce: Distributed Computing for Machine LearningMapReduce: Distributed Computing for Machine Learning
MapReduce: Distributed Computing for Machine Learning
 
Hadoop training-in-hyderabad
Hadoop training-in-hyderabadHadoop training-in-hyderabad
Hadoop training-in-hyderabad
 

Recently uploaded

Natural birth techniques - Mrs.Akanksha Trivedi Rama University
Natural birth techniques - Mrs.Akanksha Trivedi Rama UniversityNatural birth techniques - Mrs.Akanksha Trivedi Rama University
Natural birth techniques - Mrs.Akanksha Trivedi Rama University
Akanksha trivedi rama nursing college kanpur.
 
Main Java[All of the Base Concepts}.docx
Main Java[All of the Base Concepts}.docxMain Java[All of the Base Concepts}.docx
Main Java[All of the Base Concepts}.docx
adhitya5119
 
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...
Dr. Vinod Kumar Kanvaria
 
How to Setup Warehouse & Location in Odoo 17 Inventory
How to Setup Warehouse & Location in Odoo 17 InventoryHow to Setup Warehouse & Location in Odoo 17 Inventory
How to Setup Warehouse & Location in Odoo 17 Inventory
Celine George
 
How to Add Chatter in the odoo 17 ERP Module
How to Add Chatter in the odoo 17 ERP ModuleHow to Add Chatter in the odoo 17 ERP Module
How to Add Chatter in the odoo 17 ERP Module
Celine George
 
A Independência da América Espanhola LAPBOOK.pdf
A Independência da América Espanhola LAPBOOK.pdfA Independência da América Espanhola LAPBOOK.pdf
A Independência da América Espanhola LAPBOOK.pdf
Jean Carlos Nunes Paixão
 
Advanced Java[Extra Concepts, Not Difficult].docx
Advanced Java[Extra Concepts, Not Difficult].docxAdvanced Java[Extra Concepts, Not Difficult].docx
Advanced Java[Extra Concepts, Not Difficult].docx
adhitya5119
 
ANATOMY AND BIOMECHANICS OF HIP JOINT.pdf
ANATOMY AND BIOMECHANICS OF HIP JOINT.pdfANATOMY AND BIOMECHANICS OF HIP JOINT.pdf
ANATOMY AND BIOMECHANICS OF HIP JOINT.pdf
Priyankaranawat4
 
World environment day ppt For 5 June 2024
World environment day ppt For 5 June 2024World environment day ppt For 5 June 2024
World environment day ppt For 5 June 2024
ak6969907
 
BBR 2024 Summer Sessions Interview Training
BBR  2024 Summer Sessions Interview TrainingBBR  2024 Summer Sessions Interview Training
BBR 2024 Summer Sessions Interview Training
Katrina Pritchard
 
What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...
What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...
What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...
GeorgeMilliken2
 
How to Fix the Import Error in the Odoo 17
How to Fix the Import Error in the Odoo 17How to Fix the Import Error in the Odoo 17
How to Fix the Import Error in the Odoo 17
Celine George
 
BÀI TẬP BỔ TRỢ TIẾNG ANH 8 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2023-2024 (CÓ FI...
BÀI TẬP BỔ TRỢ TIẾNG ANH 8 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2023-2024 (CÓ FI...BÀI TẬP BỔ TRỢ TIẾNG ANH 8 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2023-2024 (CÓ FI...
BÀI TẬP BỔ TRỢ TIẾNG ANH 8 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2023-2024 (CÓ FI...
Nguyen Thanh Tu Collection
 
The simplified electron and muon model, Oscillating Spacetime: The Foundation...
The simplified electron and muon model, Oscillating Spacetime: The Foundation...The simplified electron and muon model, Oscillating Spacetime: The Foundation...
The simplified electron and muon model, Oscillating Spacetime: The Foundation...
RitikBhardwaj56
 
writing about opinions about Australia the movie
writing about opinions about Australia the moviewriting about opinions about Australia the movie
writing about opinions about Australia the movie
Nicholas Montgomery
 
Film vocab for eal 3 students: Australia the movie
Film vocab for eal 3 students: Australia the movieFilm vocab for eal 3 students: Australia the movie
Film vocab for eal 3 students: Australia the movie
Nicholas Montgomery
 
Pride Month Slides 2024 David Douglas School District
Pride Month Slides 2024 David Douglas School DistrictPride Month Slides 2024 David Douglas School District
Pride Month Slides 2024 David Douglas School District
David Douglas School District
 
LAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UP
LAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UPLAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UP
LAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UP
RAHUL
 
Azure Interview Questions and Answers PDF By ScholarHat
Azure Interview Questions and Answers PDF By ScholarHatAzure Interview Questions and Answers PDF By ScholarHat
Azure Interview Questions and Answers PDF By ScholarHat
Scholarhat
 
Liberal Approach to the Study of Indian Politics.pdf
Liberal Approach to the Study of Indian Politics.pdfLiberal Approach to the Study of Indian Politics.pdf
Liberal Approach to the Study of Indian Politics.pdf
WaniBasim
 

Recently uploaded (20)

Natural birth techniques - Mrs.Akanksha Trivedi Rama University
Natural birth techniques - Mrs.Akanksha Trivedi Rama UniversityNatural birth techniques - Mrs.Akanksha Trivedi Rama University
Natural birth techniques - Mrs.Akanksha Trivedi Rama University
 
Main Java[All of the Base Concepts}.docx
Main Java[All of the Base Concepts}.docxMain Java[All of the Base Concepts}.docx
Main Java[All of the Base Concepts}.docx
 
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...
 
How to Setup Warehouse & Location in Odoo 17 Inventory
How to Setup Warehouse & Location in Odoo 17 InventoryHow to Setup Warehouse & Location in Odoo 17 Inventory
How to Setup Warehouse & Location in Odoo 17 Inventory
 
How to Add Chatter in the odoo 17 ERP Module
How to Add Chatter in the odoo 17 ERP ModuleHow to Add Chatter in the odoo 17 ERP Module
How to Add Chatter in the odoo 17 ERP Module
 
A Independência da América Espanhola LAPBOOK.pdf
A Independência da América Espanhola LAPBOOK.pdfA Independência da América Espanhola LAPBOOK.pdf
A Independência da América Espanhola LAPBOOK.pdf
 
Advanced Java[Extra Concepts, Not Difficult].docx
Advanced Java[Extra Concepts, Not Difficult].docxAdvanced Java[Extra Concepts, Not Difficult].docx
Advanced Java[Extra Concepts, Not Difficult].docx
 
ANATOMY AND BIOMECHANICS OF HIP JOINT.pdf
ANATOMY AND BIOMECHANICS OF HIP JOINT.pdfANATOMY AND BIOMECHANICS OF HIP JOINT.pdf
ANATOMY AND BIOMECHANICS OF HIP JOINT.pdf
 
World environment day ppt For 5 June 2024
World environment day ppt For 5 June 2024World environment day ppt For 5 June 2024
World environment day ppt For 5 June 2024
 
BBR 2024 Summer Sessions Interview Training
BBR  2024 Summer Sessions Interview TrainingBBR  2024 Summer Sessions Interview Training
BBR 2024 Summer Sessions Interview Training
 
What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...
What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...
What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...
 
How to Fix the Import Error in the Odoo 17
How to Fix the Import Error in the Odoo 17How to Fix the Import Error in the Odoo 17
How to Fix the Import Error in the Odoo 17
 
BÀI TẬP BỔ TRỢ TIẾNG ANH 8 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2023-2024 (CÓ FI...
BÀI TẬP BỔ TRỢ TIẾNG ANH 8 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2023-2024 (CÓ FI...BÀI TẬP BỔ TRỢ TIẾNG ANH 8 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2023-2024 (CÓ FI...
BÀI TẬP BỔ TRỢ TIẾNG ANH 8 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2023-2024 (CÓ FI...
 
The simplified electron and muon model, Oscillating Spacetime: The Foundation...
The simplified electron and muon model, Oscillating Spacetime: The Foundation...The simplified electron and muon model, Oscillating Spacetime: The Foundation...
The simplified electron and muon model, Oscillating Spacetime: The Foundation...
 
writing about opinions about Australia the movie
writing about opinions about Australia the moviewriting about opinions about Australia the movie
writing about opinions about Australia the movie
 
Film vocab for eal 3 students: Australia the movie
Film vocab for eal 3 students: Australia the movieFilm vocab for eal 3 students: Australia the movie
Film vocab for eal 3 students: Australia the movie
 
Pride Month Slides 2024 David Douglas School District
Pride Month Slides 2024 David Douglas School DistrictPride Month Slides 2024 David Douglas School District
Pride Month Slides 2024 David Douglas School District
 
LAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UP
LAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UPLAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UP
LAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UP
 
Azure Interview Questions and Answers PDF By ScholarHat
Azure Interview Questions and Answers PDF By ScholarHatAzure Interview Questions and Answers PDF By ScholarHat
Azure Interview Questions and Answers PDF By ScholarHat
 
Liberal Approach to the Study of Indian Politics.pdf
Liberal Approach to the Study of Indian Politics.pdfLiberal Approach to the Study of Indian Politics.pdf
Liberal Approach to the Study of Indian Politics.pdf
 

Map reduce - simplified data processing on large clusters

  • 1. MapReduce: Simplified Data Processing on Large Clusters Presented by Cleverence Kombe By Jeffrey Dean and Sanjay Ghemawat
  • 2. OUTLINES 1. Introduction 2. Programming Model 3. Implementation 4. Refinements 5. Performance 6. Experience and Conclusion
  • 3. 1. INTRODUCTION o Many tasks in large scale data processing composed of: o Computations that processes large amount of raw data which produce a lots of other data. o Due to massiveness of input data, the computation is distributed to the hundreds or thousands of machines to complete the tasks in reasonable period of time. o Techniques such as crawled documents and web request logs have been used by Google to parallelize the computation, distribute the data, and handle failures. o But these techniques contains very complex programming codes. o Jeffrey Dean and Sanjay Ghemawat came up with MapReduce concept which Simplify Data Processing by hiding the messy details of parallelization, fault-tolerance, data distribution and load balancing in a library.
  • 4. oWhat is MapReduce? Programming Model, approach, for processing large data sets. Contains Map and Reduce functions. Runs on a large cluster of commodity machines. Many real world tasks are expressible in this model. oMapReduce provides: User-defined functions Automatic parallelization and distribution Fault-tolerance I/O scheduling Status and monitoring 1. INTRODUCTION CONT…
  • 5. oInput & Output are sets of key/value pairs oProgrammer specifies two functions: 1. map (in_key, in_value) -> list(out_key, intermediate_value) Processes input key/value pair Produces set of intermediate pairs 1. reduce (out_key, list(intermediate_value)) -> list(out_value) Combines all intermediate values for a particular key Produces a set of merged output values (most cases just one) 2. PROGRAMMING MODEL
  • 6. 2. PROGRAMMING MODEL … Input Files Input file1 Input file2 Each line passed to individual mapper instances Map Key Value Splitting Sort and Shuffle Reduce Key Value Pairs Final Output Output file o Words Count Example
  • 7. 2. PROGRAMMING MODEL …More Examples Distributed Grep  The map function emits a line if it matches a supplied pattern Count of URL access frequency.  The map function processes logs of web page requests and outputs <URL, 1> Reverse web-link graph  The map function outputs <target, source> pairs for each link to a target URL found in a page named source Term-Vector per Host  A term vector summarizes the most important words that occur in a document or a set of documents as a list of (word, frequency) pairs Inverted Index  The map function parses each document, and emits a sequence of (word, document ID) pairs Distributed Sort  The map function extracts the key from each record, and emits a (key, record) pair
  • 8.  Many different implementations are possible  The right choice is depending on the environment.  Typical cluster: (wide use at Google, large clusters of PC’s connected via switched nets) • Hundreds to thousands of dual-processors x86 machines, Linux, 2-4 GB of memory per machine. • connected with networking HW, Limited bisection bandwidth • Storage is on local IDE disks (inexpensive) • GFS: distributed file system manages data • Scheduling system by the users to submit the tasks (Job=set of tasks mapped by scheduler to set of available PC within the cluster) Implemented using C++ library and linked into user programs 3. IMPLEMENTATION
  • 9. Execution Overview Map • Divide the input into M equal-sized splits • Each split is 16-64 MB large Reduce • Partitioning intermediate key space into R pieces • hash(intermediate_key) mod R Typical setting: • 2,000 machines • M = 200,000 • R = 5,000 3. IMPLEMENTATION...
  • 10. M input splits of 16- 64MB each Partitioning function hash(intermediate_key) mod R (0) mapreduce(spec, &result) R regions • Read all intermediate data • Sort it by intermediate keys Execution Overview…3. IMPLEMENTATION…
  • 11. Fault Tolerance Works: Handled through re-execution • Detect failure via periodic heartbeats • Re-execute completed + in-progress map tasks • Why do we need to re-execute even the completed tasks? • Re-execute in progress reduce tasks • Task completion committed through master Master failure: • It can be handled, but don't yet (master failure unlikely) 3. IMPLEMENTATION…
  • 12. Locality Master scheduling policy: • Asks GFS for locations of replicas of input file blocks • Map tasks typically split into 64MB (GFS block size) • Map tasks scheduled so GFS input block replica are on same machine or same rack As a result: • most task’s input data is read locally and consumes no network bandwidth 3. IMPLEMENTATION…
  • 13. Backup Tasks common causes that lengthens the total time taken for a MapReduce operation is a straggler. mechanism to alleviate the problem of stragglers. the master schedules backup executions of the remaining in- progress tasks. significantly reduces the time to complete large MapReduce operations.( up to 40% ) 3. IMPLEMENTATION…
  • 14. • Different partitioning functions. • User specify the number of reduce tasks/output that they desire (R). • Combiner function. • Useful for saving network bandwidth • Different input/output types • Skipping bad records • Master asks next worker is told to skip the bad record • Local execution • an alternative implementation of the MapReduce library that sequentially executes all of the work for a MapReduce operation on the local machine. • Status info • Progress of the computation & more info… • Counters • count occurrences of various events. (Ex: total number of words processed) 4. REFINEMENT
  • 15. Measure the performance of MapReduce on two computations running on a large cluster of machines. Grep • searches through approximately one terabyte of data looking for a particular pattern Sort • sorts approximately one terabyte of data 5. PERFORMANCE
  • 16. Specifications Cluster 1800 machines Memory 4 GB Processors Dual-processor 2 GHz Xeons with Hyper- threading Hard disk Dual 160 GB IDE disks Network Gigabit Ethernet per machine bandwidth approximately 100 Gbps Cluster Configuration 5. PERFORMANCE…
  • 17. Grep Computation Scans 10 billions 100-byte records, searching for rare 3- character pattern (occurs in 92,337 records).  input is split into approximately 64MB pieces (M = 15000), entire output is placed in one file , R = 1 Startup overhead is significant for short jobs Data Transfer rate over time 5. PERFORMANCE…
  • 18. Sort Computation  Backup tasks improves completion time reasonably  System manages machine failures relatively quickly. 5. PERFORMANCE… Data transfer rates over time for different executions of the sort program 44% longer 5% longer
  • 19. MapReduce has proven to be a useful abstraction Greatly simplifies large-scale computations at Google Fun to use: focus on problem, let library deal with messy details No big need for parallelization knowledge • (relief the user from dealing with low level parallelization details) 6. Experience & Conclusions