SlideShare a Scribd company logo
1 of 15
Map-Reduce Algorithms
Introduction to Map-Reduce
• MapReduce is a programming model used for efficient processing in
parallel over large data-sets in a distributed manner.
• The data is first split and then combined to produce the final result.
• The purpose of MapReduce in Hadoop is to Map each of the jobs and
then it will reduce it to equivalent tasks for providing less overhead
over the cluster network and to reduce the processing power.
MapReduce :
Map phase and Reduce phase.
1. Map Phase:
1.As the name suggests its main use is to map the input data in
key-value pairs.
2.The input to the map may be a key-value pair where the key can
be the id of some kind of address and value is the actual value
that it keeps.
3.The Map() function will be executed in its memory repository on
each of these input key-value pairs and generates the
intermediate key-value pair which works as input for the Reducer
or Reduce() function.
2. Reduce Phase :
1. The intermediate key-value pairs that work as input for
Reducer are shuffled and sort and send to
the Reduce() function.
2. Reducer aggregate or group the data based on its key-value
pair as per the reducer algorithm written by the developer.
Example
Algorithms
MapReduce implements various mathematical algorithms to divide a task into small
parts and assign them to multiple systems. In technical terms, MapReduce algorithm
helps in sending the Map & Reduce tasks to appropriate servers in a cluster.
These mathematical algorithms may include the following −
•Sorting
•Searching
•Indexing
•TF-IDF
1. Sorting :-
Sorting is one of the basic MapReduce algorithms to process and analyze data. MapReduce
implements sorting algorithm to automatically sort the output key-value pairs from the mapper
by their keys.
Sorting methods are implemented in the mapper class itself.
2. Searching :-
• Let us assume we have employee data in four different files − A, B, C, and D. Let us also
assume there are duplicate employee records in all four files because of importing the
employee data from all database tables repeatedly. See the following illustration
• The Map phase processes each input file and provides the employee data in key-value pairs
(<k, v> : <emp name, salary>). See the following illustration.
• The combiner phase (searching technique) will accept the input from the Map phase as
a key-value pair with employee name and salary. Using searching technique, the combiner
will check all the employee salary to find the highest salaried employee in each file
Reducer phase − Form each file, you will find the highest salaried employee. To avoid
redundancy, check all the <k, v> pairs and eliminate duplicate entries, if any. The same algorithm
is used in between the four <k, v> pairs, which are coming from four input files. The final output
should be as follows −
<gopal, 50000>
3. Indexing
:-
Normally indexing is used to point to a particular data and its address. It performs batch
indexing on the input files for a particular Mapper.
The following text is the input for inverted indexing. Here T[0], T[1], and t[2] are the file
names and their content are in double quotes.
T[0] = "it is what it is"
T[1] = "what is it"
T[2] = "it is a banana"
After applying the Indexing algorithm, we get the following output −
"a": {2}
"banana": {2}
"is": {0, 1, 2}
"it": {0, 1, 2}
"what": {0, 1}
4. TF-IDF :-
TF-IDF is a text processing algorithm which is short for Term Frequency − Inverse Document
Frequency. It is one of the common web analysis algorithms. Here, the term 'frequency' refers to
the number of times a term appears in a document.
1. Term Frequency (TF)
It measures how frequently a particular term occurs in a document. It is calculated by the
number of times a word appears in a document divided by the total number of words in that
document.
TF(the) = (Number of times term the ‘the’ appears in a document)
2. Inverse Document Frequency
It measures the importance of a term. It is calculated by the number of documents in
the text database divided by the number of documents where a specific term appears.
Consider a document containing 1000 words, wherein the word hive appears 50
times.
The TF for hive is then (50 / 1000) = 0.05.
Now, assume we have 10 million documents and the word hive appears in 1000 of
these.
Then, the IDF is calculated as log(10,000,000 / 1,000) = 4.
The TF-IDF weight is the product of these quantities − 0.05 × 4 = 0.20.
Thank You

More Related Content

Similar to Introduction to Map-Reduce in Hadoop.pptx

2004 map reduce simplied data processing on large clusters (mapreduce)
2004 map reduce simplied data processing on large clusters (mapreduce)2004 map reduce simplied data processing on large clusters (mapreduce)
2004 map reduce simplied data processing on large clusters (mapreduce)
anh tuan
 
DataMiningReport
DataMiningReportDataMiningReport
DataMiningReport
?? ?
 
Tri Merge Sorting Algorithm
Tri Merge Sorting AlgorithmTri Merge Sorting Algorithm
Tri Merge Sorting Algorithm
Ashim Sikder
 

Similar to Introduction to Map-Reduce in Hadoop.pptx (20)

Introduction to the Map-Reduce framework.pdf
Introduction to the Map-Reduce framework.pdfIntroduction to the Map-Reduce framework.pdf
Introduction to the Map-Reduce framework.pdf
 
Map reduce
Map reduceMap reduce
Map reduce
 
2004 map reduce simplied data processing on large clusters (mapreduce)
2004 map reduce simplied data processing on large clusters (mapreduce)2004 map reduce simplied data processing on large clusters (mapreduce)
2004 map reduce simplied data processing on large clusters (mapreduce)
 
MapReduce.pptx
MapReduce.pptxMapReduce.pptx
MapReduce.pptx
 
MapReduceAlgorithms.ppt
MapReduceAlgorithms.pptMapReduceAlgorithms.ppt
MapReduceAlgorithms.ppt
 
Lecture 1 mapreduce
Lecture 1  mapreduceLecture 1  mapreduce
Lecture 1 mapreduce
 
Data Structures Notes
Data Structures NotesData Structures Notes
Data Structures Notes
 
MapReduce-Notes.pdf
MapReduce-Notes.pdfMapReduce-Notes.pdf
MapReduce-Notes.pdf
 
Join Algorithms in MapReduce
Join Algorithms in MapReduceJoin Algorithms in MapReduce
Join Algorithms in MapReduce
 
Map reduce presentation
Map reduce presentationMap reduce presentation
Map reduce presentation
 
Introduction to data structures and complexity.pptx
Introduction to data structures and complexity.pptxIntroduction to data structures and complexity.pptx
Introduction to data structures and complexity.pptx
 
DataMiningReport
DataMiningReportDataMiningReport
DataMiningReport
 
Unit-2 Hadoop Framework.pdf
Unit-2 Hadoop Framework.pdfUnit-2 Hadoop Framework.pdf
Unit-2 Hadoop Framework.pdf
 
Unit-2 Hadoop Framework.pdf
Unit-2 Hadoop Framework.pdfUnit-2 Hadoop Framework.pdf
Unit-2 Hadoop Framework.pdf
 
Session 19 - MapReduce
Session 19  - MapReduce Session 19  - MapReduce
Session 19 - MapReduce
 
Algorithms.
Algorithms. Algorithms.
Algorithms.
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Aggarwal Draft
Aggarwal DraftAggarwal Draft
Aggarwal Draft
 
Mapreduce Algorithms
Mapreduce AlgorithmsMapreduce Algorithms
Mapreduce Algorithms
 
Tri Merge Sorting Algorithm
Tri Merge Sorting AlgorithmTri Merge Sorting Algorithm
Tri Merge Sorting Algorithm
 

Recently uploaded

Transparency, Recognition and the role of eSealing - Ildiko Mazar and Koen No...
Transparency, Recognition and the role of eSealing - Ildiko Mazar and Koen No...Transparency, Recognition and the role of eSealing - Ildiko Mazar and Koen No...
Transparency, Recognition and the role of eSealing - Ildiko Mazar and Koen No...
EADTU
 
SPLICE Working Group: Reusable Code Examples
SPLICE Working Group:Reusable Code ExamplesSPLICE Working Group:Reusable Code Examples
SPLICE Working Group: Reusable Code Examples
Peter Brusilovsky
 
SURVEY I created for uni project research
SURVEY I created for uni project researchSURVEY I created for uni project research
SURVEY I created for uni project research
CaitlinCummins3
 

Recently uploaded (20)

Supporting Newcomer Multilingual Learners
Supporting Newcomer  Multilingual LearnersSupporting Newcomer  Multilingual Learners
Supporting Newcomer Multilingual Learners
 
BỘ LUYỆN NGHE TIẾNG ANH 8 GLOBAL SUCCESS CẢ NĂM (GỒM 12 UNITS, MỖI UNIT GỒM 3...
BỘ LUYỆN NGHE TIẾNG ANH 8 GLOBAL SUCCESS CẢ NĂM (GỒM 12 UNITS, MỖI UNIT GỒM 3...BỘ LUYỆN NGHE TIẾNG ANH 8 GLOBAL SUCCESS CẢ NĂM (GỒM 12 UNITS, MỖI UNIT GỒM 3...
BỘ LUYỆN NGHE TIẾNG ANH 8 GLOBAL SUCCESS CẢ NĂM (GỒM 12 UNITS, MỖI UNIT GỒM 3...
 
Including Mental Health Support in Project Delivery, 14 May.pdf
Including Mental Health Support in Project Delivery, 14 May.pdfIncluding Mental Health Support in Project Delivery, 14 May.pdf
Including Mental Health Support in Project Delivery, 14 May.pdf
 
Transparency, Recognition and the role of eSealing - Ildiko Mazar and Koen No...
Transparency, Recognition and the role of eSealing - Ildiko Mazar and Koen No...Transparency, Recognition and the role of eSealing - Ildiko Mazar and Koen No...
Transparency, Recognition and the role of eSealing - Ildiko Mazar and Koen No...
 
e-Sealing at EADTU by Kamakshi Rajagopal
e-Sealing at EADTU by Kamakshi Rajagopale-Sealing at EADTU by Kamakshi Rajagopal
e-Sealing at EADTU by Kamakshi Rajagopal
 
Scopus Indexed Journals 2024 - ISCOPUS Publications
Scopus Indexed Journals 2024 - ISCOPUS PublicationsScopus Indexed Journals 2024 - ISCOPUS Publications
Scopus Indexed Journals 2024 - ISCOPUS Publications
 
OSCM Unit 2_Operations Processes & Systems
OSCM Unit 2_Operations Processes & SystemsOSCM Unit 2_Operations Processes & Systems
OSCM Unit 2_Operations Processes & Systems
 
male presentation...pdf.................
male presentation...pdf.................male presentation...pdf.................
male presentation...pdf.................
 
How to Send Pro Forma Invoice to Your Customers in Odoo 17
How to Send Pro Forma Invoice to Your Customers in Odoo 17How to Send Pro Forma Invoice to Your Customers in Odoo 17
How to Send Pro Forma Invoice to Your Customers in Odoo 17
 
VAMOS CUIDAR DO NOSSO PLANETA! .
VAMOS CUIDAR DO NOSSO PLANETA!                    .VAMOS CUIDAR DO NOSSO PLANETA!                    .
VAMOS CUIDAR DO NOSSO PLANETA! .
 
Sternal Fractures & Dislocations - EMGuidewire Radiology Reading Room
Sternal Fractures & Dislocations - EMGuidewire Radiology Reading RoomSternal Fractures & Dislocations - EMGuidewire Radiology Reading Room
Sternal Fractures & Dislocations - EMGuidewire Radiology Reading Room
 
TỔNG HỢP HƠN 100 ĐỀ THI THỬ TỐT NGHIỆP THPT TOÁN 2024 - TỪ CÁC TRƯỜNG, TRƯỜNG...
TỔNG HỢP HƠN 100 ĐỀ THI THỬ TỐT NGHIỆP THPT TOÁN 2024 - TỪ CÁC TRƯỜNG, TRƯỜNG...TỔNG HỢP HƠN 100 ĐỀ THI THỬ TỐT NGHIỆP THPT TOÁN 2024 - TỪ CÁC TRƯỜNG, TRƯỜNG...
TỔNG HỢP HƠN 100 ĐỀ THI THỬ TỐT NGHIỆP THPT TOÁN 2024 - TỪ CÁC TRƯỜNG, TRƯỜNG...
 
Analyzing and resolving a communication crisis in Dhaka textiles LTD.pptx
Analyzing and resolving a communication crisis in Dhaka textiles LTD.pptxAnalyzing and resolving a communication crisis in Dhaka textiles LTD.pptx
Analyzing and resolving a communication crisis in Dhaka textiles LTD.pptx
 
Spring gala 2024 photo slideshow - Celebrating School-Community Partnerships
Spring gala 2024 photo slideshow - Celebrating School-Community PartnershipsSpring gala 2024 photo slideshow - Celebrating School-Community Partnerships
Spring gala 2024 photo slideshow - Celebrating School-Community Partnerships
 
Mattingly "AI and Prompt Design: LLMs with NER"
Mattingly "AI and Prompt Design: LLMs with NER"Mattingly "AI and Prompt Design: LLMs with NER"
Mattingly "AI and Prompt Design: LLMs with NER"
 
Đề tieng anh thpt 2024 danh cho cac ban hoc sinh
Đề tieng anh thpt 2024 danh cho cac ban hoc sinhĐề tieng anh thpt 2024 danh cho cac ban hoc sinh
Đề tieng anh thpt 2024 danh cho cac ban hoc sinh
 
Mattingly "AI & Prompt Design: Named Entity Recognition"
Mattingly "AI & Prompt Design: Named Entity Recognition"Mattingly "AI & Prompt Design: Named Entity Recognition"
Mattingly "AI & Prompt Design: Named Entity Recognition"
 
SPLICE Working Group: Reusable Code Examples
SPLICE Working Group:Reusable Code ExamplesSPLICE Working Group:Reusable Code Examples
SPLICE Working Group: Reusable Code Examples
 
SURVEY I created for uni project research
SURVEY I created for uni project researchSURVEY I created for uni project research
SURVEY I created for uni project research
 
When Quality Assurance Meets Innovation in Higher Education - Report launch w...
When Quality Assurance Meets Innovation in Higher Education - Report launch w...When Quality Assurance Meets Innovation in Higher Education - Report launch w...
When Quality Assurance Meets Innovation in Higher Education - Report launch w...
 

Introduction to Map-Reduce in Hadoop.pptx

  • 2. Introduction to Map-Reduce • MapReduce is a programming model used for efficient processing in parallel over large data-sets in a distributed manner. • The data is first split and then combined to produce the final result. • The purpose of MapReduce in Hadoop is to Map each of the jobs and then it will reduce it to equivalent tasks for providing less overhead over the cluster network and to reduce the processing power.
  • 4. Map phase and Reduce phase. 1. Map Phase: 1.As the name suggests its main use is to map the input data in key-value pairs. 2.The input to the map may be a key-value pair where the key can be the id of some kind of address and value is the actual value that it keeps. 3.The Map() function will be executed in its memory repository on each of these input key-value pairs and generates the intermediate key-value pair which works as input for the Reducer or Reduce() function.
  • 5. 2. Reduce Phase : 1. The intermediate key-value pairs that work as input for Reducer are shuffled and sort and send to the Reduce() function. 2. Reducer aggregate or group the data based on its key-value pair as per the reducer algorithm written by the developer.
  • 7. Algorithms MapReduce implements various mathematical algorithms to divide a task into small parts and assign them to multiple systems. In technical terms, MapReduce algorithm helps in sending the Map & Reduce tasks to appropriate servers in a cluster. These mathematical algorithms may include the following − •Sorting •Searching •Indexing •TF-IDF
  • 8. 1. Sorting :- Sorting is one of the basic MapReduce algorithms to process and analyze data. MapReduce implements sorting algorithm to automatically sort the output key-value pairs from the mapper by their keys. Sorting methods are implemented in the mapper class itself.
  • 9. 2. Searching :- • Let us assume we have employee data in four different files − A, B, C, and D. Let us also assume there are duplicate employee records in all four files because of importing the employee data from all database tables repeatedly. See the following illustration
  • 10. • The Map phase processes each input file and provides the employee data in key-value pairs (<k, v> : <emp name, salary>). See the following illustration. • The combiner phase (searching technique) will accept the input from the Map phase as a key-value pair with employee name and salary. Using searching technique, the combiner will check all the employee salary to find the highest salaried employee in each file
  • 11. Reducer phase − Form each file, you will find the highest salaried employee. To avoid redundancy, check all the <k, v> pairs and eliminate duplicate entries, if any. The same algorithm is used in between the four <k, v> pairs, which are coming from four input files. The final output should be as follows − <gopal, 50000>
  • 12. 3. Indexing :- Normally indexing is used to point to a particular data and its address. It performs batch indexing on the input files for a particular Mapper. The following text is the input for inverted indexing. Here T[0], T[1], and t[2] are the file names and their content are in double quotes. T[0] = "it is what it is" T[1] = "what is it" T[2] = "it is a banana" After applying the Indexing algorithm, we get the following output − "a": {2} "banana": {2} "is": {0, 1, 2} "it": {0, 1, 2} "what": {0, 1}
  • 13. 4. TF-IDF :- TF-IDF is a text processing algorithm which is short for Term Frequency − Inverse Document Frequency. It is one of the common web analysis algorithms. Here, the term 'frequency' refers to the number of times a term appears in a document. 1. Term Frequency (TF) It measures how frequently a particular term occurs in a document. It is calculated by the number of times a word appears in a document divided by the total number of words in that document. TF(the) = (Number of times term the ‘the’ appears in a document)
  • 14. 2. Inverse Document Frequency It measures the importance of a term. It is calculated by the number of documents in the text database divided by the number of documents where a specific term appears. Consider a document containing 1000 words, wherein the word hive appears 50 times. The TF for hive is then (50 / 1000) = 0.05. Now, assume we have 10 million documents and the word hive appears in 1000 of these. Then, the IDF is calculated as log(10,000,000 / 1,000) = 4. The TF-IDF weight is the product of these quantities − 0.05 × 4 = 0.20.