SlideShare a Scribd company logo
1 of 9
• What is Reduce-side join? 
• Steps used to join the datasets in Reduce-side 
join 
• Sample datasets used in this project 
• Scenario flow 
• Practical demonstration of Reduce-side join
• Joins of datasets done in the reduce phase based on join 
key are called reduce side joins. Reduce-side joins are the 
easiest to implement 
• What makes reduce-side joins straight forward is the fact 
that Hadoop sends identical keys to the same reducer, so 
by default the data is organized for us. 
• To perform the join, we simply need to cache a key and 
compare it to incoming keys. As long as the keys match, we 
can join the values from the corresponding keys. 
• The trade off with reduce-side joins is performance, since 
all of the data is shuffled across the network
• The key of the map output, of datasets being joined, has to be the 
join key - so they reach the same reducer 
• Each dataset has to be tagged with its identity, in the mapper- to 
help differentiate between the datasets in the reducer, so they can 
be processed accordingly. 
• In each reducer, the data values from both datasets, for keys 
assigned to the reducer, are available, to be processed as required. 
• A secondary sort needs to be done to ensure the ordering of the 
values sent to the reducer 
• If the input files are of different formats, we would need separate 
mappers, and we would need to use MultipleInputs class in the 
driver to add the inputs and associate the specific mapper to the 
same.
1. Map output key 
The key will be the empNo as it is the join key for the datasets employee and salary 
[Implementation: in the mapper] 
2. Tagging the data with the dataset identity 
Add an attribute called srcIndex to tag the identity of the data (1=employee, 
2=salary) 
[Implementation: in the mapper] 
3. Discarding unwanted atributes 
[Implementation: in the mapper] 
4. Composite key 
Make the map output key a composite of empNo and srcIndex 
[Implementation: create custom writable] 
5. Partitioner 
Partition the data on natural key of empNo 
[Implementation: create custom partitioner class] 
---- continue 
•
6. Sorting 
Sort the data on empNo first, and then source index 
[Implementation: create custom sorting comparator 
class] 
7. Grouping 
Group the data based on natural key 
[Implementation: create custom grouping comparator 
class] 
8. Joining 
Iterate through the values for a key and complete the 
join for employee and salary data. 
[Implementation: in the reducer]
Repartition join in mapreduce

More Related Content

What's hot

Mapreduce - Simplified Data Processing on Large Clusters
Mapreduce - Simplified Data Processing on Large ClustersMapreduce - Simplified Data Processing on Large Clusters
Mapreduce - Simplified Data Processing on Large ClustersAbhishek Singh
 
Stratosphere with big_data_analytics
Stratosphere with big_data_analyticsStratosphere with big_data_analytics
Stratosphere with big_data_analyticsAvinash Pandu
 
Applying stratosphere for big data analytics
Applying stratosphere for big data analyticsApplying stratosphere for big data analytics
Applying stratosphere for big data analyticsAvinash Pandu
 
Hadoop secondary sort and a custom comparator
Hadoop secondary sort and a custom comparatorHadoop secondary sort and a custom comparator
Hadoop secondary sort and a custom comparatorSubhas Kumar Ghosh
 
Mapreduce total order sorting technique
Mapreduce total order sorting techniqueMapreduce total order sorting technique
Mapreduce total order sorting techniqueUday Vakalapudi
 
Introduction to MapReduce
Introduction to MapReduceIntroduction to MapReduce
Introduction to MapReduceHassan A-j
 
Introduction to map reduce s. jency jayastina II MSC COMPUTER SCIENCE BON SEC...
Introduction to map reduce s. jency jayastina II MSC COMPUTER SCIENCE BON SEC...Introduction to map reduce s. jency jayastina II MSC COMPUTER SCIENCE BON SEC...
Introduction to map reduce s. jency jayastina II MSC COMPUTER SCIENCE BON SEC...jencyjayastina
 
Hive query optimization infinity
Hive query optimization infinityHive query optimization infinity
Hive query optimization infinityShashwat Shriparv
 
H base introduction & development
H base introduction & developmentH base introduction & development
H base introduction & developmentShashwat Shriparv
 
Hadoop combiner and partitioner
Hadoop combiner and partitionerHadoop combiner and partitioner
Hadoop combiner and partitionerSubhas Kumar Ghosh
 
Hadoop deconstructing map reduce job step by step
Hadoop deconstructing map reduce job step by stepHadoop deconstructing map reduce job step by step
Hadoop deconstructing map reduce job step by stepSubhas Kumar Ghosh
 
Map reduce presentation
Map reduce presentationMap reduce presentation
Map reduce presentationateeq ateeq
 
SPARJA: a Distributed Social Graph Partitioning and Replication Middleware
SPARJA: a Distributed Social Graph Partitioning and Replication MiddlewareSPARJA: a Distributed Social Graph Partitioning and Replication Middleware
SPARJA: a Distributed Social Graph Partitioning and Replication MiddlewareMaria Stylianou
 
Join optimization in hive
Join optimization in hive Join optimization in hive
Join optimization in hive Liyin Tang
 

What's hot (19)

What is MapReduce ?
What is MapReduce ?What is MapReduce ?
What is MapReduce ?
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Mapreduce - Simplified Data Processing on Large Clusters
Mapreduce - Simplified Data Processing on Large ClustersMapreduce - Simplified Data Processing on Large Clusters
Mapreduce - Simplified Data Processing on Large Clusters
 
Stratosphere with big_data_analytics
Stratosphere with big_data_analyticsStratosphere with big_data_analytics
Stratosphere with big_data_analytics
 
Applying stratosphere for big data analytics
Applying stratosphere for big data analyticsApplying stratosphere for big data analytics
Applying stratosphere for big data analytics
 
Hadoop secondary sort and a custom comparator
Hadoop secondary sort and a custom comparatorHadoop secondary sort and a custom comparator
Hadoop secondary sort and a custom comparator
 
Lecture 1 mapreduce
Lecture 1  mapreduceLecture 1  mapreduce
Lecture 1 mapreduce
 
Pregel - Paper Review
Pregel - Paper ReviewPregel - Paper Review
Pregel - Paper Review
 
Mapreduce total order sorting technique
Mapreduce total order sorting techniqueMapreduce total order sorting technique
Mapreduce total order sorting technique
 
Introduction to MapReduce
Introduction to MapReduceIntroduction to MapReduce
Introduction to MapReduce
 
Introduction to map reduce s. jency jayastina II MSC COMPUTER SCIENCE BON SEC...
Introduction to map reduce s. jency jayastina II MSC COMPUTER SCIENCE BON SEC...Introduction to map reduce s. jency jayastina II MSC COMPUTER SCIENCE BON SEC...
Introduction to map reduce s. jency jayastina II MSC COMPUTER SCIENCE BON SEC...
 
Hive query optimization infinity
Hive query optimization infinityHive query optimization infinity
Hive query optimization infinity
 
H base introduction & development
H base introduction & developmentH base introduction & development
H base introduction & development
 
Hadoop combiner and partitioner
Hadoop combiner and partitionerHadoop combiner and partitioner
Hadoop combiner and partitioner
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Hadoop deconstructing map reduce job step by step
Hadoop deconstructing map reduce job step by stepHadoop deconstructing map reduce job step by step
Hadoop deconstructing map reduce job step by step
 
Map reduce presentation
Map reduce presentationMap reduce presentation
Map reduce presentation
 
SPARJA: a Distributed Social Graph Partitioning and Replication Middleware
SPARJA: a Distributed Social Graph Partitioning and Replication MiddlewareSPARJA: a Distributed Social Graph Partitioning and Replication Middleware
SPARJA: a Distributed Social Graph Partitioning and Replication Middleware
 
Join optimization in hive
Join optimization in hive Join optimization in hive
Join optimization in hive
 

Similar to Repartition join in mapreduce

Introduction to the Map-Reduce framework.pdf
Introduction to the Map-Reduce framework.pdfIntroduction to the Map-Reduce framework.pdf
Introduction to the Map-Reduce framework.pdfBikalAdhikari4
 
Hadoop eco system with mapreduce hive and pig
Hadoop eco system with mapreduce hive and pigHadoop eco system with mapreduce hive and pig
Hadoop eco system with mapreduce hive and pigKhanKhaja1
 
Map reduce presentation
Map reduce presentationMap reduce presentation
Map reduce presentationAhmad El Tawil
 
S_MapReduce_Types_Formats_Features_07.pptx
S_MapReduce_Types_Formats_Features_07.pptxS_MapReduce_Types_Formats_Features_07.pptx
S_MapReduce_Types_Formats_Features_07.pptxRajiArun7
 
Hadoop Map Reduce OS
Hadoop Map Reduce OSHadoop Map Reduce OS
Hadoop Map Reduce OSVedant Mane
 
Introduction to Map-Reduce in Hadoop.pptx
Introduction to Map-Reduce in Hadoop.pptxIntroduction to Map-Reduce in Hadoop.pptx
Introduction to Map-Reduce in Hadoop.pptxtest1miniproject
 
Introduction to Map-Reduce in Hadoop.pptx
Introduction to Map-Reduce in Hadoop.pptxIntroduction to Map-Reduce in Hadoop.pptx
Introduction to Map-Reduce in Hadoop.pptxtest1miniproject
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)Paul Chao
 
Apache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce TutorialApache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce TutorialFarzad Nozarian
 
Big Data Analytics Chapter3-6@2021.pdf
Big Data Analytics Chapter3-6@2021.pdfBig Data Analytics Chapter3-6@2021.pdf
Big Data Analytics Chapter3-6@2021.pdfWasyihunSema2
 
MapReduce: Ordering and Large-Scale Indexing on Large Clusters
MapReduce: Ordering and  Large-Scale Indexing on Large ClustersMapReduce: Ordering and  Large-Scale Indexing on Large Clusters
MapReduce: Ordering and Large-Scale Indexing on Large ClustersIRJET Journal
 
Hadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.comHadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.comsoftwarequery
 
Tutorial: Implementing your first Postgres extension | PGConf EU 2019 | Burak...
Tutorial: Implementing your first Postgres extension | PGConf EU 2019 | Burak...Tutorial: Implementing your first Postgres extension | PGConf EU 2019 | Burak...
Tutorial: Implementing your first Postgres extension | PGConf EU 2019 | Burak...Citus Data
 

Similar to Repartition join in mapreduce (20)

Introduction to the Map-Reduce framework.pdf
Introduction to the Map-Reduce framework.pdfIntroduction to the Map-Reduce framework.pdf
Introduction to the Map-Reduce framework.pdf
 
Hadoop eco system with mapreduce hive and pig
Hadoop eco system with mapreduce hive and pigHadoop eco system with mapreduce hive and pig
Hadoop eco system with mapreduce hive and pig
 
Hadoop Architecture
Hadoop ArchitectureHadoop Architecture
Hadoop Architecture
 
MapReduce.pptx
MapReduce.pptxMapReduce.pptx
MapReduce.pptx
 
Map reduce presentation
Map reduce presentationMap reduce presentation
Map reduce presentation
 
2 mapreduce-model-principles
2 mapreduce-model-principles2 mapreduce-model-principles
2 mapreduce-model-principles
 
S_MapReduce_Types_Formats_Features_07.pptx
S_MapReduce_Types_Formats_Features_07.pptxS_MapReduce_Types_Formats_Features_07.pptx
S_MapReduce_Types_Formats_Features_07.pptx
 
Hadoop Map Reduce OS
Hadoop Map Reduce OSHadoop Map Reduce OS
Hadoop Map Reduce OS
 
Introduction to Map-Reduce in Hadoop.pptx
Introduction to Map-Reduce in Hadoop.pptxIntroduction to Map-Reduce in Hadoop.pptx
Introduction to Map-Reduce in Hadoop.pptx
 
Introduction to Map-Reduce in Hadoop.pptx
Introduction to Map-Reduce in Hadoop.pptxIntroduction to Map-Reduce in Hadoop.pptx
Introduction to Map-Reduce in Hadoop.pptx
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)
 
Apache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce TutorialApache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce Tutorial
 
Pig Experience
Pig ExperiencePig Experience
Pig Experience
 
Hadoop
HadoopHadoop
Hadoop
 
Big Data Analytics Chapter3-6@2021.pdf
Big Data Analytics Chapter3-6@2021.pdfBig Data Analytics Chapter3-6@2021.pdf
Big Data Analytics Chapter3-6@2021.pdf
 
MapReduce: Ordering and Large-Scale Indexing on Large Clusters
MapReduce: Ordering and  Large-Scale Indexing on Large ClustersMapReduce: Ordering and  Large-Scale Indexing on Large Clusters
MapReduce: Ordering and Large-Scale Indexing on Large Clusters
 
Map reduce prashant
Map reduce prashantMap reduce prashant
Map reduce prashant
 
Hadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.comHadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.com
 
Tutorial: Implementing your first Postgres extension | PGConf EU 2019 | Burak...
Tutorial: Implementing your first Postgres extension | PGConf EU 2019 | Burak...Tutorial: Implementing your first Postgres extension | PGConf EU 2019 | Burak...
Tutorial: Implementing your first Postgres extension | PGConf EU 2019 | Burak...
 
Hadoop map reduce concepts
Hadoop map reduce conceptsHadoop map reduce concepts
Hadoop map reduce concepts
 

More from Uday Vakalapudi

Introduction to HDFS and MapReduce
Introduction to HDFS and MapReduceIntroduction to HDFS and MapReduce
Introduction to HDFS and MapReduceUday Vakalapudi
 
Oozie workflow using HUE 2.2
Oozie workflow using HUE 2.2Oozie workflow using HUE 2.2
Oozie workflow using HUE 2.2Uday Vakalapudi
 
Apache Storm and twitter Streaming API integration
Apache Storm and twitter Streaming API integrationApache Storm and twitter Streaming API integration
Apache Storm and twitter Streaming API integrationUday Vakalapudi
 
How Hadoop Exploits Data Locality
How Hadoop Exploits Data LocalityHow Hadoop Exploits Data Locality
How Hadoop Exploits Data LocalityUday Vakalapudi
 

More from Uday Vakalapudi (10)

Introduction to pig
Introduction to pigIntroduction to pig
Introduction to pig
 
Introduction to sqoop
Introduction to sqoopIntroduction to sqoop
Introduction to sqoop
 
Introduction to hbase
Introduction to hbaseIntroduction to hbase
Introduction to hbase
 
Introduction to Hive
Introduction to HiveIntroduction to Hive
Introduction to Hive
 
Introduction to HDFS and MapReduce
Introduction to HDFS and MapReduceIntroduction to HDFS and MapReduce
Introduction to HDFS and MapReduce
 
Advanced topics in hive
Advanced topics in hiveAdvanced topics in hive
Advanced topics in hive
 
Oozie workflow using HUE 2.2
Oozie workflow using HUE 2.2Oozie workflow using HUE 2.2
Oozie workflow using HUE 2.2
 
Apache Storm and twitter Streaming API integration
Apache Storm and twitter Streaming API integrationApache Storm and twitter Streaming API integration
Apache Storm and twitter Streaming API integration
 
How Hadoop Exploits Data Locality
How Hadoop Exploits Data LocalityHow Hadoop Exploits Data Locality
How Hadoop Exploits Data Locality
 
Flume basic
Flume basicFlume basic
Flume basic
 

Recently uploaded

BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEOrtus Solutions, Corp
 
XpertSolvers: Your Partner in Building Innovative Software Solutions
XpertSolvers: Your Partner in Building Innovative Software SolutionsXpertSolvers: Your Partner in Building Innovative Software Solutions
XpertSolvers: Your Partner in Building Innovative Software SolutionsMehedi Hasan Shohan
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...gurkirankumar98700
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio, Inc.
 
Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...aditisharan08
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataBradBedford3
 
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityNeo4j
 
chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptkotipi9215
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
 
What are the features of Vehicle Tracking System?
What are the features of Vehicle Tracking System?What are the features of Vehicle Tracking System?
What are the features of Vehicle Tracking System?Watsoo Telematics
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Andreas Granig
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...Christina Lin
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackVICTOR MAESTRE RAMIREZ
 
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样umasea
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - InfographicHr365.us smith
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureDinusha Kumarasiri
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmSujith Sukumaran
 

Recently uploaded (20)

Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
 
XpertSolvers: Your Partner in Building Innovative Software Solutions
XpertSolvers: Your Partner in Building Innovative Software SolutionsXpertSolvers: Your Partner in Building Innovative Software Solutions
XpertSolvers: Your Partner in Building Innovative Software Solutions
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
 
Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
 
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered Sustainability
 
chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.ppt
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
What are the features of Vehicle Tracking System?
What are the features of Vehicle Tracking System?What are the features of Vehicle Tracking System?
What are the features of Vehicle Tracking System?
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStack
 
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - Infographic
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with Azure
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalm
 

Repartition join in mapreduce

  • 1.
  • 2.
  • 3. • What is Reduce-side join? • Steps used to join the datasets in Reduce-side join • Sample datasets used in this project • Scenario flow • Practical demonstration of Reduce-side join
  • 4. • Joins of datasets done in the reduce phase based on join key are called reduce side joins. Reduce-side joins are the easiest to implement • What makes reduce-side joins straight forward is the fact that Hadoop sends identical keys to the same reducer, so by default the data is organized for us. • To perform the join, we simply need to cache a key and compare it to incoming keys. As long as the keys match, we can join the values from the corresponding keys. • The trade off with reduce-side joins is performance, since all of the data is shuffled across the network
  • 5. • The key of the map output, of datasets being joined, has to be the join key - so they reach the same reducer • Each dataset has to be tagged with its identity, in the mapper- to help differentiate between the datasets in the reducer, so they can be processed accordingly. • In each reducer, the data values from both datasets, for keys assigned to the reducer, are available, to be processed as required. • A secondary sort needs to be done to ensure the ordering of the values sent to the reducer • If the input files are of different formats, we would need separate mappers, and we would need to use MultipleInputs class in the driver to add the inputs and associate the specific mapper to the same.
  • 6.
  • 7. 1. Map output key The key will be the empNo as it is the join key for the datasets employee and salary [Implementation: in the mapper] 2. Tagging the data with the dataset identity Add an attribute called srcIndex to tag the identity of the data (1=employee, 2=salary) [Implementation: in the mapper] 3. Discarding unwanted atributes [Implementation: in the mapper] 4. Composite key Make the map output key a composite of empNo and srcIndex [Implementation: create custom writable] 5. Partitioner Partition the data on natural key of empNo [Implementation: create custom partitioner class] ---- continue •
  • 8. 6. Sorting Sort the data on empNo first, and then source index [Implementation: create custom sorting comparator class] 7. Grouping Group the data based on natural key [Implementation: create custom grouping comparator class] 8. Joining Iterate through the values for a key and complete the join for employee and salary data. [Implementation: in the reducer]