View MR Design Patterns course details at www.edureka.co/mapreduce-design-patterns
Application of JOIN Pattern
MAP Reduce Design PATTERN
Slide 2 www.edureka.co/mapreduce-design-patterns
Objectives
At the end of this module, you will be able to understand
Why Design Patterns in MR
Who should know Map-Reduce Design patterns
Available Design Patterns in MR
Join pattern
Slide 3 www.edureka.co/mapreduce-design-patternsSlide 3
Why Design Patterns in MR?
General reusable, optimized solutions to most common
problems
Template to solve problems used in different situations
Speed up the development process
Tried and tested design principles
An initial guideline to solve most common problems in MR
Help build sophisticated and best solution
Slide 4 www.edureka.co/mapreduce-design-patternsSlide 4
Who should know MR Design Pattern?
A Java developer who wants to explore world of Big Data
A MapReduce programmer who wants to develop expertise in his/her MR skills
One who aims to become a Hadoop Architect
Slide 5 www.edureka.co/mapreduce-design-patternsSlide 5
Available Design Patterns in MR
Summarization
Pattern
Filtering Pattern
Data Organization
Pattern
Join Pattern
Meta Pattern
Input & Output
Pattern
Slide 6 www.edureka.co/mapreduce-design-patterns
Join Patterns – What is it
 Datasets generally exist in multiple sources
 Deriving full-value requires merging them together
 Join Patterns are used for this purpose
 Performing joins on the fly on Big Data can be costly in terms of time
Example: Joining StackOverflow data from Comments & Posts on UserId
Slide 7 www.edureka.co/mapreduce-design-patterns
Join Patterns – What is it?
 Joining Patterns we will talk about are
» Reduce Side Join/Repartition Join
» Reduce Side Join with Bloom Filter
» Replicated Join
» Composite Join
» Cartesian Product
Slide 8 www.edureka.co/mapreduce-design-patterns
Join – Refresher
 Inner Join
 Outer Join
» Left Outer Join
» Right Outer Join
» Full Outer Join
 Anti Join
 Cartesian Product
Slide 9 www.edureka.co/mapreduce-design-patterns
Reduce Side Join – Description
 Easiest to implement but can be longest to execute
 Supports all types of join operation
 Can join multiple data sources, but expensive in terms of network resources & time
 All data transferred across network
Example : Join PostLinks table data in StackOverflow to Posts data
Slide 10 www.edureka.co/mapreduce-design-patterns
Reduce Side Join – Description (Contd.)
 Applicability – Use it when
» Multiple large data sets require to be joined
» If one of the data sources is small look at using replicated join
» Different data sources are linked by a foreign key
» You want all join operations to be supported
Slide 11 www.edureka.co/mapreduce-design-patterns
Reduce Side Join – Structure
Slide 12 www.edureka.co/mapreduce-design-patterns
Reduce Side Join – Structure (Contd.)
 Mapper
» Output key should reflect the foreign key
» Value can be the whole record and an identifier to identify the source
» Use projection and output only the required number of fields
 Combiner
» Not Required ; No additional benefit
Slide 13 www.edureka.co/mapreduce-design-patterns
Reduce Side Join – Structure (Contd.)
 Partitioner
» User Custom Partitioner if required;
 Reducer
» Reducer logic based on type of join required
» Reducer receives the data from all the different sources per key
Slide 14 www.edureka.co/mapreduce-design-patterns
Reduce Side Join – Analogy
 Resemblances
» SQL
» SELECT users.ID, users.Location, comments.upVotes
FROM users
[INNER|LEFT|RIGHT] JOIN comments
ON users.ID=comments.UserID
» Pig
» Supports inner & outer joins
» Inner Join
» A = JOIN comments BY userID, users BY userID;
» Outer Join
» A = JOIN comments BY userID [LEFT|RIGHT|FULL] OUTER, users BY userID
Slide 15 www.edureka.co/mapreduce-design-patterns
Reduce Side Join – Performance
 Performance
» The whole data moves across the network to reducers
» You can optimize by using projection and sending only the required fields
» Number of reducers typically higher than normal
» If you can use any other Join type for your problem, use that instead
Slide 16 www.edureka.co/mapreduce-design-patterns
Reduce Side Join – Use Cases
 Join tweets with user personal information for Behavioral Analysis
 Join PostLinks and Posts tables from StackOverflow to have all related posts in one place
Slide 17 www.edureka.co/mapreduce-design-patterns
Reduce Side Join Example – Problem
 Your dataset is the StackOverflow dataset. Look at the PostLinks.xml & Posts.xml file. Join the two tables based on
PostId in PostLinks & Id in Posts
» Use MultipleInputs class
» Projection on PostLinks to output only PostId & RelatedPostId fields
Slide 18 www.edureka.co/mapreduce-design-patterns
DEMO
Reduce Side Join Example
Slide 19 www.edureka.co/mapreduce-design-patterns
Questions
Slide 20 www.edureka.co/mapreduce-design-patterns

Mrdp reduce side_join