SlideShare a Scribd company logo
1 of 25
MapReduce
Design Patterns

Anastasiia Kornilova,
SoftServe Data Science Group
MapReduce Components
❖

record reader

❖

map

❖

Reader

combiner

❖

partitioner

❖

Mapper

Combiner

Partitioner

Shuffle
and sort

shuffle and sort

❖

reduce

❖

output format

Reducer

Output
MapReduce Patterns
❖

Filtering Patterns

❖

Summarization Patterns

❖

Join Patterns

❖

Data Organization Patterns

❖

Metapatterns

❖

Input and Output Patterns
Filtering patterns

❖

Filtering

❖

Bloom filtering

❖

Top-N

❖

Distinct
Filtering
❖

Closer view of data

❖

Tracking a thread of events

❖

Distributed grep

❖

Data cleansing

❖

Simple random sampling

❖

Removing low scoring data
Input
split

Filter
Mapper

Output
file

Input
split

Filter
Mapper

Output
file

Input
split

Filter
Mapper

Output
file
Bloom filtering
❖

Removing most of non watched
values

❖

Prefiltering a data set for an
expensive set membership
check

•
•
•

Probabilistic data structure
Hash functions comparing
Answer: probably yes or now
Step 1 - Filter
Training
Bloom Filter
Training

Input
split

Output
file

Step 2 - Bloom Filtering via MapReduce

Input
split

Bloom
Filter
Mapper

Maybe
Bloom Filter
Test

No
Discarded

Load filter from
distributed cache

Input
split

Output
file

Bloom
Filter
Mapper

Maybe
Bloom Filter
Test

Output
file

No
Load filter from
distributed cache

Discarded
Top N
❖

Outlier analysis

❖

Select interesting data

❖

Catchy dashboards
Input
split

Top Ten
Mapper

local top 10

Input
split

Top Ten
Mapper

local top 10

Top Ten
Reducer
Input
split

Top Ten
Mapper

local top 10

Input
split

Top Ten
Mapper

local top 10

final top
10

Top 10
Output
Distinct
❖

Deduplicate data

❖

Getting distinct values

❖

Protecting from inner join
explosions
Summarization patterns
❖

Numerical summarization

❖

Inverted index

❖

Counting with counters
Numerical summarization

❖

Word count

❖

Record count

❖

Min/Max/Count

❖

Average/Median/Standart
deviation
Mapper

Mapper

Mapper

(key, summary field)
(key, summary field)

(key, summary field)
(key, summary field)

(key, summary field)
(key, summary field)

Partitoner
Reducer

(group B, summary)
(group D, summary)

Reducer

(group B, summary)
(group D, summary)

Partitoner

Partitoner
Inverted index
Mapper

(keyword, unique ID)
(keyword, unique ID)

Partitoner
Reducer

Reducer

(keyword, unique ID)
(keyword, unique ID)

(keyword A, list of IDs)
(keyword D, list of IDs)

Partitoner

Mapper

(keyword, unique ID)
(keyword, unique ID)

Mapper

(keyword A, list of IDs)
(keyword D, list of IDs)

Partitoner
Data Organization Patterns
❖

Structured to Hierarchical

❖

Partitioning

❖

Binning

❖

Total Order Sorting

❖

Shuffling
Join patterns

❖

Reduce Side Join

❖

Replicated Join

❖

Composite Join

❖

Cartesian Product
Data Set A
Input
split
Input
split
Input
split

Join
Mapper
Join
Mapper
Join
Mapper

(key, values
A)

(key, values
A)

Join
Reducer

Output
part

Join
Reducer

Output
part

Join
Reducer

Output
part

(key, values
A)

Shuffle
and sort

Data Set B
Input
split
Input
split

Join
Mapper
Join
Mapper

(key, values
B)
(key, values
B)
Node table

id
title
tagnames
authorized

User table

body
node type
parent id
abs parent id
added at
score
state string
last edited id
last activity id
last activity at
activity revision
extra
extra def
extra count

user id
reputation
gold
silver
bronze
Pig examples
- - Inner Join:
A = JOIN comments BY userID, users BY userID;

- - Outer Join:
A = JOIN comments BY userID [LEFT | RIFGT| FULL] OUTER , users BY userID;

- - Binning:
SPLIT data INTO
eights IF col1 == 8,
bigs IF col1 > 8,
smalls IF (col1 < 8 and col1 > 0 );

- - Top Ten:
B = ORDER A BY col4 DESC’
C = limit B 10;

- - Filtering:
b = FILTER a BY value < 3;

More Related Content

Similar to MapReduce Design Patterns

Slides on introduction to R by ArinBasu MD
Slides on introduction to R by ArinBasu MDSlides on introduction to R by ArinBasu MD
Slides on introduction to R by ArinBasu MD
SonaCharles2
 
Data Binding In Depth
Data Binding In DepthData Binding In Depth
Data Binding In Depth
Eyal Vardi
 

Similar to MapReduce Design Patterns (20)

Flink Forward San Francisco 2019: TensorFlow Extended: An end-to-end machine ...
Flink Forward San Francisco 2019: TensorFlow Extended: An end-to-end machine ...Flink Forward San Francisco 2019: TensorFlow Extended: An end-to-end machine ...
Flink Forward San Francisco 2019: TensorFlow Extended: An end-to-end machine ...
 
How to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analyticsHow to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analytics
 
Hadoop london
Hadoop londonHadoop london
Hadoop london
 
Database
DatabaseDatabase
Database
 
Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0
 
17641.ppt
17641.ppt17641.ppt
17641.ppt
 
Slides on introduction to R by ArinBasu MD
Slides on introduction to R by ArinBasu MDSlides on introduction to R by ArinBasu MD
Slides on introduction to R by ArinBasu MD
 
17641.ppt
17641.ppt17641.ppt
17641.ppt
 
Data engineering and analytics using python
Data engineering and analytics using pythonData engineering and analytics using python
Data engineering and analytics using python
 
Machine Learning with Azure
Machine Learning with AzureMachine Learning with Azure
Machine Learning with Azure
 
Scalable data pipeline
Scalable data pipelineScalable data pipeline
Scalable data pipeline
 
Tutorial Microsoft Excel 2007
Tutorial Microsoft Excel 2007Tutorial Microsoft Excel 2007
Tutorial Microsoft Excel 2007
 
Microsoft Excel 2007 Tutorial
Microsoft Excel 2007 TutorialMicrosoft Excel 2007 Tutorial
Microsoft Excel 2007 Tutorial
 
Data Binding In Depth
Data Binding In DepthData Binding In Depth
Data Binding In Depth
 
Pig latin
Pig latinPig latin
Pig latin
 
Knowage manual
Knowage manualKnowage manual
Knowage manual
 
The D3 Toolbox
The D3 ToolboxThe D3 Toolbox
The D3 Toolbox
 
METODOLOGIA DEA EN STATA
METODOLOGIA DEA EN STATAMETODOLOGIA DEA EN STATA
METODOLOGIA DEA EN STATA
 
R seminar dplyr package
R seminar dplyr packageR seminar dplyr package
R seminar dplyr package
 
Practice discovering biological knowledge using networks approach.
Practice discovering biological knowledge using networks approach.Practice discovering biological knowledge using networks approach.
Practice discovering biological knowledge using networks approach.
 

More from Anastasiia Kornilova (7)

Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
NLP approach for medical translation task
NLP approach for medical translation taskNLP approach for medical translation task
NLP approach for medical translation task
 
Kaggle - global Data Science community
Kaggle - global Data Science communityKaggle - global Data Science community
Kaggle - global Data Science community
 
Neural Networks and Deep Learning
Neural Networks and Deep LearningNeural Networks and Deep Learning
Neural Networks and Deep Learning
 
Stay well with machine learning
Stay well with machine learningStay well with machine learning
Stay well with machine learning
 
Recommender systems
Recommender systemsRecommender systems
Recommender systems
 
Mahout
MahoutMahout
Mahout
 

Recently uploaded

Recently uploaded (20)

Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 

MapReduce Design Patterns