SlideShare a Scribd company logo
MapReduce
Design Patterns

Anastasiia Kornilova,
SoftServe Data Science Group
MapReduce Components
❖

record reader

❖

map

❖

Reader

combiner

❖

partitioner

❖

Mapper

Combiner

Partitioner

Shuffle
and sort

shuffle and sort

❖

reduce

❖

output format

Reducer

Output
MapReduce Patterns
❖

Filtering Patterns

❖

Summarization Patterns

❖

Join Patterns

❖

Data Organization Patterns

❖

Metapatterns

❖

Input and Output Patterns
Filtering patterns

❖

Filtering

❖

Bloom filtering

❖

Top-N

❖

Distinct
Filtering
❖

Closer view of data

❖

Tracking a thread of events

❖

Distributed grep

❖

Data cleansing

❖

Simple random sampling

❖

Removing low scoring data
Input
split

Filter
Mapper

Output
file

Input
split

Filter
Mapper

Output
file

Input
split

Filter
Mapper

Output
file
Bloom filtering
❖

Removing most of non watched
values

❖

Prefiltering a data set for an
expensive set membership
check

•
•
•

Probabilistic data structure
Hash functions comparing
Answer: probably yes or now
Step 1 - Filter
Training
Bloom Filter
Training

Input
split

Output
file

Step 2 - Bloom Filtering via MapReduce

Input
split

Bloom
Filter
Mapper

Maybe
Bloom Filter
Test

No
Discarded

Load filter from
distributed cache

Input
split

Output
file

Bloom
Filter
Mapper

Maybe
Bloom Filter
Test

Output
file

No
Load filter from
distributed cache

Discarded
Top N
❖

Outlier analysis

❖

Select interesting data

❖

Catchy dashboards
Input
split

Top Ten
Mapper

local top 10

Input
split

Top Ten
Mapper

local top 10

Top Ten
Reducer
Input
split

Top Ten
Mapper

local top 10

Input
split

Top Ten
Mapper

local top 10

final top
10

Top 10
Output
Distinct
❖

Deduplicate data

❖

Getting distinct values

❖

Protecting from inner join
explosions
Summarization patterns
❖

Numerical summarization

❖

Inverted index

❖

Counting with counters
Numerical summarization

❖

Word count

❖

Record count

❖

Min/Max/Count

❖

Average/Median/Standart
deviation
Mapper

Mapper

Mapper

(key, summary field)
(key, summary field)

(key, summary field)
(key, summary field)

(key, summary field)
(key, summary field)

Partitoner
Reducer

(group B, summary)
(group D, summary)

Reducer

(group B, summary)
(group D, summary)

Partitoner

Partitoner
Inverted index
Mapper

(keyword, unique ID)
(keyword, unique ID)

Partitoner
Reducer

Reducer

(keyword, unique ID)
(keyword, unique ID)

(keyword A, list of IDs)
(keyword D, list of IDs)

Partitoner

Mapper

(keyword, unique ID)
(keyword, unique ID)

Mapper

(keyword A, list of IDs)
(keyword D, list of IDs)

Partitoner
Data Organization Patterns
❖

Structured to Hierarchical

❖

Partitioning

❖

Binning

❖

Total Order Sorting

❖

Shuffling
Join patterns

❖

Reduce Side Join

❖

Replicated Join

❖

Composite Join

❖

Cartesian Product
Data Set A
Input
split
Input
split
Input
split

Join
Mapper
Join
Mapper
Join
Mapper

(key, values
A)

(key, values
A)

Join
Reducer

Output
part

Join
Reducer

Output
part

Join
Reducer

Output
part

(key, values
A)

Shuffle
and sort

Data Set B
Input
split
Input
split

Join
Mapper
Join
Mapper

(key, values
B)
(key, values
B)
Node table

id
title
tagnames
authorized

User table

body
node type
parent id
abs parent id
added at
score
state string
last edited id
last activity id
last activity at
activity revision
extra
extra def
extra count

user id
reputation
gold
silver
bronze
Pig examples
- - Inner Join:
A = JOIN comments BY userID, users BY userID;

- - Outer Join:
A = JOIN comments BY userID [LEFT | RIFGT| FULL] OUTER , users BY userID;

- - Binning:
SPLIT data INTO
eights IF col1 == 8,
bigs IF col1 > 8,
smalls IF (col1 < 8 and col1 > 0 );

- - Top Ten:
B = ORDER A BY col4 DESC’
C = limit B 10;

- - Filtering:
b = FILTER a BY value < 3;

More Related Content

Similar to MapReduce Design Patterns

Flink Forward San Francisco 2019: TensorFlow Extended: An end-to-end machine ...
Flink Forward San Francisco 2019: TensorFlow Extended: An end-to-end machine ...Flink Forward San Francisco 2019: TensorFlow Extended: An end-to-end machine ...
Flink Forward San Francisco 2019: TensorFlow Extended: An end-to-end machine ...
Flink Forward
 
How to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analyticsHow to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analytics
Julien Le Dem
 
Hadoop london
Hadoop londonHadoop london
Database
DatabaseDatabase
Database
Naveen Sihag
 
Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0
Databricks
 
17641.ppt
17641.ppt17641.ppt
17641.ppt
vikassingh569137
 
Slides on introduction to R by ArinBasu MD
Slides on introduction to R by ArinBasu MDSlides on introduction to R by ArinBasu MD
Slides on introduction to R by ArinBasu MD
SonaCharles2
 
17641.ppt
17641.ppt17641.ppt
Data engineering and analytics using python
Data engineering and analytics using pythonData engineering and analytics using python
Data engineering and analytics using python
Purna Chander
 
Machine Learning with Azure
Machine Learning with AzureMachine Learning with Azure
Machine Learning with Azure
Barbara Fusinska
 
Scalable data pipeline
Scalable data pipelineScalable data pipeline
Scalable data pipeline
GreenM
 
Tutorial Microsoft Excel 2007
Tutorial Microsoft Excel 2007Tutorial Microsoft Excel 2007
Tutorial Microsoft Excel 2007
dhafinnaviansyah
 
Microsoft Excel 2007 Tutorial
Microsoft Excel 2007 TutorialMicrosoft Excel 2007 Tutorial
Microsoft Excel 2007 Tutorial
dhafinnaviansyah
 
Data Binding In Depth
Data Binding In DepthData Binding In Depth
Data Binding In Depth
Eyal Vardi
 
Pig latin
Pig latinPig latin
Pig latin
Bita Kazemi
 
Knowage manual
Knowage manualKnowage manual
Knowage manual
Herry Atwar
 
The D3 Toolbox
The D3 ToolboxThe D3 Toolbox
The D3 Toolbox
Mark Rickerby
 
METODOLOGIA DEA EN STATA
METODOLOGIA DEA EN STATAMETODOLOGIA DEA EN STATA
METODOLOGIA DEA EN STATA
LuhSm
 
R seminar dplyr package
R seminar dplyr packageR seminar dplyr package
R seminar dplyr package
Muhammad Nabi Ahmad
 
Practice discovering biological knowledge using networks approach.
Practice discovering biological knowledge using networks approach.Practice discovering biological knowledge using networks approach.
Practice discovering biological knowledge using networks approach.
Elena Sügis
 

Similar to MapReduce Design Patterns (20)

Flink Forward San Francisco 2019: TensorFlow Extended: An end-to-end machine ...
Flink Forward San Francisco 2019: TensorFlow Extended: An end-to-end machine ...Flink Forward San Francisco 2019: TensorFlow Extended: An end-to-end machine ...
Flink Forward San Francisco 2019: TensorFlow Extended: An end-to-end machine ...
 
How to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analyticsHow to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analytics
 
Hadoop london
Hadoop londonHadoop london
Hadoop london
 
Database
DatabaseDatabase
Database
 
Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0
 
17641.ppt
17641.ppt17641.ppt
17641.ppt
 
Slides on introduction to R by ArinBasu MD
Slides on introduction to R by ArinBasu MDSlides on introduction to R by ArinBasu MD
Slides on introduction to R by ArinBasu MD
 
17641.ppt
17641.ppt17641.ppt
17641.ppt
 
Data engineering and analytics using python
Data engineering and analytics using pythonData engineering and analytics using python
Data engineering and analytics using python
 
Machine Learning with Azure
Machine Learning with AzureMachine Learning with Azure
Machine Learning with Azure
 
Scalable data pipeline
Scalable data pipelineScalable data pipeline
Scalable data pipeline
 
Tutorial Microsoft Excel 2007
Tutorial Microsoft Excel 2007Tutorial Microsoft Excel 2007
Tutorial Microsoft Excel 2007
 
Microsoft Excel 2007 Tutorial
Microsoft Excel 2007 TutorialMicrosoft Excel 2007 Tutorial
Microsoft Excel 2007 Tutorial
 
Data Binding In Depth
Data Binding In DepthData Binding In Depth
Data Binding In Depth
 
Pig latin
Pig latinPig latin
Pig latin
 
Knowage manual
Knowage manualKnowage manual
Knowage manual
 
The D3 Toolbox
The D3 ToolboxThe D3 Toolbox
The D3 Toolbox
 
METODOLOGIA DEA EN STATA
METODOLOGIA DEA EN STATAMETODOLOGIA DEA EN STATA
METODOLOGIA DEA EN STATA
 
R seminar dplyr package
R seminar dplyr packageR seminar dplyr package
R seminar dplyr package
 
Practice discovering biological knowledge using networks approach.
Practice discovering biological knowledge using networks approach.Practice discovering biological knowledge using networks approach.
Practice discovering biological knowledge using networks approach.
 

More from Anastasiia Kornilova

Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
Anastasiia Kornilova
 
NLP approach for medical translation task
NLP approach for medical translation taskNLP approach for medical translation task
NLP approach for medical translation task
Anastasiia Kornilova
 
Kaggle - global Data Science community
Kaggle - global Data Science communityKaggle - global Data Science community
Kaggle - global Data Science community
Anastasiia Kornilova
 
Neural Networks and Deep Learning
Neural Networks and Deep LearningNeural Networks and Deep Learning
Neural Networks and Deep Learning
Anastasiia Kornilova
 
Stay well with machine learning
Stay well with machine learningStay well with machine learning
Stay well with machine learning
Anastasiia Kornilova
 
Recommender systems
Recommender systemsRecommender systems
Recommender systems
Anastasiia Kornilova
 
Mahout
MahoutMahout

More from Anastasiia Kornilova (7)

Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
NLP approach for medical translation task
NLP approach for medical translation taskNLP approach for medical translation task
NLP approach for medical translation task
 
Kaggle - global Data Science community
Kaggle - global Data Science communityKaggle - global Data Science community
Kaggle - global Data Science community
 
Neural Networks and Deep Learning
Neural Networks and Deep LearningNeural Networks and Deep Learning
Neural Networks and Deep Learning
 
Stay well with machine learning
Stay well with machine learningStay well with machine learning
Stay well with machine learning
 
Recommender systems
Recommender systemsRecommender systems
Recommender systems
 
Mahout
MahoutMahout
Mahout
 

Recently uploaded

Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
shyamraj55
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
Neo4j
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems S.M.S.A.
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
danishmna97
 
Mariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceXMariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceX
Mariano Tinti
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
innovationoecd
 
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceAI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
IndexBug
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
Zilliz
 
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
Edge AI and Vision Alliance
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
Uni Systems S.M.S.A.
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
kumardaparthi1024
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
Matthew Sinclair
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
DianaGray10
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
Zilliz
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
Pixlogix Infotech
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
DianaGray10
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 

Recently uploaded (20)

Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
 
Mariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceXMariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceX
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
 
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceAI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
 
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 

MapReduce Design Patterns