MapReduce
Design Patterns

Anastasiia Kornilova,
SoftServe Data Science Group
MapReduce Components
❖

record reader

❖

map

❖

Reader

combiner

❖

partitioner

❖

Mapper

Combiner

Partitioner

Shuf...
MapReduce Patterns
❖

Filtering Patterns

❖

Summarization Patterns

❖

Join Patterns

❖

Data Organization Patterns

❖

M...
Filtering patterns

❖

Filtering

❖

Bloom filtering

❖

Top-N

❖

Distinct
Filtering
❖

Closer view of data

❖

Tracking a thread of events

❖

Distributed grep

❖

Data cleansing

❖

Simple random...
Input
split

Filter
Mapper

Output
file

Input
split

Filter
Mapper

Output
file

Input
split

Filter
Mapper

Output
file
Bloom filtering
❖

Removing most of non watched
values

❖

Prefiltering a data set for an
expensive set membership
check

...
Step 1 - Filter
Training
Bloom Filter
Training

Input
split

Output
file

Step 2 - Bloom Filtering via MapReduce

Input
sp...
Top N
❖

Outlier analysis

❖

Select interesting data

❖

Catchy dashboards
Input
split

Top Ten
Mapper

local top 10

Input
split

Top Ten
Mapper

local top 10

Top Ten
Reducer
Input
split

Top Ten...
Distinct
❖

Deduplicate data

❖

Getting distinct values

❖

Protecting from inner join
explosions
Summarization patterns
❖

Numerical summarization

❖

Inverted index

❖

Counting with counters
Numerical summarization

❖

Word count

❖

Record count

❖

Min/Max/Count

❖

Average/Median/Standart
deviation
Mapper

Mapper

Mapper

(key, summary field)
(key, summary field)

(key, summary field)
(key, summary field)

(key, summar...
Inverted index
Mapper

(keyword, unique ID)
(keyword, unique ID)

Partitoner
Reducer

Reducer

(keyword, unique ID)
(keyword, unique ID)
...
Data Organization Patterns
❖

Structured to Hierarchical

❖

Partitioning

❖

Binning

❖

Total Order Sorting

❖

Shufflin...
Join patterns

❖

Reduce Side Join

❖

Replicated Join

❖

Composite Join

❖

Cartesian Product
Data Set A
Input
split
Input
split
Input
split

Join
Mapper
Join
Mapper
Join
Mapper

(key, values
A)

(key, values
A)

Joi...
Node table

id
title
tagnames
authorized

User table

body
node type
parent id
abs parent id
added at
score
state string
l...
Pig examples
- - Inner Join:
A = JOIN comments BY userID, users BY userID;

- - Outer Join:
A = JOIN comments BY userID [L...
MapReduce Design Patterns
MapReduce Design Patterns
MapReduce Design Patterns
MapReduce Design Patterns
Upcoming SlideShare
Loading in...5
×

MapReduce Design Patterns

573

Published on

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
573
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
33
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Transcript of "MapReduce Design Patterns"

  1. 1. MapReduce Design Patterns Anastasiia Kornilova, SoftServe Data Science Group
  2. 2. MapReduce Components ❖ record reader ❖ map ❖ Reader combiner ❖ partitioner ❖ Mapper Combiner Partitioner Shuffle and sort shuffle and sort ❖ reduce ❖ output format Reducer Output
  3. 3. MapReduce Patterns ❖ Filtering Patterns ❖ Summarization Patterns ❖ Join Patterns ❖ Data Organization Patterns ❖ Metapatterns ❖ Input and Output Patterns
  4. 4. Filtering patterns ❖ Filtering ❖ Bloom filtering ❖ Top-N ❖ Distinct
  5. 5. Filtering ❖ Closer view of data ❖ Tracking a thread of events ❖ Distributed grep ❖ Data cleansing ❖ Simple random sampling ❖ Removing low scoring data
  6. 6. Input split Filter Mapper Output file Input split Filter Mapper Output file Input split Filter Mapper Output file
  7. 7. Bloom filtering ❖ Removing most of non watched values ❖ Prefiltering a data set for an expensive set membership check • • • Probabilistic data structure Hash functions comparing Answer: probably yes or now
  8. 8. Step 1 - Filter Training Bloom Filter Training Input split Output file Step 2 - Bloom Filtering via MapReduce Input split Bloom Filter Mapper Maybe Bloom Filter Test No Discarded Load filter from distributed cache Input split Output file Bloom Filter Mapper Maybe Bloom Filter Test Output file No Load filter from distributed cache Discarded
  9. 9. Top N ❖ Outlier analysis ❖ Select interesting data ❖ Catchy dashboards
  10. 10. Input split Top Ten Mapper local top 10 Input split Top Ten Mapper local top 10 Top Ten Reducer Input split Top Ten Mapper local top 10 Input split Top Ten Mapper local top 10 final top 10 Top 10 Output
  11. 11. Distinct ❖ Deduplicate data ❖ Getting distinct values ❖ Protecting from inner join explosions
  12. 12. Summarization patterns ❖ Numerical summarization ❖ Inverted index ❖ Counting with counters
  13. 13. Numerical summarization ❖ Word count ❖ Record count ❖ Min/Max/Count ❖ Average/Median/Standart deviation
  14. 14. Mapper Mapper Mapper (key, summary field) (key, summary field) (key, summary field) (key, summary field) (key, summary field) (key, summary field) Partitoner Reducer (group B, summary) (group D, summary) Reducer (group B, summary) (group D, summary) Partitoner Partitoner
  15. 15. Inverted index
  16. 16. Mapper (keyword, unique ID) (keyword, unique ID) Partitoner Reducer Reducer (keyword, unique ID) (keyword, unique ID) (keyword A, list of IDs) (keyword D, list of IDs) Partitoner Mapper (keyword, unique ID) (keyword, unique ID) Mapper (keyword A, list of IDs) (keyword D, list of IDs) Partitoner
  17. 17. Data Organization Patterns ❖ Structured to Hierarchical ❖ Partitioning ❖ Binning ❖ Total Order Sorting ❖ Shuffling
  18. 18. Join patterns ❖ Reduce Side Join ❖ Replicated Join ❖ Composite Join ❖ Cartesian Product
  19. 19. Data Set A Input split Input split Input split Join Mapper Join Mapper Join Mapper (key, values A) (key, values A) Join Reducer Output part Join Reducer Output part Join Reducer Output part (key, values A) Shuffle and sort Data Set B Input split Input split Join Mapper Join Mapper (key, values B) (key, values B)
  20. 20. Node table id title tagnames authorized User table body node type parent id abs parent id added at score state string last edited id last activity id last activity at activity revision extra extra def extra count user id reputation gold silver bronze
  21. 21. Pig examples - - Inner Join: A = JOIN comments BY userID, users BY userID; - - Outer Join: A = JOIN comments BY userID [LEFT | RIFGT| FULL] OUTER , users BY userID; - - Binning: SPLIT data INTO eights IF col1 == 8, bigs IF col1 > 8, smalls IF (col1 < 8 and col1 > 0 ); - - Top Ten: B = ORDER A BY col4 DESC’ C = limit B 10; - - Filtering: b = FILTER a BY value < 3;
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×