Your SlideShare is downloading. ×
MapReduce Design Patterns
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Saving this for later?

Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime - even offline.

Text the download link to your phone

Standard text messaging rates apply

MapReduce Design Patterns

464
views

Published on

Published in: Technology

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
464
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
27
Comments
0
Likes
1
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. MapReduce Design Patterns Anastasiia Kornilova, SoftServe Data Science Group
  • 2. MapReduce Components ❖ record reader ❖ map ❖ Reader combiner ❖ partitioner ❖ Mapper Combiner Partitioner Shuffle and sort shuffle and sort ❖ reduce ❖ output format Reducer Output
  • 3. MapReduce Patterns ❖ Filtering Patterns ❖ Summarization Patterns ❖ Join Patterns ❖ Data Organization Patterns ❖ Metapatterns ❖ Input and Output Patterns
  • 4. Filtering patterns ❖ Filtering ❖ Bloom filtering ❖ Top-N ❖ Distinct
  • 5. Filtering ❖ Closer view of data ❖ Tracking a thread of events ❖ Distributed grep ❖ Data cleansing ❖ Simple random sampling ❖ Removing low scoring data
  • 6. Input split Filter Mapper Output file Input split Filter Mapper Output file Input split Filter Mapper Output file
  • 7. Bloom filtering ❖ Removing most of non watched values ❖ Prefiltering a data set for an expensive set membership check • • • Probabilistic data structure Hash functions comparing Answer: probably yes or now
  • 8. Step 1 - Filter Training Bloom Filter Training Input split Output file Step 2 - Bloom Filtering via MapReduce Input split Bloom Filter Mapper Maybe Bloom Filter Test No Discarded Load filter from distributed cache Input split Output file Bloom Filter Mapper Maybe Bloom Filter Test Output file No Load filter from distributed cache Discarded
  • 9. Top N ❖ Outlier analysis ❖ Select interesting data ❖ Catchy dashboards
  • 10. Input split Top Ten Mapper local top 10 Input split Top Ten Mapper local top 10 Top Ten Reducer Input split Top Ten Mapper local top 10 Input split Top Ten Mapper local top 10 final top 10 Top 10 Output
  • 11. Distinct ❖ Deduplicate data ❖ Getting distinct values ❖ Protecting from inner join explosions
  • 12. Summarization patterns ❖ Numerical summarization ❖ Inverted index ❖ Counting with counters
  • 13. Numerical summarization ❖ Word count ❖ Record count ❖ Min/Max/Count ❖ Average/Median/Standart deviation
  • 14. Mapper Mapper Mapper (key, summary field) (key, summary field) (key, summary field) (key, summary field) (key, summary field) (key, summary field) Partitoner Reducer (group B, summary) (group D, summary) Reducer (group B, summary) (group D, summary) Partitoner Partitoner
  • 15. Inverted index
  • 16. Mapper (keyword, unique ID) (keyword, unique ID) Partitoner Reducer Reducer (keyword, unique ID) (keyword, unique ID) (keyword A, list of IDs) (keyword D, list of IDs) Partitoner Mapper (keyword, unique ID) (keyword, unique ID) Mapper (keyword A, list of IDs) (keyword D, list of IDs) Partitoner
  • 17. Data Organization Patterns ❖ Structured to Hierarchical ❖ Partitioning ❖ Binning ❖ Total Order Sorting ❖ Shuffling
  • 18. Join patterns ❖ Reduce Side Join ❖ Replicated Join ❖ Composite Join ❖ Cartesian Product
  • 19. Data Set A Input split Input split Input split Join Mapper Join Mapper Join Mapper (key, values A) (key, values A) Join Reducer Output part Join Reducer Output part Join Reducer Output part (key, values A) Shuffle and sort Data Set B Input split Input split Join Mapper Join Mapper (key, values B) (key, values B)
  • 20. Node table id title tagnames authorized User table body node type parent id abs parent id added at score state string last edited id last activity id last activity at activity revision extra extra def extra count user id reputation gold silver bronze
  • 21. Pig examples - - Inner Join: A = JOIN comments BY userID, users BY userID; - - Outer Join: A = JOIN comments BY userID [LEFT | RIFGT| FULL] OUTER , users BY userID; - - Binning: SPLIT data INTO eights IF col1 == 8, bigs IF col1 > 8, smalls IF (col1 < 8 and col1 > 0 ); - - Top Ten: B = ORDER A BY col4 DESC’ C = limit B 10; - - Filtering: b = FILTER a BY value < 3;