MapReduce Design Patterns

651
-1

Published on

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
651
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
33
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

MapReduce Design Patterns

  1. 1. MapReduce Design Patterns Anastasiia Kornilova, SoftServe Data Science Group
  2. 2. MapReduce Components ❖ record reader ❖ map ❖ Reader combiner ❖ partitioner ❖ Mapper Combiner Partitioner Shuffle and sort shuffle and sort ❖ reduce ❖ output format Reducer Output
  3. 3. MapReduce Patterns ❖ Filtering Patterns ❖ Summarization Patterns ❖ Join Patterns ❖ Data Organization Patterns ❖ Metapatterns ❖ Input and Output Patterns
  4. 4. Filtering patterns ❖ Filtering ❖ Bloom filtering ❖ Top-N ❖ Distinct
  5. 5. Filtering ❖ Closer view of data ❖ Tracking a thread of events ❖ Distributed grep ❖ Data cleansing ❖ Simple random sampling ❖ Removing low scoring data
  6. 6. Input split Filter Mapper Output file Input split Filter Mapper Output file Input split Filter Mapper Output file
  7. 7. Bloom filtering ❖ Removing most of non watched values ❖ Prefiltering a data set for an expensive set membership check • • • Probabilistic data structure Hash functions comparing Answer: probably yes or now
  8. 8. Step 1 - Filter Training Bloom Filter Training Input split Output file Step 2 - Bloom Filtering via MapReduce Input split Bloom Filter Mapper Maybe Bloom Filter Test No Discarded Load filter from distributed cache Input split Output file Bloom Filter Mapper Maybe Bloom Filter Test Output file No Load filter from distributed cache Discarded
  9. 9. Top N ❖ Outlier analysis ❖ Select interesting data ❖ Catchy dashboards
  10. 10. Input split Top Ten Mapper local top 10 Input split Top Ten Mapper local top 10 Top Ten Reducer Input split Top Ten Mapper local top 10 Input split Top Ten Mapper local top 10 final top 10 Top 10 Output
  11. 11. Distinct ❖ Deduplicate data ❖ Getting distinct values ❖ Protecting from inner join explosions
  12. 12. Summarization patterns ❖ Numerical summarization ❖ Inverted index ❖ Counting with counters
  13. 13. Numerical summarization ❖ Word count ❖ Record count ❖ Min/Max/Count ❖ Average/Median/Standart deviation
  14. 14. Mapper Mapper Mapper (key, summary field) (key, summary field) (key, summary field) (key, summary field) (key, summary field) (key, summary field) Partitoner Reducer (group B, summary) (group D, summary) Reducer (group B, summary) (group D, summary) Partitoner Partitoner
  15. 15. Inverted index
  16. 16. Mapper (keyword, unique ID) (keyword, unique ID) Partitoner Reducer Reducer (keyword, unique ID) (keyword, unique ID) (keyword A, list of IDs) (keyword D, list of IDs) Partitoner Mapper (keyword, unique ID) (keyword, unique ID) Mapper (keyword A, list of IDs) (keyword D, list of IDs) Partitoner
  17. 17. Data Organization Patterns ❖ Structured to Hierarchical ❖ Partitioning ❖ Binning ❖ Total Order Sorting ❖ Shuffling
  18. 18. Join patterns ❖ Reduce Side Join ❖ Replicated Join ❖ Composite Join ❖ Cartesian Product
  19. 19. Data Set A Input split Input split Input split Join Mapper Join Mapper Join Mapper (key, values A) (key, values A) Join Reducer Output part Join Reducer Output part Join Reducer Output part (key, values A) Shuffle and sort Data Set B Input split Input split Join Mapper Join Mapper (key, values B) (key, values B)
  20. 20. Node table id title tagnames authorized User table body node type parent id abs parent id added at score state string last edited id last activity id last activity at activity revision extra extra def extra count user id reputation gold silver bronze
  21. 21. Pig examples - - Inner Join: A = JOIN comments BY userID, users BY userID; - - Outer Join: A = JOIN comments BY userID [LEFT | RIFGT| FULL] OUTER , users BY userID; - - Binning: SPLIT data INTO eights IF col1 == 8, bigs IF col1 > 8, smalls IF (col1 < 8 and col1 > 0 ); - - Top Ten: B = ORDER A BY col4 DESC’ C = limit B 10; - - Filtering: b = FILTER a BY value < 3;
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×