MapReduce Design Patterns
Upcoming SlideShare
Loading in...5
×
 

MapReduce Design Patterns

on

  • 386 views

 

Statistics

Views

Total Views
386
Views on SlideShare
380
Embed Views
6

Actions

Likes
1
Downloads
17
Comments
0

2 Embeds 6

http://www.linkedin.com 3
https://www.linkedin.com 3

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

MapReduce Design Patterns MapReduce Design Patterns Presentation Transcript

  • MapReduce Design Patterns Anastasiia Kornilova, SoftServe Data Science Group
  • MapReduce Components ❖ record reader ❖ map ❖ Reader combiner ❖ partitioner ❖ Mapper Combiner Partitioner Shuffle and sort shuffle and sort ❖ reduce ❖ output format Reducer Output
  • MapReduce Patterns ❖ Filtering Patterns ❖ Summarization Patterns ❖ Join Patterns ❖ Data Organization Patterns ❖ Metapatterns ❖ Input and Output Patterns
  • Filtering patterns ❖ Filtering ❖ Bloom filtering ❖ Top-N ❖ Distinct
  • Filtering ❖ Closer view of data ❖ Tracking a thread of events ❖ Distributed grep ❖ Data cleansing ❖ Simple random sampling ❖ Removing low scoring data
  • Input split Filter Mapper Output file Input split Filter Mapper Output file Input split Filter Mapper Output file
  • Bloom filtering ❖ Removing most of non watched values ❖ Prefiltering a data set for an expensive set membership check • • • Probabilistic data structure Hash functions comparing Answer: probably yes or now
  • Step 1 - Filter Training Bloom Filter Training Input split Output file Step 2 - Bloom Filtering via MapReduce Input split Bloom Filter Mapper Maybe Bloom Filter Test No Discarded Load filter from distributed cache Input split Output file Bloom Filter Mapper Maybe Bloom Filter Test Output file No Load filter from distributed cache Discarded
  • Top N ❖ Outlier analysis ❖ Select interesting data ❖ Catchy dashboards
  • Input split Top Ten Mapper local top 10 Input split Top Ten Mapper local top 10 Top Ten Reducer Input split Top Ten Mapper local top 10 Input split Top Ten Mapper local top 10 final top 10 Top 10 Output
  • Distinct ❖ Deduplicate data ❖ Getting distinct values ❖ Protecting from inner join explosions
  • Summarization patterns ❖ Numerical summarization ❖ Inverted index ❖ Counting with counters
  • Numerical summarization ❖ Word count ❖ Record count ❖ Min/Max/Count ❖ Average/Median/Standart deviation
  • Mapper Mapper Mapper (key, summary field) (key, summary field) (key, summary field) (key, summary field) (key, summary field) (key, summary field) Partitoner Reducer (group B, summary) (group D, summary) Reducer (group B, summary) (group D, summary) Partitoner Partitoner
  • Inverted index
  • Mapper (keyword, unique ID) (keyword, unique ID) Partitoner Reducer Reducer (keyword, unique ID) (keyword, unique ID) (keyword A, list of IDs) (keyword D, list of IDs) Partitoner Mapper (keyword, unique ID) (keyword, unique ID) Mapper (keyword A, list of IDs) (keyword D, list of IDs) Partitoner
  • Data Organization Patterns ❖ Structured to Hierarchical ❖ Partitioning ❖ Binning ❖ Total Order Sorting ❖ Shuffling
  • Join patterns ❖ Reduce Side Join ❖ Replicated Join ❖ Composite Join ❖ Cartesian Product
  • Data Set A Input split Input split Input split Join Mapper Join Mapper Join Mapper (key, values A) (key, values A) Join Reducer Output part Join Reducer Output part Join Reducer Output part (key, values A) Shuffle and sort Data Set B Input split Input split Join Mapper Join Mapper (key, values B) (key, values B)
  • Node table id title tagnames authorized User table body node type parent id abs parent id added at score state string last edited id last activity id last activity at activity revision extra extra def extra count user id reputation gold silver bronze
  • Pig examples - - Inner Join: A = JOIN comments BY userID, users BY userID; - - Outer Join: A = JOIN comments BY userID [LEFT | RIFGT| FULL] OUTER , users BY userID; - - Binning: SPLIT data INTO eights IF col1 == 8, bigs IF col1 > 8, smalls IF (col1 < 8 and col1 > 0 ); - - Top Ten: B = ORDER A BY col4 DESC’ C = limit B 10; - - Filtering: b = FILTER a BY value < 3;