Map Reduce

Functional ProgrammingMapApply function to transform elements of a list Return results as a listReduceApply function to all elements of a listCollect results and return as a single value

DefinitionMapReduce: A software framework to support processing of massive data sets across distributed computers

SAMPLE USE CASEBack end credit card processorNightly processing of millions of transactionsProcessing requires grouping, sorting, and merchant wide analysisCan’t just divide the over all list into equal parts as further analysis is necessaryTight processing window

DescriptionSimple, powerful programming modelLanguage independentCan run on a single machine, but shines for distributed computing and extreme datasetsBreak down the processing problem into embarrassingly parallel atomic operations

AlgorithmMap Phase Raw data analyzed and converted to name/value pairShuffle PhaseAll name/value pairs are sorted and grouped by their keysReduce PhaseAll values associated with a key are processed for results

MapReduce Walk ThroughGoal: Construct a word frequency of all the words in Wikipedia

Step 0: Split DataRaw input data divided into N partsN > number of machinesSplit must be context specific

Step 1: MapEach machine takes/receives a single slice of the raw input for mappingThe map function processes the input file and emits a name/value pair of the relevant data

STEP 2: ShuffleThe results of the map phase are sorted and grouped by the key in each key value pair.

STEP 3: ReduceResults from shuffle phase divided into M partsM >number of machinesEach machine runs a reduction method on a part of shuffle results.

MAPREDUCE BENEFITSScaleProcessing speed increases with number of machines involvedReliableLoss of any one machine doesn’t stop processingCostOften built from heterogeneous commodity grade computers

Use Case ResultsProcessing time of 1 million records Originally ~3 hoursReduced to 40 minutes on 5 computers

Other MapReduce installationsGoogle – Index buildingVisa – Transaction ProcessingFacebook – Facebook LexiconIntelligence CommunityYahoo/Google – Terabyte Sort10 billion, 100 byte recordsYahoo: 910 nodes, 206 secondsGoogle: ~1,000 nodes, 68 seconds

Map Reduce

More Related Content

What's hot

Viewers also liked

Similar to Map Reduce

Recently uploaded

Map Reduce