MapReduce
Functional ProgrammingMapApply function to transform elements of a list Return results as a listReduceApply function to all elements of a listCollect results and return as a single value
DefinitionMapReduce: A software framework to support processing of massive data sets across distributed computers
SAMPLE USE CASEBack end credit card processorNightly processing of millions of transactionsProcessing requires grouping, sorting, and merchant wide analysisCan’t just divide the over all list into equal parts as further analysis is necessaryTight processing window
DescriptionSimple, powerful programming modelLanguage independentCan run on a single machine, but shines for distributed computing and extreme datasetsBreak down the processing problem into embarrassingly parallel atomic operations
AlgorithmMap Phase Raw data analyzed and converted to name/value pairShuffle PhaseAll name/value pairs are sorted and grouped by their keysReduce PhaseAll values associated with a key are processed for results
MapReduce Walk ThroughGoal: Construct a word frequency of all the words in Wikipedia
Step 0: Split DataRaw input data divided into N partsN > number of machinesSplit must be context specific
Step 1: MapEach machine takes/receives a single slice of the raw input for mappingThe map function processes the input file and emits a name/value pair of the relevant data
STEP 2: ShuffleThe results of the map phase are sorted and grouped by the key in each key value pair.
STEP 3: ReduceResults from shuffle phase divided into M partsM >number of machinesEach machine runs a reduction method on a part of shuffle results.
MAPREDUCE BENEFITSScaleProcessing speed increases with number of machines involvedReliableLoss of any one machine doesn’t stop processingCostOften built from heterogeneous commodity grade computers
Use Case ResultsProcessing time of 1 million records Originally ~3 hoursReduced to 40 minutes on 5 computers
Other MapReduce installationsGoogle – Index buildingVisa – Transaction ProcessingFacebook – Facebook LexiconIntelligence CommunityYahoo/Google – Terabyte Sort10 billion, 100 byte recordsYahoo: 910 nodes, 206 secondsGoogle: ~1,000 nodes, 68 seconds
Questions

Map Reduce

  • 1.
  • 2.
    Functional ProgrammingMapApply functionto transform elements of a list Return results as a listReduceApply function to all elements of a listCollect results and return as a single value
  • 3.
    DefinitionMapReduce: A softwareframework to support processing of massive data sets across distributed computers
  • 4.
    SAMPLE USE CASEBackend credit card processorNightly processing of millions of transactionsProcessing requires grouping, sorting, and merchant wide analysisCan’t just divide the over all list into equal parts as further analysis is necessaryTight processing window
  • 5.
    DescriptionSimple, powerful programmingmodelLanguage independentCan run on a single machine, but shines for distributed computing and extreme datasetsBreak down the processing problem into embarrassingly parallel atomic operations
  • 6.
    AlgorithmMap Phase Rawdata analyzed and converted to name/value pairShuffle PhaseAll name/value pairs are sorted and grouped by their keysReduce PhaseAll values associated with a key are processed for results
  • 7.
    MapReduce Walk ThroughGoal:Construct a word frequency of all the words in Wikipedia
  • 8.
    Step 0: SplitDataRaw input data divided into N partsN > number of machinesSplit must be context specific
  • 9.
    Step 1: MapEachmachine takes/receives a single slice of the raw input for mappingThe map function processes the input file and emits a name/value pair of the relevant data
  • 10.
    STEP 2: ShuffleTheresults of the map phase are sorted and grouped by the key in each key value pair.
  • 11.
    STEP 3: ReduceResultsfrom shuffle phase divided into M partsM >number of machinesEach machine runs a reduction method on a part of shuffle results.
  • 12.
    MAPREDUCE BENEFITSScaleProcessing speedincreases with number of machines involvedReliableLoss of any one machine doesn’t stop processingCostOften built from heterogeneous commodity grade computers
  • 13.
    Use Case ResultsProcessingtime of 1 million records Originally ~3 hoursReduced to 40 minutes on 5 computers
  • 14.
    Other MapReduce installationsGoogle– Index buildingVisa – Transaction ProcessingFacebook – Facebook LexiconIntelligence CommunityYahoo/Google – Terabyte Sort10 billion, 100 byte recordsYahoo: 910 nodes, 206 secondsGoogle: ~1,000 nodes, 68 seconds
  • 15.