Map reduce

2,558 views

Published on

Published in: Technology
0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,558
On SlideShare
0
From Embeds
0
Number of Embeds
71
Actions
Shares
0
Downloads
64
Comments
0
Likes
4
Embeds 0
No embeds

No notes for slide

Map reduce

  1. 1. MAP REDUCE PATTERN (cafe.naver.com/architect1) (itmentor@gmail.com)
  2. 2. • Automatic parallelization & distribution • Fault-tolerant • Provides status and monitoring tools • Clean abstraction for programmers
  3. 3. MAP REDUCE • Google • Map reduce • Page rank, crawler, google map • Hadoop • • Map function, reduce function • Qizmt • • C# Map function, reduce function • etc • C++, C#, Java, Haskell • http://en.wikipedia.org/wiki/MapReduce
  4. 4. MAP (= fold, accumulate, compress, inject) map f lst: (’a->’b) -> (’a list) -> (’b list) f . <key, value> .
  5. 5. REDUCE (= fold, accumulate, compress, inject) fold f x0 lst: ('a*'b->'b)->'b->('a list)->'b ,f accumulator . key value reduce .
  6. 6. MAPREDUCE ? 28 CHAPTER 2 THE BASICS OF A MAPREDUCE JOB Provided by Hadoop Provided by User Framework • Job Configuration . Input Splitting & Distribution Input Format Start of Individual • Input format Input Locations Map Tasks Map Function • Input location Number of Shuffle, Partition/Sort Reduce Tasks per Map Output Reduce Function • Map function Output Merge Sort for Map Outputs for Each Key Type Reduce Task • Reduce function Output Value Type Start of Individual Reduce Tasks • Output format Output Format Output Location Collection of Final Output • Output location Figure 2-1. Parts of a MapReduce job The user is responsible for handling the job setup, specifying the input
  7. 7. MAP REDUCE Input Map Shuffle Reduce Output 1. Logical Flow map() Key key Reduce 2. map() reduce() (key,val) pairs
  8. 8. MAP REDUCE Physical Flow
  9. 9. MAP REDUCE Physical Flow Job
  10. 10. ? PROGRAM Map function Reduce function Distributed Grep matched lines pass Reverse Web link graph <target, source> <target, list(src)> URL <URL, 1> <URL, total count> Term-Vector per Host <hostname, term-vector> <hostname, all-term-vector> Inverted Index <word, doc id> <word, list(doc id)> Distributed Sort <key,value> pass
  11. 11. CLUSTER 80 CHAPTER 3 - HADOOP THE BASICS OF MULTIMACHINE CLUSTERS Enable Job Control Options on the Web Interfaces • Master Both the JobTracker and the NameNode provide a web interface for monitori trol. By default, the JobTracker provides web service on the NameNode provides web service on . If the • Name node parameter is set to , the JobTracker web interface will ad and Change Job Priority options to the per-job detail page. The default locatio tional options is the bottom-left corner of the page (so you usually need to scr page to see them). • Job tracker A Sample Cluster Configuration In this section, we will walk through a simple configuration of a six-node Had • Slave( =Worker ) cluster will be composed of six machines: , , , . The JobTracker and NameNode will reside on the machine NameNode will be placed on . The DataNodes and TaskTrackers will b the same machines, and the nodes will be named through . Fi • Data node this setup. Master Slave01 NameNode • Task tracker Slave02 http://master:50070/ Datanode Slave03 JobTracker Datanode TasktrackerSlave04 http://master:50030/ Datanode TasktrackerSlave05 Datanode Tasktracker DataNode Tasktracker TaskTracker Figure 3-2. A simple six-node cluster
  12. 12. MAP REDUCE - GOOGLE 1. 16MB ~ 64MB . . 2. Master . Worker Master (map task, reduce task) . master idle worker . 3. Map task worker map immediate key/value pair . 4. pair , Reduce . pair master . master map worker reduce worker . 5. reduce worker master , RPC map worker buffered data( immediate key/value pairs ) . immediate key . external sort . 6. reduce worker , . reduce . reduce ( ) 7. map reduce , user program , MapReduce .
  13. 13. ? • (DFS) • Google Map reduce - Bigtable • Hadoop - HBase • Hypertable ( commercial )
  14. 14. EXAMPLE SOURCE CODE Google Mapreduce example Word count
  15. 15. http://research.microsoft.com/barc/SortBenchmark/. ence. Concurrency and Computation: Practice and Ex- input->set_filepattern(argv[i]); class Adder : public Reducer { perience, 2004. input->set_mapper_class("WordCounter"); [11] William Gropp, Ewing Lusk, and Anthony Skjellum. virtual void Reduce(ReduceInput* input) { } Using MPI: Portable Parallel Programming with the [17] L. G. Valiant. A bridging model for parallel computation. // Iterate over all entries with the Message-Passing Interface. MIT Press, Cambridge, MA, // same key and add the values Communications of the ACM, 33(8):103–111, 1997. // Specify the output files: int64 value = 0; 1999. // /gfs/test/freq-00000-of-00100 [18] Jim Wyllie. Spsort: How to sort a terabyte quickly. // /gfs/test/freq-00001-of-00100 while (!input->done()) { EXAMPLE - WORDCOUNT http://alme1.almaden.ibm.com/cs/spsort.pdf. L. Huston, R. Sukthankar, R. Wickremesinghe, M. Satya- [12] // ... value += StringToInt(input->value()); narayanan, G. R. Ganger, E. Riedel, and A. out = spec.output(); input->NextValue(); MapReduceOutput* Ailamaki. Di- amond: A storage architecture for early discard in inter- } out->set_filebase("/gfs/test/freq"); A Word Frequency active search. In Proceedings of the 2004 USENIX File out->set_num_tasks(100); // Emit sum for input->key() and Storage Technologies FAST Conference, April 2004. out->set_format("text"); Emit(IntToString(value)); out->set_reducer_class("Adder"); This section contains a program that counts the number [13] Richard E. Ladner and Michael J. Fischer. Parallel prefix } }; of occurrences of each unique word in a set of input files Journal ofOptional: do partial 1980. within map computation. // the ACM, 27(4):831–838, sums REGISTER_REDUCER(Adder); specified on the command line. // tasks to save network bandwidth [14] Michael O. Rabin. Efficient dispersal of information for out->set_combiner_class("Adder"); security, load balancing and fault tolerance. Journal of int main(int argc, char** argv) { #include "mapreduce/mapreduce.h" the ACM, 36(2):335–348, 1989. parameters: use at most ParseCommandLineFlags(argc, argv); // Tuning 2000 // User’s map function // Faloutsos, Garth A. Gibson, and [15] Erik Riedel, Christos machines and 100 MB of memory per task MapReduceSpecification spec; spec.set_machines(2000); class WordCounter : public Mapper { David Nagle. Active disks for large-scale data process- public: spec.set_map_megabytes(100); ing. IEEE Computer, pages 68–74, June 2001. spec.set_reduce_megabytes(100); // Store list of input files into "spec" virtual void Map(const MapInput& input) { for (int i = 1; i < argc; i++) { [16] Douglas Thain, Todd Tannenbaum, and Miron Livny. const string& text = input.value(); MapReduceInput* input = spec.add_input(); const int n = text.size(); Distributed computing in practice:it // Now run The Condor experi- input->set_format("text"); MapReduceResult result; for (int i = 0; i < n; ) { ence. Concurrency if (!MapReduce(spec, &result)) abort(); and Computation: Practice and Ex- input->set_filepattern(argv[i]); // Skip past leading whitespace perience, 2004. input->set_mapper_class("WordCounter"); while ((i < n) && isspace(text[i])) } i++; [17] L. G. Valiant. A bridging model ’result’ computation. contains info // Done: for parallel structure Communications of the ACM, 33(8):103–111,time taken, number of // about counters, 1997. // Specify the output files: // Find word end // machines used, etc. // /gfs/test/freq-00000-of-00100 int start = i; [18] Jim Wyllie. Spsort: How to sort a terabyte quickly. // /gfs/test/freq-00001-of-00100 http://alme1.almaden.ibm.com/cs/spsort.pdf. while ((i < n) && !isspace(text[i])) return 0; // ... i++; } MapReduceOutput* out = spec.output(); out->set_filebase("/gfs/test/freq"); can scan if (start < i) if (start < i) A Word Frequency out->set_num_tasks(100); ni- Emit(text.substr(start,i-start),"1"); out->set_format("text"); gni- To}} Emit(text.substr(start,i-start),"1"); appear in OSDI 2004 13 out->set_reducer_class("Adder"); 96. This section contains a program that counts the number ’96. } nce ence }; } of occurrences of each unique word in a set of input files // Optional: do partial sums within map }; REGISTER_MAPPER(WordCounter); REGISTER_MAPPER(WordCounter); specified on the command line. // tasks to save network bandwidth ge. out->set_combiner_class("Adder"); age. // User’s reduce function // User’s reduce function #include "mapreduce/mapreduce.h" class Adder : public Reducer { // Tuning parameters: use at most 2000 um. class Adder : public Reducer { virtual void Reduce(ReduceInput* // User’s map function input) { // machines and 100 MB of memory per task um. virtual void Reduce(ReduceInput* input) { the // Iterate over all entries with the WordCounter : public Mapper { class spec.set_machines(2000); the // Iterate over all entries with the // same key and add the values public: spec.set_map_megabytes(100); MA, // same key and add the values MA, int64 value = 0; spec.set_reduce_megabytes(100); int64 value = 0; virtual void Map(const MapInput& input) { while (!input->done()) { const string& text = input.value(); while (!input->done()) { ya- value += StringToInt(input->value()); int n = text.size(); // Now run it tya- const value += StringToInt(input->value()); Di- input->NextValue(); for (int i = 0; i < n; ) { MapReduceResult result; Di- } input->NextValue(); if (!MapReduce(spec, &result)) abort(); er- } // Skip past leading whitespace nter- File while ((i < n) && isspace(text[i])) File // Emit sum for input->key() i++; // Done: ’result’ structure contains info 04. // Emit sum for input->key() 04. Emit(IntToString(value)); Emit(IntToString(value)); // about counters, time taken, number of efix } // Find word end // machines used, etc. efix } 80. }; int start = i; 980. }; REGISTER_REDUCER(Adder); while ((i < n) && !isspace(text[i])) return 0; REGISTER_REDUCER(Adder); for } for i++; of int main(int argc, char** argv) { l of int main(int argc, char** argv) { ParseCommandLineFlags(argc, argv); ParseCommandLineFlags(argc, argv);
  16. 16. QIZMT Qizmt - Map reduce framework on Windows
  17. 17. QIZMT FEATURES
  18. 18. CORE MYSPACE QIZMT FEATURES • C# mapreducer job • • Built-in IDE/Debugger • mapreducer job / / / • Delta-only exchange option for Mapreduce jobs • / • Easily add machines to a cluster to increase processing power and capacity • CAC (Cluster Assembly Cache) for exposing .Net DLLs to mapreduce jobs • Job ◦ Mapreduce - ◦ Remote - ( ) ◦ Local - For orchestrating a pipeline of Mapreducer and Remote jobs • ◦ Sorted - Shuffle Key ( ) ◦ Grouped - ◦ Hashsorted - core hashtable , Key . Input Map Shuffle Reduce Output 1. map() Sorted / key Reduce 2. map() Grouped / reduce() (key,val) pairs Hashsorted
  19. 19. EXAMPLE - WORD COUNT
  20. 20. QIZMT EXAMPLE WORDCOUNT
  21. 21. • Hadoop • • C++ map, reduce • But, cygwin • Qizmt • ‘ ’ . • • Master . • IDE . • . • -
  22. 22. Q&A

×