Map Reduce


Published on

1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • HDFS takes care of details of data partitioningScheduling program executionHandling machine failuresHandling inter-machine communication and data transfers
  • Pool commodity servers in a single hierarchical namespace.Designed for large files that are written once and read many times.Example here shows what happens with a replication factor of 3, each data block is present in at least 3 separate data nodes.Typical Hadoop node is eight cores with 16GB ram and four 1TB SATA disks.Default block size is 64MB, though most folks now set it to 128MB
  • Map Reduce

    1. 1. MapReduce<br />Rahul Agarwal<br /><br />
    2. 2. Dean, Ghemawat:<br />Attributions<br />
    3. 3. <ul><li>Hadoop Principles
    4. 4. MapReduce
    5. 5. Programming model
    6. 6. Examples
    7. 7. Execution
    8. 8. Refinements
    9. 9. Q&A</li></ul>Agenda<br />
    10. 10. HDFS: Fault-tolerant high-bandwidth clustered storage<br />Automatically and transparently route around failure<br />Master (named-node) – Slave architecture<br />Speculatively execute redundant tasks if certain nodes are detected to be slow<br />Move compute to data<br />Lower latency, lower bandwidth<br />Hadoop principles and MapReduce<br />
    11. 11. HDFS: Hadoop Distributed FS<br />Block Size = 64MB<br />Replication Factor = 3<br />
    12. 12. Patented by Google<br />“programming model… for processing and generating large data sets”<br />Allows such programs to be “automatically parallelized and executed on a large cluster”<br />Works with structured and unstructured data<br />Map function processes a key/value pair to generate a set of intermediate key/value pairs<br />Reduce function merges all intermediate values with the same intermediate key<br />MapReduce<br />
    13. 13. map (in_key, in_value) -> list(intermidiate_key, intermediate_value)<br />reduce (intermediate_key, list(intermediate_value)) -> list(out_value)<br />MapReduce<br />
    14. 14. Example: count word occurences<br />map (String key, String value): <br /> //key: document name<br /> //value: document contents<br /> for each word w in value:<br /> EmitIntermediate(w,”1”);<br />reduce (String key, Iterator values):<br /> //key: a word<br /> //values: a list of counts<br /> for each v in values:<br /> result+=ParseInt(v);<br /> Emit(AsString(result));<br />
    15. 15. Example: distributed grep<br />map (String key, String value): <br /> //key: document name<br /> //value: document contents<br /> for each line in value:<br /> if line.match(pattern)<br /> EmitIntermediate(key, line);<br />reduce (String key, Iterator values):<br /> //key: document name<br /> //values: a list lines<br /> for each v in values:<br /> Emit(v);<br />
    16. 16. Example: URL access frequency<br />map (String key, String value): <br /> //key: log name<br /> //value: log contents<br /> for each line in value:<br /> EmitIntermediate(URL(line), “1”);<br />reduce (String key, Iterator values):<br /> //key: URL<br /> //values: list of counts<br /> for each v in values:<br /> result+=ParseInt(v);<br /> Emit(AsString(result));<br />
    17. 17. Example: Reverse web-link graph<br />map (String key, String value): <br /> //key: source document name<br /> //value: document contents<br /> for each link in value:<br /> EmitIntermediate(link, key);<br />reduce (String key, Iterator values):<br /> //key: each target link<br /> //values: list of sources<br /> for each v in values:<br /> source_list.add(v)<br /> Emit(AsPair(key, source_list));<br />
    18. 18. Execution<br />
    19. 19. Locality<br />Network bandwidth is scarce<br />Compute on local copies which are distributed by HFDS<br />Task Granularity<br />Ratio of Map (M) to Reduce (R) workers<br />Ideally M and R should be much larger than cluster <br />Typically M such that enough tasks for each 64M block, R is a small multiple<br />Eg: 200,000 M for 5,000 R with 2,000 machines<br />Backup Tasks<br />‘Straggling’ workers<br />Execution Optimization<br />
    20. 20. Partitioning Function<br />How to distribute the intermediate results to Reduce workers?<br />Default: hash(key) mod R<br />Eg: hash(Hostname(URL)) mod R<br />Combiner Function<br />Partial merging of data before the reduce step<br />Save bandwidth<br />Eg: Lots of <the,1> in word counting<br />Refinements<br />
    21. 21. Ordering<br />Process values and produce ordered results<br />Strong typing<br />Strongly type input/output values<br />Skip bad records<br />Skip consistently failing data<br />Counters<br />Shared counters<br />May be updated from any map/reduce worker<br />Refinements<br />