Your SlideShare is downloading. ×
Map Reduce
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Saving this for later?

Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime - even offline.

Text the download link to your phone

Standard text messaging rates apply

Map Reduce

2,574
views

Published on


0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
2,574
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
57
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • HDFS takes care of details of data partitioningScheduling program executionHandling machine failuresHandling inter-machine communication and data transfers
  • Pool commodity servers in a single hierarchical namespace.Designed for large files that are written once and read many times.Example here shows what happens with a replication factor of 3, each data block is present in at least 3 separate data nodes.Typical Hadoop node is eight cores with 16GB ram and four 1TB SATA disks.Default block size is 64MB, though most folks now set it to 128MB
  • Transcript

    • 1. MapReduce
      Rahul Agarwal
      irahul.com
    • 2. Dean, Ghemawat: http://labs.google.com/papers/mapreduce.html
      Attributions
    • 3. Agenda
    • 10. HDFS: Fault-tolerant high-bandwidth clustered storage
      Automatically and transparently route around failure
      Master (named-node) – Slave architecture
      Speculatively execute redundant tasks if certain nodes are detected to be slow
      Move compute to data
      Lower latency, lower bandwidth
      Hadoop principles and MapReduce
    • 11. HDFS: Hadoop Distributed FS
      Block Size = 64MB
      Replication Factor = 3
    • 12. Patented by Google
      “programming model… for processing and generating large data sets”
      Allows such programs to be “automatically parallelized and executed on a large cluster”
      Works with structured and unstructured data
      Map function processes a key/value pair to generate a set of intermediate key/value pairs
      Reduce function merges all intermediate values with the same intermediate key
      MapReduce
    • 13. map (in_key, in_value) -> list(intermidiate_key, intermediate_value)
      reduce (intermediate_key, list(intermediate_value)) -> list(out_value)
      MapReduce
    • 14. Example: count word occurences
      map (String key, String value):
      //key: document name
      //value: document contents
      for each word w in value:
      EmitIntermediate(w,”1”);
      reduce (String key, Iterator values):
      //key: a word
      //values: a list of counts
      for each v in values:
      result+=ParseInt(v);
      Emit(AsString(result));
    • 15. Example: distributed grep
      map (String key, String value):
      //key: document name
      //value: document contents
      for each line in value:
      if line.match(pattern)
      EmitIntermediate(key, line);
      reduce (String key, Iterator values):
      //key: document name
      //values: a list lines
      for each v in values:
      Emit(v);
    • 16. Example: URL access frequency
      map (String key, String value):
      //key: log name
      //value: log contents
      for each line in value:
      EmitIntermediate(URL(line), “1”);
      reduce (String key, Iterator values):
      //key: URL
      //values: list of counts
      for each v in values:
      result+=ParseInt(v);
      Emit(AsString(result));
    • 17. Example: Reverse web-link graph
      map (String key, String value):
      //key: source document name
      //value: document contents
      for each link in value:
      EmitIntermediate(link, key);
      reduce (String key, Iterator values):
      //key: each target link
      //values: list of sources
      for each v in values:
      source_list.add(v)
      Emit(AsPair(key, source_list));
    • 18. Execution
    • 19. Locality
      Network bandwidth is scarce
      Compute on local copies which are distributed by HFDS
      Task Granularity
      Ratio of Map (M) to Reduce (R) workers
      Ideally M and R should be much larger than cluster
      Typically M such that enough tasks for each 64M block, R is a small multiple
      Eg: 200,000 M for 5,000 R with 2,000 machines
      Backup Tasks
      ‘Straggling’ workers
      Execution Optimization
    • 20. Partitioning Function
      How to distribute the intermediate results to Reduce workers?
      Default: hash(key) mod R
      Eg: hash(Hostname(URL)) mod R
      Combiner Function
      Partial merging of data before the reduce step
      Save bandwidth
      Eg: Lots of <the,1> in word counting
      Refinements
    • 21. Ordering
      Process values and produce ordered results
      Strong typing
      Strongly type input/output values
      Skip bad records
      Skip consistently failing data
      Counters
      Shared counters
      May be updated from any map/reduce worker
      Refinements