Uploaded on

 

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
2,517
On Slideshare
0
From Embeds
0
Number of Embeds
1

Actions

Shares
Downloads
57
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • HDFS takes care of details of data partitioningScheduling program executionHandling machine failuresHandling inter-machine communication and data transfers
  • Pool commodity servers in a single hierarchical namespace.Designed for large files that are written once and read many times.Example here shows what happens with a replication factor of 3, each data block is present in at least 3 separate data nodes.Typical Hadoop node is eight cores with 16GB ram and four 1TB SATA disks.Default block size is 64MB, though most folks now set it to 128MB

Transcript

  • 1. MapReduce
    Rahul Agarwal
    irahul.com
  • 2. Dean, Ghemawat: http://labs.google.com/papers/mapreduce.html
    Attributions
  • 3. Agenda
  • 10. HDFS: Fault-tolerant high-bandwidth clustered storage
    Automatically and transparently route around failure
    Master (named-node) – Slave architecture
    Speculatively execute redundant tasks if certain nodes are detected to be slow
    Move compute to data
    Lower latency, lower bandwidth
    Hadoop principles and MapReduce
  • 11. HDFS: Hadoop Distributed FS
    Block Size = 64MB
    Replication Factor = 3
  • 12. Patented by Google
    “programming model… for processing and generating large data sets”
    Allows such programs to be “automatically parallelized and executed on a large cluster”
    Works with structured and unstructured data
    Map function processes a key/value pair to generate a set of intermediate key/value pairs
    Reduce function merges all intermediate values with the same intermediate key
    MapReduce
  • 13. map (in_key, in_value) -> list(intermidiate_key, intermediate_value)
    reduce (intermediate_key, list(intermediate_value)) -> list(out_value)
    MapReduce
  • 14. Example: count word occurences
    map (String key, String value):
    //key: document name
    //value: document contents
    for each word w in value:
    EmitIntermediate(w,”1”);
    reduce (String key, Iterator values):
    //key: a word
    //values: a list of counts
    for each v in values:
    result+=ParseInt(v);
    Emit(AsString(result));
  • 15. Example: distributed grep
    map (String key, String value):
    //key: document name
    //value: document contents
    for each line in value:
    if line.match(pattern)
    EmitIntermediate(key, line);
    reduce (String key, Iterator values):
    //key: document name
    //values: a list lines
    for each v in values:
    Emit(v);
  • 16. Example: URL access frequency
    map (String key, String value):
    //key: log name
    //value: log contents
    for each line in value:
    EmitIntermediate(URL(line), “1”);
    reduce (String key, Iterator values):
    //key: URL
    //values: list of counts
    for each v in values:
    result+=ParseInt(v);
    Emit(AsString(result));
  • 17. Example: Reverse web-link graph
    map (String key, String value):
    //key: source document name
    //value: document contents
    for each link in value:
    EmitIntermediate(link, key);
    reduce (String key, Iterator values):
    //key: each target link
    //values: list of sources
    for each v in values:
    source_list.add(v)
    Emit(AsPair(key, source_list));
  • 18. Execution
  • 19. Locality
    Network bandwidth is scarce
    Compute on local copies which are distributed by HFDS
    Task Granularity
    Ratio of Map (M) to Reduce (R) workers
    Ideally M and R should be much larger than cluster
    Typically M such that enough tasks for each 64M block, R is a small multiple
    Eg: 200,000 M for 5,000 R with 2,000 machines
    Backup Tasks
    ‘Straggling’ workers
    Execution Optimization
  • 20. Partitioning Function
    How to distribute the intermediate results to Reduce workers?
    Default: hash(key) mod R
    Eg: hash(Hostname(URL)) mod R
    Combiner Function
    Partial merging of data before the reduce step
    Save bandwidth
    Eg: Lots of <the,1> in word counting
    Refinements
  • 21. Ordering
    Process values and produce ordered results
    Strong typing
    Strongly type input/output values
    Skip bad records
    Skip consistently failing data
    Counters
    Shared counters
    May be updated from any map/reduce worker
    Refinements