Your SlideShare is downloading. ×
  • Like
Introduction to MapReduce and Hadoop
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.


Now you can save presentations on your phone or tablet

Available for both IPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Introduction to MapReduce and Hadoop


This was the first session about Hadoop and MapReduce. It introduces what Hadoop is and its main components. It also covers the how to program your first MapReduce task and how to run it on pseudo …

This was the first session about Hadoop and MapReduce. It introduces what Hadoop is and its main components. It also covers the how to program your first MapReduce task and how to run it on pseudo distributed Hadoop installation.

This session was given in Arabic and i may provide a video for the session soon.

Published in Software
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads


Total Views
On SlideShare
From Embeds
Number of Embeds



Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

    No notes for slide
  • Lucene is a full featured text indexer and searching library.
    Nutch was trying to build a complete web search engine with Lucene, it has web crawler and HTML parser and so on..
    Problem: There are billions of web pages there!! What can the poor Nutch do?
    > Google announced GFS and MapReduce 2004, they said that they are using these techniques in their search engine … realy? :/ <
    Doug and his team used these techniques for nutch and then Hadoop was born.

    Doug Cutting
  • Challenges in processing Large Data in a distributed way.
  • Accessible—Hadoop runs on large clusters of commodity machines or on cloud computing services such as Amazon’s Elastic Compute Cloud (EC2).

    Robust—Because it is intended to run on commodity hardware, Hadoop is archi­tected with the assumption of frequent hardware malfunctions. It can gracefully
    handle most such failures.

    Scalable—Hadoop scales linearly to handle larger data by adding more nodes to the cluster.

    Simple—Hadoop allows users to quickly write efficient parallel code.

    Hadoop in Action section 1.2
  • REF:
  • REF:
  • Table from “Hadoop In Action”
    Images source:
  • Pseudo-code for map and reduce functions for word counting
    Source: Hadoop In Action
  • We now know a general overview about mapreduce, let’s see how hadoop works
  • Hadoop In Action Figure 2.1
  • Local (standalone) mode.
    No HDFS.
    No Hadoop Daemons.
    Debugging and testing the logic of MapReduce program.
    Pseudo-distributed mode.
    All daemons running on a single machine.
    Debugging your code, allowing you to examine memory usage, HDFS input/out­put issues, and other daemon interactions.
    Fully distributed mode.
  • This slide is initially left blank.
  • This slide is initially left blank.
  • When the reducer task receives the output from the various mappers, it sorts the
    incoming data on the key of the (key/value) pair and groups together all values of
    the same key.
  • When the reducer task receives the output from the various mappers, it sorts the
    incoming data on the key of the (key/value) pair and groups together all values of
    the same key.


  • 1. Expected … what to be said! ● History. ● What is Hadoop. ● Hadoop vs SQl. ● MapReduce. ● Hadoop Building Blocks. ● Installing, Configuring and Running Hadoop. ● Anatomy of MapReduce program.
  • 2. Hadoop Series Resources
  • 3. How hadoop was born? Doug Cutting
  • 4. Challenges of Distributed Processing of Large Data ● How to distribute the work? ● How to store and distribute the data itself? ● How to overcome failures? ● How to balance the load? ● How to deal with unstructured data? ● ...
  • 5. Hadoop tackles these challenges! So, what’s Hadoop?
  • 6. What is Hadoop? Hadoop is an open source framework for writing and running distributed applications that process large amounts of data. Key distinctions of Hadoop: ● Accessible ● Robust ● Scalable ● Simple
  • 7. Hadoop vs SQL ● Structured and Unstructured data. ● Datastore and Data Analysis. ● Scale-out and Scale-up. ● Offline batch processing and Online transactions.
  • 8. Hadoop Uses MapReduce What is MapReduce?...
  • 9. ● Parallel programming model for clusters of commodity machines. ● MapReduce provides: o Automatic parallelization & distribution. o Fault tolerance. o Locality of data. What is MapReduce?
  • 10. MapReduce … Map then Reduce
  • 11. Keys and Values ● Key/Value pairs. ● Keys divide Reduce Space. Input Output Map <k1, v1> list(<k2, v2>) Reduce <k2, list(v2)> list(<k3, v3>)
  • 12. WordCount in Action Input: foo.txt: “This is the foo file” bar.txt: “And this is the bar one” 1 is 1 the 1 foo 1 file 1 and 1 this 1 is 1 the 1 Reduce#2: Input: Output: is, [1, 1] is, 2 Reduce#1: Input: Output: this, [1, 1] this, 2 Reduce#3: Input: Output: foo, [1] foo, 1. . Final output: this 2 is 2 the 2 foo 1 file 1 and 1 bar 1 one 1
  • 13. WordCount with MapReduce map(String filename, String document) { List<String> T = tokenize(document); for each token in T { emit ((String)token, (Integer) 1); } } reduce(String token, List<Integer> values) { Integer sum = 0; for each value in values { sum = sum + value; } emit ((String)token, (Integer) sum); }
  • 14. Hadoop Building Blocks How does hadoop work?...
  • 15. Hadoop Building Blocks 1. NameNode 2. DataNode 3. Secondary NameNode 4. JobTracker 5. TaskTracker
  • 16. HDFS: NameNode and DataNodes
  • 17. JobTracker and TaskTracker
  • 18. Typical Hadoop Cluster
  • 19. Running Hadoop Three modes to run Hadoop: 1. Local (standalone) mode. 2. Pseudo-distributed mode “cluster of one” . 3. Fully distributed mode.
  • 20. An Action Running Hadoop on Local Machine
  • 21. Actions ... 1. Installing Hadoop. 2. Configuring Hadoop (Pseudo-distributed mode). 3. Running WordCount example. 4. Web-based cluster UI.
  • 22. HDFS 1. HDFS is a filesystem designed for large-scale distributed data processing. 2. HDFS isn’t a native Unix filesystem. Basic File Commands: $ hadoop fs -cmd <args> $ hadoop fs –ls $ hadoop fs –mkdir /user/chuck $ hadoop fs -copyFromLocal
  • 23. Anatomy of a MapReduce program MapReduce and beyond
  • 24. Hadoop 1. Data Types 2. Mapper 3. Reducer 4. Partitioner 5. Combiner 6. Reading and Writing a. InputFormat b. OutputFormat
  • 25. Anatomy of a MapReduce program
  • 26. Hadoop Data Types ● Certain defined way of serializing key/value pairs. ● Values should implement Writable Interface. ● Keys should implement WritableComparable interface. ● Some predefined classes: o BooleanWritable. o ByteWritable. o IntWritable o ...
  • 27. Mapper
  • 28. Mapper 1. Mapper<K1,V1,K2,V2> 2. Override method: void map(K1 key, V1 value, Context context) 3. Use context.write(K2, V2) to emit key/value pairs.
  • 29. WordCount Mapperpublic static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context){ String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); }
  • 30. Predefined Mappers
  • 31. Reducer
  • 32. Reducer 1. Extends Reducer<K1,V1,K2,V2> 2. Overrides method: void reduce(K2, Iterable<V2>, Context context) 3. Use context.write(K2, V2) to emit key/value pairs.
  • 33. WordCount Reducer public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context){ int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } }
  • 34. Predefined Reducers
  • 35. Partitioner
  • 36. Partitioner The partitioner decides which key goes where class WordSizePartitioner extends Partitioner<Text, IntWritable> { @Override public int getPartition(Text word, IntWritable count, int numOfPartions) { return 0; } }
  • 37. Combiner
  • 38. Combiner It’s a local Reduce Task at Mapper. WordCout Mapper Output: 1. Without Combiner:<the, 1>, <file, 1>, <the, 1>, … 2. With Combiner:<the, 2>, <file, 2>, ...
  • 39. Reading and Writing
  • 40. Reading and Writing 1. Input data usually resides in large files. 2. MapReduce’s processing power is the splitting of the input data into chunks(InputSplit). 3. Hadoop’s FileSystem provides the class FSDataInputStream for file reading. It extends DataInputStream with random read access.
  • 41. InputFormat Classes ● TextInputFormat o <offset, line> ● KeyValueTextInputFormat o keytvaue => <key, value> ● NLineInputFormat o <offset, nLines> You can define your own InputFormat class ...
  • 42. 1. The output has no splits. 2. Each reducer generates output file named part-nnnnn, where nnnnn is the partition ID of the reducer. Predefined OutputFormat classes: > TextOutputFormat <k, v> => ktv OutputFormat
  • 43. Recap
  • 44. END OF SESSION #1
  • 45. Q