This was the first session about Hadoop and MapReduce. It introduces what Hadoop is and its main components. It also covers the how to program your first MapReduce task and how to run it on pseudo distributed Hadoop installation.
This session was given in Arabic and i may provide a video for the session soon.
2. Expected … what to be said!
● History.
● What is Hadoop.
● Hadoop vs SQl.
● MapReduce.
● Hadoop Building Blocks.
● Installing, Configuring and Running Hadoop.
● Anatomy of MapReduce program.
6. Challenges of Distributed Processing of
Large Data
● How to distribute the work?
● How to store and distribute the data itself?
● How to overcome failures?
● How to balance the load?
● How to deal with unstructured data?
● ...
8. What is Hadoop?
Hadoop is an open source framework for writing and
running distributed applications that process large
amounts of data.
Key distinctions of Hadoop:
● Accessible
● Robust
● Scalable
● Simple
9. Hadoop vs SQL
● Structured and Unstructured data.
● Datastore and Data Analysis.
● Scale-out and Scale-up.
● Offline batch processing and Online
transactions.
11. ● Parallel programming model for clusters of
commodity machines.
● MapReduce provides:
o Automatic parallelization & distribution.
o Fault tolerance.
o Locality of data.
What is MapReduce?
14. WordCount in Action
Input:
foo.txt:
“This is the foo file”
bar.txt:
“And this is the bar one”
1
is
1
the
1
foo
1
file
1
and
1
this
1
is
1
the
1
Reduce#2:
Input:
Output:
is, [1, 1] is,
2
Reduce#1:
Input:
Output:
this, [1, 1]
this, 2
Reduce#3:
Input:
Output:
foo, [1]
foo, 1.
.
Final output:
this 2
is 2
the 2
foo 1
file 1
and 1
bar 1
one 1
15. WordCount with MapReduce
map(String filename, String document) {
List<String> T = tokenize(document);
for each token in T {
emit ((String)token,
(Integer) 1);
}
}
reduce(String token, List<Integer> values) {
Integer sum = 0;
for each value in values {
sum = sum + value;
}
emit ((String)token, (Integer) sum);
}
28. Hadoop Data Types
● Certain defined way of serializing key/value pairs.
● Values should implement Writable Interface.
● Keys should implement WritableComparable interface.
● Some predefined classes:
o BooleanWritable.
o ByteWritable.
o IntWritable
o ...
31. WordCount Mapperpublic static class Map extends Mapper<LongWritable, Text,
Text, IntWritable> {
private final static IntWritable one = new
IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, Context
context){
String line = value.toString();
StringTokenizer tokenizer = new
StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write(word, one);
}
35. WordCount Reducer
public static class Reduce
extends Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values,
Context context){
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
context.write(key, new IntWritable(sum));
}
}
38. Partitioner
The partitioner decides
which key goes where
class WordSizePartitioner extends
Partitioner<Text, IntWritable> {
@Override
public int getPartition(Text
word, IntWritable count, int
numOfPartions) {
return 0;
}
}
42. Reading and Writing
1. Input data usually resides in large files.
2. MapReduce’s processing power is the splitting of the
input data into chunks(InputSplit).
3. Hadoop’s FileSystem provides the class
FSDataInputStream for file reading. It extends
DataInputStream with random read access.
43. InputFormat Classes
● TextInputFormat
o <offset, line>
● KeyValueTextInputFormat
o keytvaue => <key, value>
● NLineInputFormat
o <offset, nLines>
You can define your own InputFormat class ...
44. 1. The output has no splits.
2. Each reducer generates output file named
part-nnnnn, where nnnnn is the partition ID
of the reducer.
Predefined OutputFormat classes:
> TextOutputFormat <k, v> => ktv
OutputFormat
Lucene is a full featured text indexer and searching library.
Nutch was trying to build a complete web search engine with Lucene, it has web crawler and HTML parser and so on..
Problem: There are billions of web pages there!! What can the poor Nutch do?
> Google announced GFS and MapReduce 2004, they said that they are using these techniques in their search engine … realy? :/ <
Doug and his team used these techniques for nutch and then Hadoop was born.
Doug Cutting
Challenges in processing Large Data in a distributed way.
Accessible—Hadoop runs on large clusters of commodity machines or on cloud computing services such as Amazon’s Elastic Compute Cloud (EC2).
Robust—Because it is intended to run on commodity hardware, Hadoop is architected with the assumption of frequent hardware malfunctions. It can gracefully
handle most such failures.
Scalable—Hadoop scales linearly to handle larger data by adding more nodes to the cluster.
Simple—Hadoop allows users to quickly write efficient parallel code.
Hadoop in Action section 1.2
Table from “Hadoop In Action”
Images source:
https://developer.yahoo.com/hadoop/tutorial/module4.html
Pseudo-code for map and reduce functions for word counting
Source: Hadoop In Action
We now know a general overview about mapreduce, let’s see how hadoop works
Hadoop In Action Figure 2.1
Local (standalone) mode.
No HDFS.
No Hadoop Daemons.
Debugging and testing the logic of MapReduce program.
Pseudo-distributed mode.
All daemons running on a single machine.
Debugging your code, allowing you to examine memory usage, HDFS input/output issues, and other daemon interactions.
Fully distributed mode.
When the reducer task receives the output from the various mappers, it sorts the
incoming data on the key of the (key/value) pair and groups together all values of
the same key.
When the reducer task receives the output from the various mappers, it sorts the
incoming data on the key of the (key/value) pair and groups together all values of
the same key.