MapReduce
• Inspired from functional operations
– Map
– Reduce
• Functional operations do not modify data,
they generate new data
• Original data remains unmodified
BarCamp Chennai - 5 Mohit Soni
Functional Operations
Map Reduce
def sqr(n): def add(i, j):
return n * n return i + j
list = [1,2,3,4] list = [1,2,3,4]
map(sqr, list) -> [1,4,9,16] reduce(add, list) -> 10
MapReduce
def MapReduce(data, mapper, reducer):
return reduce(reducer, map(mapper, data))
MapReduce(list, sqr, add) -> 30
Python code
BarCamp Chennai - 5 Mohit Soni
What is Hadoop ?
• Framework for large-scale data processing
• Based on Google’s MapReduce and GFS
• An Apache Software Foundation project
• Open Source!
• Written in Java
• Oh, btw
BarCamp Chennai - 5 Mohit Soni
Why Hadoop ?
• Need to process lots of data (PetaByte scale)
• Need to parallelize processing across
multitude of CPUs
• Achieves above while KeepIng Software
Simple
• Gives scalability with low-cost commodity
hardware
BarCamp Chennai - 5 Mohit Soni
When to use and not-use Hadoop ?
Hadoop is a good choice for:
• Indexing data
• Log Analysis
• Image manipulation
• Sorting large-scale data
• Data Mining
Hadoop is not a good choice:
• For real-time processing
• For processing intensive tasks with little data
• If you have Jaguar or RoadRunner in your stock
BarCamp Chennai - 5 Mohit Soni
HDFS – Overview
• Hadoop Distributed File System
• Based on Google’s GFS (Google File System)
• Write once read many access model
• Fault tolerant
• Efficient for batch-processing
BarCamp Chennai - 5 Mohit Soni
HDFS – Blocks
Block 1
Block 2
Input Data Block 3
• HDFS splits input data into blocks
• Block size in HDFS: 64/128MB (configurable)
• Block size *nix: 4KB
BarCamp Chennai - 5 Mohit Soni
HDFS – Replication
Block 1 Block 1
Block 2 Block 3
Block 2
Block 3
• Blocks are replicated across nodes to handle hardware failure
• Node failure is handled gracefully, without loss of data
BarCamp Chennai - 5 Mohit Soni
HDFS – Architecture
NameNode
Client
Cluster
DataNodes
BarCamp Chennai - 5 Mohit Soni
HDFS – NameNode
• NameNode (Master)
– Manages filesystem metadata
– Manages replication of blocks
– Manages read/write access to files
• Metadata
– List of files
– List of blocks that constitutes a file
– List of DataNodes on which blocks reside, etc
• Single Point of Failure (candidate for spending $$)
BarCamp Chennai - 5 Mohit Soni
HDFS – DataNode
• DataNode (Slave)
– Contains actual data
– Manages data blocks
– Informs NameNode about block IDs stored
– Client read/write data blocks from DataNode
– Performs block replication as instructed by NameNode
• Block Replication
– Supports various pluggable replication strategies
– Clients read blocks from nearest DataNode
• Data Pipelining
– Client write block to first DataNode
– First DataNode forwards data to next DataNode in pipeline
– When block is replicated across all replicas, next block is chosen
BarCamp Chennai - 5 Mohit Soni
Hadoop - Architecture
User JobTracker
TaskTracker TaskTracker
NameNode
DataNode DataNode
DataNode DataNode
DataNode DataNode
BarCamp Chennai - 5 Mohit Soni
Hadoop - Terminology
• JobTracker (Master)
– 1 Job Tracker per cluster
– Accepts job requests from users
– Schedule Map and Reduce tasks for TaskTrackers
– Monitors tasks and TaskTrackers status
– Re-execute task on failure
• TaskTracker (Slave)
– Multiple TaskTrackers in a cluster
– Run Map and Reduce tasks
BarCamp Chennai - 5 Mohit Soni
MapReduce – Flow
Input Map Shuffle + Sort Reduce Output
Map
Reduce
Input Output
Data Map Data
Reduce
Map
BarCamp Chennai - 5 Mohit Soni
Word Count
Hadoop’s HelloWorld
BarCamp Chennai - 5 Mohit Soni
Word Count Example
• Input
– Text files
• Output
– Single file containing (Word <TAB> Count)
• Map Phase
– Generates (Word, Count) pairs
– [{a,1}, {b,1}, {a,1}] [{a,2}, {b,3}, {c,5}] [{a,3}, {b,1}, {c,1}]
• Reduce Phase
– For each word, calculates aggregate
– [{a,7}, {b,5}, {c,6}]
BarCamp Chennai - 5 Mohit Soni
Word Count – Mapper
public class WordCountMapper extends MapReduceBase implements
Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, OutputCollector<Text,
IntWritable> out, Reporter reporter) throws Exception {
String l = value.toString();
StringTokenizer t = new StringTokenizer(l);
while(t.hasMoreTokens()) {
word.set(t.nextToken());
out.collect(word, one);
}
}
}
BarCamp Chennai - 5 Mohit Soni
Word Count – Reducer
public class WordCountReducer extends MapReduceBase implements
Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWriter> values,
OutputCollector<Text, IntWritable> out, Reporter reporter) throws
Exception {
int sum = 0;
while(values.hasNext()) {
sum += values.next().get();
}
out.collect(key, new IntWritable(sum));
}
}
BarCamp Chennai - 5 Mohit Soni
Word Count – Config
public class WordCountConfig {
public static void main(String[] args) throws Exception {
if (args.length() != 2) {
System.exit(1);
}
JobConf conf = new JobConf(WordCountConfig.class);
conf.setJobName(“Word Counter”);
FileInputFormat.addInputPath(conf, new Path(args[0]);
FileInputFormat.addOutputPath(conf, new Path(args[1]));
conf.setMapperClass(WordCountMapper.class);
conf.setCombinerClass(WordCountReducer.class);
conf.setReducerClass(WordCountReducer.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
JobClient.runJob(conf);
}
}
BarCamp Chennai - 5 Mohit Soni
Diving Deeper
• http://hadoop.apache.org/
• Jeffrey Dean and Sanjay Ghemwat, MapReduce:
Simplified Data Processing on Large Clusters
• Tom White, Hadoop: The Definitive Guide, O’Reilly
• Setting up a Single-Node Cluster: http://bit.ly/glNzs4
• Setting up a Multi-Node Cluster: http://bit.ly/f5KqCP
BarCamp Chennai - 5 Mohit Soni
Catching-Up
• Follow me on twitter @mohitsoni
• http://mohitsoni.com/
BarCamp Chennai - 5 Mohit Soni