Hadoop is a framework for distributed processing of large data sets across clusters of computers. It is based on the MapReduce programming model where the input data is processed by the map function, mapped data is shuffled and sorted, and then reduced. The core of Hadoop is Hadoop Distributed File System (HDFS) which stores data reliably in a large cluster, and MapReduce which processes data in parallel across the cluster. Hadoop is an open source project and is useful for applications such as log analysis, data mining, and processing large amounts of unstructured data.
4. MapReduce
• Inspired from functional operations
– Map
– Reduce
• Functional operations do not modify data,
they generate new data
• Original data remains unmodified
BarCamp Chennai - 5 Mohit Soni
5. Functional Operations
Map Reduce
def sqr(n): def add(i, j):
return n * n return i + j
list = [1,2,3,4] list = [1,2,3,4]
map(sqr, list) -> [1,4,9,16] reduce(add, list) -> 10
MapReduce
def MapReduce(data, mapper, reducer):
return reduce(reducer, map(mapper, data))
MapReduce(list, sqr, add) -> 30
Python code
BarCamp Chennai - 5 Mohit Soni
7. What is Hadoop ?
• Framework for large-scale data processing
• Based on Google’s MapReduce and GFS
• An Apache Software Foundation project
• Open Source!
• Written in Java
• Oh, btw
BarCamp Chennai - 5 Mohit Soni
8. Why Hadoop ?
• Need to process lots of data (PetaByte scale)
• Need to parallelize processing across
multitude of CPUs
• Achieves above while KeepIng Software
Simple
• Gives scalability with low-cost commodity
hardware
BarCamp Chennai - 5 Mohit Soni
10. When to use and not-use Hadoop ?
Hadoop is a good choice for:
• Indexing data
• Log Analysis
• Image manipulation
• Sorting large-scale data
• Data Mining
Hadoop is not a good choice:
• For real-time processing
• For processing intensive tasks with little data
• If you have Jaguar or RoadRunner in your stock
BarCamp Chennai - 5 Mohit Soni
11. HDFS – Overview
• Hadoop Distributed File System
• Based on Google’s GFS (Google File System)
• Write once read many access model
• Fault tolerant
• Efficient for batch-processing
BarCamp Chennai - 5 Mohit Soni
12. HDFS – Blocks
Block 1
Block 2
Input Data Block 3
• HDFS splits input data into blocks
• Block size in HDFS: 64/128MB (configurable)
• Block size *nix: 4KB
BarCamp Chennai - 5 Mohit Soni
13. HDFS – Replication
Block 1 Block 1
Block 2 Block 3
Block 2
Block 3
• Blocks are replicated across nodes to handle hardware failure
• Node failure is handled gracefully, without loss of data
BarCamp Chennai - 5 Mohit Soni
14. HDFS – Architecture
NameNode
Client
Cluster
DataNodes
BarCamp Chennai - 5 Mohit Soni
15. HDFS – NameNode
• NameNode (Master)
– Manages filesystem metadata
– Manages replication of blocks
– Manages read/write access to files
• Metadata
– List of files
– List of blocks that constitutes a file
– List of DataNodes on which blocks reside, etc
• Single Point of Failure (candidate for spending $$)
BarCamp Chennai - 5 Mohit Soni
16. HDFS – DataNode
• DataNode (Slave)
– Contains actual data
– Manages data blocks
– Informs NameNode about block IDs stored
– Client read/write data blocks from DataNode
– Performs block replication as instructed by NameNode
• Block Replication
– Supports various pluggable replication strategies
– Clients read blocks from nearest DataNode
• Data Pipelining
– Client write block to first DataNode
– First DataNode forwards data to next DataNode in pipeline
– When block is replicated across all replicas, next block is chosen
BarCamp Chennai - 5 Mohit Soni
17. Hadoop - Architecture
User JobTracker
TaskTracker TaskTracker
NameNode
DataNode DataNode
DataNode DataNode
DataNode DataNode
BarCamp Chennai - 5 Mohit Soni
18. Hadoop - Terminology
• JobTracker (Master)
– 1 Job Tracker per cluster
– Accepts job requests from users
– Schedule Map and Reduce tasks for TaskTrackers
– Monitors tasks and TaskTrackers status
– Re-execute task on failure
• TaskTracker (Slave)
– Multiple TaskTrackers in a cluster
– Run Map and Reduce tasks
BarCamp Chennai - 5 Mohit Soni
19. MapReduce – Flow
Input Map Shuffle + Sort Reduce Output
Map
Reduce
Input Output
Data Map Data
Reduce
Map
BarCamp Chennai - 5 Mohit Soni
20. Word Count
Hadoop’s HelloWorld
BarCamp Chennai - 5 Mohit Soni
21. Word Count Example
• Input
– Text files
• Output
– Single file containing (Word <TAB> Count)
• Map Phase
– Generates (Word, Count) pairs
– [{a,1}, {b,1}, {a,1}] [{a,2}, {b,3}, {c,5}] [{a,3}, {b,1}, {c,1}]
• Reduce Phase
– For each word, calculates aggregate
– [{a,7}, {b,5}, {c,6}]
BarCamp Chennai - 5 Mohit Soni
22. Word Count – Mapper
public class WordCountMapper extends MapReduceBase implements
Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, OutputCollector<Text,
IntWritable> out, Reporter reporter) throws Exception {
String l = value.toString();
StringTokenizer t = new StringTokenizer(l);
while(t.hasMoreTokens()) {
word.set(t.nextToken());
out.collect(word, one);
}
}
}
BarCamp Chennai - 5 Mohit Soni
23. Word Count – Reducer
public class WordCountReducer extends MapReduceBase implements
Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWriter> values,
OutputCollector<Text, IntWritable> out, Reporter reporter) throws
Exception {
int sum = 0;
while(values.hasNext()) {
sum += values.next().get();
}
out.collect(key, new IntWritable(sum));
}
}
BarCamp Chennai - 5 Mohit Soni
24. Word Count – Config
public class WordCountConfig {
public static void main(String[] args) throws Exception {
if (args.length() != 2) {
System.exit(1);
}
JobConf conf = new JobConf(WordCountConfig.class);
conf.setJobName(“Word Counter”);
FileInputFormat.addInputPath(conf, new Path(args[0]);
FileInputFormat.addOutputPath(conf, new Path(args[1]));
conf.setMapperClass(WordCountMapper.class);
conf.setCombinerClass(WordCountReducer.class);
conf.setReducerClass(WordCountReducer.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
JobClient.runJob(conf);
}
}
BarCamp Chennai - 5 Mohit Soni
25. Diving Deeper
• http://hadoop.apache.org/
• Jeffrey Dean and Sanjay Ghemwat, MapReduce:
Simplified Data Processing on Large Clusters
• Tom White, Hadoop: The Definitive Guide, O’Reilly
• Setting up a Single-Node Cluster: http://bit.ly/glNzs4
• Setting up a Multi-Node Cluster: http://bit.ly/f5KqCP
BarCamp Chennai - 5 Mohit Soni
26. Catching-Up
• Follow me on twitter @mohitsoni
• http://mohitsoni.com/
BarCamp Chennai - 5 Mohit Soni