Hadoop 101: An Introduction to Apache Hadoop and MapReduce

Hadoop 101

Mohit Soni
eBay Inc.

BarCamp Chennai - 5 Mohit Soni

About Me

• I work as a Software Engineer at eBay
• Worked on large-scale data processing with
eBay Research Labs


First Things First


MapReduce
• Inspired from functional operations
– Map
– Reduce
• Functional operations do not modify data,
they generate new data
• Original data remains unmodified


Functional Operations
Map Reduce
def sqr(n): def add(i, j):
return n * n return i + j

list = [1,2,3,4] list = [1,2,3,4]

map(sqr, list) -> [1,4,9,16] reduce(add, list) -> 10

MapReduce
def MapReduce(data, mapper, reducer):
return reduce(reducer, map(mapper, data))

MapReduce(list, sqr, add) -> 30

Python code

What is Hadoop ?

• Framework for large-scale data processing
• Based on Google’s MapReduce and GFS
• An Apache Software Foundation project
• Open Source!
• Written in Java
• Oh, btw


Why Hadoop ?

• Need to process lots of data (PetaByte scale)
• Need to parallelize processing across
multitude of CPUs
• Achieves above while KeepIng Software
Simple
• Gives scalability with low-cost commodity
hardware


Hadoop fans

Source: Hadoop Wiki

When to use and not-use Hadoop ?
Hadoop is a good choice for:
• Indexing data
• Log Analysis
• Image manipulation
• Sorting large-scale data
• Data Mining
Hadoop is not a good choice:
• For real-time processing
• For processing intensive tasks with little data
• If you have Jaguar or RoadRunner in your stock


HDFS – Overview

• Hadoop Distributed File System
• Based on Google’s GFS (Google File System)
• Write once read many access model
• Fault tolerant
• Efficient for batch-processing


HDFS – Blocks
Block 1

Block 2

Input Data Block 3

• HDFS splits input data into blocks
• Block size in HDFS: 64/128MB (configurable)
• Block size *nix: 4KB


HDFS – Replication
Block 1 Block 1

Block 2 Block 3

Block 2

Block 3

• Blocks are replicated across nodes to handle hardware failure
• Node failure is handled gracefully, without loss of data


HDFS – Architecture

NameNode

Client

Cluster
DataNodes


HDFS – NameNode
• NameNode (Master)
– Manages filesystem metadata
– Manages replication of blocks
– Manages read/write access to files
• Metadata
– List of files
– List of blocks that constitutes a file
– List of DataNodes on which blocks reside, etc
• Single Point of Failure (candidate for spending $$)


HDFS – DataNode
• DataNode (Slave)
– Contains actual data
– Manages data blocks
– Informs NameNode about block IDs stored
– Client read/write data blocks from DataNode
– Performs block replication as instructed by NameNode
• Block Replication
– Supports various pluggable replication strategies
– Clients read blocks from nearest DataNode
• Data Pipelining
– Client write block to first DataNode
– First DataNode forwards data to next DataNode in pipeline
– When block is replicated across all replicas, next block is chosen


Hadoop - Architecture

User JobTracker

TaskTracker TaskTracker

NameNode

DataNode DataNode
DataNode DataNode

DataNode DataNode


Hadoop - Terminology
• JobTracker (Master)
– 1 Job Tracker per cluster
– Accepts job requests from users
– Schedule Map and Reduce tasks for TaskTrackers
– Monitors tasks and TaskTrackers status
– Re-execute task on failure
• TaskTracker (Slave)
– Multiple TaskTrackers in a cluster
– Run Map and Reduce tasks


MapReduce – Flow
Input Map Shuffle + Sort Reduce Output

Map

Reduce

Input Output
Data Map Data

Reduce

Map


Word Count
Hadoop’s HelloWorld


Word Count Example
• Input
– Text files
• Output
– Single file containing (Word <TAB> Count)
• Map Phase
– Generates (Word, Count) pairs
– [{a,1}, {b,1}, {a,1}] [{a,2}, {b,3}, {c,5}] [{a,3}, {b,1}, {c,1}]
• Reduce Phase
– For each word, calculates aggregate
– [{a,7}, {b,5}, {c,6}]


Word Count – Mapper
public class WordCountMapper extends MapReduceBase implements
Mapper<LongWritable, Text, Text, IntWritable> {

private final static IntWritable one = new IntWritable(1);
private Text word = new Text();

public void map(LongWritable key, Text value, OutputCollector<Text,
IntWritable> out, Reporter reporter) throws Exception {
String l = value.toString();
StringTokenizer t = new StringTokenizer(l);
while(t.hasMoreTokens()) {
word.set(t.nextToken());
out.collect(word, one);
}
}
}


Word Count – Reducer
public class WordCountReducer extends MapReduceBase implements
Reducer<Text, IntWritable, Text, IntWritable> {

public void reduce(Text key, Iterator<IntWriter> values,
OutputCollector<Text, IntWritable> out, Reporter reporter) throws
Exception {
int sum = 0;
while(values.hasNext()) {
sum += values.next().get();
}
out.collect(key, new IntWritable(sum));
}
}


Word Count – Config
public class WordCountConfig {
public static void main(String[] args) throws Exception {
if (args.length() != 2) {
System.exit(1);
}
JobConf conf = new JobConf(WordCountConfig.class);
conf.setJobName(“Word Counter”);

FileInputFormat.addInputPath(conf, new Path(args[0]);
FileInputFormat.addOutputPath(conf, new Path(args[1]));

conf.setMapperClass(WordCountMapper.class);
conf.setCombinerClass(WordCountReducer.class);
conf.setReducerClass(WordCountReducer.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);

JobClient.runJob(conf);
}
}


Diving Deeper
• http://hadoop.apache.org/
• Jeffrey Dean and Sanjay Ghemwat, MapReduce:
Simplified Data Processing on Large Clusters
• Tom White, Hadoop: The Definitive Guide, O’Reilly
• Setting up a Single-Node Cluster: http://bit.ly/glNzs4
• Setting up a Multi-Node Cluster: http://bit.ly/f5KqCP


Catching-Up

• Follow me on twitter @mohitsoni
• http://mohitsoni.com/


Hadoop 101: An Introduction to Apache Hadoop and MapReduce

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (6)

Similar to Hadoop 101: An Introduction to Apache Hadoop and MapReduce

Similar to Hadoop 101: An Introduction to Apache Hadoop and MapReduce (20)

Hadoop 101: An Introduction to Apache Hadoop and MapReduce