SlideShare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.
SlideShare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.
Successfully reported this slideshow.
Activate your 14 day free trial to unlock unlimited reading.
MapReduce • Inspired from functional
operations – Map – Reduce • Functional operations do not modify data, they generate new data • Original data remains unmodified BarCamp Chennai - 5 Mohit Soni
Functional Operations Map Reduce def
sqr(n): def add(i, j): return n * n return i + j list = [1,2,3,4] list = [1,2,3,4] map(sqr, list) -> [1,4,9,16] reduce(add, list) -> 10 MapReduce def MapReduce(data, mapper, reducer): return reduce(reducer, map(mapper, data)) MapReduce(list, sqr, add) -> 30 Python code BarCamp Chennai - 5 Mohit Soni
What is Hadoop ? •
Framework for large-scale data processing • Based on Google’s MapReduce and GFS • An Apache Software Foundation project • Open Source! • Written in Java • Oh, btw BarCamp Chennai - 5 Mohit Soni
Why Hadoop ? • Need
to process lots of data (PetaByte scale) • Need to parallelize processing across multitude of CPUs • Achieves above while KeepIng Software Simple • Gives scalability with low-cost commodity hardware BarCamp Chennai - 5 Mohit Soni
When to use and not-use
Hadoop ? Hadoop is a good choice for: • Indexing data • Log Analysis • Image manipulation • Sorting large-scale data • Data Mining Hadoop is not a good choice: • For real-time processing • For processing intensive tasks with little data • If you have Jaguar or RoadRunner in your stock BarCamp Chennai - 5 Mohit Soni
HDFS – Overview • Hadoop
Distributed File System • Based on Google’s GFS (Google File System) • Write once read many access model • Fault tolerant • Efficient for batch-processing BarCamp Chennai - 5 Mohit Soni
HDFS – Blocks Block 1
Block 2 Input Data Block 3 • HDFS splits input data into blocks • Block size in HDFS: 64/128MB (configurable) • Block size *nix: 4KB BarCamp Chennai - 5 Mohit Soni
HDFS – Replication Block 1
Block 1 Block 2 Block 3 Block 2 Block 3 • Blocks are replicated across nodes to handle hardware failure • Node failure is handled gracefully, without loss of data BarCamp Chennai - 5 Mohit Soni
HDFS – NameNode • NameNode
(Master) – Manages filesystem metadata – Manages replication of blocks – Manages read/write access to files • Metadata – List of files – List of blocks that constitutes a file – List of DataNodes on which blocks reside, etc • Single Point of Failure (candidate for spending $$) BarCamp Chennai - 5 Mohit Soni
HDFS – DataNode • DataNode
(Slave) – Contains actual data – Manages data blocks – Informs NameNode about block IDs stored – Client read/write data blocks from DataNode – Performs block replication as instructed by NameNode • Block Replication – Supports various pluggable replication strategies – Clients read blocks from nearest DataNode • Data Pipelining – Client write block to first DataNode – First DataNode forwards data to next DataNode in pipeline – When block is replicated across all replicas, next block is chosen BarCamp Chennai - 5 Mohit Soni
Hadoop - Architecture User JobTracker
TaskTracker TaskTracker NameNode DataNode DataNode DataNode DataNode DataNode DataNode BarCamp Chennai - 5 Mohit Soni
Hadoop - Terminology • JobTracker
(Master) – 1 Job Tracker per cluster – Accepts job requests from users – Schedule Map and Reduce tasks for TaskTrackers – Monitors tasks and TaskTrackers status – Re-execute task on failure • TaskTracker (Slave) – Multiple TaskTrackers in a cluster – Run Map and Reduce tasks BarCamp Chennai - 5 Mohit Soni
MapReduce – Flow Input Map
Shuffle + Sort Reduce Output Map Reduce Input Output Data Map Data Reduce Map BarCamp Chennai - 5 Mohit Soni
Word Count Example • Input
– Text files • Output – Single file containing (Word <TAB> Count) • Map Phase – Generates (Word, Count) pairs – [{a,1}, {b,1}, {a,1}] [{a,2}, {b,3}, {c,5}] [{a,3}, {b,1}, {c,1}] • Reduce Phase – For each word, calculates aggregate – [{a,7}, {b,5}, {c,6}] BarCamp Chennai - 5 Mohit Soni
Word Count – Mapper public
class WordCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> out, Reporter reporter) throws Exception { String l = value.toString(); StringTokenizer t = new StringTokenizer(l); while(t.hasMoreTokens()) { word.set(t.nextToken()); out.collect(word, one); } } } BarCamp Chennai - 5 Mohit Soni
Word Count – Reducer public
class WordCountReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWriter> values, OutputCollector<Text, IntWritable> out, Reporter reporter) throws Exception { int sum = 0; while(values.hasNext()) { sum += values.next().get(); } out.collect(key, new IntWritable(sum)); } } BarCamp Chennai - 5 Mohit Soni
Word Count – Config public
class WordCountConfig { public static void main(String[] args) throws Exception { if (args.length() != 2) { System.exit(1); } JobConf conf = new JobConf(WordCountConfig.class); conf.setJobName(“Word Counter”); FileInputFormat.addInputPath(conf, new Path(args[0]); FileInputFormat.addOutputPath(conf, new Path(args[1])); conf.setMapperClass(WordCountMapper.class); conf.setCombinerClass(WordCountReducer.class); conf.setReducerClass(WordCountReducer.class); conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(TextOutputFormat.class); JobClient.runJob(conf); } } BarCamp Chennai - 5 Mohit Soni
Diving Deeper • http://hadoop.apache.org/ •
Jeffrey Dean and Sanjay Ghemwat, MapReduce: Simplified Data Processing on Large Clusters • Tom White, Hadoop: The Definitive Guide, O’Reilly • Setting up a Single-Node Cluster: http://bit.ly/glNzs4 • Setting up a Multi-Node Cluster: http://bit.ly/f5KqCP BarCamp Chennai - 5 Mohit Soni