Hadoop 101

Slide deck from Barcamp Chennai

  Hadoop 101 Mohit Soni eBay Inc.BarCamp Chennai - 5
  About Me• I work as a Software Engineer at eBay• Worked on large-scale data processing with eBay Research Labs
  MapReduce• Inspired from functional operations – Map – Reduce• Functional operations do not modify data, they generate new data• Original data remains unmodified
  Functional OperationsMap Reducedef sqr(n): def add(i, j): return n * n return i + jlist = [1,2,3,4] list = [1,2,3,4]map(sqr, list) -> [1,4,9,16] reduce(add, list) -> 10 MapReduce def MapReduce(data, mapper, reducer): return reduce(reducer, map(mapper, data)) MapReduce(list, sqr, add) -> 30 Python code
  What is Hadoop ?• Framework for large-scale data processing• Based on Google's MapReduce and GFS• An Apache Software Foundation project• Open Source!• Written in Java• Oh, btw
  Why Hadoop ?• Need to process lots of data (PetaByte scale)• Need to parallelize processing across multitude of CPUs• Achieves above while KeepIng Software Simple• Gives scalability with low-cost commodity hardware
  Hadoop fansSource: Hadoop Wiki
  When to use and not-use Hadoop ?Hadoop is a good choice for:• Indexing data• Log Analysis• Image manipulation• Sorting large-scale data• Data MiningHadoop is not a good choice:• For real-time processing• For processing intensive tasks with little data• If you have Jaguar or RoadRunner in your stock
  HDFS – Overview• Hadoop Distributed File System• Based on Google's GFS (Google File System)• Write once read many access model• Fault tolerant• Efficient for batch-processing
  HDFS – Blocks Block 1 Block 2 Input Data Block 3• HDFS splits input data into blocks• Block size in HDFS: 64/128MB (configurable)• Block size *nix: 4KB
  HDFS – Replication Block 1 Block 1 Block 2 Block 3 Block 2 Block 3• Blocks are replicated across nodes to handle hardware failure• Node failure is handled gracefully, without loss of data
  HDFS – Architecture NameNodeClient Cluster DataNodes
  HDFS – NameNode• NameNode (Master) – Manages filesystem metadata – Manages replication of blocks – Manages read/write access to files• Metadata – List of files – List of blocks that constitutes a file – List of DataNodes on which blocks reside, etc• Single Point of Failure (candidate for spending $$)
  HDFS – DataNode• DataNode (Slave) – Contains actual data – Manages data blocks – Informs NameNode about block IDs stored – Client read/write data blocks from DataNode – Performs block replication as instructed by NameNode• Block Replication – Supports various pluggable replication strategies – Clients read blocks from nearest DataNode• Data Pipelining – Client write block to first DataNode – First DataNode forwards data to next DataNode in pipeline – When block is replicated across all replicas, next block is chosen
  Hadoop - ArchitectureUser JobTracker TaskTracker TaskTracker NameNode DataNode DataNode DataNode DataNode DataNode DataNode
  Hadoop - Terminology• JobTracker (Master) – 1 Job Tracker per cluster – Accepts job requests from users – Schedule Map and Reduce tasks for TaskTrackers – Monitors tasks and TaskTrackers status – Re-execute task on failure• TaskTracker (Slave) – Multiple TaskTrackers in a cluster – Run Map and Reduce tasks
  MapReduce – FlowInput Map Shuffle + Sort Reduce Output Map ReduceInput Output Data Map Data Reduce Map
  Word Count Hadoop's HelloWorld
  Word Count Example• Input – Text files• Output – Single file containing (Word <TAB> Count)• Map Phase – Generates (Word, Count) pairs – [{a,1}, {b,1}, {a,1}] [{a,2}, {b,3}, {c,5}] [{a,3}, {b,1}, {c,1}]• Reduce Phase – For each word, calculates aggregate – [{a,7}, {b,5}, {c,6}]
  Word Count – Mapperpublic class WordCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> out, Reporter reporter) throws Exception { String l = value.toString(); StringTokenizer t = new StringTokenizer(l); while(t.hasMoreTokens()) { word.set(t.nextToken()); out.collect(word, one); } }}
  Word Count – Reducerpublic class WordCountReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWriter> values, OutputCollector<Text, IntWritable> out, Reporter reporter) throws Exception { int sum = 0; while(values.hasNext()) { sum +=; } out.collect(key, new IntWritable(sum)); }}
  Word Count – Configpublic class WordCountConfig { public static void main(String[] args) throws Exception { if (args.length() != 2) { System.exit(1); } JobConf conf = new JobConf(WordCountConfig.class); conf.setJobName("Word Counter"); FileInputFormat.addInputPath(conf, new Path(args[0]); FileInputFormat.addOutputPath(conf, new Path(args[1])); conf.setMapperClass(WordCountMapper.class); conf.setCombinerClass(WordCountReducer.class); conf.setReducerClass(WordCountReducer.class); conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(TextOutputFormat.class); JobClient.runJob(conf); }}
  Diving Deeper•• Jeffrey Dean and Sanjay Ghemwat, MapReduce: Simplified Data Processing on Large Clusters• Tom White, Hadoop: The Definitive Guide, O'Reilly• Setting up a Single-Node Cluster:• Setting up a Multi-Node Cluster:
  Catching-Up• Follow me on twitter @mohitsoni•