Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Hadoop 101

Slide deck from Barcamp Chennai

  • Login to see the comments

Hadoop 101

  1. Hadoop 101 Mohit Soni eBay Inc.BarCamp Chennai - 5 Mohit Soni
  2. About Me• I work as a Software Engineer at eBay• Worked on large-scale data processing with eBay Research Labs BarCamp Chennai - 5 Mohit Soni
  3. First Things FirstBarCamp Chennai - 5 Mohit Soni
  4. MapReduce• Inspired from functional operations – Map – Reduce• Functional operations do not modify data, they generate new data• Original data remains unmodified BarCamp Chennai - 5 Mohit Soni
  5. Functional OperationsMap Reducedef sqr(n): def add(i, j): return n * n return i + jlist = [1,2,3,4] list = [1,2,3,4]map(sqr, list) -> [1,4,9,16] reduce(add, list) -> 10 MapReduce def MapReduce(data, mapper, reducer): return reduce(reducer, map(mapper, data)) MapReduce(list, sqr, add) -> 30 Python code BarCamp Chennai - 5 Mohit Soni
  6. BarCamp Chennai - 5 Mohit Soni
  7. What is Hadoop ?• Framework for large-scale data processing• Based on Google’s MapReduce and GFS• An Apache Software Foundation project• Open Source!• Written in Java• Oh, btw BarCamp Chennai - 5 Mohit Soni
  8. Why Hadoop ?• Need to process lots of data (PetaByte scale)• Need to parallelize processing across multitude of CPUs• Achieves above while KeepIng Software Simple• Gives scalability with low-cost commodity hardware BarCamp Chennai - 5 Mohit Soni
  9. Hadoop fansSource: Hadoop Wiki BarCamp Chennai - 5 Mohit Soni
  10. When to use and not-use Hadoop ?Hadoop is a good choice for:• Indexing data• Log Analysis• Image manipulation• Sorting large-scale data• Data MiningHadoop is not a good choice:• For real-time processing• For processing intensive tasks with little data• If you have Jaguar or RoadRunner in your stock BarCamp Chennai - 5 Mohit Soni
  11. HDFS – Overview• Hadoop Distributed File System• Based on Google’s GFS (Google File System)• Write once read many access model• Fault tolerant• Efficient for batch-processing BarCamp Chennai - 5 Mohit Soni
  12. HDFS – Blocks Block 1 Block 2 Input Data Block 3• HDFS splits input data into blocks• Block size in HDFS: 64/128MB (configurable)• Block size *nix: 4KB BarCamp Chennai - 5 Mohit Soni
  13. HDFS – Replication Block 1 Block 1 Block 2 Block 3 Block 2 Block 3• Blocks are replicated across nodes to handle hardware failure• Node failure is handled gracefully, without loss of data BarCamp Chennai - 5 Mohit Soni
  14. HDFS – Architecture NameNodeClient Cluster DataNodes BarCamp Chennai - 5 Mohit Soni
  15. HDFS – NameNode• NameNode (Master) – Manages filesystem metadata – Manages replication of blocks – Manages read/write access to files• Metadata – List of files – List of blocks that constitutes a file – List of DataNodes on which blocks reside, etc• Single Point of Failure (candidate for spending $$) BarCamp Chennai - 5 Mohit Soni
  16. HDFS – DataNode• DataNode (Slave) – Contains actual data – Manages data blocks – Informs NameNode about block IDs stored – Client read/write data blocks from DataNode – Performs block replication as instructed by NameNode• Block Replication – Supports various pluggable replication strategies – Clients read blocks from nearest DataNode• Data Pipelining – Client write block to first DataNode – First DataNode forwards data to next DataNode in pipeline – When block is replicated across all replicas, next block is chosen BarCamp Chennai - 5 Mohit Soni
  17. Hadoop - ArchitectureUser JobTracker TaskTracker TaskTracker NameNode DataNode DataNode DataNode DataNode DataNode DataNode BarCamp Chennai - 5 Mohit Soni
  18. Hadoop - Terminology• JobTracker (Master) – 1 Job Tracker per cluster – Accepts job requests from users – Schedule Map and Reduce tasks for TaskTrackers – Monitors tasks and TaskTrackers status – Re-execute task on failure• TaskTracker (Slave) – Multiple TaskTrackers in a cluster – Run Map and Reduce tasks BarCamp Chennai - 5 Mohit Soni
  19. MapReduce – FlowInput Map Shuffle + Sort Reduce Output Map ReduceInput Output Data Map Data Reduce Map BarCamp Chennai - 5 Mohit Soni
  20. Word Count Hadoop’s HelloWorldBarCamp Chennai - 5 Mohit Soni
  21. Word Count Example• Input – Text files• Output – Single file containing (Word <TAB> Count)• Map Phase – Generates (Word, Count) pairs – [{a,1}, {b,1}, {a,1}] [{a,2}, {b,3}, {c,5}] [{a,3}, {b,1}, {c,1}]• Reduce Phase – For each word, calculates aggregate – [{a,7}, {b,5}, {c,6}] BarCamp Chennai - 5 Mohit Soni
  22. Word Count – Mapperpublic class WordCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> out, Reporter reporter) throws Exception { String l = value.toString(); StringTokenizer t = new StringTokenizer(l); while(t.hasMoreTokens()) { word.set(t.nextToken()); out.collect(word, one); } }} BarCamp Chennai - 5 Mohit Soni
  23. Word Count – Reducerpublic class WordCountReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWriter> values, OutputCollector<Text, IntWritable> out, Reporter reporter) throws Exception { int sum = 0; while(values.hasNext()) { sum +=; } out.collect(key, new IntWritable(sum)); }} BarCamp Chennai - 5 Mohit Soni
  24. Word Count – Configpublic class WordCountConfig { public static void main(String[] args) throws Exception { if (args.length() != 2) { System.exit(1); } JobConf conf = new JobConf(WordCountConfig.class); conf.setJobName(“Word Counter”); FileInputFormat.addInputPath(conf, new Path(args[0]); FileInputFormat.addOutputPath(conf, new Path(args[1])); conf.setMapperClass(WordCountMapper.class); conf.setCombinerClass(WordCountReducer.class); conf.setReducerClass(WordCountReducer.class); conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(TextOutputFormat.class); JobClient.runJob(conf); }} BarCamp Chennai - 5 Mohit Soni
  25. Diving Deeper•• Jeffrey Dean and Sanjay Ghemwat, MapReduce: Simplified Data Processing on Large Clusters• Tom White, Hadoop: The Definitive Guide, O’Reilly• Setting up a Single-Node Cluster:• Setting up a Multi-Node Cluster: BarCamp Chennai - 5 Mohit Soni
  26. Catching-Up• Follow me on twitter @mohitsoni• BarCamp Chennai - 5 Mohit Soni