Advertisement

More Related Content

Advertisement

Hadoop 101

  1. Hadoop 101 Mohit Soni eBay Inc. BarCamp Chennai - 5 Mohit Soni
  2. About Me • I work as a Software Engineer at eBay • Worked on large-scale data processing with eBay Research Labs BarCamp Chennai - 5 Mohit Soni
  3. First Things First BarCamp Chennai - 5 Mohit Soni
  4. MapReduce • Inspired from functional operations – Map – Reduce • Functional operations do not modify data, they generate new data • Original data remains unmodified BarCamp Chennai - 5 Mohit Soni
  5. Functional Operations Map Reduce def sqr(n): def add(i, j): return n * n return i + j list = [1,2,3,4] list = [1,2,3,4] map(sqr, list) -> [1,4,9,16] reduce(add, list) -> 10 MapReduce def MapReduce(data, mapper, reducer): return reduce(reducer, map(mapper, data)) MapReduce(list, sqr, add) -> 30 Python code BarCamp Chennai - 5 Mohit Soni
  6. BarCamp Chennai - 5 Mohit Soni
  7. What is Hadoop ? • Framework for large-scale data processing • Based on Google’s MapReduce and GFS • An Apache Software Foundation project • Open Source! • Written in Java • Oh, btw BarCamp Chennai - 5 Mohit Soni
  8. Why Hadoop ? • Need to process lots of data (PetaByte scale) • Need to parallelize processing across multitude of CPUs • Achieves above while KeepIng Software Simple • Gives scalability with low-cost commodity hardware BarCamp Chennai - 5 Mohit Soni
  9. Hadoop fans Source: Hadoop Wiki BarCamp Chennai - 5 Mohit Soni
  10. When to use and not-use Hadoop ? Hadoop is a good choice for: • Indexing data • Log Analysis • Image manipulation • Sorting large-scale data • Data Mining Hadoop is not a good choice: • For real-time processing • For processing intensive tasks with little data • If you have Jaguar or RoadRunner in your stock BarCamp Chennai - 5 Mohit Soni
  11. HDFS – Overview • Hadoop Distributed File System • Based on Google’s GFS (Google File System) • Write once read many access model • Fault tolerant • Efficient for batch-processing BarCamp Chennai - 5 Mohit Soni
  12. HDFS – Blocks Block 1 Block 2 Input Data Block 3 • HDFS splits input data into blocks • Block size in HDFS: 64/128MB (configurable) • Block size *nix: 4KB BarCamp Chennai - 5 Mohit Soni
  13. HDFS – Replication Block 1 Block 1 Block 2 Block 3 Block 2 Block 3 • Blocks are replicated across nodes to handle hardware failure • Node failure is handled gracefully, without loss of data BarCamp Chennai - 5 Mohit Soni
  14. HDFS – Architecture NameNode Client Cluster DataNodes BarCamp Chennai - 5 Mohit Soni
  15. HDFS – NameNode • NameNode (Master) – Manages filesystem metadata – Manages replication of blocks – Manages read/write access to files • Metadata – List of files – List of blocks that constitutes a file – List of DataNodes on which blocks reside, etc • Single Point of Failure (candidate for spending $$) BarCamp Chennai - 5 Mohit Soni
  16. HDFS – DataNode • DataNode (Slave) – Contains actual data – Manages data blocks – Informs NameNode about block IDs stored – Client read/write data blocks from DataNode – Performs block replication as instructed by NameNode • Block Replication – Supports various pluggable replication strategies – Clients read blocks from nearest DataNode • Data Pipelining – Client write block to first DataNode – First DataNode forwards data to next DataNode in pipeline – When block is replicated across all replicas, next block is chosen BarCamp Chennai - 5 Mohit Soni
  17. Hadoop - Architecture User JobTracker TaskTracker TaskTracker NameNode DataNode DataNode DataNode DataNode DataNode DataNode BarCamp Chennai - 5 Mohit Soni
  18. Hadoop - Terminology • JobTracker (Master) – 1 Job Tracker per cluster – Accepts job requests from users – Schedule Map and Reduce tasks for TaskTrackers – Monitors tasks and TaskTrackers status – Re-execute task on failure • TaskTracker (Slave) – Multiple TaskTrackers in a cluster – Run Map and Reduce tasks BarCamp Chennai - 5 Mohit Soni
  19. MapReduce – Flow Input Map Shuffle + Sort Reduce Output Map Reduce Input Output Data Map Data Reduce Map BarCamp Chennai - 5 Mohit Soni
  20. Word Count Hadoop’s HelloWorld BarCamp Chennai - 5 Mohit Soni
  21. Word Count Example • Input – Text files • Output – Single file containing (Word <TAB> Count) • Map Phase – Generates (Word, Count) pairs – [{a,1}, {b,1}, {a,1}] [{a,2}, {b,3}, {c,5}] [{a,3}, {b,1}, {c,1}] • Reduce Phase – For each word, calculates aggregate – [{a,7}, {b,5}, {c,6}] BarCamp Chennai - 5 Mohit Soni
  22. Word Count – Mapper public class WordCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> out, Reporter reporter) throws Exception { String l = value.toString(); StringTokenizer t = new StringTokenizer(l); while(t.hasMoreTokens()) { word.set(t.nextToken()); out.collect(word, one); } } } BarCamp Chennai - 5 Mohit Soni
  23. Word Count – Reducer public class WordCountReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWriter> values, OutputCollector<Text, IntWritable> out, Reporter reporter) throws Exception { int sum = 0; while(values.hasNext()) { sum += values.next().get(); } out.collect(key, new IntWritable(sum)); } } BarCamp Chennai - 5 Mohit Soni
  24. Word Count – Config public class WordCountConfig { public static void main(String[] args) throws Exception { if (args.length() != 2) { System.exit(1); } JobConf conf = new JobConf(WordCountConfig.class); conf.setJobName(“Word Counter”); FileInputFormat.addInputPath(conf, new Path(args[0]); FileInputFormat.addOutputPath(conf, new Path(args[1])); conf.setMapperClass(WordCountMapper.class); conf.setCombinerClass(WordCountReducer.class); conf.setReducerClass(WordCountReducer.class); conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(TextOutputFormat.class); JobClient.runJob(conf); } } BarCamp Chennai - 5 Mohit Soni
  25. Diving Deeper • http://hadoop.apache.org/ • Jeffrey Dean and Sanjay Ghemwat, MapReduce: Simplified Data Processing on Large Clusters • Tom White, Hadoop: The Definitive Guide, O’Reilly • Setting up a Single-Node Cluster: http://bit.ly/glNzs4 • Setting up a Multi-Node Cluster: http://bit.ly/f5KqCP BarCamp Chennai - 5 Mohit Soni
  26. Catching-Up • Follow me on twitter @mohitsoni • http://mohitsoni.com/ BarCamp Chennai - 5 Mohit Soni
Advertisement