Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala

1,360 views

Published on

Learn Hadoop and Bigdata Analytics, Join Design Pathshala training programs on Big data and analytics.

This slide covers the Advance Map reduce concepts of Hadoop and Big Data.

For training queries you can contact us:

Email: admin@designpathshala.com
Call us at: +91 98 188 23045
Visit us at: http://designpathshala.com
Join us at: http://www.designpathshala.com/contact-us
Course details: http://www.designpathshala.com/course/view/65536
Big data Analytics Course details: http://www.designpathshala.com/course/view/1441792
Business Analytics Course details: http://www.designpathshala.com/course/view/196608

Published in: Data & Analytics
  • Be the first to comment

Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala

  1. 1. Apache Hadoop Design Pathshala April 22, 2014 www.designpathshala.com 1
  2. 2. Apache Hadoop Interacting with HDFS Design Pathshala April 22, 2014 www.designpathshala.com 2
  3. 3. Apache Hadoop Bigdata Training By Design Pathshala Contact us on: admin@designpathshala.com Or Call us at: +91 120 260 5512 or +91 98 188 23045 Visit us at: http://designpathshala.com www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 3
  4. 4. Basic file commands  Commads for HDFS User:  hadoop fs -mkdir /foodir  hadoop fs –ls /  hadoop fs –lsr /  hadoop fs –put abc.txt /usr/dp  hadoop fs –get /usr/dp/abc.txt .  hadoop fs -cat /foodir/myfile.txt  hadoop fs -rm /foodir/myfile.txt www.designpathshala.com 4
  5. 5. Reading & Writing Programatically  org.apache.hadoop.fs  Configuration conf = new Configuration();  FileSystem hdfs = FileSystem.get(conf);  FileSystem local = FileSystem.getLocal(conf); www.designpathshala.com 5
  6. 6.  public static void main(String[] args) throws IOException {  Configuration conf = new Configuration();  FileSystem hdfs = FileSystem.get(conf);  FileSystem local = FileSystem.getLocal(conf);  Path inputDir = new Path(args[0]);  Path hdfsFile = new Path(args[1]);  try {  FileStatus[] inputFiles = local.listStatus(inputDir);  FSDataOutputStream out = hdfs.create(hdfsFile); www.designpathshala.com 6
  7. 7. Apache Hadoop Bigdata Training By Design Pathshala Contact us on: admin@designpathshala.com Or Call us at: +91 120 260 5512 or +91 98 188 23045 Visit us at: http://designpathshala.com www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 7
  8. 8.  for (int i = 0; i < inputFiles.length; i++) {  System.out.println(inputFiles[i].getPath().getName());  FSDataInputStream in = local.open(inputFiles[i].getPath());  byte buffer[] = new byte[256];  int bytesRead = 0;  while ((bytesRead = in.read(buffer)) > 0) {  out.write(buffer, 0, bytesRead);  }  in.close();  }  out.close();  } catch (IOException e) {  e.printStackTrace();  }  } www.designpathshala.com 8
  9. 9. Apache Hadoop Map Reduce Basics Design Pathshala April 22, 2014 www.designpathshala.com 9
  10. 10. MapReduce - Dataflow www.designpathshala.com 10
  11. 11. Map-Reduce Execution Engine (Example: Color Count) Shuffle & Sorting based on k Consumes(k, [v]) ( , [1,1,1,1,1,1..]) Reduce Reduce Reduce Produces (k, v) ( , 1) Map Map Map Map Input blocks on HDFS Parse-hash Parse-hash Parse-hash Parse-hash Produces(k’, v’) ( , 100) www.designpathshala.com 11 Users only provide the “Map” and “Reduce” functions Part0001 Part0002 Part0003 That’s the output file, it has 3 parts on probably 3 different machines
  12. 12. Apache Hadoop Bigdata Training By Design Pathshala Contact us on: admin@designpathshala.com Or Call us at: +91 120 260 5512 or +91 98 188 23045 Visit us at: http://designpathshala.com www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 12
  13. 13. Map <key, 1> Reducers (say, Count) Count Count Count Large scale data splits Parse-hash Parse-hash Parse-hash Parse-hash P-0000 , count1 P-0001 , count2 P-0002 ,count3 www.designpathshala.com 13
  14. 14. Properties of MapReduce Engine (Cont’d)  Task Tracker is the slave node (runs on each datanode)  Receives the task from Job Tracker  Runs the task until completion (either map or reduce task)  Always in communication with the Job Tracker reporting progress Reduce Reduce Reduce Map Map Map Map Parse-hash Parse-hash Parse-hash Parse-hash In this example, 1 map-reduce job consists of 4 map tasks and 3 reduce tasks www.designpathshala.com 14
  15. 15. Key-Value Pairs  Mappers and Reducers are users’ code (provided functions)  Just need to obey the Key-Value pairs interface  Mappers:  Consume <key, value> pairs  Produce <key, value> pairs  Reducers:  Consume <key, <list of values>>  Produce <key, value>  Shuffling and Sorting:  Hidden phase between mappers and reducers  Groups all similar keys from all mappers, sorts and passes them to a certain reducer in the form of <key, <list of values>> www.designpathshala.com 15
  16. 16. Apache Hadoop Bigdata Training By Design Pathshala Contact us on: admin@designpathshala.com Or Call us at: +91 120 260 5512 or +91 98 188 23045 Visit us at: http://designpathshala.com www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 16
  17. 17. Example 2: Color Filter Job: Select only the blue colorS Input blocks Produces (k, v) on HDFS ( , 1) Map Map Map Map Write to HDFS Write to HDFS Write to HDFS Write to HDFS • Each map task will select only the blue color • No need for reduce phase Part0001 Part0002 Part0003 Part0004 That’s the output file, it has 4 parts on probably 4 different machines www.designpathshala.com 17
  18. 18. How does MapReduce work?  The run time partitions the input and provides it to different Map instances;  Map (key, value)  (key’, value’)  The run time collects the (key’, value’) pairs and distributes them to several Reduce functions so that each Reduce function gets the pairs with the same key’.  Each Reduce produces a single (or zero) file output.  Map and Reduce are user written functions www.designpathshala.com 18
  19. 19. Example 3: Count Fruits  Job: Count the occurrences of each word in a data set Map Tasks Reduce Tasks www.designpathshala.com 19
  20. 20. Word Count Example  Mapper  Input: value: lines of text of input  Output: key: word, value: 1  Reducer  Input: key: word, value: set of counts  Output: key: word, value: sum  Launching program  Defines this job  Submits job to cluster www.designpathshala.com 20
  21. 21. Apache Hadoop Bigdata Training By Design Pathshala Contact us on: admin@designpathshala.com Or Call us at: +91 120 260 5512 or +91 98 188 23045 Visit us at: http://designpathshala.com www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 21
  22. 22. Example MapReduce: Mapper public class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable> { Text word = new Text(); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken().trim()); context.write(word, new IntWritable(1)); }}} www.designpathshala.com 22
  23. 23. Reducer public class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable> { private IntWritable result = new IntWritable(); public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result);}} www.designpathshala.com 23
  24. 24. Job public static void main(String[] args) throws Exception { JobConf conf = new JobConf(WordCount.class); conf.setJobName("wordcount"); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); conf.setMapperClass(Map.class); conf.setReducerClass(Reduce.class); conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(TextOutputFormat.class); FileInputFormat.setInputPaths(conf, new Path(args[0])); FileOutputFormat.setOutputPath(conf, new Path(args[1])); JobClient.runJob(conf);} www.designpathshala.com 24
  25. 25. Terminology Example  Running “Word Count” across 20 files is one job  20 input splits to be mapped imply 20 map tasks + some number of reduce tasks  At least 20 map task attempts will be performed… more if a machine crashes, etc. www.designpathshala.com 25
  26. 26. Apache Hadoop Bigdata Training By Design Pathshala Contact us on: admin@designpathshala.com Or Call us at: +91 120 260 5512 or +91 98 188 23045 Visit us at: http://designpathshala.com www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 26
  27. 27. MapReduce - Features  Fine grained Map and Reduce tasks  Improved load balancing  Faster recovery from failed tasks  Automatic re-execution on failure  In a large cluster, some nodes are always slow or flaky  Framework re-executes failed tasks  Locality optimizations  With large data, bandwidth to data is a problem  Map-Reduce + HDFS is a very effective solution  Map-Reduce queries HDFS for locations of input data  Map tasks are scheduled close to the inputs when possible www.designpathshala.com 27
  28. 28. What is Writable?  Hadoop defines its own “box” classes for strings (Text), integers (IntWritable), etc.  All values are instances of Writable  All keys are instances of WritableComparable www.designpathshala.com 28
  29. 29. Hadoop Data Types Class Size in bytes Description Sort Policy BooleanWritable 1 Wrapper for a standard Boolean variable False before and true after ByteWritable 1 Wrapper for a single byte Ascending order DoubleWritable 8 Wrapper for a Double Ascending order FloatWritable 4 Wrapper for a Float Ascending order IntWritable 4 Wrapper for a Integer Ascending order LongWritable 8 Wrapper for a Long Ascending order Text 2GB Wrapper to store text using the unicode UTF8 format Alphabetic order NullWritable Placeholder when the key or value is not needed Undefined Your Writable Implement the Writable Interface for a value or WritableComparable<T> for a key Your sort policy www.designpathshala.com 29
  30. 30. WritableComparable  Compares WritableComparable data  Will call compareTo method to do comparison www.designpathshala.com 30
  31. 31. Apache Hadoop Bigdata Training By Design Pathshala Contact us on: admin@designpathshala.com Or Call us at: +91 120 260 5512 or +91 98 188 23045 Visit us at: http://designpathshala.com www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 31
  32. 32. Input Split Size  Input splits are logical division of records whereas HDFS blocks are physical division of the input data.  Its extremely efficient when they are same but in practice it’s never align.  Machine processing a particular split may fetch a fragment of a record from a block other than its “main” block and which may reside remotely.  FileInputFormat will divide large files into chunks  Exact size controlled by mapred.min.split.size  RecordReaders receive file, offset, and length of chunk (or input splits) www.designpathshala.com 32
  33. 33. Getting Data To The Mapper Input file Input file InputSplit InputSplit InputSplit InputSplit RecordReader RecordReader RecordReader RecordReader Mapper (intermediates) Mapper (intermediates) Mapper (intermediates) Mapper (intermediates) InputFormat www.designpathshala.com 33
  34. 34. Reading Data  Data sets are specified by InputFormats  Defines input data (e.g., a directory)  Identifies partitions of the data that form an InputSplit  Factory for RecordReader objects to extract (k, v) records from the input source www.designpathshala.com 34
  35. 35. Apache Hadoop Bigdata Training By Design Pathshala Contact us on: admin@designpathshala.com Or Call us at: +91 120 260 5512 or +91 98 188 23045 Visit us at: http://designpathshala.com www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 35
  36. 36. FileInputFormat  TextInputFormat – Treats each ‘n’-terminated line of a file as a value & Key is the byte offset of line  Key – LongWritable  Value - Text  KeyValueTextInputFormat – Each line in the text files is a record. Separator character deivides each line. Any thing before separator is a key and after that is a value.  Separator is set by “key.value.separator.in.input.line.property”  Default separator is “t”  Key: Text  Value: Text  SequenceFileInputFormat – Input format for reading in sequence files. Key and values are user defined. These are specific compression binary file format.  NLineInputFormat – Same as TextInputFormat, but each split is guaranteed to have exactly N lines.  Mapred.line.input.format.linespermap.property  Key: LongWritable  Value: Text www.designpathshala.com 36
  37. 37. Filtering File Inputs  FileInputFormat will read all files out of a specified directory and send them to the mapper  Delegates filtering this file list to a method subclasses may override  e.g., Create your own “xyzFileInputFormat” to read *.xyz from directory list www.designpathshala.com 37
  38. 38. Record Readers  Each InputFormat provides its own RecordReader implementation  Responsible for parsing input splits into records  Then parsing each record into a key value pair  LineRecordReader – Reads a line from a text file  Used in TextInputFormat  KeyValueRecordReader – Used by KeyValueTextInputFormat  Custom Record Readers can be created by implementing RecordReader<K,V> www.designpathshala.com 38
  39. 39. Creating the Mapper  Extends Mapper Abstract Class  Class Mapper<KEYIN,VALUEIN,KEYOUT,VALUEOUT>  protected void cleanup(org.apache.hadoop.mapreduce.Mapper.Context context) - Called once at the end of the task.  protected void map(KEYIN key, VALUEIN value, org.apache.hadoop.mapreduce.Mapper.Context context) - Called once for each key/value pair in the input split.  void run(org.apache.hadoop.mapreduce.Mapper.Context context) - Expert users can override this method for more complete control over the execution of the Mapper.  protected void setup(org.apache.hadoop.mapreduce.Mapper.Context context) - Called once at the beginning of the task.OutputCollector receives the output of the mapping process www.designpathshala.com 39
  40. 40. Apache Hadoop Bigdata Training By Design Pathshala Contact us on: admin@designpathshala.com Or Call us at: +91 120 260 5512 or +91 98 188 23045 Visit us at: http://designpathshala.com www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 40
  41. 41. Mapper  public void map( Object key, Text value, Context context)  Key types implement WritableComparable  Value types implement Writable www.designpathshala.com 41
  42. 42. Some useful mappers  IdentityMapper<K,V> - Maps the input directly to output  InverseMapper<K,V> - Reverse key value pair  RegexMapper<K> - Implements Mapper<K,Text,Text,LongWritable> and generates a (match,1) pair for every regular expression match.  TokenCountMapper<K> - - Implements Mapper<K,Text,Text,LongWritable> and generates a (token,1) pair when input value is tokenized. www.designpathshala.com 42
  43. 43. Reducer  void reduce( Text key, Iterable<IntWritable> values, Context context)  Key types implement WritableComparable  Value types implement Writable www.designpathshala.com 43
  44. 44. Apache Hadoop Bigdata Training By Design Pathshala Contact us on: admin@designpathshala.com Or Call us at: +91 120 260 5512 or +91 98 188 23045 Visit us at: http://designpathshala.com www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 44
  45. 45. Finally: Writing The Output Reducer Reducer Reducer RecordWriter RecordWriter RecordWriter output file output file output file OutputFormat www.designpathshala.com 45
  46. 46. Some useful reducers  IdentityReducer<K,V> - Maps the input directly to output  LongSumReducer<K> - Implements Reducer<K,LongWritable, K,LongWritable> and determines sum of all values corresponding to the given key. www.designpathshala.com 46
  47. 47. OutputFormat  TextOutputFormat – Writes each record as a line of text. Key and values are written as string and separated as “/t”  SequenceFileOutputFormat – Writes Key and value as hadoops proprietary sequence file format.  NullOutputFormat – Output nothing. If for any reason you want to suppress the output completely. www.designpathshala.com 47
  48. 48. Apache Hadoop Common MapReduce Algorithms Design Pathshala April 22, 2014 www.designpathshala.com 48
  49. 49. Apache Hadoop Bigdata Training By Design Pathshala Contact us on: admin@designpathshala.com Or Call us at: +91 120 260 5512 or +91 98 188 23045 Visit us at: http://designpathshala.com www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 49
  50. 50. Some handy tools  Partitioners  Combiners  Compression  Zero Reduces  Distributed File Cache www.designpathshala.com 50
  51. 51. Partitioners  Partitioners are application code that define how keys are assigned to reduces  Default partitioning spreads keys evenly, but randomly  Uses key.hashCode() % num_reduces  Custom partitioning is often required, for example, to produce a total order in the output  Should implement Partitioner interface  Set by calling conf.setPartitionerClass(MyPart.class)  To get a total order, sample the map output keys and pick values to divide the keys into roughly equal buckets and use that in your partitioner www.designpathshala.com 51
  52. 52. Partition And Shuffle Mapper (intermediates) Mapper (intermediates) Mapper (intermediates) Mapper (intermediates) Partitioner Partitioner Partitioner Partitioner (intermediates) (intermediates) (intermediates) Reducer Reducer Reducer shuffling www.designpathshala.com 52
  53. 53. Apache Hadoop Bigdata Training By Design Pathshala Contact us on: admin@designpathshala.com Or Call us at: +91 120 260 5512 or +91 98 188 23045 Visit us at: http://designpathshala.com www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 53
  54. 54. Combiners  When maps produce many repeated keys  It is often useful to do a local aggregation following the map  Done by specifying a Combiner  Goal is to decrease size of the transient data  Combiners have the same interface as Reduces, and often are the same class  Combiners must not side effects, because they run an intermediate number of times  In WordCount, conf.setCombinerClass(Reduce.class); www.designpathshala.com 54
  55. 55. Compression  Compressing the outputs and intermediate data will often yield huge performance gains  Can be specified via a configuration file or set programmatically  Set mapred.output.compress to true to compress job output  Set mapred.compress.map.output to true to compress map outputs  Compression Types (mapred(.map)?.output.compression.type)  “block” - Group of keys and values are compressed together  “record” - Each value is compressed individually  Block compression is almost always best  Compression Codecs (mapred(.map)?.output.compression.codec)  Default (zlib) - slower, but more compression  LZO - faster, but less compression www.designpathshala.com 55
  56. 56. Apache Hadoop Bigdata Training By Design Pathshala Contact us on: admin@designpathshala.com Or Call us at: +91 120 260 5512 or +91 98 188 23045 Visit us at: http://designpathshala.com www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 56
  57. 57. Zero Reduces  Frequently, we only need to run a filter on the input data  No sorting or shuffling required by the job  Set the number of reduces to 0  Output from maps will go directly to OutputFormat and disk www.designpathshala.com 57
  58. 58. Distributed File Cache  Sometimes need read-only copies of data on the local computer  Downloading 1GB of data for each Mapper is expensive  Define list of files you need to download in JobConf  Files are downloaded once per computer  Add to launching program: DistributedCache.addCacheFile(new URI(“hdfs://nn:8020/foo”), conf);  Add to task: Path[] files = DistributedCache.getLocalCacheFiles(conf); www.designpathshala.com 58
  59. 59. Apache Hadoop Bigdata Training By Design Pathshala Contact us on: admin@designpathshala.com Or Call us at: +91 120 260 5512 or +91 98 188 23045 Visit us at: http://designpathshala.com www.designpathshala.com | +91 120 260 5512 | +91 98 188 23045 | admin@designpathshala.com | http://designpathshala.com 59

×