Apache Hadoop Java API


Published on

Short introduction to MapReduce Java API for Apache Hadoop

Published in: Technology

Apache Hadoop Java API

  1. 1. Short Apache Hadoop API Overview Adam Kawa Data Engineer @ Spotify2/24/13
  2. 2. 2/24/13Image Source http://developer.yahoo.com/hadoop/tutorial/module4.html
  3. 3. InputFormat ReposibilitiesDivide input data into logical input splits Data in HDFS is divided into block, but processed as input splits InputSplit may contains any number of blocks (usually 1) Each Mapper processes one input splitCreates RecordReaders to extract <key, value> pairs2/24/13
  4. 4. InputFormat Classpublic abstract class InputFormat<K, V> { public abstract List<InputSplit> getSplits(JobContext context) throws ...;public abstract RecordReader<K,V> createRecordReader(InputSplit split, TaskAttemptContext context) throws ...;}2/24/13
  5. 5. Most Common InputFormatsTextInputFormat Each n-terminated line is a value The byte offset of that line is a key Why not a line number?KeyValueTextInputFormat Key and value are separated by a separator (tab by default)2/24/13
  6. 6. Binary InputFormatsSequenceFileInputFormat SequenceFiles are flat files consisting of binary <key, value> pairsAvroInputFormat Avro supports rich data structures (not necessarily <key, value> pairs) serialized to files or messages Compact, fast, language-independent, self-describing, dynamic2/24/13
  7. 7. Some Other InputFormatsNLineInputFormat Should not be too big since splits are calculated in a single thread (NLineInputFormat#getSplitsForFile)CombineFileInputFormat An abstract class, but not so difficult to extendSeparatorInputFormat How to here: http://blog.rguha.net/?p=2932/24/13
  8. 8. Some Other InputFormatsMultipleInputs Supports multiple input paths with a different InputFormat and Mapper for each pathMultipleInputs.addInputPath(job, firstPath, FirstInputFormat.class, FirstMapper.class);MultipleInputs.addInputPath(job, secondPath, SecondInputFormat.class, SecondMapper.class);2/24/13
  9. 9. InputFormat Class (Partial) Hierarchy2/24/13
  10. 10. InputFormat Interesting FactsIdeally InputSplit size is equal to HDFS block size Or InputSplit contains multiple collocated HDFS blockInputFormat may prevent splitting a file A whole file is processed by a single mapper (e.g. gzip) boolean FileInputFormat#isSplittable();2/24/13
  11. 11. InputFormat Interesting FactsMapper knows the file/offset/size of the split that it process MapContext#getInputSplit() Useful for later debugging on a local machine2/24/13
  12. 12. InputFormat Interesting FactsPathFilter (included in InputFormat) specifies which files to include or not into input dataPathFilter hiddenFileFilter = new PathFilter(){ public boolean accept(Path p){ String name = p.getName(); return !name.startsWith("_") && !name.startsWith("."); }}; 2/24/13
  13. 13. RecordReaderExtract <key, value> pairs from corresponding InputSplitExamples: LineRecordReader KeyValueRecordReader SequenceFileRecordReader2/24/13
  14. 14. RecordReader Logic Must handle a common situation when InputSplit and HDFS block boundaries do not match 2/24/13Image source: Hadoop: The Definitive Guide by Tom White
  15. 15. RecordReader Logic Exemplary solution – based on LineRecordReader Skips* everything from its block until the fist n Reads from the second block until it sees n *except the very first block (an offset equals to 0) 2/24/13Image source: Hadoop: The Definitive Guide by Tom White
  16. 16. Keys And ValuesKeys must implement WritableComparable interface Since they are sorted before passing to the ReducersValues must implement “at least” Writable interface2/24/13
  17. 17. WritableComparables Hierarchy 2/24/13Image source: Hadoop: The Definitive Guide by Tom White
  18. 18. Writable And WritableComparablepublic interface Writable { void write(DataOutput out) throws IOException; void readFields(DataInput in) throws IOException;}public interface WritableComparable<T> extends Writable, Comparable<T> {}public interface Comparable<T> { public int compareTo(T o);}2/24/13
  19. 19. Example: SongWritableclass SongWritable implements Writable { String title; int year; byte[] content; … public void write(DataOutput out) throws ... { out.writeUTF(title); out.writeInt(year); out.writeInt(content.length); out.write(content); }}2/24/13
  20. 20. MapperTakes input in form of a <key, value> pairEmits a set of intermediate <key, value> pairsStores them locally and later passes to the Reducers But earlier: partition + sort + spill + merge2/24/13
  21. 21. Mapper Methodsvoid setup(Context context) throws ... {}protected void cleanup(Context context) throws ... {}void map(KEYIN key, VALUEIN value, Context context) ... { context.write((KEYOUT) key, (VALUEOUT) value);}public void run(Context context) throws ... { setup(context); while (context.nextKeyValue()) { map(context.getCurrentKey(), context.getCurrentValue(), context); } cleanup(context);}2/24/13
  22. 22. MapContext ObjectAllow the user map code to communicate with MapReduce systempublic InputSplit getInputSplit();public TaskAttemptID getTaskAttemptID();public void setStatus(String msg);public boolean nextKeyValue() throws ...;public KEYIN getCurrentKey() throws ...;public VALUEIN getCurrentValue() throws ...;public void write(KEYOUT key, VALUEOUT value) throws ...;public Counter getCounter(String groupName, String counterName);2/24/13
  23. 23. Examples Of MappersImplement highly specialized Mappers and reuse/chain them when possibleIdentityMapperInverseMapperRegexMapperTokenCounterMapper2/24/13
  24. 24. TokenCounterMapperpublic class TokenCounterMapper extends Mapper<Object, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); @Override public void map(Object key, Text value, Context context ) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); } }}2/24/13
  25. 25. General AdvicesReuse Writable instead of creating a new one each timeApache commons StringUtils class seems to be the most efficient for String tokenization2/24/13
  26. 26. Chain Of MappersUse multiple Mapper classes within a single Map taskThe output of the first Mapper becomes the input of the second, and so on until the last MapperThe output of the last Mapper will be written to the tasks outputEncourages implementation of reusable and highly specialized Mappers2/24/13
  27. 27. Exemplary Chain Of Mappers JobConf mapAConf = new JobConf(false); ... ChainMapper.addMapper(conf, AMap.class, LongWritable.class, Text.class,   Text.class, Text.class, true, mapAConf);  JobConf mapBConf = new JobConf(false); ... ChainMapper.addMapper(conf, BMap.class, Text.class, Text.class,   LongWritable.class, Text.class, false, mapBConf); FileInputFormat.setInputPaths(conf, inDir); FileOutputFormat.setOutputPath(conf, outDir); JobClient jc = new JobClient(conf); RunningJob job = jc.submitJob(conf);2/24/13
  28. 28. PartitionerSpecifies which Reducer a given <key, value> pair is sent toDesire even distribution of the intermediate dataSkewed data may overload a single reducer and make a whole job running longerpublic abstract class Partitioner<KEY, VALUE> { public abstract int getPartition(KEY key, VALUE value, int numPartitions);}2/24/13
  29. 29. HashPartitionerThe default choice for general-purpose use casespublic int getPartition(K key, V value, int numReduceTasks) { return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;}2/24/13
  30. 30. TotalOrderPartitionerA partitioner that aims the total order of the output2/24/13
  31. 31. TotalOrderPartitionerBefore job runs, it samples input data to provide fairly even distribution over keys2/24/13
  32. 32. TotalOrderPartitionerThree samplers InputSampler.RandomSampler<K,V> Sample from random points in the input InputSampler.IntervalSampler<K,V> Sample from s splits at regular intervals InputSampler.SplitSampler<K,V> Samples the first n records from s splits2/24/13
  33. 33. ReducerGets list(<key, list(value)>)Keys are sorted, but values for a given key are not sortedEmits a set of output <key, value> pairs2/24/13
  34. 34. Reducer Run Methodpublic void run(Context context) throws … { setup(context); while (context.nextKey()) { reduce(context.getCurrentKey(), context.getValues(), context); } cleanup(context);}2/24/13
  35. 35. Chain Of Mappers After A ReducerThe ChainReducer class allows to chain multiple Mapper classes after a Reducer within the Reducer taskCombined with ChainMapper, one could get [MAP+ / REDUCE MAP*]ChainReducer.setReducer(conf, XReduce.class, LongWritable.class, Text.class, Text.class, Text.class, true, reduceConf); ChainReducer.addMapper(conf, CMap.class, Text.class, Text.class, LongWritable.class, Text.class, false, null); ChainReducer.addMapper(conf, DMap.class, LongWritable.class, Text.class, LongWritable.class, LongWritable.class, true, null);2/24/13
  36. 36. OutputFormat Class Hierarchy 2/24/13Image source: Hadoop: The Definitive Guide by Tom White
  37. 37. MultipleOutputsMultipleOutputs.addNamedOutput(job, "text", TextOutputFormat.class, LongWritable.class, Text.class);MultipleOutputs.addNamedOutput(job, "seq", SequenceFileOutputFormat.class, LongWritable.class, Text.class); public void reduce(WritableComparable key, Iterator<Writable> values, Context context) throws ... { ... mos.write("text", , key, new Text("Hello")); mos.write("seq", LongWritable(1), new Text("Bye"), "seq_a"); mos.write("seq", LongWritable(2), key, new Text("Chau"), "seq_b"); mos.write(key, new Text("value"), generateFileName(key, new Text("value"))); }2/24/13
  38. 38. Other Useful FeaturesCombinerSkipping bad recordsCompressionProfilingIsolation Runner2/24/13
  39. 39. Job Class Methodspublic void setInputFormatClass(..); public void setNumReduceTasks(int tasks);public void setOutputFormatClass(..); public void setJobName(String name);public void setMapperClass(..); public float mapProgress();public void setCombinerClass(..); public float reduceProgress();public void setReducerClass(...); public boolean isComplete();public void setPartitionerClass(..); public boolean isSuccessful();public void setMapOutputKeyClass(..); public void killJob();public void setMapOutputValueClass(..); public void submit();public void setOutputKeyClass(..); public boolean waitForCompletion(..);public void setOutputValueClass(..);public void setSortComparatorClass(..);public void setGroupingComparatorClass(..);2/24/13
  40. 40. ToolRunnerSupports parsing allows the user to specify configuration options on the command linehadoop jar examples.jar SongCount -D mapreduce.job.reduces=10 -D artist.gender=FEMALE -files dictionary.dat -jar math.jar,spotify.jar songs counts2/24/13
  41. 41. Side Data Distributionpublic class MyMapper<K, V> extends Mapper<K,V,V,K> { String gender = null; File dictionary = null; protected void setup(Context context) throws … { Configuration conf = context.getConfiguration(); gender = conf.get(“artist.gender”, “MALE”); dictionary = new File(“dictionary.dat”); }2/24/13
  42. 42. public class WordCount extends Configured implements Tool { public int run(String[] otherArgs) throws Exception { if (args.length != 2) { System.out.println("Usage: %s [options] <input> <output>", getClass().getSimpleName()); return -1; } Job job = new Job(getConf()); FileInputFormat.setInputPaths(job, new Path(otherArgs[0])); FileOutputFormat.setOutputPath(job, new Path(otherArgs[1])); ... return job.waitForCompletion(true); ? 0 : 1; }}public static void main(String[] allArgs) throws Exception { int exitCode = ToolRunner.run(new Configuration(), new WordCount(), allArgs); System.exit(exitCode);}2/24/13
  43. 43. MRUnitBuilt on top of JUnitProvides a mock InputSplit, Contex and other classesCan test The Mapper class, The Reducer class, The full MapReduce job The pipeline of MapReduce jobs2/24/13
  44. 44. MRUnit Examplepublic class IdentityMapTest extends TestCase { private MapDriver<Text, Text, Text, Text> driver; @Before public void setUp() { driver = new MapDriver<Text, Text, Text, Text>(new MyMapper<Text, Text, Text, Text>()); } @Test public void testMyMapper() { driver .withInput(new Text("foo"), new Text("bar")) .withOutput(new Text("oof"), new Text("rab")) .runTest(); }}2/24/13
  45. 45. Example: Secondary Sortreduce(key, Iterator<value>) method gets iterator over valuesThese values are not sorted for a given keySometimes we want to get them sortedUseful to find minimum or maximum value quickly2/24/13
  46. 46. Secondary Sort Is TrickyA couple of custom classes are needed WritableComparable Partitioner SortComparator (optional, but recommended) GroupingComparator2/24/13
  47. 47. Composite KeyLeverages “traditional” sorting mechanism of intermediate keysIntermediate key becomes composite of the “natural” key and the value(Disturbia, 1) → (Disturbia#1, 1)(SOS, 4) → (SOS#4, 4)(Disturbia, 7) → (Disturbia#7, 7)(Fast car, 2) → (Fast car#2, 2)(Fast car, 6) → (Fast car#6, 6)(Disturbia, 4) → (Disturbia#4, 4)(Fast car, 2) → (Fast car#2, 2)2/24/13
  48. 48. Custom PartitionerHashPartitioner uses a hash on keys The same titles may go to different reducers (because they are combined with ts in a key)Use a custom partitioner that partitions only on first part of the keyint getPartition(TitleWithTs key, LongWritable value, int num) { return hashParitioner.getPartition(key.title);}2/24/13
  49. 49. Ordering Of KeysKeys needs to be ordered before passing to the reducerOrders by natural key and, for the same natural key, on the value portion of the keyImplement sorting in WritableComparable or use Comparator classjob.setSortComparatorClass(SongWithTsComparator.class);2/24/13
  50. 50. Data Passed To The ReducerBy default, each unique key forces reduce() method(Disturbia#1, 1) → reduce method is invoked(Disturbia#4, 4) → reduce method is invoked(Disturbia#7, 7) → reduce method is invoked(Fast car#2, 2) → reduce method is invoked(Fast car#2, 2)(Fast car#6, 6) → reduce method is invoked(SOS#4, 4) → reduce method is invoked2/24/13
  51. 51. Data Passed To The ReducerGroupingComparatorClass class determines which keys and values are passed in a single call to the reduce methodJust look at the natural key when grouping(Disturbia#1, 1) → reduce method is invoked(Disturbia#4, 4)(Disturbia#7, 7)(Fast car#2, 2) → reduce method is invoked(Fast car#2, 2)(Fast car#6, 6)(SOS#4, 4) → reduce method is invoked2/24/13
  52. 52. QuestionHow to calculate a median from a set of numbers using Java MapReduce?2/24/13
  53. 53. Question – A Possible AnswerImplement TotalSort, but Each Reducer produce an additional file containing a pair <minimum_value, number_of_values>After the job ends, a single-thread application Reads these files to build the index Calculate which value in which file is the median Finds this value in this file2/24/13
  54. 54. Thanks!Would you like to use Hadoop API at Spotify?Apply via jobs@spotify.com2/24/13