Upgrading To The New Map Reduce API

33,168 views

Published on

Opening presentation from Yahoo! for the Hadoop User Group (this presentation was not by me)

Published in: Technology, Business
3 Comments
60 Likes
Statistics
Notes
No Downloads
Views
Total views
33,168
On SlideShare
0
From Embeds
0
Number of Embeds
2,090
Actions
Shares
0
Downloads
631
Comments
3
Likes
60
Embeds 0
No embeds

No notes for slide

Upgrading To The New Map Reduce API

  1. 1. September 23 rd , 2009 Yahoo! welcomes you to the Hadoop Bay Area User Group
  2. 2. Agenda <ul><li>Upgrading to the New MapReduce API </li></ul><ul><ul><li>Owen O'Malley, Yahoo! </li></ul></ul><ul><li>Mining the web with Hadoop, Cascading & Bixo </li></ul><ul><ul><li>Ken Krugler </li></ul></ul><ul><li>QnA and Open Discussion </li></ul>
  3. 3. September 23 rd , 2009 Upgrading to the New MapReduce API
  4. 4. Top Level Changes <ul><li>Change all of the “mapred” packages to “mapreduce”. </li></ul><ul><li>Methods can throw InterruptedException as well as IOException. </li></ul><ul><li>Use Configuration instead of JobConf </li></ul><ul><li>Library classes moved to mapreduce.lib.{input,map,output,partition,reduce}.* </li></ul><ul><li>Don’t Panic! </li></ul>
  5. 5. Mapper <ul><li>Change map function signature from: </li></ul><ul><ul><li>map(K1 key, V1 value, </li></ul></ul><ul><ul><li>OutputCollector<K2,V2> output, </li></ul></ul><ul><ul><li>Reporter reporter) </li></ul></ul><ul><ul><li>map(K1 key, V1 value, Context context) </li></ul></ul><ul><ul><li>context replaces output and reporter. </li></ul></ul><ul><li>Change close() to </li></ul><ul><ul><li>cleanup(Context context) </li></ul></ul>
  6. 6. Mapper (cont) <ul><li>Change output.collect(K,V) to </li></ul><ul><ul><li>context.write(K,V) </li></ul></ul><ul><li>Also have setup(Context context) that can replace your configure method. </li></ul>
  7. 7. MapRunnable <ul><li>Use mapreduce.Mapper </li></ul><ul><li>Change from: </li></ul><ul><ul><li>void run(RecordReader<K1,V1> input, </li></ul></ul><ul><ul><li>OutputCollector<K2,V2> output, </li></ul></ul><ul><ul><li>Reporter reporter) </li></ul></ul><ul><ul><li>void run(Context context) </li></ul></ul>
  8. 8. Reducer and Combiner <ul><li>Replace: </li></ul><ul><ul><li>void reduce(K2, Iterator<V2> values, </li></ul></ul><ul><ul><li>OutputCollector<K3,V3> output) </li></ul></ul><ul><ul><li>void reduce(K2, Iterable<V2> values, </li></ul></ul><ul><ul><li>Context context) </li></ul></ul><ul><li>Also replace close() with: </li></ul><ul><ul><li>void cleanup(Context context) </li></ul></ul><ul><li>Also have setup and run! </li></ul>
  9. 9. Reducer and Combiner (cont) <ul><li>Replace </li></ul><ul><ul><li>while (values.hasNext()) { </li></ul></ul><ul><ul><li>V2 value = values.next(); … } </li></ul></ul><ul><ul><li>for(V2 value: values) { … } </li></ul></ul><ul><li>Users of the grouping comparator can use context.getCurrentKey to get the real current key. </li></ul>
  10. 10. Submitting Jobs <ul><li>Replace the JobConf and JobClient with Job. </li></ul><ul><ul><li>The Job represents the entire job instead of just the configuration. </li></ul></ul><ul><ul><li>Set properties of the job. </li></ul></ul><ul><ul><li>Get the status of the job. </li></ul></ul><ul><ul><li>Wait for job to complete. </li></ul></ul>
  11. 11. Submitting Jobs (cont) <ul><li>Job constructor: </li></ul><ul><ul><li>job = new JobConf(conf, MyMapper.class) </li></ul></ul><ul><ul><li>job.setJobName(“job name”) </li></ul></ul><ul><ul><li>job = new Job(conf, “job name”) </li></ul></ul><ul><ul><li>job.setJarByClass(MyMapper.class) </li></ul></ul><ul><li>Job has getConfiguration </li></ul><ul><li>FileInputFormat in mapreduce.lib.input </li></ul><ul><li>FileOutputFormat in mapreduce.lib.output </li></ul>
  12. 12. Submitting Jobs (cont) <ul><li>Replace: </li></ul><ul><ul><li>JobClient.runJob(job) </li></ul></ul><ul><ul><li>System.exit(job.waitForCompletion(true)?0:1) </li></ul></ul>
  13. 13. InputFormats <ul><li>Replace: </li></ul><ul><ul><li>InputSplit[] getSplits(JobConf job, int numSplits) </li></ul></ul><ul><ul><li>List<InputSplit> getSplits(JobContext context) </li></ul></ul><ul><li>Replace: </li></ul><ul><ul><li>RecordReader<K,V> </li></ul></ul><ul><ul><li>getRecordReader(InputSplit split, JobConf job, </li></ul></ul><ul><ul><li>Reporter reporter) </li></ul></ul><ul><ul><li>RecordReader<K,V> </li></ul></ul><ul><ul><li>createRecordReader(InputSplit split, </li></ul></ul><ul><ul><li>TaskAttemptContext context) </li></ul></ul>
  14. 14. InputFormat (cont) <ul><li>There is no replacement for numSplits (mapred.map.tasks). </li></ul><ul><li>FileInputFormat just uses: </li></ul><ul><ul><li>block size </li></ul></ul><ul><ul><li>mapreduce.input.fileinputformat.minsize </li></ul></ul><ul><ul><li>mapreduce.input.fileinputformat.maxsize </li></ul></ul><ul><li>Replace MultiFileInputFormat with CombineFileInputFormat </li></ul>
  15. 15. RecordReader <ul><li>Replace: </li></ul><ul><ul><li>boolean next(K key, V value) </li></ul></ul><ul><ul><li>K createKey() </li></ul></ul><ul><ul><li>V createValue() </li></ul></ul><ul><li>With: </li></ul><ul><ul><li>boolean nextKeyValue() </li></ul></ul><ul><ul><li>K getCurrentKey() </li></ul></ul><ul><ul><li>V getCurrentValue() </li></ul></ul>
  16. 16. RecordReader (cont) <ul><li>The interface supports generic serialization formats instead of just Writable. </li></ul><ul><li>Note that the getCurrentKey and getCurrentValue may or may not return the same object. </li></ul><ul><li>The getPos method has gone away since not all RecordReaders have byte positions. </li></ul>
  17. 17. OutputFormat <ul><li>Replace </li></ul><ul><ul><li>RecordWriter<K,V> </li></ul></ul><ul><ul><li>getRecordWriter(FileSystem ignored, </li></ul></ul><ul><ul><li>JobConf conf, </li></ul></ul><ul><ul><li>String name, </li></ul></ul><ul><ul><li>Progressable progress) </li></ul></ul><ul><ul><li>RecordWriter<K,V> </li></ul></ul><ul><ul><li>getRecordWriter(TaskAttemptContext ctx) </li></ul></ul>
  18. 18. OutputContext (cont) <ul><li>Replace: </li></ul><ul><ul><li>void checkOutputSpecs(FileSystem ignore, </li></ul></ul><ul><ul><li>JobConf conf) </li></ul></ul><ul><ul><li>void checkOutputSpecs(JobContext ctx) </li></ul></ul><ul><li>The OutputCommitter is returned by the OutputFormat instead of configured separately! </li></ul>
  19. 19. Cluster Information (in 21) <ul><li>Replace JobClient with Cluster. </li></ul><ul><li>Replace ClusterStatus with ClusterMetrics </li></ul><ul><li>Replace RunningJob with JobStatus </li></ul>
  20. 20. September 23 rd , 2009 Mining the web with Hadoop, Cascading & Bixo
  21. 21. Thank you. See you at the next Bay Area User Group Oct 21 st , 2009

×