Introduction To Hadoop                                    Kenneth Heafield                                          Google ...
Outline1    Word Count Code      Mapper      Reducer      Main2    How it Works       Serialization       Data Flow3    La...
Word Count Code   MapperMapper< “wikipedia.org”, “The Free” >→ < “The”, 1 >, < “Free”, 1 >public void map(WritableComparab...
Word Count Code   MapperMapper< “wikipedia.org”, “The Free” >→ < “The”, 1 >, < “Free”, 1 >public void map(WritableComparab...
Word Count Code   MapperMapper< “wikipedia.org”, “The Free” >→ < “The”, 1 >, < “Free”, 1 >public void map(WritableComparab...
Word Count Code   MapperMapper< “wikipedia.org”, “The Free” >→ < “The”, 1 >, < “Free”, 1 >public void map(WritableComparab...
Word Count Code   MapperMapper< “wikipedia.org”, “The Free” >→ < “The”, 1 >, < “Free”, 1 >public void map(WritableComparab...
Word Count Code   ReducerReducer< “The”, 1 >, < “The”, 1 >→ < “The”, 2 >public void reduce(WritableComparable key,        ...
Word Count Code   ReducerReducer< “The”, 1 >, < “The”, 1 >→ < “The”, 2 >public void reduce(WritableComparable key,        ...
Word Count Code   ReducerReducer< “The”, 1 >, < “The”, 1 >→ < “The”, 2 >public void reduce(WritableComparable key,        ...
Word Count Code   ReducerReducer< “The”, 1 >, < “The”, 1 >→ < “The”, 2 >public void reduce(WritableComparable key,        ...
Word Count Code   MainMainpublic static void main(String[] args)    throws IOException {  JobConf conf = new JobConf(WordC...
Word Count Code   MainMainpublic static void main(String[] args)    throws IOException {  JobConf conf = new JobConf(WordC...
Word Count Code   MainMainpublic static void main(String[] args)    throws IOException {  JobConf conf = new JobConf(WordC...
Word Count Code   MainMainpublic static void main(String[] args)    throws IOException {  JobConf conf = new JobConf(WordC...
How it Works   SerializationTypesPurposeSimple serialization for keys, values, and other dataInterface Writable     Read a...
How it Works   SerializationA Writablepublic class IntPairWritable implements Writable {  public int first;  public int se...
How it Works   SerializationWritableComparable Methodpublic int compareTo(Object other) {  IntPairWritable o = (IntPairWri...
How it Works   Data FlowData FlowDefault Flow 1 Mappers read from HDFS 2   Map output is partitioned by key and sent to Re...
How it Works   Data FlowCombinersConcept    Add counts at Mapper before sending to Reducer.     Word count is 6 minutes wi...
LabExercisesRecommended: Word CountGet word count running.BigramsCount bigrams and unigrams efficiently.CapitalizationWith w...
LabInstructions  1   Login to the cluster successfully (and set your password).  2   Get Eclipse installed, so you can bui...
Upcoming SlideShare
Loading in...5
×

Hadoop

454

Published on

0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
454
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
16
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Transcript of "Hadoop"

  1. 1. Introduction To Hadoop Kenneth Heafield Google Inc January 14, 2008Example code from Hadoop 0.13.1 used under the Apache License Version 2.0and modified for presentation. Except as otherwise noted, the content of thispresentation is licensed under the Creative Commons Attribution 2.5 License. Kenneth Heafield (Google Inc) Introduction To Hadoop January 14, 2008 1 / 12
  2. 2. Outline1 Word Count Code Mapper Reducer Main2 How it Works Serialization Data Flow3 Lab Kenneth Heafield (Google Inc) Introduction To Hadoop January 14, 2008 2 / 12
  3. 3. Word Count Code MapperMapper< “wikipedia.org”, “The Free” >→ < “The”, 1 >, < “Free”, 1 >public void map(WritableComparable key, Writable value, OutputCollector output, Reporter reporter) throws IOException { String line = ((Text)value).toString(); StringTokenizer itr = new StringTokenizer(line); Text word = new Text(); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, new IntWritable(1)); }} Kenneth Heafield (Google Inc) Introduction To Hadoop January 14, 2008 3 / 12
  4. 4. Word Count Code MapperMapper< “wikipedia.org”, “The Free” >→ < “The”, 1 >, < “Free”, 1 >public void map(WritableComparable key, Writable value, OutputCollector output, Reporter reporter) throws IOException { String line = ((Text)value).toString(); StringTokenizer itr = new StringTokenizer(line); Text word = new Text(); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, new IntWritable(1)); }} Kenneth Heafield (Google Inc) Introduction To Hadoop January 14, 2008 3 / 12
  5. 5. Word Count Code MapperMapper< “wikipedia.org”, “The Free” >→ < “The”, 1 >, < “Free”, 1 >public void map(WritableComparable key, Writable value, OutputCollector output, Reporter reporter) throws IOException { String line = ((Text)value).toString(); StringTokenizer itr = new StringTokenizer(line); Text word = new Text(); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, new IntWritable(1)); }} Kenneth Heafield (Google Inc) Introduction To Hadoop January 14, 2008 3 / 12
  6. 6. Word Count Code MapperMapper< “wikipedia.org”, “The Free” >→ < “The”, 1 >, < “Free”, 1 >public void map(WritableComparable key, Writable value, OutputCollector output, Reporter reporter) throws IOException { String line = ((Text)value).toString(); StringTokenizer itr = new StringTokenizer(line); Text word = new Text(); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, new IntWritable(1)); }} Kenneth Heafield (Google Inc) Introduction To Hadoop January 14, 2008 3 / 12
  7. 7. Word Count Code MapperMapper< “wikipedia.org”, “The Free” >→ < “The”, 1 >, < “Free”, 1 >public void map(WritableComparable key, Writable value, OutputCollector output, Reporter reporter) throws IOException { String line = ((Text)value).toString(); StringTokenizer itr = new StringTokenizer(line); Text word = new Text(); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, new IntWritable(1)); }} Kenneth Heafield (Google Inc) Introduction To Hadoop January 14, 2008 3 / 12
  8. 8. Word Count Code ReducerReducer< “The”, 1 >, < “The”, 1 >→ < “The”, 2 >public void reduce(WritableComparable key, Iterator values, OutputCollector output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += ((IntWritable) values.next()).get(); } output.collect(key, new IntWritable(sum));} Kenneth Heafield (Google Inc) Introduction To Hadoop January 14, 2008 4 / 12
  9. 9. Word Count Code ReducerReducer< “The”, 1 >, < “The”, 1 >→ < “The”, 2 >public void reduce(WritableComparable key, Iterator values, OutputCollector output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += ((IntWritable) values.next()).get(); } output.collect(key, new IntWritable(sum));} Kenneth Heafield (Google Inc) Introduction To Hadoop January 14, 2008 4 / 12
  10. 10. Word Count Code ReducerReducer< “The”, 1 >, < “The”, 1 >→ < “The”, 2 >public void reduce(WritableComparable key, Iterator values, OutputCollector output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += ((IntWritable) values.next()).get(); } output.collect(key, new IntWritable(sum));} Kenneth Heafield (Google Inc) Introduction To Hadoop January 14, 2008 4 / 12
  11. 11. Word Count Code ReducerReducer< “The”, 1 >, < “The”, 1 >→ < “The”, 2 >public void reduce(WritableComparable key, Iterator values, OutputCollector output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += ((IntWritable) values.next()).get(); } output.collect(key, new IntWritable(sum));} Kenneth Heafield (Google Inc) Introduction To Hadoop January 14, 2008 4 / 12
  12. 12. Word Count Code MainMainpublic static void main(String[] args) throws IOException { JobConf conf = new JobConf(WordCount.class); conf.setJobName("wordcount"); conf.setMapperClass(MapClass.class); conf.setCombinerClass(ReduceClass.class); conf.setReducerClass(ReduceClass.class); conf.setNumMapTasks(new Integer(40)); conf.setNumReduceTasks(new Integer(30)); conf.setInputPath(new Path("/shared/wikipedia_small")); conf.setOutputPath(new Path("/user/kheafield/word_count")); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); JobClient.runJob(conf);} Kenneth Heafield (Google Inc) Introduction To Hadoop January 14, 2008 5 / 12
  13. 13. Word Count Code MainMainpublic static void main(String[] args) throws IOException { JobConf conf = new JobConf(WordCount.class); conf.setJobName("wordcount"); conf.setMapperClass(MapClass.class); conf.setCombinerClass(ReduceClass.class); conf.setReducerClass(ReduceClass.class); conf.setNumMapTasks(new Integer(40)); conf.setNumReduceTasks(new Integer(30)); conf.setInputPath(new Path("/shared/wikipedia_small")); conf.setOutputPath(new Path("/user/kheafield/word_count")); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); JobClient.runJob(conf);} Kenneth Heafield (Google Inc) Introduction To Hadoop January 14, 2008 5 / 12
  14. 14. Word Count Code MainMainpublic static void main(String[] args) throws IOException { JobConf conf = new JobConf(WordCount.class); conf.setJobName("wordcount"); conf.setMapperClass(MapClass.class); conf.setCombinerClass(ReduceClass.class); conf.setReducerClass(ReduceClass.class); conf.setNumMapTasks(new Integer(40)); conf.setNumReduceTasks(new Integer(30)); conf.setInputPath(new Path("/shared/wikipedia_small")); conf.setOutputPath(new Path("/user/kheafield/word_count")); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); JobClient.runJob(conf);} Kenneth Heafield (Google Inc) Introduction To Hadoop January 14, 2008 5 / 12
  15. 15. Word Count Code MainMainpublic static void main(String[] args) throws IOException { JobConf conf = new JobConf(WordCount.class); conf.setJobName("wordcount"); conf.setMapperClass(MapClass.class); conf.setCombinerClass(ReduceClass.class); conf.setReducerClass(ReduceClass.class); conf.setNumMapTasks(new Integer(40)); conf.setNumReduceTasks(new Integer(30)); conf.setInputPath(new Path("/shared/wikipedia_small")); conf.setOutputPath(new Path("/user/kheafield/word_count")); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); JobClient.runJob(conf);} Kenneth Heafield (Google Inc) Introduction To Hadoop January 14, 2008 5 / 12
  16. 16. How it Works SerializationTypesPurposeSimple serialization for keys, values, and other dataInterface Writable Read and write binary format Convert to String for text formats WritableComparable adds sorting order for keysExample Implementations ArrayWritable is only Writable BooleanWritable IntWritable sorts in increasing order Text holds a String Kenneth Heafield (Google Inc) Introduction To Hadoop January 14, 2008 6 / 12
  17. 17. How it Works SerializationA Writablepublic class IntPairWritable implements Writable { public int first; public int second; public void write(DataOutput out) throws IOException { out.writeInt(first); out.writeInt(second); } public void readFields(DataInput in) throws IOException { first = in.readInt(); second = in.readInt(); } public int hashCode() { return first + second; } public String toString() { return Integer.toString(first) + "," + Integer.toString(second); } Kenneth Heafield (Google Inc) Introduction To Hadoop January 14, 2008 7 / 12
  18. 18. How it Works SerializationWritableComparable Methodpublic int compareTo(Object other) { IntPairWritable o = (IntPairWritable)other; if (first < o.first) return -1; if (first > o.first) return 1; if (second < o.second) return -1; if (second > o.second) return 1; return 0;} Kenneth Heafield (Google Inc) Introduction To Hadoop January 14, 2008 8 / 12
  19. 19. How it Works Data FlowData FlowDefault Flow 1 Mappers read from HDFS 2 Map output is partitioned by key and sent to Reducers 3 Reducers sort input by key 4 Reduce output is written to HDFS 1 HDFS 2 Mapper 3 Reducer 4 HDFS Input Map Sort Reduce Output Input Map Sort Reduce Output Kenneth Heafield (Google Inc) Introduction To Hadoop January 14, 2008 9 / 12
  20. 20. How it Works Data FlowCombinersConcept Add counts at Mapper before sending to Reducer. Word count is 6 minutes with combiners and 14 without.Implementation Mapper caches output and periodically calls Combiner Input to Combine may be from Map or Combine Combiner uses interface as Reducer Mapper Combine Sort Reduce OutputInput Map Cache Sort Reduce Output Kenneth Heafield (Google Inc) Introduction To Hadoop January 14, 2008 10 / 12
  21. 21. LabExercisesRecommended: Word CountGet word count running.BigramsCount bigrams and unigrams efficiently.CapitalizationWith what probability is a word capitalized?IndexerIn what documents does each word appear? Where in the documents? Kenneth Heafield (Google Inc) Introduction To Hadoop January 14, 2008 11 / 12
  22. 22. LabInstructions 1 Login to the cluster successfully (and set your password). 2 Get Eclipse installed, so you can build Java code. 3 Install the Hadoop plugin for Eclipse so you can deploy jobs to the cluster. 4 Set up your Eclipse workspace from a template that we provide. 5 Run the word counter example over the Wikipedia data set. Kenneth Heafield (Google Inc) Introduction To Hadoop January 14, 2008 12 / 12
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×