• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Hadoop Hackathon Reader
 

Hadoop Hackathon Reader

on

  • 3,893 views

This reader contains an introduction to MapReduce jobs. It covers some important classes within the r0.20.2 version of Hadoop, the setup of an empty application and a simple assignment that can be ...

This reader contains an introduction to MapReduce jobs. It covers some important classes within the r0.20.2 version of Hadoop, the setup of an empty application and a simple assignment that can be used to get familiar. It was created for the Hadoop Hackathon at SARA (http://www.sara.nl) during December 7th, 2010.

Note that the URL's used in this document might not persist.

Statistics

Views

Total Views
3,893
Views on SlideShare
3,888
Embed Views
5

Actions

Likes
3
Downloads
145
Comments
0

1 Embed 5

http://www.linkedin.com 5

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

CC Attribution-ShareAlike LicenseCC Attribution-ShareAlike License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Hadoop Hackathon Reader Hadoop Hackathon Reader Document Transcript

    • SARA Hadoop Hackathon December 2010Table of ContentsAn introduction to Java MapReduce jobs in Apache Hadoop...................................................................1 org.apache.hadoop.mapreduce.InputFormat.........................................................................................1 org.apache.hadoop.mapreduce.Mapper.................................................................................................1 org.apache.hadoop.io.SequenceFile and org.apache.hadoop.mapreduce.Partitioner............................2 org.apache.hadoop.mapreduce.Reducer................................................................................................2 org.apache.hadoop.mapreduce.OutputFormat.......................................................................................3An empty Hadoop MapReduce job in Java................................................................................................3 org.apache.hadoop.util.Tool..................................................................................................................3 org.apache.hadoop.mapreduce.Mapper.................................................................................................5 org.apache.hadoop.mapreduce.Reducer................................................................................................6A simple try-out: top Wikipedia page views..............................................................................................6 The setup...............................................................................................................................................7 Our Tool.................................................................................................................................................7 Our Mapper...........................................................................................................................................7 Our Reducer..........................................................................................................................................8An introduction to Java MapReduce jobs in Apache HadoopA MapReduce job written in Java typically exists of the following components: 1. An InputFormat 2. A Mapper 3. A SequenceFile and Partioner 4. A Reducer 5. An OutputFormatorg.apache.hadoop.mapreduce.InputFormat@See: http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapreduce/InputFormat.html@See: http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapreduce/InputSplit.html@See: http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapreduce/RecordReader.htmlIt is the InputFormats reponsibility to: • Validate the input of the Job • Split up the input into logical InputSplits, which will be assigned to each Mapper • Provide an implementation of a RecordReader, which is used by a Mapper to read input records from the logical InputSplit.org.apache.hadoop.mapreduce.Mapper@See: http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapreduce/Mapper.html@See: http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapreduce/RecordReader.html@See: http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapreduce/InputSplit.html
    • @See: http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapreduce/Mapper.Context.htmlA Mapper implements application specific logic. It reades a set of key / value pairs as input from aRecordReader, and generates a set of key / value pairs as output.A Mapper should override a function map:public void map(KEYIN key, VALUEIN value, Mapper.Context context) Every time a Mapper gets initialized – which happens once for each InputSplit – a function iscalled to setup the Object. You can optionally override this function and do your own setup:public void setup(Mapper.Context context) Similarly, you can override a cleanup function that is called when the Mapper object is destoyed:public void cleanup(Mapper.Context context) Output from a Mapper is collected from within Mapper.map(). The Context object, provided asparameter to the function, exposes a function that must be used for this task:public void write(KEYOUT key, VALUEOUT value)org.apache.hadoop.io.SequenceFile and org.apache.hadoop.mapreduce.Partitioner@See: http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/io/SequenceFile.html@See: http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapreduce/Partitioner.html@See: http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/io/SequenceFile.Sorter.htmlTemporary outputs from the Mappers are stored in SequenceFile. This is a binary representation ofkey / value pairs. A SequenceFile object provides a: • java.io.Reader • java.io.Writer • and SequenceFile.Sorter.If the job is configured to use more than one Reducer then the sorted SequenceFile is partitionedby a Partioner, creating as many partitions as Reducers. The partitioning is done by executingsome function on each key in the SequenceFile, typically a hash function. Each Reducer thenfetches a range of keys, assembled from all SequenceFiles produced by the Mappers, over theinternal network using HTTP. These individual sorted ranges are then merged into a single sortedrange. These events are usually collectively referred to as the “shuffle phase”.org.apache.hadoop.mapreduce.Reducer@See: http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapreduce/Reducer.htmlA Reducer, like a Mapper, implements application specific logic. You can draw an analogy with SQLto understand the distinction. In an SQL SELECT query, the input data (a table) is filtered by zero ormore conditions in a WHERE clause. The resulting data is optionally grouped, maybe because of aGROUP BY clause, and after that the aggregate functions can be applied (SUM(), AVG(), COUNT(),etcetera). The conditional logic of a query in MapReduce terms, is done by the Mapper. When the
    • Mappers are finished, the resulting data is sorted on the keys. The Reducers take care of the aggregatefunctions (and can be arbitrarily complex). (This analogy is actually part of a discussion going on forsome time now1.)A Reducer, after having completed the shuffle phase, has a number of keys, each with one or morevalues, to apply its logic to. Like Mapper, Reducer has setup and cleanup functions that can beoverridden. The application logic is applied through the function:public void reduce(KEYIN key, Iterable<VALUEIN> values, Reducer.Context context)org.apache.hadoop.mapreduce.OutputFormat@See: http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapreduce/OutputFormat.html@See: http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapreduce/RecordWriter.html@See: http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/fs/FileSystem.htmlThe OutputFormat is responsible for: • validating the jobs output specification • Provide an implementation of RecordWriter to be used to write output files of the job. The output is written to a FileSystem.An empty Hadoop MapReduce job in JavaAny MapReduce job in Java implements a minimum of three classes: • a Tool • a Mapper • and a ReducerAn implementation of an empty MapReduce job that can be used as base for new jobs, can be found ina SARA Subversion repository https://subtrac.sara.nl/oss/svn/hadoop/trunk/BaseProject/. Read-onlyaccess is provided for anonymous users. The example code in this document is simplified code fromthis repository.org.apache.hadoop.util.Tool@See: http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/util/Tool.html@See: http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/conf/Configurable.html@See: http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/conf/Configured.html@See: http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/util/ToolRunner.html@See: http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/conf/Configuration.html@See: http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapreduce/Job.html@See: http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.html@See: http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapreduce/lib/output/FileOutputFormat.html1 http://www.databasecolumn.com/2008/01/mapreduce-a-major-step-back.html
    • An implementation of Tool is the single point of entry for any Hadoop MapReduce application. Theimplementing class should expose a main() method. It is commonly used to configure the job – eitherthrough the parsing of command-line options, static configuration in the code itself or a combination ofboth. The Tool interface itself, has Configurable as its Superinterface. Therefore, animplementation of tool must either subclass an implementation of Configurable or implement theinterface itself. The typical Hadoop MapReduce application subclasses Configured, which is animplementation of Configurable.Next to the main method, an implementation of Tool should override:public int run(String[] args) throws ExceptionThe run method is responsible for actually configuring and running the Job. See here a simplifiedimplementation of Tool. Below the code is a step-by-step explanation.public class RunnerTool extends Configured implements Tool {    /**     * An org.apache.commons.logging.Log object     */    private static final Log LOG = LogFactory.getLog(RunnerTool.class.getName());    /**     * This function handles configuration and submission of your     * MapReduce job.     * @return 1 on failure, 0 on success     * @throws Exception     */    @Override    public int run(String[] arg0) throws Exception {        Configuration conf = getConf();        Job job = new Job(conf);        job.setJarByClass(RunnerTool.class);        job.setMapperClass(MyMapper.class);        job.setReducerClass(MyReducer.class);                FileInputFormat.addInputPath(job, new Path("in­dir"));        FileOutputFormat.setOutputPath(job, new Path("out­dir­" + System.nanoTime()));                if (!job.waitForCompletion(true)) {            LOG.error("Job failed!");            return 1;        }        return 0;    }        /**     * Main method. Runs the application.     * @param args     * @throws Exception     */    public static void main(String... args) throws Exception {        System.exit(ToolRunner.run(new RunnerTool(), args));    }} 1. The main method uses the static ToolRunner.run() method. This method parses generic
    • Hadoop command line options2 and, if necessary, modifies the Configuration object. After that it calls RunnerTool.run(). 2. Our RunnerTool.run() method starts by fetching the jobs Configuration object. The object can then be used to further configure the job, using the set*() functions. 3. Then a Job is being created using the Configuration object, and we let the Job know what jar it came from by calling its setJarByClass() method. 4. We need to tell our Job which Mapper and Reducer it should use by calling the setMapperByClass() and setReducerByClass() methods. 5. Now we tell the Job on what data it will operate (FileInputFormat.addInputPath(), call once for each file) and where it should store its output data (FileOutputFormat.setOutputPath()) (note: the output directory should not yet exist!). 6. The Job has all information it needs now, and is being submitted by calling job.waitForCompletion().org.apache.hadoop.mapreduce.Mapper@See: http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/io/LongWritable.html@See: http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/io/Text.htmlOur empty Mapper class only provides us the setup() and map() functions. It is worth to note that,using Java generics, we tell the Mapper that the type of: 1. the input key will be LongWritable (an object wrapper for the long datatype) 2. the input value will be Text (an object wrapper for text) 3. the output key will be LongWritable as well 4. the output value will be Text.public class MyMapper extends Mapper<LongWritable, Text, LongWritable, Text> {    /**     * An org.apache.commons.logging.Log object     */    private static final Log LOG = LogFactory.getLog(RunnerTool.class.getName());    /**     * This function is called once during the start of the map phase.     * @param context The job Context     */    @Override    public void setup(Context context) {            }    /**     * This function holds the mapper logic     * @param key The key of the K/V input pair     * @param value The value of the K/V input pair     * @param context The context of the application2 http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/util/GenericOptionsParser.html#GenericOptions
    •      */    @Override    public void map(LongWritable key, Text value, Context context) {            }}org.apache.hadoop.mapreduce.Reducer@See: http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/io/LongWritable.html@See: http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/io/Text.htmlOur empty Reducer, like the Mapper, only provides the setup() and reduce() functions. Also likeour Mapper, using Java generics, we tell the Reducer that the type of: 1. the input key will be LongWritable 2. the input value will be Text 3. the output key will be LongWritable as well 4. the output value will be Text.public class MyReducer extends Reducer<LongWritable, Text, LongWritable, Text> {    /**     * The LOG Object     */    private static final Log LOG = LogFactory.getLog(RunnerTool.class.getName());    /**     * This function is called once during the start of the reduce phase.     * @param context The job Context     */    @Override    public void setup(Context context) {            }        /**     * This function holds the reducer logic.     * @param key The key of the input K/V pair     * @param values Values associated with key     * @param context The context of the application     */    @Override    public void reduce(LongWritable key, Iterable<Text> values, Context context) {            }}A simple try-out: top Wikipedia page viewsCourtesy of Edgar Meij3, UvA ILPS, we have access to a sample dataset containing the amount of pageviews per article, per language code, during a single hour. The data is structured as follows:[language_code] [article_name] [page_views] [transfered_bytes]3 edgar.meij@uva.nl
    • In the below example data, the English language article about Amsterdam has been viewed 215 timesduring a certain hour, and these views generated a total of 23312999 bytes (~23MB) of traffic.en Amsterdam 215 23312999You can download the sample dataset fromhttps://subtrac.sara.nl/oss/svn/hadoop/branches/WikipediaPageCounts/in-dir/.Data like this could give us an interesting view on the usage of Wikipedia. Say we have this datacollected over a period of months or even longer. We would be able to see the rise and fall in terms ofpopularity of a certain page over time, and maybe try to find a relation between the evolution of thearticle and its relative size by looking at the total amount of transferred bytes.But you can start simpler: by extracting the top [N] viewed pages per language code. You can use theempty MapReduce classes from the previous chapter as a starting point.The setupOur Mapper will output the language code as key, and the page views and article title as value – foreach line in our input file.Our Reducer – which gets the data after the shuffle phase is done and all values are sorted – will getall pages associated with a single language code. The Reducer will maintain a top [N] list of the pagesit has seen, and output this list when it has checked all values.The implementation of Tool we will use has the responsibility to read a single argument: [N]. It furthermore needs to tell the Job how to handle our Mapper and Reducer, particularly about the expectedInputFormat and OutputFormat, and the outputKeyClass and outputValueClass.Our Tool@JavaDoc:https://subtrac.sara.nl/oss/svn/hadoop/branches/WikipediaPageCounts/doc/javadoc/nl/sara/hadoop/RunnerTool.html@See: http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapreduce/lib/input/TextInputFormat.html@See: http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapreduce/lib/output/TextOutputFormat.htmlSee here the steps you need to take to get a functional implementation of Tool for this job. Hint: use theprevious chapters if you miss information, and try to get familiar with the APIs by looking at thedocumentation. 1. Our Tool will accept a single argument, N. It will have to pass the argument on from the main() method to the run() method – keeping missing input in mind, of course. After that it should use the Configuration.set() method to pass the configuration on to the job. 2. Since we are dealing with plain text, organized in single lines, we can use Hadoops native TextInputFormat type to deal with our input file and create FileSplits for our Mapper. 3. The output will be lines in the form of [language_code] [article_name] [page_views]. We can easily store this as plain text, so we can use Hadoops TextOutputFormat. 4. Since both the key and the value we will store in our TextOutputFormat will be of type Text, we should tell our job to expect these types.
    • Our Mapper@Javadoc:https://subtrac.sara.nl/oss/svn/hadoop/branches/WikipediaPageCounts/doc/javadoc/nl/sara/hadoop/MyMapper.htmlOur Mapper is trivially simple. It needs to split the input value (the TextInputFormat gives the lineitself as value, and the position of the first character of the line in the file as key) on spaces. If that wassuccessful, it should output the first word – the language code – as key, and the remainder as value.Even though this is a trivial action and can be written as a single line of code, make sure to deal withExceptions. You cannot expect every line in the text to be structured the exact same, and a fact of lifeis that most datasets you will work with do not strictly apply to structure. Fault tolerance can beachieved by using many try / catch blocks in your code, while (especially during development) loggingall entries that raise an Exception.Our Reducer@Javadoc:https://subtrac.sara.nl/oss/svn/hadoop/branches/WikipediaPageCounts/doc/javadoc/nl/sara/hadoop/MyReducer.htmlThe Reducer is a bit less trivial. We want to loop over all values we receive for a certain key – alanguage code in our case -, and maintain a top [N] of most viewed pages. Every time we process anew value, we should check whether it is higher than the lowest value in our top [N], and replace thelowest value with the current one if it is.A TreeMap object comes in handy for storing the top [N], since it stores its values sorted. This makesfinding the current lowest value of your top very easy.