SlideShare a Scribd company logo
Short
          Apache Hadoop API
              Overview

                  Adam Kawa
            Data Engineer @ Spotify

2/24/13
2/24/13

Image Source http://developer.yahoo.com/hadoop/tutorial/module4.html
InputFormat Reposibilities
Divide input data into logical input splits
   Data in HDFS is divided into block, but processed as input
   splits
   InputSplit may contains any number of blocks (usually 1)
   Each Mapper processes one input split
Creates RecordReaders to extract <key, value> pairs


2/24/13
InputFormat Class
public abstract class InputFormat<K, V> {
    public abstract
      List<InputSplit> getSplits(JobContext context) throws ...;


public abstract
      RecordReader<K,V> createRecordReader(InputSplit split,
                      TaskAttemptContext context) throws ...;
}


2/24/13
Most Common InputFormats
TextInputFormat
   Each n-terminated line is a value
   The byte offset of that line is a key
      Why not a line number?


KeyValueTextInputFormat
   Key and value are separated by a separator (tab by default)
2/24/13
Binary InputFormats
SequenceFileInputFormat
   SequenceFiles are flat files consisting of binary <key,
   value> pairs
AvroInputFormat
   Avro supports rich data structures (not necessarily <key,
   value> pairs) serialized to files or messages
   Compact, fast, language-independent, self-describing,
   dynamic
2/24/13
Some Other InputFormats
NLineInputFormat
   Should not be too big since splits are calculated in a single
   thread (NLineInputFormat#getSplitsForFile)
CombineFileInputFormat
   An abstract class, but not so difficult to extend
SeparatorInputFormat
   How to here: http://blog.rguha.net/?p=293

2/24/13
Some Other InputFormats
MultipleInputs
   Supports multiple input paths with a different
   InputFormat and Mapper for each path

MultipleInputs.addInputPath(job,
      firstPath, FirstInputFormat.class, FirstMapper.class);
MultipleInputs.addInputPath(job,
      secondPath, SecondInputFormat.class, SecondMapper.class);

2/24/13
InputFormat Class (Partial) Hierarchy




2/24/13
InputFormat Interesting Facts
Ideally InputSplit size is equal to HDFS block size
   Or InputSplit contains multiple collocated HDFS block
InputFormat may prevent splitting a file
   A whole file is processed by a single mapper (e.g. gzip)
   boolean FileInputFormat#isSplittable();



2/24/13
InputFormat Interesting Facts
Mapper knows the file/offset/size of the split that it process
   MapContext#getInputSplit()
   Useful for later debugging on a local machine




2/24/13
InputFormat Interesting Facts
PathFilter (included in InputFormat) specifies which files
  to include or not into input data

PathFilter hiddenFileFilter = new PathFilter(){
   public boolean accept(Path p){
          String name = p.getName();
          return !name.startsWith("_") && !name.startsWith(".");
   }

}; 
2/24/13
RecordReader
Extract <key, value> pairs from corresponding InputSplit
Examples:
   LineRecordReader
   KeyValueRecordReader
   SequenceFileRecordReader



2/24/13
RecordReader Logic
  Must handle a common situation when InputSplit and
    HDFS block boundaries do not match




  2/24/13

Image source: Hadoop: The Definitive Guide by Tom White
RecordReader Logic
  Exemplary solution – based on LineRecordReader
       Skips* everything from its block until the fist 'n'
       Reads from the second block until it sees 'n'
       *except the very first block (an offset equals to 0)




  2/24/13

Image source: Hadoop: The Definitive Guide by Tom White
Keys And Values
Keys must implement WritableComparable interface
   Since they are sorted before passing to the Reducers
Values must implement “at least” Writable interface




2/24/13
WritableComparables Hierarchy




  2/24/13

Image source: Hadoop: The Definitive Guide by Tom White
Writable And WritableComparable
public interface Writable {
    void write(DataOutput out) throws IOException;
    void readFields(DataInput in) throws IOException;
}
public interface WritableComparable<T> extends Writable,
   Comparable<T> {
}
public interface Comparable<T> {
    public int compareTo(T o);
}
2/24/13
Example: SongWritable
class SongWritable implements Writable {
 String title;
 int year;
 byte[] content;
 …
 public void write(DataOutput out) throws ... {
     out.writeUTF(title);
     out.writeInt(year);
     out.writeInt(content.length);
     out.write(content);
 }
}
2/24/13
Mapper
Takes input in form of a <key, value> pair
Emits a set of intermediate <key, value> pairs
Stores them locally and later passes to the Reducers
   But earlier: partition + sort + spill + merge




2/24/13
Mapper Methods
void setup(Context context) throws ... {}
protected void cleanup(Context context) throws ... {}
void map(KEYIN key, VALUEIN value, Context context) ... {
    context.write((KEYOUT) key, (VALUEOUT) value);
}
public void run(Context context) throws ... {
    setup(context);
    while (context.nextKeyValue()) {
        map(context.getCurrentKey(), context.getCurrentValue(), context);
    }
    cleanup(context);
}
2/24/13
MapContext Object
Allow the user map code to communicate with MapReduce system


public InputSplit getInputSplit();
public TaskAttemptID getTaskAttemptID();
public void setStatus(String msg);
public boolean nextKeyValue() throws ...;
public KEYIN getCurrentKey() throws ...;
public VALUEIN getCurrentValue() throws ...;
public void write(KEYOUT key, VALUEOUT value) throws ...;
public Counter getCounter(String groupName, String counterName);

2/24/13
Examples Of Mappers
Implement highly specialized Mappers and reuse/chain them
  when possible


IdentityMapper
InverseMapper
RegexMapper
TokenCounterMapper

2/24/13
TokenCounterMapper
public class TokenCounterMapper extends Mapper<Object, Text, Text, IntWritable> {
    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();


    @Override
    public void map(Object key, Text value, Context context
                        ) throws IOException, InterruptedException {
        StringTokenizer itr = new StringTokenizer(value.toString());
        while (itr.hasMoreTokens()) {
            word.set(itr.nextToken());
            context.write(word, one);
        }
    }
}
2/24/13
General Advices
Reuse Writable instead of creating a new one each time
Apache commons StringUtils class seems to be the most
  efficient for String tokenization




2/24/13
Chain Of Mappers
Use multiple Mapper classes within a single Map task
The output of the first Mapper becomes the input of the
  second, and so on until the last Mapper
The output of the last Mapper will be written to the task's
  output
Encourages implementation of reusable and highly
  specialized Mappers

2/24/13
Exemplary Chain Of Mappers
 JobConf mapAConf = new JobConf(false);
 ...
 ChainMapper.addMapper(conf, AMap.class, LongWritable.class, Text.class,
   Text.class, Text.class, true, mapAConf);
 
 JobConf mapBConf = new JobConf(false);
 ...
 ChainMapper.addMapper(conf, BMap.class, Text.class, Text.class,
   LongWritable.class, Text.class, false, mapBConf);


 FileInputFormat.setInputPaths(conf, inDir);
 FileOutputFormat.setOutputPath(conf, outDir);
 JobClient jc = new JobClient(conf);
 RunningJob job = jc.submitJob(conf);

2/24/13
Partitioner
Specifies which Reducer a given <key, value> pair is sent to
Desire even distribution of the intermediate data
Skewed data may overload a single reducer and make a whole
  job running longer
public abstract class Partitioner<KEY, VALUE> {
    public abstract
       int getPartition(KEY key, VALUE value, int numPartitions);
}
2/24/13
HashPartitioner
The default choice for general-purpose use cases

public int getPartition(K key, V value, int numReduceTasks) {
    return
    (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
}




2/24/13
TotalOrderPartitioner
A partitioner that aims the total order of the output




2/24/13
TotalOrderPartitioner
Before job runs, it samples input data to provide fairly even
  distribution over keys




2/24/13
TotalOrderPartitioner
Three samplers
   InputSampler.RandomSampler<K,V>
          Sample from random points in the input
   InputSampler.IntervalSampler<K,V>
          Sample from s splits at regular intervals
   InputSampler.SplitSampler<K,V>
          Samples the first n records from s splits
2/24/13
Reducer
Gets list(<key, list(value)>)
Keys are sorted, but values for a given key are not sorted
Emits a set of output <key, value> pairs




2/24/13
Reducer Run Method
public void run(Context context) throws … {
          setup(context);
          while (context.nextKey()) {
              reduce(context.getCurrentKey(),
                     context.getValues(), context);
          }
          cleanup(context);
}
2/24/13
Chain Of Mappers After A Reducer
The ChainReducer class allows to chain multiple Mapper classes after a
  Reducer within the Reducer task
Combined with ChainMapper, one could get [MAP+ / REDUCE MAP*]

ChainReducer.setReducer(conf, XReduce.class, LongWritable.class, Text.class,
   Text.class, Text.class, true, reduceConf);
 ChainReducer.addMapper(conf, CMap.class, Text.class, Text.class,
   LongWritable.class, Text.class, false, null);
 ChainReducer.addMapper(conf, DMap.class, LongWritable.class, Text.class,
   LongWritable.class, LongWritable.class, true, null);

2/24/13
OutputFormat Class Hierarchy




  2/24/13

Image source: Hadoop: The Definitive Guide by Tom White
MultipleOutputs
MultipleOutputs.addNamedOutput(job, "text", TextOutputFormat.class,
  LongWritable.class, Text.class);
MultipleOutputs.addNamedOutput(job, "seq", SequenceFileOutputFormat.class,
  LongWritable.class, Text.class);


 public void reduce(WritableComparable key, Iterator<Writable> values, Context
  context) throws ... {
     ...
     mos.write("text", , key, new Text("Hello"));
     mos.write("seq", LongWritable(1), new Text("Bye"), "seq_a");
     mos.write("seq", LongWritable(2), key, new Text("Chau"), "seq_b");
     mos.write(key, new Text("value"), generateFileName(key, new Text("value")));
 }

2/24/13
Other Useful Features
Combiner
Skipping bad records
Compression
Profiling
Isolation Runner



2/24/13
Job Class Methods
public void setInputFormatClass(..);          public void setNumReduceTasks(int tasks);
public void setOutputFormatClass(..);         public void setJobName(String name);
public void setMapperClass(..);               public float mapProgress();
public void setCombinerClass(..);             public float reduceProgress();
public void setReducerClass(...);             public boolean isComplete();
public void setPartitionerClass(..);          public boolean isSuccessful();
public void setMapOutputKeyClass(..);         public void killJob();
public void setMapOutputValueClass(..);       public void submit();
public void setOutputKeyClass(..);            public boolean waitForCompletion(..);
public void setOutputValueClass(..);
public void setSortComparatorClass(..);
public void setGroupingComparatorClass(..);


2/24/13
ToolRunner
Supports parsing allows the user to specify configuration
  options on the command line
hadoop jar examples.jar SongCount
  -D mapreduce.job.reduces=10
  -D artist.gender=FEMALE
  -files dictionary.dat
  -jar math.jar,spotify.jar
  songs counts


2/24/13
Side Data Distribution
public class MyMapper<K, V> extends Mapper<K,V,V,K> {
  String gender = null;
  File dictionary = null;


  protected void setup(Context context) throws … {
     Configuration conf = context.getConfiguration();
     gender = conf.get(“artist.gender”, “MALE”);
     dictionary = new File(“dictionary.dat”);
 }
2/24/13
public class WordCount extends Configured implements Tool {
    public int run(String[] otherArgs) throws Exception {
    if (args.length != 2) {
        System.out.println("Usage: %s [options] <input> <output>", getClass().getSimpleName());
        return -1;
    }
    Job job = new Job(getConf());
    FileInputFormat.setInputPaths(job, new Path(otherArgs[0]));
    FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
    ...
    return job.waitForCompletion(true); ? 0 : 1;
    }
}
public static void main(String[] allArgs) throws Exception {
    int exitCode = ToolRunner.run(new Configuration(), new WordCount(), allArgs);
    System.exit(exitCode);
}

2/24/13
MRUnit
Built on top of JUnit
Provides a mock InputSplit, Contex and other classes
Can test
  The Mapper class,
  The Reducer class,
  The full MapReduce job
  The pipeline of MapReduce jobs
2/24/13
MRUnit Example
public class IdentityMapTest extends TestCase {
    private MapDriver<Text, Text, Text, Text> driver;
    @Before
    public void setUp() {
        driver = new MapDriver<Text, Text, Text, Text>(new MyMapper<Text, Text, Text, Text>());
    }
    @Test
    public void testMyMapper() {
        driver
           .withInput(new Text("foo"), new Text("bar"))
           .withOutput(new Text("oof"), new Text("rab"))
        .runTest();
    }
}

2/24/13
Example: Secondary Sort
reduce(key, Iterator<value>) method gets iterator
  over values
These values are not sorted for a given key
Sometimes we want to get them sorted
Useful to find minimum or maximum value quickly




2/24/13
Secondary Sort Is Tricky
A couple of custom classes are needed
   WritableComparable
   Partitioner
   SortComparator (optional, but recommended)
   GroupingComparator



2/24/13
Composite Key
Leverages “traditional” sorting mechanism of intermediate keys
Intermediate key becomes composite of the “natural” key and the value

(Disturbia, 1)   → (Disturbia#1, 1)
(SOS, 4)         → (SOS#4, 4)
(Disturbia, 7)   → (Disturbia#7, 7)
(Fast car, 2)    → (Fast car#2, 2)
(Fast car, 6)    → (Fast car#6, 6)
(Disturbia, 4)   → (Disturbia#4, 4)
(Fast car, 2)    → (Fast car#2, 2)

2/24/13
Custom Partitioner
HashPartitioner uses a hash on keys
    The same titles may go to different reducers (because they are
    combined with ts in a key)
Use a custom partitioner that partitions only on first part of the key

int getPartition(TitleWithTs key, LongWritable value, int num) {
    return hashParitioner.getPartition(key.title);
}


2/24/13
Ordering Of Keys
Keys needs to be ordered before passing to the reducer
Orders by natural key and, for the same natural key, on the
  value portion of the key
Implement sorting in WritableComparable or use
  Comparator class

job.setSortComparatorClass(SongWithTsComparator.class);


2/24/13
Data Passed To The Reducer
By default, each unique key forces reduce() method
(Disturbia#1, 1) → reduce method is invoked
(Disturbia#4, 4) → reduce method is invoked
(Disturbia#7, 7) → reduce method is invoked
(Fast car#2, 2)   → reduce method is invoked
(Fast car#2, 2)
(Fast car#6, 6)   → reduce method is invoked
(SOS#4, 4)        → reduce method is invoked


2/24/13
Data Passed To The Reducer
GroupingComparatorClass class determines which keys and
  values are passed in a single call to the reduce method
Just look at the natural key when grouping
(Disturbia#1, 1) → reduce method is invoked
(Disturbia#4, 4)
(Disturbia#7, 7)
(Fast car#2, 2)    → reduce method is invoked
(Fast car#2, 2)
(Fast car#6, 6)
(SOS#4, 4)         → reduce method is invoked
2/24/13
Question
How to calculate a median from a set of numbers using Java
  MapReduce?




2/24/13
Question – A Possible Answer
Implement TotalSort, but
   Each Reducer produce an additional file containing a pair
   <minimum_value, number_of_values>
After the job ends, a single-thread application
   Reads these files to build the index
   Calculate which value in which file is the median
   Finds this value in this file
2/24/13
Thanks!
Would you like to use Hadoop API at Spotify?
Apply via jobs@spotify.com




2/24/13

More Related Content

What's hot

Csc1100 lecture14 ch16_pt2
Csc1100 lecture14 ch16_pt2Csc1100 lecture14 ch16_pt2
Csc1100 lecture14 ch16_pt2IIUM
 
Embedded C - Lecture 2
Embedded C - Lecture 2Embedded C - Lecture 2
Embedded C - Lecture 2
Mohamed Abdallah
 
What's new in Scala 2.13?
What's new in Scala 2.13?What's new in Scala 2.13?
What's new in Scala 2.13?
Hermann Hueck
 
C++ process new
C++ process newC++ process new
C++ process new
敬倫 林
 
Compiler Construction | Lecture 10 | Data-Flow Analysis
Compiler Construction | Lecture 10 | Data-Flow AnalysisCompiler Construction | Lecture 10 | Data-Flow Analysis
Compiler Construction | Lecture 10 | Data-Flow Analysis
Eelco Visser
 
Programming Assignment Help
Programming Assignment HelpProgramming Assignment Help
Programming Assignment Help
Programming Homework Help
 
Chapter i(introduction to java)
Chapter i(introduction to java)Chapter i(introduction to java)
Chapter i(introduction to java)
Chhom Karath
 
Cpp functions
Cpp functionsCpp functions
Cpp functions
NabeelaNousheen
 
Erlang Message Passing Concurrency, For The Win
Erlang  Message  Passing  Concurrency,  For  The  WinErlang  Message  Passing  Concurrency,  For  The  Win
Erlang Message Passing Concurrency, For The Winl xf
 
Porting an MPI application to hybrid MPI+OpenMP with Reveal tool on Shaheen II
Porting an MPI application to hybrid MPI+OpenMP with Reveal tool on Shaheen IIPorting an MPI application to hybrid MPI+OpenMP with Reveal tool on Shaheen II
Porting an MPI application to hybrid MPI+OpenMP with Reveal tool on Shaheen II
George Markomanolis
 
Compiler unit 4
Compiler unit 4Compiler unit 4
Compiler unit 4
BBDITM LUCKNOW
 
Dynamic memory allocation
Dynamic memory allocationDynamic memory allocation
Dynamic memory allocation
Burhanuddin Kapadia
 
VHDL- data types
VHDL- data typesVHDL- data types
VHDL- data types
VandanaPagar1
 
On Applying Or-Parallelism and Tabling to Logic Programs
On Applying Or-Parallelism and Tabling to Logic ProgramsOn Applying Or-Parallelism and Tabling to Logic Programs
On Applying Or-Parallelism and Tabling to Logic Programs
Lino Possamai
 
Improving Robustness In Distributed Systems
Improving Robustness In Distributed SystemsImproving Robustness In Distributed Systems
Improving Robustness In Distributed Systemsl xf
 
U Xml Defense presentation
U Xml Defense presentationU Xml Defense presentation
U Xml Defense presentationksp4186
 
Dynamic memory allocation in c++
Dynamic memory allocation in c++Dynamic memory allocation in c++
Dynamic memory allocation in c++Tech_MX
 
2 b queues
2 b queues2 b queues
2 b queues
Nguync91368
 
C++11 - STL Additions
C++11 - STL AdditionsC++11 - STL Additions
C++11 - STL Additions
GlobalLogic Ukraine
 

What's hot (20)

Csc1100 lecture14 ch16_pt2
Csc1100 lecture14 ch16_pt2Csc1100 lecture14 ch16_pt2
Csc1100 lecture14 ch16_pt2
 
Embedded C - Lecture 2
Embedded C - Lecture 2Embedded C - Lecture 2
Embedded C - Lecture 2
 
What's new in Scala 2.13?
What's new in Scala 2.13?What's new in Scala 2.13?
What's new in Scala 2.13?
 
C++ process new
C++ process newC++ process new
C++ process new
 
Compiler Construction | Lecture 10 | Data-Flow Analysis
Compiler Construction | Lecture 10 | Data-Flow AnalysisCompiler Construction | Lecture 10 | Data-Flow Analysis
Compiler Construction | Lecture 10 | Data-Flow Analysis
 
Programming Assignment Help
Programming Assignment HelpProgramming Assignment Help
Programming Assignment Help
 
Chapter i(introduction to java)
Chapter i(introduction to java)Chapter i(introduction to java)
Chapter i(introduction to java)
 
Cpp functions
Cpp functionsCpp functions
Cpp functions
 
Erlang Message Passing Concurrency, For The Win
Erlang  Message  Passing  Concurrency,  For  The  WinErlang  Message  Passing  Concurrency,  For  The  Win
Erlang Message Passing Concurrency, For The Win
 
Porting an MPI application to hybrid MPI+OpenMP with Reveal tool on Shaheen II
Porting an MPI application to hybrid MPI+OpenMP with Reveal tool on Shaheen IIPorting an MPI application to hybrid MPI+OpenMP with Reveal tool on Shaheen II
Porting an MPI application to hybrid MPI+OpenMP with Reveal tool on Shaheen II
 
Compiler unit 4
Compiler unit 4Compiler unit 4
Compiler unit 4
 
Dynamic memory allocation
Dynamic memory allocationDynamic memory allocation
Dynamic memory allocation
 
VHDL- data types
VHDL- data typesVHDL- data types
VHDL- data types
 
On Applying Or-Parallelism and Tabling to Logic Programs
On Applying Or-Parallelism and Tabling to Logic ProgramsOn Applying Or-Parallelism and Tabling to Logic Programs
On Applying Or-Parallelism and Tabling to Logic Programs
 
Improving Robustness In Distributed Systems
Improving Robustness In Distributed SystemsImproving Robustness In Distributed Systems
Improving Robustness In Distributed Systems
 
U Xml Defense presentation
U Xml Defense presentationU Xml Defense presentation
U Xml Defense presentation
 
Dynamic memory allocation in c++
Dynamic memory allocation in c++Dynamic memory allocation in c++
Dynamic memory allocation in c++
 
2 b queues
2 b queues2 b queues
2 b queues
 
Chap03[1]
Chap03[1]Chap03[1]
Chap03[1]
 
C++11 - STL Additions
C++11 - STL AdditionsC++11 - STL Additions
C++11 - STL Additions
 

Viewers also liked

Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)
Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)
Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)Adam Kawa
 
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Adam Kawa
 
Apache Hadoop In Theory And Practice
Apache Hadoop In Theory And PracticeApache Hadoop In Theory And Practice
Apache Hadoop In Theory And Practice
Adam Kawa
 
Apache Hadoop YARN
Apache Hadoop YARNApache Hadoop YARN
Apache Hadoop YARN
Adam Kawa
 
Apache Hadoop YARN, NameNode HA, HDFS Federation
Apache Hadoop YARN, NameNode HA, HDFS FederationApache Hadoop YARN, NameNode HA, HDFS Federation
Apache Hadoop YARN, NameNode HA, HDFS Federation
Adam Kawa
 
Introduction To Elastic MapReduce at WHUG
Introduction To Elastic MapReduce at WHUGIntroduction To Elastic MapReduce at WHUG
Introduction To Elastic MapReduce at WHUG
Adam Kawa
 
Introduction To Apache Pig at WHUG
Introduction To Apache Pig at WHUGIntroduction To Apache Pig at WHUG
Introduction To Apache Pig at WHUGAdam Kawa
 
Hadoop Playlist (Ignite talks at Strata + Hadoop World 2013)
Hadoop Playlist (Ignite talks at Strata + Hadoop World 2013)Hadoop Playlist (Ignite talks at Strata + Hadoop World 2013)
Hadoop Playlist (Ignite talks at Strata + Hadoop World 2013)
Adam Kawa
 
Systemy rekomendacji
Systemy rekomendacjiSystemy rekomendacji
Systemy rekomendacjiAdam Kawa
 
Big Data At Spotify
Big Data At SpotifyBig Data At Spotify
Big Data At SpotifyAdam Kawa
 
Is life insurance tax deductible in super?
Is life insurance tax deductible in super?Is life insurance tax deductible in super?
Is life insurance tax deductible in super?
Chris Strano
 
TruLink hearing control app user guide
TruLink hearing control app user guideTruLink hearing control app user guide
TruLink hearing control app user guide
Starkey Hearing Technologies
 
Coverage Insights - Vacant Property Insurance
Coverage Insights - Vacant Property InsuranceCoverage Insights - Vacant Property Insurance
Coverage Insights - Vacant Property Insurance
Nicholas Toscano
 
Business Advisors, Consultants, and Coaches: Whats The Difference?
Business Advisors, Consultants, and Coaches:  Whats The Difference?Business Advisors, Consultants, and Coaches:  Whats The Difference?
Business Advisors, Consultants, and Coaches: Whats The Difference?
Alan Walsh
 
GENBAND G6 datasheet
GENBAND G6 datasheetGENBAND G6 datasheet
GENBAND G6 datasheet
GENBANDcorporate
 
Bridging the gap between digital and relationship marketing - DMA 2013 Though...
Bridging the gap between digital and relationship marketing - DMA 2013 Though...Bridging the gap between digital and relationship marketing - DMA 2013 Though...
Bridging the gap between digital and relationship marketing - DMA 2013 Though...
Lars Crama
 
SOCIAL PRESENCE: WHAT IS IT? HOW DO WE MEASURE IT?
SOCIAL PRESENCE: WHAT IS IT? HOW DO WE MEASURE IT?SOCIAL PRESENCE: WHAT IS IT? HOW DO WE MEASURE IT?
SOCIAL PRESENCE: WHAT IS IT? HOW DO WE MEASURE IT?
Patrick Lowenthal
 

Viewers also liked (18)

Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)
Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)
Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)
 
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
 
Apache Hadoop In Theory And Practice
Apache Hadoop In Theory And PracticeApache Hadoop In Theory And Practice
Apache Hadoop In Theory And Practice
 
Apache Hadoop YARN
Apache Hadoop YARNApache Hadoop YARN
Apache Hadoop YARN
 
Apache Hadoop YARN, NameNode HA, HDFS Federation
Apache Hadoop YARN, NameNode HA, HDFS FederationApache Hadoop YARN, NameNode HA, HDFS Federation
Apache Hadoop YARN, NameNode HA, HDFS Federation
 
Introduction To Elastic MapReduce at WHUG
Introduction To Elastic MapReduce at WHUGIntroduction To Elastic MapReduce at WHUG
Introduction To Elastic MapReduce at WHUG
 
Introduction To Apache Pig at WHUG
Introduction To Apache Pig at WHUGIntroduction To Apache Pig at WHUG
Introduction To Apache Pig at WHUG
 
Hadoop Playlist (Ignite talks at Strata + Hadoop World 2013)
Hadoop Playlist (Ignite talks at Strata + Hadoop World 2013)Hadoop Playlist (Ignite talks at Strata + Hadoop World 2013)
Hadoop Playlist (Ignite talks at Strata + Hadoop World 2013)
 
Systemy rekomendacji
Systemy rekomendacjiSystemy rekomendacji
Systemy rekomendacji
 
Big Data At Spotify
Big Data At SpotifyBig Data At Spotify
Big Data At Spotify
 
Recommended homeowners insurance endorsements for charleston, sc
Recommended homeowners insurance endorsements for charleston, scRecommended homeowners insurance endorsements for charleston, sc
Recommended homeowners insurance endorsements for charleston, sc
 
Is life insurance tax deductible in super?
Is life insurance tax deductible in super?Is life insurance tax deductible in super?
Is life insurance tax deductible in super?
 
TruLink hearing control app user guide
TruLink hearing control app user guideTruLink hearing control app user guide
TruLink hearing control app user guide
 
Coverage Insights - Vacant Property Insurance
Coverage Insights - Vacant Property InsuranceCoverage Insights - Vacant Property Insurance
Coverage Insights - Vacant Property Insurance
 
Business Advisors, Consultants, and Coaches: Whats The Difference?
Business Advisors, Consultants, and Coaches:  Whats The Difference?Business Advisors, Consultants, and Coaches:  Whats The Difference?
Business Advisors, Consultants, and Coaches: Whats The Difference?
 
GENBAND G6 datasheet
GENBAND G6 datasheetGENBAND G6 datasheet
GENBAND G6 datasheet
 
Bridging the gap between digital and relationship marketing - DMA 2013 Though...
Bridging the gap between digital and relationship marketing - DMA 2013 Though...Bridging the gap between digital and relationship marketing - DMA 2013 Though...
Bridging the gap between digital and relationship marketing - DMA 2013 Though...
 
SOCIAL PRESENCE: WHAT IS IT? HOW DO WE MEASURE IT?
SOCIAL PRESENCE: WHAT IS IT? HOW DO WE MEASURE IT?SOCIAL PRESENCE: WHAT IS IT? HOW DO WE MEASURE IT?
SOCIAL PRESENCE: WHAT IS IT? HOW DO WE MEASURE IT?
 

Similar to Apache Hadoop Java API

Hadoop_Pennonsoft
Hadoop_PennonsoftHadoop_Pennonsoft
Hadoop_PennonsoftPennonSoft
 
Mapreduce by examples
Mapreduce by examplesMapreduce by examples
Mapreduce by examples
Andrea Iacono
 
Big-data-analysis-training-in-mumbai
Big-data-analysis-training-in-mumbaiBig-data-analysis-training-in-mumbai
Big-data-analysis-training-in-mumbai
Unmesh Baile
 
Apache Flink Training: DataStream API Part 2 Advanced
Apache Flink Training: DataStream API Part 2 Advanced Apache Flink Training: DataStream API Part 2 Advanced
Apache Flink Training: DataStream API Part 2 Advanced
Flink Forward
 
Hadoop - Introduction to mapreduce
Hadoop -  Introduction to mapreduceHadoop -  Introduction to mapreduce
Hadoop - Introduction to mapreduce
Vibrant Technologies & Computers
 
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013Stratosphere System Overview Big Data Beers Berlin. 20.11.2013
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013
Robert Metzger
 
Introduction to Map-Reduce Programming with Hadoop
Introduction to Map-Reduce Programming with HadoopIntroduction to Map-Reduce Programming with Hadoop
Introduction to Map-Reduce Programming with Hadoop
Dilum Bandara
 
Cs267 hadoop programming
Cs267 hadoop programmingCs267 hadoop programming
Cs267 hadoop programming
Kuldeep Dhole
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
Ran Silberman
 
Hadoop源码分析 mapreduce部分
Hadoop源码分析 mapreduce部分Hadoop源码分析 mapreduce部分
Hadoop源码分析 mapreduce部分
sg7879
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptx
HARIKRISHNANU13
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
Ran Silberman
 
Apache Crunch
Apache CrunchApache Crunch
Apache Crunch
Alwin James
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
Prashant Gupta
 
Map reduce presentation
Map reduce presentationMap reduce presentation
Map reduce presentation
ateeq ateeq
 
Mapredtutorial
MapredtutorialMapredtutorial
Mapredtutorial
Anup Mohta
 
Patterns for JVM languages JokerConf
Patterns for JVM languages JokerConfPatterns for JVM languages JokerConf
Patterns for JVM languages JokerConf
Jaroslaw Palka
 
Full stack analytics with Hadoop 2
Full stack analytics with Hadoop 2Full stack analytics with Hadoop 2
Full stack analytics with Hadoop 2
Gabriele Modena
 
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
IndicThreads
 

Similar to Apache Hadoop Java API (20)

Hadoop_Pennonsoft
Hadoop_PennonsoftHadoop_Pennonsoft
Hadoop_Pennonsoft
 
Mapreduce by examples
Mapreduce by examplesMapreduce by examples
Mapreduce by examples
 
Big-data-analysis-training-in-mumbai
Big-data-analysis-training-in-mumbaiBig-data-analysis-training-in-mumbai
Big-data-analysis-training-in-mumbai
 
Apache Flink Training: DataStream API Part 2 Advanced
Apache Flink Training: DataStream API Part 2 Advanced Apache Flink Training: DataStream API Part 2 Advanced
Apache Flink Training: DataStream API Part 2 Advanced
 
Hadoop - Introduction to mapreduce
Hadoop -  Introduction to mapreduceHadoop -  Introduction to mapreduce
Hadoop - Introduction to mapreduce
 
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013Stratosphere System Overview Big Data Beers Berlin. 20.11.2013
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013
 
Introduction to Map-Reduce Programming with Hadoop
Introduction to Map-Reduce Programming with HadoopIntroduction to Map-Reduce Programming with Hadoop
Introduction to Map-Reduce Programming with Hadoop
 
Cs267 hadoop programming
Cs267 hadoop programmingCs267 hadoop programming
Cs267 hadoop programming
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Hadoop源码分析 mapreduce部分
Hadoop源码分析 mapreduce部分Hadoop源码分析 mapreduce部分
Hadoop源码分析 mapreduce部分
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptx
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Apache Crunch
Apache CrunchApache Crunch
Apache Crunch
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Map reduce presentation
Map reduce presentationMap reduce presentation
Map reduce presentation
 
Mapredtutorial
MapredtutorialMapredtutorial
Mapredtutorial
 
Patterns for JVM languages JokerConf
Patterns for JVM languages JokerConfPatterns for JVM languages JokerConf
Patterns for JVM languages JokerConf
 
Full stack analytics with Hadoop 2
Full stack analytics with Hadoop 2Full stack analytics with Hadoop 2
Full stack analytics with Hadoop 2
 
Hadoop + Clojure
Hadoop + ClojureHadoop + Clojure
Hadoop + Clojure
 
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
 

Recently uploaded

PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems S.M.S.A.
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
Peter Spielvogel
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
DianaGray10
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
RinaMondal9
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
Ralf Eggert
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
sonjaschweigert1
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
Neo4j
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
Neo4j
 
GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...
ThomasParaiso2
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
Dorra BARTAGUIZ
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
SOFTTECHHUB
 

Recently uploaded (20)

PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
 
GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
 

Apache Hadoop Java API

  • 1. Short Apache Hadoop API Overview Adam Kawa Data Engineer @ Spotify 2/24/13
  • 3. InputFormat Reposibilities Divide input data into logical input splits Data in HDFS is divided into block, but processed as input splits InputSplit may contains any number of blocks (usually 1) Each Mapper processes one input split Creates RecordReaders to extract <key, value> pairs 2/24/13
  • 4. InputFormat Class public abstract class InputFormat<K, V> { public abstract List<InputSplit> getSplits(JobContext context) throws ...; public abstract RecordReader<K,V> createRecordReader(InputSplit split, TaskAttemptContext context) throws ...; } 2/24/13
  • 5. Most Common InputFormats TextInputFormat Each n-terminated line is a value The byte offset of that line is a key Why not a line number? KeyValueTextInputFormat Key and value are separated by a separator (tab by default) 2/24/13
  • 6. Binary InputFormats SequenceFileInputFormat SequenceFiles are flat files consisting of binary <key, value> pairs AvroInputFormat Avro supports rich data structures (not necessarily <key, value> pairs) serialized to files or messages Compact, fast, language-independent, self-describing, dynamic 2/24/13
  • 7. Some Other InputFormats NLineInputFormat Should not be too big since splits are calculated in a single thread (NLineInputFormat#getSplitsForFile) CombineFileInputFormat An abstract class, but not so difficult to extend SeparatorInputFormat How to here: http://blog.rguha.net/?p=293 2/24/13
  • 8. Some Other InputFormats MultipleInputs Supports multiple input paths with a different InputFormat and Mapper for each path MultipleInputs.addInputPath(job, firstPath, FirstInputFormat.class, FirstMapper.class); MultipleInputs.addInputPath(job, secondPath, SecondInputFormat.class, SecondMapper.class); 2/24/13
  • 9. InputFormat Class (Partial) Hierarchy 2/24/13
  • 10. InputFormat Interesting Facts Ideally InputSplit size is equal to HDFS block size Or InputSplit contains multiple collocated HDFS block InputFormat may prevent splitting a file A whole file is processed by a single mapper (e.g. gzip) boolean FileInputFormat#isSplittable(); 2/24/13
  • 11. InputFormat Interesting Facts Mapper knows the file/offset/size of the split that it process MapContext#getInputSplit() Useful for later debugging on a local machine 2/24/13
  • 12. InputFormat Interesting Facts PathFilter (included in InputFormat) specifies which files to include or not into input data PathFilter hiddenFileFilter = new PathFilter(){ public boolean accept(Path p){ String name = p.getName(); return !name.startsWith("_") && !name.startsWith("."); } };  2/24/13
  • 13. RecordReader Extract <key, value> pairs from corresponding InputSplit Examples: LineRecordReader KeyValueRecordReader SequenceFileRecordReader 2/24/13
  • 14. RecordReader Logic Must handle a common situation when InputSplit and HDFS block boundaries do not match 2/24/13 Image source: Hadoop: The Definitive Guide by Tom White
  • 15. RecordReader Logic Exemplary solution – based on LineRecordReader Skips* everything from its block until the fist 'n' Reads from the second block until it sees 'n' *except the very first block (an offset equals to 0) 2/24/13 Image source: Hadoop: The Definitive Guide by Tom White
  • 16. Keys And Values Keys must implement WritableComparable interface Since they are sorted before passing to the Reducers Values must implement “at least” Writable interface 2/24/13
  • 17. WritableComparables Hierarchy 2/24/13 Image source: Hadoop: The Definitive Guide by Tom White
  • 18. Writable And WritableComparable public interface Writable { void write(DataOutput out) throws IOException; void readFields(DataInput in) throws IOException; } public interface WritableComparable<T> extends Writable, Comparable<T> { } public interface Comparable<T> { public int compareTo(T o); } 2/24/13
  • 19. Example: SongWritable class SongWritable implements Writable { String title; int year; byte[] content; … public void write(DataOutput out) throws ... { out.writeUTF(title); out.writeInt(year); out.writeInt(content.length); out.write(content); } } 2/24/13
  • 20. Mapper Takes input in form of a <key, value> pair Emits a set of intermediate <key, value> pairs Stores them locally and later passes to the Reducers But earlier: partition + sort + spill + merge 2/24/13
  • 21. Mapper Methods void setup(Context context) throws ... {} protected void cleanup(Context context) throws ... {} void map(KEYIN key, VALUEIN value, Context context) ... { context.write((KEYOUT) key, (VALUEOUT) value); } public void run(Context context) throws ... { setup(context); while (context.nextKeyValue()) { map(context.getCurrentKey(), context.getCurrentValue(), context); } cleanup(context); } 2/24/13
  • 22. MapContext Object Allow the user map code to communicate with MapReduce system public InputSplit getInputSplit(); public TaskAttemptID getTaskAttemptID(); public void setStatus(String msg); public boolean nextKeyValue() throws ...; public KEYIN getCurrentKey() throws ...; public VALUEIN getCurrentValue() throws ...; public void write(KEYOUT key, VALUEOUT value) throws ...; public Counter getCounter(String groupName, String counterName); 2/24/13
  • 23. Examples Of Mappers Implement highly specialized Mappers and reuse/chain them when possible IdentityMapper InverseMapper RegexMapper TokenCounterMapper 2/24/13
  • 24. TokenCounterMapper public class TokenCounterMapper extends Mapper<Object, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); @Override public void map(Object key, Text value, Context context ) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); } } } 2/24/13
  • 25. General Advices Reuse Writable instead of creating a new one each time Apache commons StringUtils class seems to be the most efficient for String tokenization 2/24/13
  • 26. Chain Of Mappers Use multiple Mapper classes within a single Map task The output of the first Mapper becomes the input of the second, and so on until the last Mapper The output of the last Mapper will be written to the task's output Encourages implementation of reusable and highly specialized Mappers 2/24/13
  • 27. Exemplary Chain Of Mappers  JobConf mapAConf = new JobConf(false);  ...  ChainMapper.addMapper(conf, AMap.class, LongWritable.class, Text.class,    Text.class, Text.class, true, mapAConf);    JobConf mapBConf = new JobConf(false);  ...  ChainMapper.addMapper(conf, BMap.class, Text.class, Text.class,    LongWritable.class, Text.class, false, mapBConf);  FileInputFormat.setInputPaths(conf, inDir);  FileOutputFormat.setOutputPath(conf, outDir);  JobClient jc = new JobClient(conf);  RunningJob job = jc.submitJob(conf); 2/24/13
  • 28. Partitioner Specifies which Reducer a given <key, value> pair is sent to Desire even distribution of the intermediate data Skewed data may overload a single reducer and make a whole job running longer public abstract class Partitioner<KEY, VALUE> { public abstract int getPartition(KEY key, VALUE value, int numPartitions); } 2/24/13
  • 29. HashPartitioner The default choice for general-purpose use cases public int getPartition(K key, V value, int numReduceTasks) { return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks; } 2/24/13
  • 30. TotalOrderPartitioner A partitioner that aims the total order of the output 2/24/13
  • 31. TotalOrderPartitioner Before job runs, it samples input data to provide fairly even distribution over keys 2/24/13
  • 32. TotalOrderPartitioner Three samplers InputSampler.RandomSampler<K,V> Sample from random points in the input InputSampler.IntervalSampler<K,V> Sample from s splits at regular intervals InputSampler.SplitSampler<K,V> Samples the first n records from s splits 2/24/13
  • 33. Reducer Gets list(<key, list(value)>) Keys are sorted, but values for a given key are not sorted Emits a set of output <key, value> pairs 2/24/13
  • 34. Reducer Run Method public void run(Context context) throws … { setup(context); while (context.nextKey()) { reduce(context.getCurrentKey(), context.getValues(), context); } cleanup(context); } 2/24/13
  • 35. Chain Of Mappers After A Reducer The ChainReducer class allows to chain multiple Mapper classes after a Reducer within the Reducer task Combined with ChainMapper, one could get [MAP+ / REDUCE MAP*] ChainReducer.setReducer(conf, XReduce.class, LongWritable.class, Text.class, Text.class, Text.class, true, reduceConf); ChainReducer.addMapper(conf, CMap.class, Text.class, Text.class, LongWritable.class, Text.class, false, null); ChainReducer.addMapper(conf, DMap.class, LongWritable.class, Text.class, LongWritable.class, LongWritable.class, true, null); 2/24/13
  • 36. OutputFormat Class Hierarchy 2/24/13 Image source: Hadoop: The Definitive Guide by Tom White
  • 37. MultipleOutputs MultipleOutputs.addNamedOutput(job, "text", TextOutputFormat.class, LongWritable.class, Text.class); MultipleOutputs.addNamedOutput(job, "seq", SequenceFileOutputFormat.class, LongWritable.class, Text.class); public void reduce(WritableComparable key, Iterator<Writable> values, Context context) throws ... { ... mos.write("text", , key, new Text("Hello")); mos.write("seq", LongWritable(1), new Text("Bye"), "seq_a"); mos.write("seq", LongWritable(2), key, new Text("Chau"), "seq_b"); mos.write(key, new Text("value"), generateFileName(key, new Text("value"))); } 2/24/13
  • 38. Other Useful Features Combiner Skipping bad records Compression Profiling Isolation Runner 2/24/13
  • 39. Job Class Methods public void setInputFormatClass(..); public void setNumReduceTasks(int tasks); public void setOutputFormatClass(..); public void setJobName(String name); public void setMapperClass(..); public float mapProgress(); public void setCombinerClass(..); public float reduceProgress(); public void setReducerClass(...); public boolean isComplete(); public void setPartitionerClass(..); public boolean isSuccessful(); public void setMapOutputKeyClass(..); public void killJob(); public void setMapOutputValueClass(..); public void submit(); public void setOutputKeyClass(..); public boolean waitForCompletion(..); public void setOutputValueClass(..); public void setSortComparatorClass(..); public void setGroupingComparatorClass(..); 2/24/13
  • 40. ToolRunner Supports parsing allows the user to specify configuration options on the command line hadoop jar examples.jar SongCount -D mapreduce.job.reduces=10 -D artist.gender=FEMALE -files dictionary.dat -jar math.jar,spotify.jar songs counts 2/24/13
  • 41. Side Data Distribution public class MyMapper<K, V> extends Mapper<K,V,V,K> { String gender = null; File dictionary = null; protected void setup(Context context) throws … { Configuration conf = context.getConfiguration(); gender = conf.get(“artist.gender”, “MALE”); dictionary = new File(“dictionary.dat”); } 2/24/13
  • 42. public class WordCount extends Configured implements Tool { public int run(String[] otherArgs) throws Exception { if (args.length != 2) { System.out.println("Usage: %s [options] <input> <output>", getClass().getSimpleName()); return -1; } Job job = new Job(getConf()); FileInputFormat.setInputPaths(job, new Path(otherArgs[0])); FileOutputFormat.setOutputPath(job, new Path(otherArgs[1])); ... return job.waitForCompletion(true); ? 0 : 1; } } public static void main(String[] allArgs) throws Exception { int exitCode = ToolRunner.run(new Configuration(), new WordCount(), allArgs); System.exit(exitCode); } 2/24/13
  • 43. MRUnit Built on top of JUnit Provides a mock InputSplit, Contex and other classes Can test The Mapper class, The Reducer class, The full MapReduce job The pipeline of MapReduce jobs 2/24/13
  • 44. MRUnit Example public class IdentityMapTest extends TestCase { private MapDriver<Text, Text, Text, Text> driver; @Before public void setUp() { driver = new MapDriver<Text, Text, Text, Text>(new MyMapper<Text, Text, Text, Text>()); } @Test public void testMyMapper() { driver .withInput(new Text("foo"), new Text("bar")) .withOutput(new Text("oof"), new Text("rab")) .runTest(); } } 2/24/13
  • 45. Example: Secondary Sort reduce(key, Iterator<value>) method gets iterator over values These values are not sorted for a given key Sometimes we want to get them sorted Useful to find minimum or maximum value quickly 2/24/13
  • 46. Secondary Sort Is Tricky A couple of custom classes are needed WritableComparable Partitioner SortComparator (optional, but recommended) GroupingComparator 2/24/13
  • 47. Composite Key Leverages “traditional” sorting mechanism of intermediate keys Intermediate key becomes composite of the “natural” key and the value (Disturbia, 1) → (Disturbia#1, 1) (SOS, 4) → (SOS#4, 4) (Disturbia, 7) → (Disturbia#7, 7) (Fast car, 2) → (Fast car#2, 2) (Fast car, 6) → (Fast car#6, 6) (Disturbia, 4) → (Disturbia#4, 4) (Fast car, 2) → (Fast car#2, 2) 2/24/13
  • 48. Custom Partitioner HashPartitioner uses a hash on keys The same titles may go to different reducers (because they are combined with ts in a key) Use a custom partitioner that partitions only on first part of the key int getPartition(TitleWithTs key, LongWritable value, int num) { return hashParitioner.getPartition(key.title); } 2/24/13
  • 49. Ordering Of Keys Keys needs to be ordered before passing to the reducer Orders by natural key and, for the same natural key, on the value portion of the key Implement sorting in WritableComparable or use Comparator class job.setSortComparatorClass(SongWithTsComparator.class); 2/24/13
  • 50. Data Passed To The Reducer By default, each unique key forces reduce() method (Disturbia#1, 1) → reduce method is invoked (Disturbia#4, 4) → reduce method is invoked (Disturbia#7, 7) → reduce method is invoked (Fast car#2, 2) → reduce method is invoked (Fast car#2, 2) (Fast car#6, 6) → reduce method is invoked (SOS#4, 4) → reduce method is invoked 2/24/13
  • 51. Data Passed To The Reducer GroupingComparatorClass class determines which keys and values are passed in a single call to the reduce method Just look at the natural key when grouping (Disturbia#1, 1) → reduce method is invoked (Disturbia#4, 4) (Disturbia#7, 7) (Fast car#2, 2) → reduce method is invoked (Fast car#2, 2) (Fast car#6, 6) (SOS#4, 4) → reduce method is invoked 2/24/13
  • 52. Question How to calculate a median from a set of numbers using Java MapReduce? 2/24/13
  • 53. Question – A Possible Answer Implement TotalSort, but Each Reducer produce an additional file containing a pair <minimum_value, number_of_values> After the job ends, a single-thread application Reads these files to build the index Calculate which value in which file is the median Finds this value in this file 2/24/13
  • 54. Thanks! Would you like to use Hadoop API at Spotify? Apply via jobs@spotify.com 2/24/13