Apache Flink - Hadoop MapReduce Compatibility


Talk by Fabian Hueske, Apache Flink Meetup Berlin, 28th January 2015

  1. 1. Apache Flink Hadoop Compatibility Fabian Hueske @fhueske
  2. 2. Hadoop MapReduce Jobs Input Map Reduce Output InputFormat Mapper Reducer OutputFormat • Jobs have a static structure. • Input, Output, Map, Reduce run your custom (or library) code. • If application logic is too complex, you need more than one job.
  3. 3. Flink Programs Source Map Reduce Source Source Filter Join CoGroup Sink • Flink program are DAG data flows. • Data Sources, Data Sinks, Map and Reduce operators are included. • Everything that MapReduce gives and much more (super set). • Much better performance • Especially if more than 1 MR job is executed.
  4. 4. Run your Hadoop code with Flink? • Hadoop data types (Writable) are natively supported. • Hadoop Filesystems are natively supported. • Flink features Input- & OutputFormats, Map, and Reduce functions, just like Hadoop MapReduce. • Concepts are the same, but interfaces are not :-( But Flink provides wrappers for Hadoop code :-) • mapred.* API: In/OutputFormat, Mappers, & Reducers • mapreduce.* API: In/OutputFormat
  5. 5. Alright, sounds good… … but will my WordCount still work?!?
  6. 6. final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment(); // set up Hadoop InputFormat HadoopInputFormat<LongWritable, Text> hadoopInputFormat = new HadoopInputFormat<LongWritable, Text>(new TextInputFormat(), LongWritable.class, Text.class, new JobConf()); TextInputFormat.addInputPath(hadoopInputFormat.getJobConf(), new Path(inputPath)); DataSet<Tuple2<LongWritable, Text>> text = env.createInput(hadoopInputFormat); // read data with Hadoop InputFormat DataSet<Tuple2<Text, LongWritable>> words = // apply Hadoop Mapper text.flatMap(new HadoopMapFunction<LongWritable, Text, Text, LongWritable>(new Tokenizer())) // apply Hadoop Reducer .groupBy(0).reduceGroup(new HadoopReduceFunction<Text, LongWritable, Text, LongWritable>(new Counter())); // set up Hadoop Output Format HadoopOutputFormat<Text, LongWritable> hadoopOutputFormat = new HadoopOutputFormat<Text, LongWritable>(new TextOutputFormat<Text, LongWritable>(), new JobConf()); hadoopOutputFormat.getJobConf().set("mapred.textoutputformat.separator", " "); TextOutputFormat.setOutputPath(hadoopOutputFormat.getJobConf(), new Path(outputPath)); words.output(hadoopOutputFormat); // write data with Hadoop OutputFormat env.execute("Hadoop Compat WordCount"); // execute the program Hadoop Data Types Hadoop Input- & OutputFormats Your Hadoop Functions Yes, it will…
  7. 7. Use MapReduce like you always wanted • Freely assemble your functions into a program. • Very efficient, pipelined execution. – Program is executed on Flink (no Hadoop involved). – No writing to/reading from HDFS within a program. • Caveat: No support for custom Hadoop partitioners & sorters, yet :-( Input Map Reduce Input Output Reduce Map Reduce Output
  9. 9. Hadoop Job Do not change a single line of code! • Inject MapReduce jobs as a whole into Flink programs – with support for custom partitioners, sorters, groupers. • Run Hadoop MapReduce jobs on Flink – without changing a single line of code. Source Map Reduce Source Source Filter Join CoGroup Sink
  10. 10. Looking for some fun? Try Hadoop on Flink!