Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Apache Flink - Hadoop MapReduce Compatibility


Published on

Talk by Fabian Hueske, Apache Flink Meetup Berlin, 28th January 2015

Published in: Data & Analytics
  • Be the first to comment

Apache Flink - Hadoop MapReduce Compatibility

  1. 1. Apache Flink Hadoop Compatibility Fabian Hueske @fhueske
  2. 2. Hadoop MapReduce Jobs Input Map Reduce Output InputFormat Mapper Reducer OutputFormat • Jobs have a static structure. • Input, Output, Map, Reduce run your custom (or library) code. • If application logic is too complex, you need more than one job.
  3. 3. Flink Programs Source Map Reduce Source Source Filter Join CoGroup Sink • Flink program are DAG data flows. • Data Sources, Data Sinks, Map and Reduce operators are included. • Everything that MapReduce gives and much more (super set). • Much better performance • Especially if more than 1 MR job is executed.
  4. 4. Run your Hadoop code with Flink? • Hadoop data types (Writable) are natively supported. • Hadoop Filesystems are natively supported. • Flink features Input- & OutputFormats, Map, and Reduce functions, just like Hadoop MapReduce. • Concepts are the same, but interfaces are not :-( But Flink provides wrappers for Hadoop code :-) • mapred.* API: In/OutputFormat, Mappers, & Reducers • mapreduce.* API: In/OutputFormat
  5. 5. Alright, sounds good… … but will my WordCount still work?!?
  6. 6. final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment(); // set up Hadoop InputFormat HadoopInputFormat<LongWritable, Text> hadoopInputFormat = new HadoopInputFormat<LongWritable, Text>(new TextInputFormat(), LongWritable.class, Text.class, new JobConf()); TextInputFormat.addInputPath(hadoopInputFormat.getJobConf(), new Path(inputPath)); DataSet<Tuple2<LongWritable, Text>> text = env.createInput(hadoopInputFormat); // read data with Hadoop InputFormat DataSet<Tuple2<Text, LongWritable>> words = // apply Hadoop Mapper text.flatMap(new HadoopMapFunction<LongWritable, Text, Text, LongWritable>(new Tokenizer())) // apply Hadoop Reducer .groupBy(0).reduceGroup(new HadoopReduceFunction<Text, LongWritable, Text, LongWritable>(new Counter())); // set up Hadoop Output Format HadoopOutputFormat<Text, LongWritable> hadoopOutputFormat = new HadoopOutputFormat<Text, LongWritable>(new TextOutputFormat<Text, LongWritable>(), new JobConf()); hadoopOutputFormat.getJobConf().set("mapred.textoutputformat.separator", " "); TextOutputFormat.setOutputPath(hadoopOutputFormat.getJobConf(), new Path(outputPath)); words.output(hadoopOutputFormat); // write data with Hadoop OutputFormat env.execute("Hadoop Compat WordCount"); // execute the program Hadoop Data Types Hadoop Input- & OutputFormats Your Hadoop Functions Yes, it will…
  7. 7. Use MapReduce like you always wanted • Freely assemble your functions into a program. • Very efficient, pipelined execution. – Program is executed on Flink (no Hadoop involved). – No writing to/reading from HDFS within a program. • Caveat: No support for custom Hadoop partitioners & sorters, yet :-( Input Map Reduce Input Output Reduce Map Reduce Output
  9. 9. Hadoop Job Do not change a single line of code! • Inject MapReduce jobs as a whole into Flink programs – with support for custom partitioners, sorters, groupers. • Run Hadoop MapReduce jobs on Flink – without changing a single line of code. Source Map Reduce Source Source Filter Join CoGroup Sink
  10. 10. Looking for some fun? Try Hadoop on Flink!