Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

21.04.2016 Meetup: Spark vs. Flink

584 views

Published on

Nachdem sich Apache Spark 2015 als ernsthafte Alternative unter den Big Data Frameworks etablieren konnte und Hadoop MapReduce den Rang abläuft, kommt nun aus Berlin unerwartet Konkurrenz in Form von Apache Flink.

Video zu den Slides:
https://www.youtube.com/watch?v=-MmX44pjJ9s&list=PL6ceXNIVUaAKIxQO_aBLlWpp48x-cRzOE&index=2

Zur Spark & Hadoop User Group:
http://www.meetup.com/Hadoop-User-Group-Munich/

Published in: Data & Analytics

21.04.2016 Meetup: Spark vs. Flink

  1. 1. Spark vs Flink Rumble in the (Big Data) Jungle , München, 2016-04-20 Konstantin Knauf Michael Pisula
  2. 2. Background
  3. 3. The Big Data Ecosystem Apache Top-Level Projects over Time 2008 2010 2013 2014 2015
  4. 4. The New Guard
  5. 5. Berkeley University Origin TU Berlin 2013 Apache Incubator 04/2014 02/2014 Apache Top- Level 01/2015 databricks Company data Artisans Scala, Java, Python, R Supported languages Java, Scala, Python Scala Implemented in Java Stand-Alone, Mesos, EC2, YARN Cluster Stand-Alone, Mesos, EC2, YARN Lightning-fast cluster computing Teaser Scalable Batch and Stream Data Processing
  6. 6. The Challenge
  7. 7. Real-Time Analysis of a Superhero Fight Club Fight hitter: Int hittee: Int hitpoints: Int Segment id: Int name: String segment: String Detail name: String gender: Int birthYear: Int noOfAppearances: Int Fight hitter: Int hittee: Int hitpoints: Int Fight hitter: Int hittee: Int hitpoints: Int Fight hitter: Int hittee: Int hitpoints: Int Fight hitter: Int hittee: Int hitpoints: Int Fight hitter: Int hittee: Int hitpoints: Int Fight hitter: Int hittee: Int hitpoints: Int Fight hitter: Int hittee: Int hitpoints: Int Hero id: Int name: String segment: String gender: Int birthYear: Int noOfAppearances: Int {Stream {Batch
  8. 8. The Setup AWS Cluster Kafka Cluster Stream ProcessingBatch Processing Heroes Combining Stream and Batch Segment Detail Data Generator Avro Avro
  9. 9. Round 1 Setting up
  10. 10. Dependencies compile "org.apache.flink:flink-java:1.0.0" compile "org.apache.flink:flink-streaming-java_2.11:1.0.0" //For Local Execution from IDE compile "org.apache.flink:flink-clients_2.11:1.0.0" Skeleton //Batch (DataSetAPI) ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment(); //Stream (DataStream API) StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment() //Processing Logic //For Streaming env.execute()
  11. 11. Dependencies compile 'org.apache.spark:spark-core_2.10:1.5.0' compile 'org.apache.spark:spark-streaming_2.10:1.5.0' Skeleton Batch SparkConf conf = new SparkConf().setAppName(appName).setMaster(master); // Batch JavaSparkContext sparkContext = new JavaSparkContext(conf); // Stream JavaStreamingContext jssc = new JavaStreamingContext(conf, Durations.seconds(1)); // Processing Logic jssc.start(); // For Streaming
  12. 12. First Impressions Practically no boiler plate Easy to get started and play around Runs in the IDE Hadoop MapReduce is much harder to get into
  13. 13. Round 2 Static Data Analysis Combine both static data parts
  14. 14. Read the csv file and transform it JavaRDD<String> segmentFile = sparkContext.textFile("s3://..."); JavaPairRDD<Integer, SegmentTableRecord> segmentTable = segmentFile .map(line -> line.split(",")) .filter(array -> array.length == 3) .mapToPair((String[] parts) -> { int id = Integer.parseInt(parts[0]); String name = parts[1], segment = parts[2]; return new Tuple2<>(name, new SegmentTableRecord(id, name, segment)); }); Join with detail data, filter out humans and write output segmentTable.join(detailTable) .mapValues(tuple -> { SegmentTableRecord s = tuple._1(); DetailTableRecord d = tuple._2(); return new Hero(s.getId(), s.getName(), s.getSegment(), d.getGender(), d.getBirthYear(), d.getNoOfAppearances()); }) .map(tuple -> tuple._2()) .filter(hero -> hero.getSegment().equals(HUMAN_SEGMENT)) .saveAsTextFile("s3://...");
  15. 15. Loading Files from S3 into POJO DataSource<SegmentTableRecord> segmentTable = env.readCsvFile("s3://...") .ignoreInvalidLines() .pojoType(SegmentTableRecord.class, "id", "name", "segment"); Join and Filter DataSet<Hero> humanHeros = segmentTable.join(detailTable) .where("name") .equalTo("name") .with((s, d) -> new Hero(s.id, s.name, s.segment, d.gender, d.birthYear, d.noOfAppearances)) .filter(hero -> hero.segment.equals("Human")); Write back to S3 humanHeros.writeAsFormattedText(outputTablePath, WriteMode.OVERWRITE, h -> h.toCsv());
  16. 16. Performance Terasort1: Flink ca 66% of runtime Terasort2: Flink ca. 68% of runtime HashJoin: Flink ca. 32% of runtime (Iterative Processes: Flink ca. 50% of runtime, ca. 7% with Delta-Iterations)
  17. 17. 2nd Round Points Generally similar abstraction and feature set Flink has a nicer syntax, more sugar Spark is pretty bare-metal Flink is faster
  18. 18. Round 3 Simple Real Time Analysis Total Hitpoints over Last Minute
  19. 19. Configuring Environment for EventTime StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime); ExecutionConfig config = env.getConfig(); config.setAutoWatermarkInterval(500); Creating Stream from Kafka Properties properties = new Properties(); properties.put("bootstrap.servers", KAFKA_BROKERS); properties.put("zookeeper.connect", ZOOKEEPER_CONNECTION); properties.put("group.id", KAFKA_GROUP_ID); DataStreamSource<FightEvent> hitStream = env.addSource(new FlinkKafkaConsumer08<>("FightEventTopic", new FightEventDeserializer(), properties));
  20. 20. Processing Logic hitStream.assignTimestamps(new FightEventTimestampExtractor(6000)) .timeWindowAll(Time.of(60, TimeUnit.SECONDS), Time.of(10, TimeUnit.SECONDS)) .apply(new SumAllWindowFunction<FightEvent>() { @Override public long getSummand(FightEvent fightEvent) { return fightEvent.getHitPoints(); } }) .writeAsCsv("s3://..."); Example Output 3> (1448130670000,1448130730000,290789) 4> (1448130680000,1448130740000,289395) 5> (1448130690000,1448130750000,291768) 6> (1448130700000,1448130760000,292634) 7> (1448130710000,1448130770000,293869) 8> (1448130720000,1448130780000,293356) 1> (1448130730000,1448130790000,293054) 2> (1448130740000,1448130800000,294209)
  21. 21. Create Context and get Avro Stream from Kafka JavaStreamingContext jssc = new JavaStreamingContext(conf, Durations.seconds(1)); HashSet<String> topicsSet = Sets.newHashSet("FightEventTopic"); HashMap<String, String> kafkaParams = new HashMap<String, String>(); kafkaParams.put("metadata.broker.list", "xxx:11211"); kafkaParams.put("group.id", "spark"); JavaPairInputDStream<String, FightEvent> kafkaStream = KafkaUtils.createDirectStream(jssc, String.class, FightEvent.class, StringDecoder.class, AvroDecoder.class, kafkaParams, topicsSet); Analyze number of hit points over a sliding window kafkaStream.map(tuple -> tuple._2().getHitPoints()) .reduceByWindow((hit1, hit2) -> hit1 + hit2, Durations.seconds(60), Durations.seconds(10)) .foreachRDD((rdd, time) -> { rdd.saveAsTextFile(outputPath + "/round1-" + time.milliseconds()); LOGGER.info("Hitpoints in the last minute {}", rdd.take(5)); return null; });
  22. 22. Output 20:19:32 Hitpoints in the last minute [80802] 20:19:42 Hitpoints in the last minute [101019] 20:19:52 Hitpoints in the last minute [141012] 20:20:02 Hitpoints in the last minute [184759] 20:20:12 Hitpoints in the last minute [215802]
  23. 23. 3rd Round Points Flink supports event time windows Kafka and Avro worked seamlessly in both Spark uses micro-batches, no real stream Both have at-least-once delivery guarantees Exactly-once depends a lot on sink/source
  24. 24. Round 4 Connecting Static Data with Real Time Data Total Hitpoints over Last Minute Per Gender
  25. 25. Read static data using object File and map genders JavaRDD<Hero> staticRdd = jssc.sparkContext().objectFile(lookupPath); JavaPairRDD<String, String> genderLookup = staticRdd.mapToPair(user -> { int genderIndicator = user.getGender(); String gender; switch (genderIndicator) { case 1: gender = "MALE"; break; case 2: gender = "FEMALE"; break; default: gender = "OTHER"; break; } return new Tuple2<>(user.getId(), gender); }); Analyze number of hit points per hitter over a sliding window JavaPairDStream<String, Long> hitpointWindowedStream = kafkaStream .mapToPair(tuple -> { FightEvent fight = tuple._2(); return new Tuple2<>(fight.getHitterId(), fight.getHitPoints()); }) .reduceByKeyAndWindow((hit1, hit2) -> hit1 + hit2, Durations.seconds(60), Durations.seconds(10));
  26. 26. Join with static data to find gender for each hitter hitpointWindowedStream.foreachRDD((rdd, time) -> { JavaPairRDD<String, Long> hpg = rdd.leftOuterJoin(genderLookup) .mapToPair(joinedTuple -> { Optional<String> maybeGender = joinedTuple._2()._2(); Long hitpoints = joinedTuple._2()._1(); return new Tuple2<>(maybeGender.or("UNKNOWN"), hitpoints); }) .reduceByKey((hit1, hit2) -> hit1 + hit2); hpg.saveAsTextFile(outputPath + "/round2-" + time.milliseconds()); LOGGER.info("Hitpoints per gender {}", hpg.take(5)); return null; }); Output 20:30:44 Hitpoints [(FEMALE,35869), (OTHER,435), (MALE,66226)] 20:30:54 Hitpoints [(FEMALE,48805), (OTHER,644), (MALE,87014)] 20:31:04 Hitpoints [(FEMALE,55332), (OTHER,813), (MALE,99722)] 20:31:14 Hitpoints [(FEMALE,65543), (OTHER,813), (MALE,116416)] 20:31:24 Hitpoints [(FEMALE,67507), (OTHER,813), (MALE,123750)]
  27. 27. Loading Static Data in Every Map public FightEventEnricher(String bucket, String keyPrefix) { this.bucket = bucket; this.keyPrefix = keyPrefix; } @Override public void open(Configuration parameters) { populateHeroMapFromS3(bucket, keyPrefix); } @Override public EnrichedFightEvent map(FightEvent event) throws Exception { return new EnrichedFightEvent(event, idToHero.get(event.getHitterId()), idToHero.get(event.getHitteeId())); } private void populateHeroMapFromS3(String bucket, String keyPrefix) { // Omitted }
  28. 28. Processing Logic hitStream.assignTimestamps(new FightEventTimestampExtractor(6000)) .map(new FightEventEnricher("s3_bucket", "output/heros")) .filter(value -> value.getHittingHero() != null) .keyBy(enrichedFightEvent -> enrichedFightEvent.getHittingHero().getGender()) .timeWindow(Time.of(60, TimeUnit.SECONDS), Time.of(10, TimeUnit.SECONDS)) .apply(new SumWindowFunction<EnrichedFightEvent, Integer>() { @Override public long getSummand(EnrichedFightEvent value) { return value.getFightEvent() .getHitPoints(); } }) Example Output 2> (1448191350000,1448191410000,1,28478) 3> (1448191350000,1448191410000,2,264650) 2> (1448191360000,1448191420000,1,28290) 3> (1448191360000,1448191420000,2,263521) 2> (1448191370000,1448191430000,1,29327) 3> (1448191370000,1448191430000,2,265526)
  29. 29. 4th Round Points Spark makes combining batch and spark easier Windowing by key works well in both Java API of Spark can be annoying
  30. 30. Round 5 More Advanced Real Time Analysis Best Hitter over Last Minute Per Gender
  31. 31. Processing Logic hitStream.assignTimestamps(new FightEventTimestampExtractor(6000)) .map(new FightEventEnricher("s3_bucket", "output/heros")) .filter(value -> value.getHittingHero() != null) .keyBy(fightEvent -> fightEvent.getHittingHero().getName()) .timeWindow(Time.of(60, TimeUnit.SECONDS), Time.of(10, TimeUnit.SECONDS)) .apply(new SumWindowFunction<EnrichedFightEvent, String>() { @Override public long getSummand(EnrichedFightEvent value) { return value.getFightEvent().getHitPoints(); } }) .assignTimestamps(new AscendingTimestampExtractor<...>() { @Override public long extractAscendingTimestamp(Tuple4<...<tuple, long l) { return tuple.f0; } }) .timeWindowAll(Time.of(10, TimeUnit.SECONDS)) .maxBy(3) .print();
  32. 32. Example Output 1> (1448200070000,1448200130000,Tengu,546) 2> (1448200080000,1448200140000,Louis XIV,621) 3> (1448200090000,1448200150000,Louis XIV,561) 4> (1448200100000,1448200160000,Louis XIV,552) 5> (1448200110000,1448200170000,Phil Dexter,620) 6> (1448200120000,1448200180000,Phil Dexter,552) 7> (1448200130000,1448200190000,Kalamity,648) 8> (1448200140000,1448200200000,Jakita Wagner,656) 1> (1448200150000,1448200210000,Jakita Wagner,703)
  33. 33. Read static data using object File and Map names JavaRDD<Hero> staticRdd = jssc.sparkContext().objectFile(lookupPath); JavaPairRDD<String, String> userNameLookup = staticRdd .mapToPair(user -> new Tuple2<>(user.getId(), user.getName())); Analyze number of hit points per hitter over a sliding window JavaPairDStream<String, Long> hitters = kafkaStream .mapToPair(kafkaTuple -> new Tuple2<>(kafkaTuple._2().getHitterId(), kafkaTuple._2().getHitPoints())) .reduceByKeyAndWindow((accum, current) -> accum + current, (accum, remove) -> accum - remove, Durations.seconds(60), Durations.seconds(10));
  34. 34. Join with static data to find username for each hitter hitters.foreachRDD((rdd, time) -> { JavaRDD<Tuple2<String, Long>> namedHitters = rdd .leftOuterJoin(userNameLookup) .map(joinedTuple -> { String username = joinedTuple._2()._2().or("No name"); Long hitpoints = joinedTuple._2()._1(); return new Tuple2<>(username, hitpoints); }) .sortBy(Tuple2::_2, false, PARTITIONS); namedHitters.saveAsTextFile(outputPath + "/round3-" + time); LOGGER.info("Five highest hitters (total: {}){}", namedHitters.count(), namedHitters.take(5)); return null; }); Output 15/11/25 20:34:23 Five highest hitters (total: 200) [(Nick Fury,691), (Lady Blackhawk,585), (Choocho Colon,585), (Purple Man,539), 15/11/25 20:34:33 Five highest hitters (total: 378) [(Captain Dorja,826), (Choocho Colon,773), (Nick Fury,691), (Kari Limbo,646), 15/11/25 20:34:43 Five highest hitters (total: 378) [(Captain Dorja,1154), (Choocho Colon,867), (Wendy Go,723), (Kari Limbo,699),
  35. 35. 15/11/25 20:34:53 Five highest hitters (total: 558) [(Captain Dorja,1154), (Wendy Go,931), (Choocho Colon,867), (Fyodor Dostoyevsky,
  36. 36. Performance Yahoo Streaming Benchmark
  37. 37. 5th Round Points Spark makes some things easier But Flink is real streaming In Spark you often have to specify partitions
  38. 38. The Judges' Call
  39. 39. Development Compared to Hadoop, both are awesome Both provide unified programming model for diverse scenarios Comfort level of abstraction varies with use-case Spark's Java API is cumbersome compared to the Scala API Working with both is fun Docs are ok, but spotty
  40. 40. Testing Testing distributed systems will always be hard Functionally both can be tested nicely
  41. 41. Monitoring
  42. 42. Monitoring
  43. 43. Community
  44. 44. The Judge's Call It depends...
  45. 45. Use Spark, if You have Cloudera, Hortonworks. etc support and depend on it You want to heavily use Graph and ML libraries You want to use the more mature project
  46. 46. Use Flink, if Real-Time processing is important for your use case You want more complex window operations You develop in Java only If you want to support a German project
  47. 47. Benchmark References [1] http://shelan.org/blog/2016/01/31/reproducible-experiment-to-compare-apache-spark-and-apache- flink-batch-processing/ [2] http://eastcirclek.blogspot.de/2015/06/terasort-for-spark-and-flink-with-range.html [3] http://eastcirclek.blogspot.de/2015/07/hash-join-on-tez-spark-and-flink.html [4] https://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming-computation-engines-at [5] http://data-artisans.com/extending-the-yahoo-streaming-benchmark/
  48. 48. Thank You! Questions?  michael.pisula@tng.tech konstantin.knauf@tng.tech

×