Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Flink Apachecon Presentation

2,935 views

Published on

These are the slides that supported the presentation on Apache Flink at the ApacheCon Budapest.

Apache Flink is a platform for efficient, distributed, general-purpose data processing.

Published in: Data & Analytics

Flink Apachecon Presentation

  1. 1. The Flink Big Data Analytics Platform Marton Balassi, Gyula Fora" {mbalassi, gyfora}@apache.org Hungarian Academy of Sciences
  2. 2. What is Apache Flink? • Started in 2009 by the Berlin-based database research groups • In the Apache Incubator since mid 2014 • 61 contributors as of the 0.7.0 release Open Source • Fast, general purpose distributed data processing system • Combining batch and stream processing • Up to 100x faster than Hadoop Fast + Reliable • Programming APIs for Java and Scala • Tested on large clusters Ready to use 11/18/2014 Flink - M. Balassi & Gy. Fora 2
  3. 3. What is Apache Flink? Master Worker Worker Flink Cluster Analytical Program Flink Client & Op0mizer 11/18/2014 Flink - M. Balassi & Gy. Fora 3
  4. 4. This Talk • Introduction to Flink • API overview • Distinguishing Flink • Flink from a user perspective • Performance • Flink roadmap and closing 11/18/2014 Flink - M. Balassi & Gy. Fora 4
  5. 5. Open source Big Data Landscape Hive Mahout MapReduce Crunch Flink Spark Storm Yarn Mesos HDFS Cascading Tez Pig Applica3ons Data processing engines App and resource management Storage, streams HBase KaCa … 11/18/2014 Flink - M. Balassi & Gy. Fora 5
  6. 6. Flink stack Common API Storage Streams Hybrid Batch/Streaming Run0me Files HDFS S3 Cluster Manager Na0ve YARN EC2 Flink Op0mizer Scala API (batch) Graph API („Spargel“) JDBC Rabbit Redis Azure KaCa MQ … Java Collec0ons Streams Builder Apache Tez Python API Java API (streaming) Apache MRQL Batch Streaming Java API (batch) Local Execu0on 11/18/2014 Flink - M. Balassi & Gy. Fora 6
  7. 7. Flink APIs
  8. 8. Programming model Data abstractions: Data Set, Data Stream DataSet/Stream A A (1) A (2) DataSet/Stream B B (1) B (2) C (1) C (2) X X Y Y Program Parallel Execution X Y Operator X Operator Y DataSet/Stream C 11/18/2014 Flink - M. Balassi & Gy. Fora 8
  9. 9. Flexible pipelines Map, FlatMap, MapPartition, Filter, Project, Reduce, ReduceGroup, Aggregate, Distinct, Join, CoGoup, Cross, Iterate, Iterate Delta, Iterate-Vertex-Centric, Windowing Reduce Join Map Reduce Map Iterate Source Sink Source 11/18/2014 Flink - M. Balassi & Gy. Fora 9
  10. 10. WordCount, Java API DataSet<String> text = env.readTextFile(input); DataSet<Tuple2<String, Integer>> result = text .flatMap((str, out) -> { for (String token : value.split("W")) { out.collect(new Tuple2<>(token, 1)); }) .groupBy(0) .sum(1); 11/18/2014 Flink - M. Balassi & Gy. Fora 10
  11. 11. WordCount, Scala API val input = env.readTextFile(input); val words = input flatMap { line => line.split("W+") } val counts = words groupBy { word => word } count() 11/18/2014 Flink - M. Balassi & Gy. Fora 11
  12. 12. WordCount, Streaming API DataStream<String> text = env.readTextFile(input); DataStream<Tuple2<String, Integer>> result = text .flatMap((str, out) -> { for (String token : value.split("W")) { out.collect(new Tuple2<>(token, 1)); }) .groupBy(0) .sum(1); 11/18/2014 Flink - M. Balassi & Gy. Fora 12
  13. 13. Is there anything beyond WordCount? 11/18/2014 Flink - M. Balassi & Gy. Fora 13
  14. 14. Beyond Key/Value Pairs // outputs pairs of pages and impressions class Impression { public String url; public long count; } class Page { public String url; public String topic; } DataSet<Page> pages = ...; DataSet<Impression> impressions = ...; DataSet<Impression> aggregated = impressions .groupBy("url") .sum("count"); pages.join(impressions).where("url").equalTo("url") .print() // outputs pairs of matching pages and impressions 11/18/2014 Flink - M. Balassi & Gy. Fora 14
  15. 15. Preview: Logical Types DataSet<Row> dates = env.readCsv(...).as("order_id", "date"); DataSet<Row> sessions = env.readCsv(...).as("id", "session"); DataSet<Row> joined = dates .join(session).where("order_id").equals("id"); joined.groupBy("date").reduceGroup(new SessionFilter()) class SessionFilter implements GroupReduceFunction<SessionType> { public void reduce(Iterable<SessionType> value, Collector out){ ... } } public class SessionType { public String order_id; public Date date; public String session; } 11/18/2014 Flink - M. Balassi & Gy. Fora 15
  16. 16. Distinguishing Flink
  17. 17. Hybrid batch/streaming runtime • True data streaming on the runtime layer • No unnecessary synchronization steps • Batch and stream processing seamlessly integrated • Easy testing and development for both approaches 11/18/2014 Flink - M. Balassi & Gy. Fora 17
  18. 18. Flink Streaming • Most Data Set operators are also available for Data Streams • Temporal and streaming specific operators – Windowing semantics (applying transformation on chunks of data) – Window reduce, join, cross etc. • Connectors for different data sources – Kafka, Flume, RabbitMQ, Twitter etc. • Support for iterative stream processing 11/18/2014 Flink - M. Balassi & Gy. Fora 18
  19. 19. Flink Streaming //Build new model every minute DataStream<Double[]> model= env .addSource(new TrainingDataSource()) .window(60000) .reduceGroup(new ModelBuilder()); //Predict new data using the most up-to-date model DataStream<Integer> prediction = env .addSource(new NewDataSource()) .connect(model) .map(new Predictor()); M N. Data Output P T. Data 11/18/2014 Flink - M. Balassi & Gy. Fora 19
  20. 20. Lambda architecture Want to do some dev-ops? Source: https://www.mapr.com/developercentral/lambda-architecture 11/18/2014 Flink - M. Balassi & Gy. Fora 20
  21. 21. Lambda architecture in Flink - One System - One API - One cluster 11/18/2014 Flink - M. Balassi & Gy. Fora 21
  22. 22. Iterations in other systems Client Loop outside the system Step Step Step Step Step Client Loop outside the system Step Step Step Step Step 11/18/2014 Flink - M. Balassi & Gy. Fora 22
  23. 23. Iterations in Flink Streaming dataflow with feedback map red. join join System is iteration-aware, performs automatic optimization 11/18/2014 Flink - M. Balassi & Gy. Fora 23
  24. 24. Dependability JVM Heap • Flink manages its own memory • Caching and data processing happens in a dedicated memory fraction • System never breaks the JVM heap, gracefully spills Unmanaged Heap Flink Managed Heap Network Buffers User Code Hashing/Sor0ng/Caching Shuffles/Broadcasts 11/18/2014 Flink - M. Balassi & Gy. Fora 24
  25. 25. Operating on" Serialized Data • serializes data every time à Highly robust, never gives up on you • works on objects, RDDs may be stored serialized àSerialization considered slow, only when needed • makes serialization really cheap: à partial deserialization, operates on serialized form à Efficient and robust! 11/18/2014 Flink - M. Balassi & Gy. Fora 25
  26. 26. Flink from a user perspective
  27. 27. Flink programs run everywhere Cluster (Batch) Local Debugging Cluster (Streaming) Fink Run3me or Apache Tez As Java Collec0on Programs Embedded (e.g., Web Container) 11/18/2014 Flink - M. Balassi & Gy. Fora 27
  28. 28. Migrate Easily Flink out-of-the-box supports • Hadoop data types (writables) • Hadoop Input/Output Formats • Hadoop functions and object model Input Map Reduce Output S DataSet Red DataSet Join DataSet Output DataSet Map DataSet Input 11/18/2014 Flink - M. Balassi & Gy. Fora 28
  29. 29. Little tuning or configuration required • Requires no memory thresholds to configure – Flink manages its own memory • Requires no complicated network configs – Pipelining engine requires much less memory for data exchange • Requires no serializers to be configured – Flink handles its own type extraction and data representation • Programs can be adjusted to data automatically – Flink’s optimizer can choose execution strategies automatically 11/18/2014 Flink - M. Balassi & Gy. Fora 29
  30. 30. Understanding Programs Visualizes the operations and the data movement of programs Analyze after execution Screenshot from Flink’s plan visualizer 11/18/2014 Flink - M. Balassi & Gy. Fora 30
  31. 31. Understanding Programs Analyze after execution (times, stragglers, …) 11/18/2014 Flink - M. Balassi & Gy. Fora 31
  32. 32. Performance
  33. 33. Distributed Grep Filter Term 1 HDFS Filter Term 2 Filter Term 3 Matches Matches Matches • 1 TB of data (log files) • 24 machines with • 32 GB Memory • Regular HDDs • HDFS 2.4.0 • Flink 0.7-incubating- SNAPSHOT • Spark 1.2.0-SNAPSHOT èFlink up to 2.5x faster 11/18/2014 Flink - M. Balassi & Gy. Fora 33
  34. 34. Distributed Grep: Flink Execution 11/18/2014 Flink - M. Balassi & Gy. Fora 34
  35. 35. Distributed Grep: Spark Execution Spark executes the job in 3 stages: Filter HDFS Term 1 Matches Stage 1 Filter Stage 2 HDFS Term 2 Matches Filter Stage 3 HDFS Term 3 Matches 11/18/2014 Flink - M. Balassi & Gy. Fora 35
  36. 36. Spark in-memory pinning Filter Term 1 Cache in-memory to avoid reading the data for each filter HDFS Filter Term 2 Filter Term 3 Matches Matches Matches In- Memory RDD JavaSparkContext sc = new JavaSparkContext(conf); JavaRDD<String> file = sc.textFile(inFile).persist(StorageLevel.MEMORY_AND_DISK()); for(int p = 0; p < patterns.length; p++) { final String pattern = patterns[p]; JavaRDD<String> res = file.filter(new Function<String, Boolean>() { … }); } 11/18/2014 Flink - M. Balassi & Gy. Fora 36
  37. 37. Spark in-memory pinning 37 minutes 9 minutes RDD is 100% in-memory Spark starts spilling RDD to disk 11/18/2014 Flink - M. Balassi & Gy. Fora 37
  38. 38. PageRank • Dataset: – Twitter Follower Graph – 41,652,230 vertices (users) – 1,468,365,182 edges (followings) – 12 GB input data 11/18/2014 Flink - M. Balassi & Gy. Fora 38
  39. 39. PageRank results 11/18/2014 Flink - M. Balassi & Gy. Fora 39
  40. 40. PageRank on Flink with " Delta Iterations • The algorithm runs 60 iterations until convergence (runtime includes convergence check) 100 minutes 8.5 minutes 11/18/2014 Flink - M. Balassi & Gy. Fora 40
  41. 41. Again, explaining the difference On average, a (delta) interation runs for 6.7 seconds Flink (Bulk): 48 sec. Spark (Bulk): 99 sec. 11/18/2014 Flink - M. Balassi & Gy. Fora 41
  42. 42. Streaming performance 11/18/2014 Flink - M. Balassi & Gy. Fora 42
  43. 43. Flink Roadmap • Flink has a major release every 3 months," with >=1 big-fixing releases in-between • Finer-grained fault tolerance • Logical (SQL-like) field addressing • Python API • Flink Streaming, Lambda architecture support • Flink on Tez • ML on Flink (e.g., Mahout DSL) • Graph DSL on Flink • … and more 11/18/2014 Flink - M. Balassi & Gy. Fora 43
  44. 44. 11/18/2014 Flink - M. Balassi & Gy. Fora 44
  45. 45. http://flink.incubator.apache.org github.com/apache/incubator-flink @ApacheFlink Marton Balassi, Gyula Fora" {mbalassi, gyfora}@apache.org

×