Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

ApacheCon: Apache Flink - Fast and Reliable Large-Scale Data Processing

1,023 views

Published on

My talk on Apache Flink at ApacheCon North America 2015

Published in: Data & Analytics
  • Be the first to comment

ApacheCon: Apache Flink - Fast and Reliable Large-Scale Data Processing

  1. 1. Apache Flink Fast and Reliable Large-Scale Data Processing Fabian Hueske fhueske@apache.org @fhueske 1
  2. 2. About me Fabian Hueske fhueske@apache.org • PMC member of Apache Flink • With Flink since its beginnings in 2009 – back then an academic research project called Stratosphere
  3. 3. What is Apache Flink? Distributed Data Flow Processing System • Focused on large-scale data analytics • Real-time stream and batch processing • Easy and powerful APIs (Java / Scala) • Robust execution backend 3
  4. 4. 4 Flink in the Hadoop Ecosystem TableAPI GellyLibrary MLLibrary ApacheSAMOA Optimizer DataSet API (Java/Scala) DataStream API (Java/Scala) Stream Builder Runtime Local Cluster Yarn Apache TezEmbedded ApacheMRQL Dataflow HDFS S3JDBCHCatalog Apache HBase Apache Kafka Apache Flume RabbitMQ Hadoop IO ... Data Sources Execution Environments Flink Core Libraries TableAPI
  5. 5. What is Flink good at? It‘s a general-purpose data analytics system • Complex and heavy ETL jobs • Real-time stream processing with flexible windows • Analyzing huge graphs • Machine-learning on large data sets • ... 5
  6. 6. Flink in the ASF • Flink entered the ASF about one year ago – 04/2014: Incubation – 12/2014: Graduation • Strongly growing community – 17 committers – 15 PMC members 0 20 40 60 80 100 120 Nov-10 Apr-12 Aug-13 Dec-14 #unique git committers (w/o manual de-dup) 6
  7. 7. Where is Flink moving? A "use-case complete" framework to unify batch & stream processing Data Streams • Kafka • RabbitMQ • ... “Historic” data • HDFS • JDBC • ... Analytical Workloads • ETL • Relational processing • Graph analysis • Machine learning • Streaming data analysis 7 Goal: Treat batch as finite stream
  8. 8. HOW TO USE FLINK? Programming Model & APIs 8
  9. 9. DataSets and Transformations ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment(); DataSet<String> input = env.readTextFile(input); DataSet<String> first = input .filter (str -> str.contains(“Apache Flink“)); DataSet<String> second = first .map(str -> str.toLowerCase()); second.print(); env.execute(); Input First Secondfilter map 9
  10. 10. Expressive Transformations • Element-wise – map, flatMap, filter, project • Group-wise – groupBy, reduce, reduceGroup, combineGroup, mapPartition, aggregate, distinct • Binary – join, coGroup, union, cross • Iterations – iterate, iterateDelta • Physical re-organization – rebalance, partitionByHash, sortPartition • Streaming – window, windowMap, coMap, ... 10
  11. 11. Rich Type System • Use any Java/Scala classes as a data type – Special support for tuples, POJOs, and case classes – Not restricted to key-value pairs • Define (composite) keys directly on data types – Expression – Tuple position – Selector function 11
  12. 12. Counting Words in Batch and Stream 12 case class WordCount (word: String, count: Int) val lines: DataStream[String] = env.fromSocketStream(...) lines.flatMap {line => line.split(" ") .map(word => WordCount(word,1))} .window(Count.of(1000)).every(Count.of(100)) .groupBy("word").sum("count") .print() val lines: DataSet[String] = env.readTextFile(...) lines.flatMap {line => line.split(" ") .map(word => WordCount(word,1))} .groupBy("word").sum("count") .print() DataSet API (batch): DataStream API (streaming):
  13. 13. Table API • Execute SQL-like expressions on table data – Tight integration with Java and Scala APIs – Available for batch and streaming programs val orders = env.readCsvFile(…) .as('oId, 'oDate, 'shipPrio) .filter('shipPrio === 5) val items = orders .join(lineitems).where('oId === 'id) .select('oId, 'oDate, 'shipPrio, 'extdPrice * (Literal(1.0f) - 'discnt) as 'revenue) val result = items .groupBy('oId, 'oDate, 'shipPrio) .select('oId, 'revenue.sum, 'oDate, 'shipPrio) 13
  14. 14. WHAT IS HAPPENING INSIDE? Processing Engine 14
  15. 15. System Architecture Coordination Processing Optimization Client (pre-flight) Master Workers ... Flink Program ... 15
  16. 16. Lots of cool technology inside Flink • Batch and Streaming in one system • Memory-safe execution • Native data flow iterations • Cost-based data flow optimizer • Flexible windows on data streams • Type extraction and custom serialization utilities • Static code analysis on user functions • and much more... 16
  17. 17. STREAM AND BATCH IN ONE SYSTEM Pipelined Data Transfer 17
  18. 18. Stream and Batch in one System • Most systems do either stream or batch • In the past, Flink focused on batch processing – Flink‘s runtime has always done stream processing – Operators pipeline data forward as soon as it is processed – Some operators are blocking (such as sort) • Pipelining gives better performance for many batch workloads – Avoids materialization of large intermediate results 18
  19. 19. Pipelined Data Transfer Large Input Small Input Small Input Resultjoin Large Input map Interm. DataSet Build HT Result Program Pipelined Execution Pipeline 1 Pipeline 2 joinProbe HT map No intermediate materialization! 19
  20. 20. Flink Data Stream Processing • Pipelining enables Flink‘s runtime to process streams – API to define stream programs – Operators for stream processing • Stream API and operators are recent contributions – Evolving very quickly under heavy development – Integration with batch mode 20
  21. 21. MEMORY SAFE EXECUTION Memory Management and Out-of-Core Algorithms 21
  22. 22. Memory-safe Execution • Challenge of JVM-based data processing systems – OutOfMemoryErrors due to data objects on the heap • Flink runs complex data flows without memory tuning – C++-style memory management – Robust out-of-core algorithms 22
  23. 23. Managed Memory • Active memory management – Workers allocate 70% of JVM memory as byte arrays – Algorithms serialize data objects into byte arrays – In-memory processing as long as data is small enough – Otherwise partial destaging to disk • Benefits – Safe memory bounds (no OutOfMemoryError) – Scales to very large JVMs – Reduced GC pressure 23
  24. 24. Going out-of-core Single-core join of 1KB Java objects beyond memory (4 GB) Blue bars are in-memory, orange bars (partially) out-of-core 24
  25. 25. GRAPH ANALYSIS Native Data Flow Iterations 25
  26. 26. Native Data Flow Iterations • Many graph and ML algorithms require iterations • Flink features two types of iterations – Bulk iterations – Delta iterations 26 2 1 5 4 3 0.1 0.5 0.2 0.4 0.7 0.3 0.9
  27. 27. Iterative Data Flows • Flink runs iterations „natively“ as cyclic data flows – No loop unrolling – Operators are scheduled once – Data is fed back through backflow channel – Loop-invariant data is cached • Operator state is preserved across iterations! initial input Iteration head resultreducejoin Iteration tail other datasets 27
  28. 28. Delta Iterations • Delta iteration computes – Delta update of solution set – Work set for next iteration • Work set drives computations of next iteration – Workload of later iterations significantly reduced – Fast convergence • Applicable to certain problem domains – Graph processing 0 5000000 10000000 15000000 20000000 25000000 30000000 35000000 40000000 45000000 1 6 11 16 21 26 31 36 41 46 5 #ofelementsupdated # of iterations 28
  29. 29. Iteration Performance PageRank on Twitter Follower Graph 30 Iterations 61 Iterations (Convergence) 29
  30. 30. WHAT IS COMING NEXT? Roadmap 30
  31. 31. Just released a new version • 0.9.0-milestone1 preview release just got out! • Features – Table API – Flink ML Library – Gelly Library – Flink on Tez – … 31
  32. 32. Flink’s Roadmap Mission: Unified stream and batch processing • Exactly-once streaming semantics with flexible state checkpointing • Extending the ML library • Extending graph library • Integration with Apache Zeppelin (incubating) • SQL on top of the Table API • And much more… 32
  33. 33. tl;dr – What’s worth to remember? • Flink is a general-purpose data analytics system • Unifies streaming and batch processing • Expressive high-level APIs • Robust and fast execution engine 33
  34. 34. I Flink, do you? ;-) If you find Flink exciting, get involved and start a discussion on Flink‘s ML or stay tuned by subscribing to news@flink.apache.org or following @ApacheFlink on Twitter 34
  35. 35. 35
  36. 36. BACKUP 36
  37. 37. Data Flow Optimizer • Database-style optimizations for parallel data flows • Optimizes all batch programs • Optimizations – Task chaining – Join algorithms – Re-use partitioning and sorting for later operations – Caching for iterations 37
  38. 38. Data Flow Optimizer val orders = … val lineitems = … val filteredOrders = orders .filter(o => dataFormat.parse(l.shipDate).after(date)) .filter(o => o.shipPrio > 2) val lineitemsOfOrders = filteredOrders .join(lineitems) .where(“orderId”).equalTo(“orderId”) .apply((o,l) => new SelectedItem(o.orderDate, l.extdPrice)) val priceSums = lineitemsOfOrders .groupBy(“orderDate”) .sum(“l.extdPrice”); 38
  39. 39. Data Flow Optimizer DataSource orders.tbl Filter DataSource lineitem.tbl Join Hybrid Hash buildHT probe broadcast forward Combine Reduce sort[0,1] DataSource orders.tbl Filter DataSource lineitem.tbl Join Hybrid Hash buildHT probe hash-part [0] hash-part [0] hash-part [0,1] Reduce sort[0,1] Best plan depends on relative sizes of input filespartial sort[0,1] 39

×