Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Apache Flink@ Strata & Hadoop World London

3,427 views

Published on

Published in: Software
  • Login to see the comments

Apache Flink@ Strata & Hadoop World London

  1. 1. Stephan Ewen Flink committer co-founder @ data Artisans @StephanEwen Apache Flink
  2. 2. 1 year of Flink - code April 2014 April 2015
  3. 3. What is Flink 3 Gelly Table ML SAMOA DataSet (Java/Scala/Python) DataStream (Java/Scala) HadoopM/R Local Remote Yarn Tez Embedded Dataflow Dataflow(WiP) MRQL Table Cascading(WiP) Streaming dataflow runtime
  4. 4. Native workload support 5 Flink Streaming topologies Long batch pipelines Machine Learning at scale How can an engine natively support all these workloads? And what does "native" mean? Graph Analysis
  5. 5. E.g.: Non-native iterations 6 Step Step Step Step Step Client for (int i = 0; i < maxIterations; i++) { // Execute MapReduce job }
  6. 6. E.g.: Non-native streaming 7 stream discretizer Job Job Job Job while (true) { // get next few records // issue batch job }
  7. 7. Native workload support 8 Flink Streaming topologies Heavy batch jobs Machine Learning at scale How can an engine natively support all these workloads? And what does native mean?
  8. 8. Flink Engine 1. Execute everything as streams 2. Allow some iterative (cyclic) dataflows 3. Allow some mutable state 4. Operate on managed memory 9
  9. 9. Program compilation 10 case class Path (from: Long, to: Long) val tc = edges.iterate(10) { paths: DataSet[Path] => val next = paths .join(edges) .where("to") .equalTo("from") { (path, edge) => Path(path.from, edge.to) } .union(paths) .distinct() next } Optimizer Type extraction stack Task scheduling Dataflow metadata Pre-flight (Client) Master Workers Data Source orders.tbl Filter Map DataSourc e lineitem.tbl Join Hybrid Hash build HT probe hash-part [0] hash-part [0] GroupRed sort forward Program Dataflow Graph deploy operators track intermediate results
  10. 10. Flink by Use Case 11
  11. 11. Data Streaming Analysis streaming dataflows 12
  12. 12. 3 Parts of a Streaming Infrastructure 13 Gathering Broker Analysis Sensors Transaction logs … Server Logs
  13. 13. 3 Parts of a Streaming Infrastructure 14 Gathering Broker Analysis Sensors Transaction logs … Server Logs Result may be fed back to the broker
  14. 14. Cornerstones of Flink Streaming  Pipelined stream processor (low latency)  Expressive APIs  Flexible operator state, streaming windows  Efficient fault tolerance for streams and state. 15
  15. 15. Pipelined stream processor 16 Streaming Shuffle!
  16. 16. Expressive APIs 17 case class Word (word: String, frequency: Int) val lines: DataStream[String] = env.fromSocketStream(...) lines.flatMap {line => line.split(" ") .map(word => Word(word,1))} .window(Time.of(5,SECONDS)).every(Time.of(1,SECONDS)) .groupBy("word").sum("frequency") .print() val lines: DataSet[String] = env.readTextFile(...) lines.flatMap {line => line.split(" ") .map(word => Word(word,1))} .groupBy("word").sum("frequency") .print() DataSet API (batch): DataStream API (streaming):
  17. 17. Windows 18 More at: http://flink.apache.org/news/2015/02/09/streaming-example.html
  18. 18. Checkpointing / Recovery 19 Chandy-Lamport Algorithm for consistent asynchronous distributed snapshots Pushes checkpoint barriers through the data flow Operator checkpoint starting Checkpoint done Data Stream barrier Before barrier = part of the snapshot After barrier = Not in snapshot Checkpoint done checkpoint in progress (backup till next snapshot)
  19. 19. Long batch pipelines Batch on Streaming 20
  20. 20. Batch Pipelines 21
  21. 21. Batch on Streaming  Batch programs are a special kind of streaming program 22 Infinite Streams Finite Streams Stream Windows Global View Pipelined Data Exchange Pipelined or Blocking Exchange Streaming Programs Batch Programs
  22. 22. Batch Pipelines 23 Data exchange (shuffle / broadcast) is mostly streamed Some operators block (e.g. sorts / hash tables)
  23. 23. Operators Execution Overlaps 24
  24. 24. Memory Management 25
  25. 25. Memory Management 26
  26. 26. Smooth out-of-core performance 27 More at: http://flink.apache.org/news/2015/03/13/peeking-into-Apache-Flinks-Engine-Room.html Blue bars are in-memory, orange bars (partially) out-of-core
  27. 27. Table API 28 val customers = envreadCsvFile(…).as('id, 'mktSegment) .filter("mktSegment = AUTOMOBILE") val orders = env.readCsvFile(…) .filter( o => dateFormat.parse(o.orderDate).before(date) ) .as("orderId, custId, orderDate, shipPrio") val items = orders .join(customers).where("custId = id") .join(lineitems).where("orderId = id") .select("orderId, orderDate, shipPrio, extdPrice * (Literal(1.0f) – discount) as revenue") val result = items .groupBy("orderId, orderDate, shipPrio") .select('orderId, revenue.sum, orderDate, shipPrio")
  28. 28. Machine Learning Algorithms Iterative data flows 29
  29. 29. Iterate by looping  for/while loop in client submits one job per iteration step  Data reuse by caching in memory and/or disk Step Step Step Step Step Client 30
  30. 30. Iterate in the Dataflow 31
  31. 31. Example: Matrix Factorization 32 Factorizing a matrix with 28 billion ratings for recommendations More at: http://data-artisans.com/computing-recommendations-with-flink.html
  32. 32. Graph Analysis Stateful Iterations 33
  33. 33. Iterate natively with state/deltas 34
  34. 34. Effect of delta iterations… 0 5000000 10000000 15000000 20000000 25000000 30000000 35000000 40000000 45000000 1 6 11 16 21 26 31 36 41 46 51 56 61 #ofelementsupdated iteration
  35. 35. … fast graph analysis 36More at: http://data-artisans.com/data-analysis-with-flink.html
  36. 36. Closing 37
  37. 37. Flink Roadmap for 2015 Some highlights that we are working on  More flexible state and state backends in streaming  Master Failover  Improved monitoring  Integration with other Apache projects • SAMOA, Zeppelin, Ignite  More additions to the libraries 38
  38. 38. 39
  39. 39. flink.apache.org @ApacheFlink

×