Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Debunking Six Common Myths in Stream Processing

215 views

Published on

Presentation at November Apache Flink Meetup London

Published in: Software
  • Be the first to comment

Debunking Six Common Myths in Stream Processing

  1. 1. 1 Kostas Tzoumas @kostas_tzoumas Flink London Meetup November 3, 2016 Apache Flink®: State of the Union and What's Next
  2. 2. 2 Kostas Tzoumas @kostas_tzoumas Flink London Meetup November 3, 2016 Debunking Six Common Myths in Stream Processing
  3. 3. 3 Original creators of Apache Flink® Providers of the dA Platform, a supported Flink distribution
  4. 4. Outline  What is data streaming  Myth 1: The Lambda architecture  Myth 2: The throughput/latency tradeoff  Myth 3: Exactly once not possible  Myth 4: Streaming is for (near) real-time  Myth 5: Batching and buffering  Myth 6: Streaming is hard 4
  5. 5. The streaming architecture 5
  6. 6. 6 Reconsideration of data architecture  Better app isolation  More real-time reaction to events  Robust continuous applications  Process both real-time and historical data
  7. 7. 7 app state app state app state event log Query service
  8. 8. What is (distributed) streaming  Computations on never- ending “streams” of data records (“events”)  A stream processor distributes the computation in a cluster 8 Your code Your code Your code Your code
  9. 9. What is stateful streaming  Computation and state • E.g., counters, windows of past events, state machines, trained ML models  Result depends on history of stream  A stateful stream processor gives the tools to manage state • Recover, roll back, version, upgrade, etc 9 Your code state
  10. 10. What is event-time streaming  Data records associated with timestamps (time series data)  Processing depends on timestamps  An event-time stream processor gives you the tools to reason about time • E.g., handle streams that are out of order • Core feature is watermarks – a clock to measure event time 10 Your code state t3 t1 t2t4 t1-t2 t3-t4
  11. 11. What is streaming  Continuous processing on data that is continuously generated  I.e., pretty much all “big” data  It’s all about state and time 11
  12. 12. 12
  13. 13. Myth 1: The Lambda architecture 13
  14. 14. Myth variations  Stream processing is approximate  Stream processing is for transient data  Stream processing cannot handle high data volume  Hence, stream processing needs to be coupled with batch processing 14
  15. 15. Lambda architecture 15 file 1 file 2 Job 1 Job 2 Scheduler Streaming job Serve& store
  16. 16. Lambda no longer needed  Lambda was useful in the first days of stream processing (beginning of Apache Storm)  Not any more • Stream processors can handle very large volumes • Stream processors can compute accurate results  Good news is I don’t hear Lambda so often anymore 16
  17. 17. Myth 2: Throughput/latency tradeoff 17
  18. 18. Myth flavors  Low latency systems cannot support high throughput  In general, you need to trade off one for the other  There is a “high throughput” category and a “low-latency” category (naming varies) 18
  19. 19. Physical limits  Most stream processing pipelines are network bottlenecked  The network dictates both (1) what is the latency and (2) what is the throughput  A well-engineered system achieves the physical limits allowed by the network 19
  20. 20. Buffering  It is natural to handle many records together • All software and hardware systems do that • E.g., network bundles bytes into frames  Every streaming system buffers records for performance (Flink certainly does) • You don’t want to send single records over the network • "Record-at-a-time" does not exist at the physical level 20
  21. 21. Buffering (2)  Buffering is a performance optimization • Should be opaque to the user • Should not dictate system behavior in any other way • Should not impose artificial boundaries • Should not limit what you can do with the system • Etc... 21
  22. 22. Some numbers 22
  23. 23. Some more 23 TeraSort Relational Join Classic Batch Jobs Graph Processing Linear Algebra
  24. 24. Myth 3: Exactly once not possible 24
  25. 25. What is “exactly once”  Under failures, system computes result as if there was no failure  In contrast to: • At most once: no guarantees • At least once: duplicates possible  Exactly once state versus exactly once delivery 25
  26. 26. Myth variations  Exactly once is not possible in nature  Exactly once is not possible end-to-end  Exactly once is not needed  You need to trade off performance for exactly once (Usually perpetuated by folks until they implement exactly once ) 26
  27. 27. Transactions  “Exactly once” is transactions: either all actions succeed or none succeed  Transactions are possible  Transactions are useful  Let’s not start eventual consistency all over again… 27
  28. 28. Flink checkpoints  Periodic asynchronous consistent snapshots of application state  Provide exactly-once state guarantees under failures 28 9/2/2016 stream_barriers.svg checkpoint barrier n­1 data stream stream record (event) checkpoint barrier n newer records part of checkpoint n­1 part of checkpoint n part of checkpoint n+1 older records
  29. 29. End-to-end exactly once  Checkpoints double as transaction coordination mechanism  Source and sink operators can take part in checkpoints  Exactly once internally, "effectively once" end to end: e.g., Flink + Cassandra with idempotent updates 29 transactional sinks
  30. 30. State management  Checkpoints triple as state versioning mechanism (savepoints)  Go back and forth in time while maintaining state consistency  Ease code upgrades (Flink or app), maintenance, migration, and debugging, what-if simulations, A/B tests 30
  31. 31. Myth 4: Streaming = real time 31
  32. 32. Myth variations  I don’t have low latency applications hence I don’t need stream processing  Stream processing is only relevant for data before storing them  We need a batch processor to do heavy offline computations 32
  33. 33. Low latency and high latency streams 33 2016-3-1 12:00 am 2016-3-1 1:00 am 2016-3-1 2:00 am 2016-3-11 11:00pm 2016-3-12 12:00am 2016-3-12 1:00am 2016-3-11 10:00pm 2016-3-12 2:00am 2016-3-12 3:00am… partition partition Stream (low latency) Batch (bounded stream) Stream (high latency)
  34. 34. Robust continuous applications 34
  35. 35. Accurate computation  Batch processing is not an accurate computation model for continuous data • Misses the right concepts and primitives • Time handling, state across batch boundaries  Stateful stream processing a better model • Real-time/low-latency is the icing on the cake 35
  36. 36. Myth 5: Batching and buffering 36
  37. 37. Myth variations  There is a "mini-batch" category between batch and streaming  “Record-at-a-time” versus “mini-batching” or similar "choices"  Mini-batch systems can get better throughput 37
  38. 38. Myth variations (2)  The difference between mini-batching and streaming is latency  I don’t need low latency hence I need mini- batching  I have a mini-batching use case 38
  39. 39. We have answered this already  Can get throughput and latency (myth #2) • Every system buffers data, from the network to the OS to Flink  Streaming is a model, not just fast (myth #4) • Time and state • Low latency is the icing on the cake 39
  40. 40. Continuous operation  Data is continuously produced  Computation should track data production • With dynamic scaling, pause-and-resume  Restarting our pipelines every second is not a great idea, and not just for latency reasons 40
  41. 41. Myth 6: Streaming is hard 41
  42. 42. Myth variations  Streaming is hard to learn  Streaming is hard to reason about  Windows? Event time? Triggers? Oh, my!!  Streaming needs to be coupled with batch  I know batch already 42
  43. 43. It's about your data and code  What's the form of your data? • Unbounded (e.g., clicks, sensors, logs), or • Bounded (e.g., ???*)  What changes more often? • My code changes faster than my data • My data changes faster than my code 43 * Please help me find a great example of naturally static data
  44. 44. It's about your data and code  If your data changes faster than your code you have a streaming problem • You may be solving it with hourly batch jobs depending on someone else to create the hourly batches • You are probably living with inaccurate results without knowing it 44
  45. 45. It's about your data and code  If your code changes faster than your data you have an exploration problem • Using notebooks or other tools for quick data exploration is a good idea • Once your code stabilizes you will have a streaming problem, so you might as well think of it as such from the beginning 45
  46. 46. Flink in the real world 46
  47. 47. Flink community  > 240 contributors, 95 contributors in Flink 1.1  42 meetups around the world with > 15,000 members  2x-3x growth in 2015, similar in 2016 47
  48. 48. Powered by Flink 48 Zalando, one of the largest ecommerce companies in Europe, uses Flink for real- time business process monitoring. King, the creators of Candy Crush Saga, uses Flink to provide data science teams with real-time analytics. Bouygues Telecom uses Flink for real-time event processing over billions of Kafka messages per day. Alibaba, the world's largest retailer, built a Flink-based system (Blink) to optimize search rankings in real time. See more at flink.apache.org/poweredby.html
  49. 49. 30 Flink applications in production for more than one year. 10 billion events (2TB) processed daily Complex jobs of > 30 operators running 24/7, processing 30 billion events daily, maintaining state of 100s of GB with exactly-once guarantees Largest job has > 20 operators, runs on > 5000 vCores in 1000-node cluster, processes millions of events per second 49
  50. 50. 50
  51. 51. Flink Forward 2016
  52. 52. Current work in Flink 52
  53. 53. Flink's unique combination of features 53 Low latency High Throughput Well-behaved flow control (back pressure) Consistency Works on real-time and historic data Performance Event Time APIs Libraries Stateful Streaming Savepoints (replays, A/B testing, upgrades, versioning) Exactly-once semantics for fault tolerance Windows & user-defined state Flexible windows (time, count, session, roll-your own) Complex Event Processing Fluent API Out-of-order events Fast and large out-of-core state
  54. 54. Flink 1.1 54 Connectors Metric System (Stream) SQL Session Windows Library enhancements
  55. 55. Flink 1.1 + ongoing development 55 Connectors Session Windows (Stream) SQL Library enhancements Metric System Metrics & Visualization Dynamic Scaling Savepoint compatibility Checkpoints to savepoints More connectors Stream SQL Windows Large state Maintenance Fine grained recovery Side in-/outputs Window DSL Security Mesos & others Dynamic Resource Management Authentication Queryable State
  56. 56. Flink 1.1 + ongoing development 56 Connectors Session Windows (Stream) SQL Library enhancements Metric System Operations Ecosystem Application Features Metrics & Visualization Dynamic Scaling Savepoint compatibility Checkpoints to savepoints More connectors Stream SQL Windows Large state Maintenance Fine grained recovery Side in-/outputs Window DSL Broader Audience Security Mesos & others Dynamic Resource Management Authentication Queryable State
  57. 57. A longer-term vision for Flink 57
  58. 58. Streaming use cases Application (Near) real-time apps Continuous apps Analytics on historical data Request/response apps Technology Low-latency streaming High-latency streaming Batch as special case of streaming Large queryable state 58
  59. 59. Request/response applications  Queryable state: query Flink state directly instead of pushing results in a database  Large state support and query API coming in Flink 59 queries
  60. 60. In summary  The need for streaming comes from a rethinking of data infra architecture • Stream processing then just becomes natural  Debunking 5 myths • Myth 1: The Lambda architecture • Myth 2: The throughput/latency tradeoff • Myth 3: Exactly once not possible • Myth 4: Streaming is for (near) real-time • Myth 5: Batching and buffering • Myth 6: Streaming is hard 60
  61. 61. 6 Thank you! @kostas_tzoumas @ApacheFlink @dataArtisans
  62. 62. We are hiring! data-artisans.com/careers

×