Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Flink forward SF 2017: Ufuk Celebi - The Stream Processor as a Database: Building Online Applications directly on Streams

545 views

Published on



We present a new design pattern for data streaming applications, using Apache Flink and Apache Kafka: Building applications directly on top of the stream processor, rather than on top of key/value databases populated by data streams. Unlike classical setups that use stream processors or libraries to pre-process/aggregate events and update a database with the results, this setup simply gives the role of the database to the stream processor (here Apache Flink), routing queries to its workers who directly answer them from their internal state computed over the log of events (Apache Kafka). This talk will cover both the high-level introduction to the architecture, the techniques in Flink/Kafka that make this approach possible, as well as a live demo.

Published in: Data & Analytics
  • Be the first to comment

Flink forward SF 2017: Ufuk Celebi - The Stream Processor as a Database: Building Online Applications directly on Streams

  1. 1. Ufuk Celebi @iamuce The Stream Processor as a Database
  2. 2. The (Classic) Use Case Realtime Counts and Aggregates 2
  3. 3. (Real-)Time Series Statistics 3 Stream of Events Real-time Statistics
  4. 4. The Architecture 4 collect message queue analyze serve & store
  5. 5. The Flink Job 5 case class Impressions(id: String, impressions: Long) val events: DataStream[Event] = env .addSource(new FlinkKafkaConsumer09(…)) val impressions: DataStream[Impressions] = events .filter(evt => evt.isImpression) .map(evt => Impressions(evt.id, evt.numImpressions) val counts: DataStream[Impressions]= stream .keyBy("id") .timeWindow(Time.hours(1)) .sum("impressions")
  6. 6. The Flink Job 6 case class Impressions(id: String, impressions: Long) val events: DataStream[Event] = env .addSource(new FlinkKafkaConsumer09(…)) val impressions: DataStream[Impressions] = events .filter(evt => evt.isImpression) .map(evt => Impressions(evt.id, evt.numImpressions) val counts: DataStream[Impressions]= stream .keyBy("id") .timeWindow(Time.hours(1)) .sum("impressions")
  7. 7. The Flink Job 7 case class Impressions(id: String, impressions: Long) val events: DataStream[Event] = env .addSource(new FlinkKafkaConsumer09(…)) val impressions: DataStream[Impressions] = events .filter(evt => evt.isImpression) .map(evt => Impressions(evt.id, evt.numImpressions) val counts: DataStream[Impressions]= stream .keyBy("id") .timeWindow(Time.hours(1)) .sum("impressions")
  8. 8. The Flink Job 8 case class Impressions(id: String, impressions: Long) val events: DataStream[Event] = env .addSource(new FlinkKafkaConsumer09(…)) val impressions: DataStream[Impressions] = events .filter(evt => evt.isImpression) .map(evt => Impressions(evt.id, evt.numImpressions) val counts: DataStream[Impressions]= stream .keyBy("id") .timeWindow(Time.hours(1)) .sum("impressions")
  9. 9. The Flink Job 9 case class Impressions(id: String, impressions: Long) val events: DataStream[Event] = env .addSource(new FlinkKafkaConsumer09(…)) val impressions: DataStream[Impressions] = events .filter(evt => evt.isImpression) .map(evt => Impressions(evt.id, evt.numImpressions) val counts: DataStream[Impressions]= stream .keyBy("id") .timeWindow(Time.hours(1)) .sum("impressions")
  10. 10. The Flink Job 10 Kafka Source map() window()/ sum() Sink Kafka Source map() window()/ sum() Sink filter() filter() keyBy() keyBy() State State
  11. 11. Putting it all together 11 Periodically (every second) flush new aggregates to Redis
  12. 12. The Bottleneck 12 Writes to the key/value store take too long
  13. 13. Queryable State 13
  14. 14. Queryable State 14
  15. 15. Queryable State 15 Optional, and only at the end of windows
  16. 16. Queryable State: Application View 16 Database realtime results older results Application Query Service current time windows past time windows
  17. 17. Queryable State Enablers § Flink has state as a first class citizen § State is fault tolerant (exactly once semantics) § State is partitioned (sharded) together with the operators that create/update it § State is continuous (not mini batched) § State is scalable 17
  18. 18. State in Flink 18 window()/ sum() Source / filter() / map() State index (e.g., RocksDB) Events are persistent and ordered (per partition / key) in the message queue (e.g., Apache Kafka) Events flow without replication or synchronous writes
  19. 19. State in Flink 19 window()/ sum() Source / filter() / map() Trigger checkpoint Inject checkpoint barrier
  20. 20. State in Flink 20 window()/ sum() Source / filter() / map() Take state snapshot Trigger state copy-on-write
  21. 21. State in Flink 21 window()/ sum() Source / filter() / map() Persist state snapshots Durably persist snapshots asynchronously Processing pipeline continues
  22. 22. Queryable State: Implementation 22 Query Client State Registry window()/ sum() Job Manager Task Manager ExecutionGraph State Location Server deploy status Query: /job/state-name/key State Registry window()/ sum() Task Manager (1) Get location of "key-partition" of" job" (2) Look up location (3) Respond location (4) Query state-name and key local state register
  23. 23. Queryable State Performance 23
  24. 24. Conclusion 24
  25. 25. Takeaways § Streaming applications are often not bound by the stream processor itself. Cross system interaction is frequently biggest bottleneck § Queryable state mitigates a big bottleneck: Communication with external key/value stores to publish realtime results § Apache Flink's sophisticated support for state makes this possible 25
  26. 26. Takeaways Performance of Queryable State § Data persistence is fast with logs • Append only, and streaming replication § Computed state is fast with local data structures and no synchronous replication § Flink's checkpoint method makes computed state persistent with low overhead 26
  27. 27. Questions? § eMail: uce@apache.org § Twitter: @iamuce § Code/Demo: https://github.com/dataArtisans/flink- queryable_state_demo 27
  28. 28. Appendix 28
  29. 29. Flink Runtime + APIs 29 DataStream API Runtime Distributed Streaming Data Flow Table API & Stream SQL ProcessFunction API Building Blocks: Streams, Time, State
  30. 30. Apache Flink Architecture Review 30

×