Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

BIRTE Panel at VLDB: Are we solving the core Problems in stream processing?

113 views

Published on

This presentation highlight recent developments in the Apache Flink community and recent related research publications. The presentation was held at the panel discussion of the BIRTE workshop at VLDB 2018 in Rio de Janeiro.

More information can be found on http://db.cs.pitt.edu/birte2018/

Panel Title:
Are we making any attempts towards solving the hardest problems in stream processing today?

Panel Abstract:
Most of today’s Internet applications are data-centric and generate vast amounts of data that needs to be processed and analyzed for detailed reporting, enhancing user experience and increasing monetization. Streaming data processing systems must be designed based on a varying set of requirements. The list of requirements can be categorized based on different properties of such systems:
1. Consistency: Does every record in the input (or equivalently an input event) need to be committed exactly-once or at-least-once or at-most-once to the output? Is the event committed atomically or eventually to all outputs?
2. Scale: How many events per second can the system process? Tens of events per second? Or Thousands? Millions? Billions or even more? Does the system auto-scale to a new workload?
3. Failure Resilience: What kind of failures is the system resilient to? Machine-level or partial datacenter-level or entire datacenter-level? Is it enough to ensure that the data processing system itself is failure-resistant? Does the output need to be stored in globally consistent way? Is the system resilient to a bug in input data, a bug in user’s business logic, etc?
4. Latency: How long does it take every event from the time it is generated to the time it is committed? Milliseconds or seconds or minutes or hours or days? Should we target SLOs for median latencies or 90th percentile or higher tail latencies?
5. Expressiveness: What kind of operations can the user express in the system? From simple stateless operations (e.g. filter) to complex joins or stateful operations (e.g. HAVING clause in SQL)? How flexible is the system to add more input sources and output sinks?
6. Cost: This includes not only hardware cost (CPU, RAM, Disk, network, etc) but also engineering design complexity, cost of production support to run as a service and providing SLOs for latency / completeness, etc. From a pure business perspective, all this cost needs to be justified by the value the end user gets.
7. Service: Does the system run as a service for the users? Multi-tenant? What kind of isolation (e.g. performance, security, etc) is provided amongst users? How is business logic isolated from infrastructure? How easy is it for users to modify business logic in a self-service way?
Lots of systems provide a lambda architecture: Use stream processing for best-effort (approximate) analysis, and use batch processing (e.g. daily) for strong consistency, high reliability, etc. This represents an easy way out. But is it the right thing to do?
[...]

Published in: Science
  • Be the first to comment

  • Be the first to like this

BIRTE Panel at VLDB: Are we solving the core Problems in stream processing?

  1. 1. 1 Jonas Traub, BIRTE @ VLDB, 20181 Jonas Traub, BIRTE @ VLDB, 2018 Are we solving the core problems in stream processing? Jonas Traub Technische Universität Berlin / DFKI IAM www.dima.tu-berlin.de | jonas.traub@tu-berlin.de Panel Discussion with: Manpreet Singh (Google) Karthik Ramasamy (Stremlio) C. Mohan (IBM) Badrish Chandramouli (Microsoft) Neng Lu (Twitter) Alok Pareek (Striim) Jonas Traub (TU-Berlin)
  2. 2. 2 Jonas Traub, BIRTE @ VLDB, 20182 Jonas Traub, BIRTE @ VLDB, 2018 Are we solving the core problems in stream processing? Jonas Traub Technische Universität Berlin / DFKI IAM www.dima.tu-berlin.de | jonas.traub@tu-berlin.de
  3. 3. 3 Jonas Traub, BIRTE @ VLDB, 20183 Jonas Traub, BIRTE @ VLDB, 2018 Are we solving the core problems in stream processing? Yes, we do!
  4. 4. 4 Jonas Traub, BIRTE @ VLDB, 20184 Jonas Traub, BIRTE @ VLDB, 2018 Are we solving the core problems in stream processing? Yes, we do! Apache Flink and its success story What are the core problems and how are we solving them? Examples
  5. 5. 5 Jonas Traub, BIRTE @ VLDB, 2018 5 5 Jonas Traub, BIRTE @ VLDB, 2018 Apache Flink Timeline
  6. 6. 6 Jonas Traub, BIRTE @ VLDB, 2018 6 6 Jonas Traub, BIRTE @ VLDB, 2018
  7. 7. 7 Jonas Traub, BIRTE @ VLDB, 2018 Apache Flink - Stateful Computations over Data Streams source: flink.apache.org • Event-driven Applications • Stream & Batch Analytics • Data Pipelines & ETL • Exactly-once state consistency • Event-time processing • Sophisticated late data handling • Scale-out architecture • Support for very large state • Incremental checkpointing
  8. 8. 8 Jonas Traub, BIRTE @ VLDB, 20188 Jonas Traub, BIRTE @ VLDB, 2018 Examples: What are core problems and how are we solving them?
  9. 9. 9 Jonas Traub, BIRTE @ VLDB, 20189 Jonas Traub, BIRTE @ VLDB, 2018 Examples: Expressiveness: Event-time processing and sophisticated late data handling The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive- Scale, Unbounded, Out-of-Order Data Processing (Akidau et al.)
  10. 10. 10 Jonas Traub, BIRTE @ VLDB, 201810 Jonas Traub, BIRTE @ VLDB, 2018 Examples: Expressiveness: Event-time processing and sophisticated late data handling Service: Common APIs and feature sets The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive- Scale, Unbounded, Out-of-Order Data Processing (Akidau et al.) Apache Beam: An advanced unified programming model “Implement batch and streaming data processing jobs that run on any execution engine. (beam.apache.org)”
  11. 11. 11 Jonas Traub, BIRTE @ VLDB, 201811 Jonas Traub, BIRTE @ VLDB, 2018 Examples: Consistency: Exactly-once state consistency Expressiveness: Event-time processing and sophisticated late data handling Service: Common APIs and feature sets Lightweight asynchronous snapshots for distributed dataflows P Carbone, G Fóra, S Ewen, S Haridi, K Tzoumas State management in Apache Flink: consistent stateful distributed stream processing P Carbone, S Ewen, G Fóra, S Haridi, S Richter, K Tzoumas The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive- Scale, Unbounded, Out-of-Order Data Processing (Akidau et al.) Apache Beam: An advanced unified programming model “Implement batch and streaming data processing jobs that run on any execution engine. (beam.apache.org)”

×