Be the first to like this
This presentation highlight recent developments in the Apache Flink community and recent related research publications. The presentation was held at the panel discussion of the BIRTE workshop at VLDB 2018 in Rio de Janeiro.
More information can be found on http://db.cs.pitt.edu/birte2018/
Are we making any attempts towards solving the hardest problems in stream processing today?
Most of today’s Internet applications are data-centric and generate vast amounts of data that needs to be processed and analyzed for detailed reporting, enhancing user experience and increasing monetization. Streaming data processing systems must be designed based on a varying set of requirements. The list of requirements can be categorized based on different properties of such systems:
1. Consistency: Does every record in the input (or equivalently an input event) need to be committed exactly-once or at-least-once or at-most-once to the output? Is the event committed atomically or eventually to all outputs?
2. Scale: How many events per second can the system process? Tens of events per second? Or Thousands? Millions? Billions or even more? Does the system auto-scale to a new workload?
3. Failure Resilience: What kind of failures is the system resilient to? Machine-level or partial datacenter-level or entire datacenter-level? Is it enough to ensure that the data processing system itself is failure-resistant? Does the output need to be stored in globally consistent way? Is the system resilient to a bug in input data, a bug in user’s business logic, etc?
4. Latency: How long does it take every event from the time it is generated to the time it is committed? Milliseconds or seconds or minutes or hours or days? Should we target SLOs for median latencies or 90th percentile or higher tail latencies?
5. Expressiveness: What kind of operations can the user express in the system? From simple stateless operations (e.g. filter) to complex joins or stateful operations (e.g. HAVING clause in SQL)? How flexible is the system to add more input sources and output sinks?
6. Cost: This includes not only hardware cost (CPU, RAM, Disk, network, etc) but also engineering design complexity, cost of production support to run as a service and providing SLOs for latency / completeness, etc. From a pure business perspective, all this cost needs to be justified by the value the end user gets.
7. Service: Does the system run as a service for the users? Multi-tenant? What kind of isolation (e.g. performance, security, etc) is provided amongst users? How is business logic isolated from infrastructure? How easy is it for users to modify business logic in a self-service way?
Lots of systems provide a lambda architecture: Use stream processing for best-effort (approximate) analysis, and use batch processing (e.g. daily) for strong consistency, high reliability, etc. This represents an easy way out. But is it the right thing to do?