Efficient Migration of Very Large Distributed State for Scalable Stream Processing
1. Efficient Migration of Very Large
Distributed State for Scalable
Stream Processing
PhD Candidate: Bonaventura Del Monte
Advisors: Prof. Dr. Volker Markl, Prof. Dr. Tilmann Rabl
PhD Workshop, VLDB 2017
This work has been partially funded by the European Union’s Horizon 2020 research and innovation program under grant agreement n° 687691
2. Outline
• Research Goal
• Problem Statement
• Proposed Solution
• Research Issues
• Evaluation Plan
• Conclusion and future directions
3. S1
OP1
S2
OP3
OP2
Distributed Stateful Stream Processing
STATE
STORAGE
UDF
STREAM
PROCESSOR
• State is co-partitioned with the input
stream by key
• State is internally stored and managed
1
4. State Management In Current Systems
• Fault-tolerance
• Resource elasticity
• Queries maintenance
• Load balancing
• Partitioned State
• Partially Distributed State
• Hundreds of gigabytes
2
5. • Many analytics executed at same time:
• Machine Learning models, e.g., collaborative filtering, fraud
detection, NLP 100s of GB per model
• Different types of temporal aggregations/joins 100s GB
Motivational Example: a Real-World Deployment
3
6. Motivational Example: a Real-World Deployment
S1
OP1
S2
OP3
OP2
SINK1
SINK 2
STATE
STATE
STATE
STATE
STATE
STATE
STATE
4
7. CLUSTER A
Motivational Example: a Real-World Deployment
S1
OP1
S2
OP3
OP2
SINK1
SINK 2
STATE
STATE
STATE
STATE
STATESTATE
STATE
CLUSTER B
COMPUTING
RESOURCES
Add/Remove
Handle failures
Balance load
Migrate
5
8. Research Goal
• Fault-tolerance
• Resource Elasticity
• Queries Maintenance
• Load Balancing
• Distributed
State a.k.a.
Shared Mutable
State
• Terabyte Sizes
6
9. Problem Statement
• Fault-tolerance
• Resource Elasticity
• Queries Maintenance
• Load Balancing
• State Transfer
• Consistent state for
exactly once processing
• Robust Query
Performance
7
25. Optimal Placement of Keys Ranges
• Dynamic Hungarian Method
• Why Dynamic? To handle Resource Elasticity
• Rescalable Keys Range as the smallest unit
26. State of the Art
SEEP Apache Spark AIM Ding et al. Apache Flink Naiad Chrono Stream System X
State
Distribution
Pattern
Distributed Partitioned Distributed Partitioned Partitioned Partitioned Partitioned Distributed
Fault Tolerance
Async local
Checkp. w/ log
recovery
RDD lineage &
RDD interm.
checkp.
Log-based
Periodic
checkpoint.
Upstream
backup & global
Async-
Check.
Sync global
checkpoint
Slice recostr. w/
async delta
checkpoint.
Upstream
backup & Async
incr. Checkp. &
handover
Job Rescaling Threshold Manual Manual N/A Manual Manual
Horizontal
Vertical
Dynamic
Horizontal
Vertical
Load
Balancing
Hash Hash Hash Hash Hash Hash Hash
Hybrid: Hash w/
dynamic repart.
Editor's Notes
This talk is structured as follows: I will give you a first insight of the core aspects of my proposal, then I will walk you through the research issues, and how I intend to proceed in order to assess my work
Before, we may dig into details, I need to explain how distributed stateful stream processing is done today. A streaming job is defined as a weakly connected DAG. Streams are ingested into a stream processing system through source vertexes. Each input tuple is optionally keyed. In order to exploit parallelism, we use hash-partitioning to shuffle the input tuples on downstream operators. Each of those operators may be stateful, they contain some state according to their logic. Each parallel operator process a range of keys depending to how the input is partitioned. State is internally stored and manager. Each parallel operator has a stream operator, processing an UDF. They both read and write from the state storage. Each parallel instance can work only its internal state. The global state of the topology is checkpointed time to time.
State introduced new challenges, indeed we need state management techniques to ensure support fault-tolerance, resource elasticity, queries maintenance, and load balancing while keeping processing input streams. Currently, there are systems and research paper addressing a subset of those techniques, yet they constrained their focus to partitioned state or partially distributed state. Also the sizes of the state is in the size of gigabytes.
In my opinion, those assumptions limit the capabilities and supported use cases. Let’s consider the following real-word deployment scenario. We run an online marketplace and we need always up-to-date analytics about our platform, we want to perform on-the fly recommendations (thus, we use collaborative filtering), we want to perform fraud detections, and natural language processing to improve the user-experience within our platform. The sizes of those models grow with the number of items and users. Furthermore, we want to calculate not-ML analytics, e.g., heavy hitters, temporal aggregations/joins. This adds more data to our global state.
We end up with a fairly complex topology, where we have internal parallel operators holding internal state. ML algorithms require mutable shared state, one parallel instance while processing its substream might trigger an update to a partition of the state that is held by another parallel instance. Moreover, as we want to perform stream processing with exact-once processing guarantees, we need stateful sources and stateful sinks.
Therefore, we need to address spikes in the ingestion rate, meaning we need to add or remove computing resources, we need to perform load balancing because there could be skewness in the keys distribution, so parallel instances could end up with larger state shards. We need to address fault-tolerance issues as we perform all these computations in an online fashion. As last but not least, we may need to migrate state among different operational environment. Indeed, we might have many development environments, staging envs, and production. We might need at some point to migrate state from one cluster to another in order to hand over the computation among them..
To support these use cases, my research goal is to focus on improving the aforementioned state management techniques when shared mutable state is involved and when we reach terabyte sizes.
The problems behind providing those state management technique in the presence of very large distributed state deal with state transfer, because in order to scale up or do load balancing we need to copy state from one node to another, which is not very feasible when large state is involved. The shared mutable nature of the state should not undermine the consistency of the state when performing exactly once processing. Moreover, a streaming system has to provide robust query performance, and the main kpis here are high throughput as well as low latency.
To address those aforementioned problems, the solution, I propose, deals with defining a replication protocol (à la Hadoop) that creates replica groups of each keys range and it replicates them Q times. The replica groups are kept in-sync through incremental checkpoints. EXPLAIN ON THE PICTURE
We need optimal placement schema for those replica groups to minimize the migration cost. Here we plan to use the dynamic hungarian method to also support dynamic operator parallelism rescaling. For those of you who are not familiar with the hungarian method, it is a way to solve maximum weights matchings on bipartite graphs. Here, our bipartite graph models how we place replica groups onto the parallel instances.
And then as last but not least, we need an handover protocol that enables smoothly moving the computation of a keys range between the primary operator instance and one of its replica groups. I will describe how this handover protocol works in the next slides, but before moving forward, I must quickly summarize this protocol leverages on the optimal placed replica groups to move the processing of a key group from an overloaded instance, for example, or from a failing node, or if we provision a new instance.
I am going to show how to the handover protocol works. To make the explanation easier, we ll consider a scenario with an case of load unbalancing. Also let’s assume that each colour marks a different key ranges, thus the tuples with same colour will influence the state for the same key range. For instance we see the yellow tuples flowing from s1 and s2… to op1. Replication factor is set to 1, primary state is incrementally migrated to its replica group. Same for green, brown, and blue.
Suppose the system detects according to some load balancing policy that the instance number two is overloaded. Overloaded here means either a parallel instance cannot keep up with its ingestion rate (leading to backpressure) or the sizes of the state of its key ranges is hitting the instance physical storage limit. When the system detects such a scenario, it decides according to some policy that it has to migrate the green keys processing from the 2nd to the 3rd instance. How the system does that could be either through a centralized entity or consensus. Then it tells the sources to inject a KeyMove marker, which informs the instances to migrate the processing. Please, note that after the markers flowing on the channels to the 2nd instance, there are no green tuples. Vice versa, the opposite happens on the channels to instance three as green tuples starts flowing from the sources after the markers are injected.
Upon receiving the markers parallel instance number two generates a new incremental checkpoint and sends it to the third instance. According to some user-defined state merging policy, there could be two scenarios on the third instance. If the state has associative property, it will update directly the the replica group, if it does not, then the incoming tuples are buffered, like in this case, then it will merge the old replica with the last incremental checkpoint and the buffered tuples. This guarantees an eventual consistency of the state after the handover is complete.
Finally we have the green keys processing completely migrate to the third instance from the second one. A new replica is going to be create on the 4th instance and the previous state is going to be discarded from the instance number two. As there is no experimental evaluation yet, when and how to perform this last step might require some further investigation in order to achieve robust and consistent processing.
Now the next question is “how to assess the proposed protocols and how to declare success?” To this end, I plan to define a set of metrics that will stress some critical aspects of the system. Indeed, the protocols should have a negligible effects on the query processing as well as improving cluster resources utilization and prevent bottlenecks, such as, backpressure. Furthermore, those protocols should never undermine the consistency and the exactly-once processing guarantees of the system.
Since this is just a proposal and I have no experimental results, I think it is to early to provide a conclusion, therefore, I would like to point out some future directions that my phd could take once the above protocols are in-place. First of all, we would finally have a system providing true continuous stream processing, because as of today, there is no open system that fully achieves such features. Furthermore, I am assuming the system has shared mutable state in place, yet as there is no complete system providing such type of state, I will probably need to spend some research effort on it. Nevertheless, shared mutable state might open some new challenges, such as, how to scale it in the presence of streaming HTAP workloads. Investigating new hardware trends might also be an interesting research activity as well as how to apply data compression and approximation to reduce state size. Of course, I do not plan to do all of them in my phd, only the most interesting researchwise.