Efficient Migration of Very Large Distributed State for Scalable Stream Processing

•Download as PPTX, PDF•

2 likes•306 views

Bonaventura Del Monte

PhD Proposal presented at VLDB 2017. Full paper available at http://ceur-ws.org/Vol-1882/paper01.pdf

Data & Analytics

Efficient Migration of Very Large
Distributed State for Scalable
Stream Processing
PhD Candidate: Bonaventura Del Monte
Advisors: Prof. Dr. Volker Markl, Prof. Dr. Tilmann Rabl
PhD Workshop, VLDB 2017
This work has been partially funded by the European Union’s Horizon 2020 research and innovation program under grant agreement n° 687691

Outline
• Research Goal
• Problem Statement
• Proposed Solution
• Research Issues
• Evaluation Plan
• Conclusion and future directions

S1
OP1
S2
OP3
OP2
Distributed Stateful Stream Processing
STATE
STORAGE
UDF
STREAM
PROCESSOR
• State is co-partitioned with the input
stream by key
• State is internally stored and managed
1

State Management In Current Systems
• Fault-tolerance
• Resource elasticity
• Queries maintenance
• Load balancing
• Partitioned State
• Partially Distributed State
• Hundreds of gigabytes
2

• Many analytics executed at same time:
• Machine Learning models, e.g., collaborative filtering, fraud
detection, NLP 100s of GB per model
• Different types of temporal aggregations/joins 100s GB
Motivational Example: a Real-World Deployment
3

Motivational Example: a Real-World Deployment
S1
OP1
S2
OP3
OP2
SINK1
SINK 2
STATE
STATE
STATE
STATE
STATE
STATE
STATE
4

CLUSTER A
Motivational Example: a Real-World Deployment
S1
OP1
S2
OP3
OP2
SINK1
SINK 2
STATE
STATE
STATE
STATE
STATESTATE
STATE
CLUSTER B
COMPUTING
RESOURCES
Add/Remove
Handle failures
Balance load
Migrate
5

Research Goal
• Fault-tolerance
• Resource Elasticity
• Queries Maintenance
• Load Balancing
• Distributed
State a.k.a.
Shared Mutable
State
• Terabyte Sizes
6

Problem Statement
• Fault-tolerance
• Resource Elasticity
• Queries Maintenance
• Load Balancing
• State Transfer
• Consistent state for
exactly once processing
• Robust Query
Performance
7

Proposed Solution
S1
OP1
S2
OP3
OP2
SINK1
SINK 2
• Replication protocol through
incremental checkpoints
8

Proposed Solution
• Replication protocol through
incremental checkpoints
• Optimal placement of replica
groups to minimize migration cost
9
S1
OP1
S2
OP3
OP2
SINK1
SINK 2

Proposed Solution
• Replication protocol through
incremental checkpoints
• Optimal placement of replica
groups to minimize migration cost
• Hand-over protocol
10
S1
OP1
S2
OP3
OP2
SINK1
SINK 2

Hand-Over Protocol
S1
OP1
S2
OP3
OP2
OP4
Primary state
Incremental
Checkpoint
Replica
Group
11

Hand-Over Protocol
S1
OP1
S2
OP3
OP2
OP4
Keys-Move
Marker
12

Hand-Over Protocol
S1
OP1
S2
OP3
OP2
OP4
13

Hand-Over Protocol
S1
OP1
S2
OP3
OP2
OP4
14

Evaluation Plan KPIs
• Protocols with negligible effects on query processing
• Improve resource utilization and prevent bottlenecks
• Consistent exactly-once processing
15

Future directions
• “True” continuous stream processing
• Scaling shared mutable state on HTAP workloads
• New storage and network hardware (e.g., NVRAM, RDMA)
• Data compression and approximation
16

Q&A
• Overall feedback
• Tradeoff: deal with large state by replicating it
• Need of shared mutable state

Optimal Placement of Keys Ranges
• Dynamic Hungarian Method
• Why Dynamic? To handle Resource Elasticity
• Rescalable Keys Range as the smallest unit

State of the Art
SEEP Apache Spark AIM Ding et al. Apache Flink Naiad Chrono Stream System X
State
Distribution
Pattern
Distributed Partitioned Distributed Partitioned Partitioned Partitioned Partitioned Distributed
Fault Tolerance
Async local
Checkp. w/ log
recovery
RDD lineage &
RDD interm.
checkp.
Log-based
Periodic
checkpoint.
Upstream
backup & global
Async-
Check.
Sync global
checkpoint
Slice recostr. w/
async delta
checkpoint.
Upstream
backup & Async
incr. Checkp. &
handover
Job Rescaling Threshold Manual Manual N/A Manual Manual
Horizontal
Vertical
Dynamic
Horizontal
Vertical
Load
Balancing
Hash Hash Hash Hash Hash Hash Hash
Hybrid: Hash w/
dynamic repart.

What's hot

Fabian Hueske_Till Rohrmann - Declarative stream processing with StreamSQL an...Flink Forward

Flink Forward San Francisco 2019: Developing and operating real-time applicat...Flink Forward

Apache Flink Training Workshop @ HadoopCon2016 - #4 Advanced Stream ProcessingApache Flink Taiwan User Group

Kostas Tzoumas - Apache Flink®: State of the Union and What's NextVerverica

Stephan Ewen - Stream Processing as a Foundational Paradigm and Apache Flink'...Ververica

Second Layer Execution Markets, 7/17/18ChronoLogic

Bootstrapping a ML platform at Bluevine [Airflow Summit 2020]Noam Elfanbaum

The Flux Capacitor of Kafka Streams and ksqlDB (Matthias J. Sax, Confluent) K...HostedbyConfluent

What's hot (8)

Fabian Hueske_Till Rohrmann - Declarative stream processing with StreamSQL an...

Flink Forward San Francisco 2019: Developing and operating real-time applicat...

Apache Flink Training Workshop @ HadoopCon2016 - #4 Advanced Stream Processing

Kostas Tzoumas - Apache Flink®: State of the Union and What's Next

Stephan Ewen - Stream Processing as a Foundational Paradigm and Apache Flink'...

Second Layer Execution Markets, 7/17/18

Bootstrapping a ML platform at Bluevine [Airflow Summit 2020]

The Flux Capacitor of Kafka Streams and ksqlDB (Matthias J. Sax, Confluent) K...

Similar to Efficient Migration of Very Large Distributed State for Scalable Stream Processing

Architectual Comparison of Apache Apex and Spark StreamingApache Apex

BigDataSpain 2016: Introduction to Apache ApexThomas Weise

Stream data from Apache Kafka for processing with Apache ApexApache Apex

Intro to Apache Apex (next gen Hadoop) & comparison to Spark StreamingApache Apex

Network-aware Data Management for High Throughput Flows Akamai, Cambridge, ...balmanme

DevoFlow - Scaling Flow Management for High-Performance NetworksJason TC HOU (侯宗成)

Evaluating Cloud vs On-Premises for NGS Clinical WorkflowsGolden Helix

It Takes Two: Instrumenting the Interaction between In-Memory Databases and S...eXascale Infolab

OpenDaylight Openflow & OVSDB use cases ODL summit 2016abhijit2511

Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache ApexApache Apex

Play With StreamsTianjian Chen

The Search for Gravitational Wavesinside-BigData.com

Exascale Deep Learning for Climate Analyticsinside-BigData.com

Documented Requirements are not Useless After All!Lionel Briand

Network-aware Data Management for Large Scale Distributed Applications, IBM R...balmanme

Next Gen Big Data Analytics with Apache Apex DataWorks Summit/Hadoop Summit

Making_Good_Enough...Better-Addressing_the_Multiple_Objectives_of_High-Perfor...John Gunnels

Apache Big Data 2016: Next Gen Big Data Analytics with Apache ApexApache Apex

Apache Apex: Stream Processing Architecture and ApplicationsThomas Weise

Apache Apex: Stream Processing Architecture and Applications Comsysto Reply GmbH

Similar to Efficient Migration of Very Large Distributed State for Scalable Stream Processing (20)

Architectual Comparison of Apache Apex and Spark Streaming

BigDataSpain 2016: Introduction to Apache Apex

Stream data from Apache Kafka for processing with Apache Apex

Intro to Apache Apex (next gen Hadoop) & comparison to Spark Streaming

Network-aware Data Management for High Throughput Flows Akamai, Cambridge, ...

DevoFlow - Scaling Flow Management for High-Performance Networks

Evaluating Cloud vs On-Premises for NGS Clinical Workflows

It Takes Two: Instrumenting the Interaction between In-Memory Databases and S...

OpenDaylight Openflow & OVSDB use cases ODL summit 2016

Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex

Play With Streams

The Search for Gravitational Waves

Exascale Deep Learning for Climate Analytics

Documented Requirements are not Useless After All!

Network-aware Data Management for Large Scale Distributed Applications, IBM R...

Next Gen Big Data Analytics with Apache Apex

Making_Good_Enough...Better-Addressing_the_Multiple_Objectives_of_High-Perfor...

Apache Big Data 2016: Next Gen Big Data Analytics with Apache Apex

Apache Apex: Stream Processing Architecture and Applications

Recently uploaded

Supply chain analytics to combat the effects of Ukraine-Russia-conflictJack Cole

Slip-and-fall Injuries: Top Workers' Comp ClaimsBisnar Chase Personal Injury Attorneys

Fuzzy Sets decision making under information of uncertaintyRafigAliyev2

AI Imagen for data-storytelling Infographics.pdfMichaelSenkow

Atlantic Grupa Case Study (Mintec Data AI)Jon Hansen

Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotecAbortion pills in Riyadh +966572737505 get cytotec

how can i exchange pi coins for others currency like BitcoinDOT TECH

MALL CUSTOMER SEGMENTATION USING K-MEANS CLUSTERING.pptxNidaFaviankaNawawi

How I opened a fake bank account and didn't go to prisonPayment Village

一比一原版纽卡斯尔大学毕业证成绩单如何办理cyebo

一比一原版阿德莱德大学毕业证成绩单如何办理pyhepag

How can I successfully sell my pi coins in Philippines?DOT TECH

basics of data science with application areas.pdfvyankatesh1

Machine Learning for Accident Severity PredictionBoston Institute of Analytics

Easy and simple project file on mp onlinebalibahu1313

Pre-ProductionImproveddsfjgndflghtgg.pptxStephen266013

Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPsCEPTES Software Inc

一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理pyhepag

2024 Q2 Orange County (CA) Tableau User Group MeetingAlison Pitt

2024 Q1 Tableau User Group Leader Quarterly Calllward7

Recently uploaded (20)

Supply chain analytics to combat the effects of Ukraine-Russia-conflict

Slip-and-fall Injuries: Top Workers' Comp Claims

Fuzzy Sets decision making under information of uncertainty

AI Imagen for data-storytelling Infographics.pdf

Atlantic Grupa Case Study (Mintec Data AI)

Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec

how can i exchange pi coins for others currency like Bitcoin

MALL CUSTOMER SEGMENTATION USING K-MEANS CLUSTERING.pptx

How I opened a fake bank account and didn't go to prison

一比一原版纽卡斯尔大学毕业证成绩单如何办理

一比一原版阿德莱德大学毕业证成绩单如何办理

How can I successfully sell my pi coins in Philippines?

basics of data science with application areas.pdf

Machine Learning for Accident Severity Prediction

Easy and simple project file on mp online

Pre-ProductionImproveddsfjgndflghtgg.pptx

Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs

一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理

2024 Q2 Orange County (CA) Tableau User Group Meeting

2024 Q1 Tableau User Group Leader Quarterly Call

Efficient Migration of Very Large Distributed State for Scalable Stream Processing

1. Efficient Migration of Very Large Distributed State for Scalable Stream Processing PhD Candidate: Bonaventura Del Monte Advisors: Prof. Dr. Volker Markl, Prof. Dr. Tilmann Rabl PhD Workshop, VLDB 2017 This work has been partially funded by the European Union’s Horizon 2020 research and innovation program under grant agreement n° 687691

2. Outline • Research Goal • Problem Statement • Proposed Solution • Research Issues • Evaluation Plan • Conclusion and future directions

3. S1 OP1 S2 OP3 OP2 Distributed Stateful Stream Processing STATE STORAGE UDF STREAM PROCESSOR • State is co-partitioned with the input stream by key • State is internally stored and managed 1

4. State Management In Current Systems • Fault-tolerance • Resource elasticity • Queries maintenance • Load balancing • Partitioned State • Partially Distributed State • Hundreds of gigabytes 2

5. • Many analytics executed at same time: • Machine Learning models, e.g., collaborative filtering, fraud detection, NLP 100s of GB per model • Different types of temporal aggregations/joins 100s GB Motivational Example: a Real-World Deployment 3

6. Motivational Example: a Real-World Deployment S1 OP1 S2 OP3 OP2 SINK1 SINK 2 STATE STATE STATE STATE STATE STATE STATE 4

7. CLUSTER A Motivational Example: a Real-World Deployment S1 OP1 S2 OP3 OP2 SINK1 SINK 2 STATE STATE STATE STATE STATESTATE STATE CLUSTER B COMPUTING RESOURCES Add/Remove Handle failures Balance load Migrate 5

8. Research Goal • Fault-tolerance • Resource Elasticity • Queries Maintenance • Load Balancing • Distributed State a.k.a. Shared Mutable State • Terabyte Sizes 6

9. Problem Statement • Fault-tolerance • Resource Elasticity • Queries Maintenance • Load Balancing • State Transfer • Consistent state for exactly once processing • Robust Query Performance 7

10. Proposed Solution S1 OP1 S2 OP3 OP2 SINK1 SINK 2 • Replication protocol through incremental checkpoints 8

11. Proposed Solution • Replication protocol through incremental checkpoints • Optimal placement of replica groups to minimize migration cost 9 S1 OP1 S2 OP3 OP2 SINK1 SINK 2

12. Proposed Solution • Replication protocol through incremental checkpoints • Optimal placement of replica groups to minimize migration cost • Hand-over protocol 10 S1 OP1 S2 OP3 OP2 SINK1 SINK 2

13. Hand-Over Protocol S1 OP1 S2 OP3 OP2 OP4 Primary state Incremental Checkpoint Replica Group 11

14. Hand-Over Protocol S1 OP1 S2 OP3 OP2 OP4 Keys-Move Marker 12

15. Hand-Over Protocol S1 OP1 S2 OP3 OP2 OP4 13

16. Hand-Over Protocol S1 OP1 S2 OP3 OP2 OP4 14

17. Evaluation Plan KPIs • Protocols with negligible effects on query processing • Improve resource utilization and prevent bottlenecks • Consistent exactly-once processing 15

18. Future directions • “True” continuous stream processing • Scaling shared mutable state on HTAP workloads • New storage and network hardware (e.g., NVRAM, RDMA) • Data compression and approximation 16

19. Thank You!

20. Q&A • Overall feedback • Tradeoff: deal with large state by replicating it • Need of shared mutable state

21. Back-up Slides

22. The protocols in action 1/3

23. The protocols in action 2/3

24. The protocols in action 3/3

25. Optimal Placement of Keys Ranges • Dynamic Hungarian Method • Why Dynamic? To handle Resource Elasticity • Rescalable Keys Range as the smallest unit

26. State of the Art SEEP Apache Spark AIM Ding et al. Apache Flink Naiad Chrono Stream System X State Distribution Pattern Distributed Partitioned Distributed Partitioned Partitioned Partitioned Partitioned Distributed Fault Tolerance Async local Checkp. w/ log recovery RDD lineage & RDD interm. checkp. Log-based Periodic checkpoint. Upstream backup & global Async- Check. Sync global checkpoint Slice recostr. w/ async delta checkpoint. Upstream backup & Async incr. Checkp. & handover Job Rescaling Threshold Manual Manual N/A Manual Manual Horizontal Vertical Dynamic Horizontal Vertical Load Balancing Hash Hash Hash Hash Hash Hash Hash Hybrid: Hash w/ dynamic repart.

Editor's Notes

This talk is structured as follows: I will give you a first insight of the core aspects of my proposal, then I will walk you through the research issues, and how I intend to proceed in order to assess my work
Before, we may dig into details, I need to explain how distributed stateful stream processing is done today. A streaming job is defined as a weakly connected DAG. Streams are ingested into a stream processing system through source vertexes. Each input tuple is optionally keyed. In order to exploit parallelism, we use hash-partitioning to shuffle the input tuples on downstream operators. Each of those operators may be stateful, they contain some state according to their logic. Each parallel operator process a range of keys depending to how the input is partitioned. State is internally stored and manager. Each parallel operator has a stream operator, processing an UDF. They both read and write from the state storage. Each parallel instance can work only its internal state. The global state of the topology is checkpointed time to time.
State introduced new challenges, indeed we need state management techniques to ensure support fault-tolerance, resource elasticity, queries maintenance, and load balancing while keeping processing input streams. Currently, there are systems and research paper addressing a subset of those techniques, yet they constrained their focus to partitioned state or partially distributed state. Also the sizes of the state is in the size of gigabytes.
In my opinion, those assumptions limit the capabilities and supported use cases. Let’s consider the following real-word deployment scenario. We run an online marketplace and we need always up-to-date analytics about our platform, we want to perform on-the fly recommendations (thus, we use collaborative filtering), we want to perform fraud detections, and natural language processing to improve the user-experience within our platform. The sizes of those models grow with the number of items and users. Furthermore, we want to calculate not-ML analytics, e.g., heavy hitters, temporal aggregations/joins. This adds more data to our global state.
We end up with a fairly complex topology, where we have internal parallel operators holding internal state. ML algorithms require mutable shared state, one parallel instance while processing its substream might trigger an update to a partition of the state that is held by another parallel instance. Moreover, as we want to perform stream processing with exact-once processing guarantees, we need stateful sources and stateful sinks.
Therefore, we need to address spikes in the ingestion rate, meaning we need to add or remove computing resources, we need to perform load balancing because there could be skewness in the keys distribution, so parallel instances could end up with larger state shards. We need to address fault-tolerance issues as we perform all these computations in an online fashion. As last but not least, we may need to migrate state among different operational environment. Indeed, we might have many development environments, staging envs, and production. We might need at some point to migrate state from one cluster to another in order to hand over the computation among them..
To support these use cases, my research goal is to focus on improving the aforementioned state management techniques when shared mutable state is involved and when we reach terabyte sizes.
The problems behind providing those state management technique in the presence of very large distributed state deal with state transfer, because in order to scale up or do load balancing we need to copy state from one node to another, which is not very feasible when large state is involved. The shared mutable nature of the state should not undermine the consistency of the state when performing exactly once processing. Moreover, a streaming system has to provide robust query performance, and the main kpis here are high throughput as well as low latency.
To address those aforementioned problems, the solution, I propose, deals with defining a replication protocol (à la Hadoop) that creates replica groups of each keys range and it replicates them Q times. The replica groups are kept in-sync through incremental checkpoints. EXPLAIN ON THE PICTURE
We need optimal placement schema for those replica groups to minimize the migration cost. Here we plan to use the dynamic hungarian method to also support dynamic operator parallelism rescaling. For those of you who are not familiar with the hungarian method, it is a way to solve maximum weights matchings on bipartite graphs. Here, our bipartite graph models how we place replica groups onto the parallel instances.
And then as last but not least, we need an handover protocol that enables smoothly moving the computation of a keys range between the primary operator instance and one of its replica groups. I will describe how this handover protocol works in the next slides, but before moving forward, I must quickly summarize this protocol leverages on the optimal placed replica groups to move the processing of a key group from an overloaded instance, for example, or from a failing node, or if we provision a new instance.
I am going to show how to the handover protocol works. To make the explanation easier, we ll consider a scenario with an case of load unbalancing. Also let’s assume that each colour marks a different key ranges, thus the tuples with same colour will influence the state for the same key range. For instance we see the yellow tuples flowing from s1 and s2… to op1. Replication factor is set to 1, primary state is incrementally migrated to its replica group. Same for green, brown, and blue.
Suppose the system detects according to some load balancing policy that the instance number two is overloaded. Overloaded here means either a parallel instance cannot keep up with its ingestion rate (leading to backpressure) or the sizes of the state of its key ranges is hitting the instance physical storage limit. When the system detects such a scenario, it decides according to some policy that it has to migrate the green keys processing from the 2nd to the 3rd instance. How the system does that could be either through a centralized entity or consensus. Then it tells the sources to inject a KeyMove marker, which informs the instances to migrate the processing. Please, note that after the markers flowing on the channels to the 2nd instance, there are no green tuples. Vice versa, the opposite happens on the channels to instance three as green tuples starts flowing from the sources after the markers are injected.
Upon receiving the markers parallel instance number two generates a new incremental checkpoint and sends it to the third instance. According to some user-defined state merging policy, there could be two scenarios on the third instance. If the state has associative property, it will update directly the the replica group, if it does not, then the incoming tuples are buffered, like in this case, then it will merge the old replica with the last incremental checkpoint and the buffered tuples. This guarantees an eventual consistency of the state after the handover is complete.
Finally we have the green keys processing completely migrate to the third instance from the second one. A new replica is going to be create on the 4th instance and the previous state is going to be discarded from the instance number two. As there is no experimental evaluation yet, when and how to perform this last step might require some further investigation in order to achieve robust and consistent processing.
Now the next question is “how to assess the proposed protocols and how to declare success?” To this end, I plan to define a set of metrics that will stress some critical aspects of the system. Indeed, the protocols should have a negligible effects on the query processing as well as improving cluster resources utilization and prevent bottlenecks, such as, backpressure. Furthermore, those protocols should never undermine the consistency and the exactly-once processing guarantees of the system.
Since this is just a proposal and I have no experimental results, I think it is to early to provide a conclusion, therefore, I would like to point out some future directions that my phd could take once the above protocols are in-place. First of all, we would finally have a system providing true continuous stream processing, because as of today, there is no open system that fully achieves such features. Furthermore, I am assuming the system has shared mutable state in place, yet as there is no complete system providing such type of state, I will probably need to spend some research effort on it. Nevertheless, shared mutable state might open some new challenges, such as, how to scale it in the presence of streaming HTAP workloads. Investigating new hardware trends might also be an interesting research activity as well as how to apply data compression and approximation to reduce state size. Of course, I do not plan to do all of them in my phd, only the most interesting researchwise.

Efficient Migration of Very Large Distributed State for Scalable Stream Processing

Recommended

Recommended

More Related Content

What's hot

What's hot (8)

Similar to Efficient Migration of Very Large Distributed State for Scalable Stream Processing

Similar to Efficient Migration of Very Large Distributed State for Scalable Stream Processing (20)

Recently uploaded

Recently uploaded (20)

Efficient Migration of Very Large Distributed State for Scalable Stream Processing

Editor's Notes