Current 2022 talk.
Speaker: Yingjun Wu
Title: Rethinking State Management in Cloud-Native Streaming Systems.
Abstract:
Stream processing is becoming increasingly essential for extracting business value from data in real-time. To achieve strict user-defined SLAs under constantly changing workloads, modern streaming systems have started taking advantage of the cloud for scalable and resilient resources. New demand opens new opportunities and challenges for state management, which is at the core of streaming systems. Existing approaches typically use embedded key-value storage so that each worker can access it locally to achieve high performance. However, it requires an external durable file system for checkpointing, is complicated and time-consuming to redistribute state during scaling and migration, and is prone to performance throttling. Therefore, we propose shared storage based on LSM-tree. State gets stored at cloud object storage and seamlessly makes itself durable, and the high bandwidth of cloud storage enables fast recovery. The location of a partition of the state decouples with compute nodes thus making scaling straightforward and more efficient. Compaction in this shared LSM-tree is now globally coordinated with opportunistic serverless boosting instead of relying on individual compute nodes. We design a streaming-aware compaction and caching strategy to achieve smoother and better end-to-end performance.
2. https://www.risingwave.com/
About Us
• Yingjun Wu
• Founder @RisingWave Labs
• Software Engineer @AWS Redshift
• Researcher @IBM Research - Almaden
• Ph.D., National University of Singapore
• RisingWave Labs
• Series-A startup founded in January 2021
• Building RisingWave, a cloud-native streaming database
2
3. https://www.risingwave.com/
Stream Processing: Values and Costs
3
Unbounded
< 1 sec < 1 min < 10 min < 1 hour
Modified from: https://www.oreilly.com/content/ubers-case-for-incremental-processing-on-hadoop/
Result
freshness
Batch processing
4. https://www.risingwave.com/
Stream Processing: Values and Costs
4
Unbounded
< 1 sec < 1 min < 10 min < 1 hour
Modified from: https://www.oreilly.com/content/ubers-case-for-incremental-processing-on-hadoop/
Result
freshness
Batch processing
Stream processing
5. https://www.risingwave.com/
Stream Processing: Values and Costs
5
Unbounded
< 1 sec < 1 min < 10 min < 1 hour
Modified from: https://www.oreilly.com/content/ubers-case-for-incremental-processing-on-hadoop/
Result
freshness
Batch processing
Stream processing
Business value
6. https://www.risingwave.com/
Stream Processing: Values and Costs
6
Unbounded
< 1 sec < 1 min < 10 min < 1 hour
Modified from: https://www.oreilly.com/content/ubers-case-for-incremental-processing-on-hadoop/
Result
freshness
Batch processing
Stream processing
Cost $$$$
Business value
7. https://www.risingwave.com/
Stream Processing: Values and Costs
7
Unbounded
< 1 sec < 1 min < 10 min < 1 hour
Modified from: https://www.oreilly.com/content/ubers-case-for-incremental-processing-on-hadoop/
Result
freshness
Batch processing
Stream processing
Business value
Cost $
30. https://www.risingwave.com/
State Management: Big Data Era
• Coupled compute-storage architecture
• Embarrassingly parallel execution
• Utilize resources in a brute-force manner
30
State
State
State
State
31. https://www.risingwave.com/
State Management: Big Data Era
• Consider joining two streams
• Impression stream
• Click stream
31
Output (adId, impressionTime, clickTime)
Impression (adId, impressionTime)
Click (adId, clickTime)
State
State
Hash table for impression stream
Hash table for click stream
32. https://www.risingwave.com/
• Node (machine) is the minimum resource unit
• If running out compute/storage resources, just add more nodes!
32
Output (adId, impressionTime, clickTime)
Impression (adId, impressionTime)
Click (adId, clickTime)
Hash table for impression stream
Hash table for click stream
State Management: Big Data Era
State
State
33. https://www.risingwave.com/
• Node (machine) is the minimum resource unit
• If running out compute/storage resources, just add more nodes!
33
Output (adId, impressionTime, clickTime)
Impression (adId, impressionTime)
Click (adId, clickTime)
Hash table for impression stream
Hash table for click stream
State Management: Big Data Era
State
State
State
State
State
State
34. https://www.risingwave.com/
• Node (machine) is the minimum resource unit
• If running out compute/storage resources, just add more nodes!
34
Output (adId, impressionTime, clickTime)
Impression (adId, impressionTime)
Click (adId, clickTime)
Hash table for impression stream
Hash table for click stream
State Management: Big Data Era
State
State
State
State
State
State
Consume too many resources!
38. https://www.risingwave.com/
State Management: Cloud Era
• Compute and storage resources are managed separately
• If running out of compute, just buy more compute instances!
• Storage resources can scale automatically!
38
Storage (S3)
Compute (EC2)
40. https://www.risingwave.com/
State Management: Cloud Era
• Consider joining two streams
• Impression stream
• Click stream
40
Output (adId, impressionTime, clickTime)
Impression (adId, impressionTime)
Click (adId, clickTime)
State
State
Hash table for impression stream
Hash table for click stream
41. https://www.risingwave.com/
State Management: Cloud Era
• Naïve solution: maintain state in remote cloud storage
41
Output (adId, impressionTime, clickTime)
Impression (adId, impressionTime)
Click (adId, clickTime)
Hash table for impression stream
Hash table for click stream
State
State
42. https://www.risingwave.com/
State Management: Cloud Era
• Naïve solution: maintain state in remote cloud storage
42
Output (adId, impressionTime, clickTime)
Impression (adId, impressionTime)
Click (adId, clickTime)
Hash table for impression stream
Hash table for click stream
State
State
Stored in S3
Compute in EC2
43. https://www.risingwave.com/
State
State Management: Cloud Era
• Naïve solution: maintain state in remote cloud storage
• If running out of compute… then just add more EC2!
43
Output (adId, impressionTime, clickTime)
Impression (adId, impressionTime)
Click (adId, clickTime)
Hash table for impression stream
Hash table for click stream
State
Stored in S3
Compute in EC2
Compute in EC2
Compute in EC2
44. https://www.risingwave.com/
State Management: Cloud Era
• Naïve solution: maintain state in remote cloud storage
44
Output (adId, impressionTime, clickTime)
Impression (adId, impressionTime)
Click (adId, clickTime)
Hash table for impression stream
Hash table for click stream
State
State
Stored in S3
Compute in EC2
45. https://www.risingwave.com/
State Management: Cloud Era
• Naïve solution: maintain state in remote cloud storage
• If running out of storage… S3 will automatically scale itself!
45
Output (adId, impressionTime, clickTime)
Impression (adId, impressionTime)
Click (adId, clickTime)
Hash table for impression stream
Hash table for click stream
State
State
Stored in S3
Compute in EC2
46. https://www.risingwave.com/
State
State Management: Cloud Era
• Naïve solution: maintain state in remote cloud storage
46
Output (adId, impressionTime, clickTime)
Impression (adId, impressionTime)
Click (adId, clickTime)
Hash table for impression stream
Hash table for click stream
Stored in S3
Compute in EC2
State manipulation becomes remote access!
State
49. https://www.risingwave.com/
Tiered Storage
• Luckily, we can maintain data in different services
• EC2: “volatile” storage
• Super fast!
• Data will get lost if it’s not well replicated
• EBS: “semi-persistent” storage
• Fast
• 99.999% durability (5 nines)
• S3: persistent storage
• slow
• 99.999999999% durability (11 nines)
49
50. https://www.risingwave.com/
Tiered Storage
• Luckily, we can maintain data in different services
• EC2: “volatile” storage
• Super fast!
• Data will get lost if it’s not well replicated
• EBS: “semi-persistent” storage
• Fast
• 99.999% durability (5 nines)
• S3: persistent storage
• slow
• 99.999999999% durability (11 nines)
50
Tiered storage
51. https://www.risingwave.com/
Tiered Storage for State Management
• Use LSM-tree-like structure to maintain internal states in different
storage medium
51
EC2
EBS
S3
Hot data
Warm data
Cold data
52. https://www.risingwave.com/
Tiered Storage for State Management
• Use LSM-tree-like structure to maintain internal states in different
storage medium
52
EC2
EBS
S3
Hot data
Warm data
Cold data
Streaming data ingested in
53. https://www.risingwave.com/
Tiered Storage for State Management
• Use LSM-tree-like structure to maintain internal states in different
storage medium
53
EC2
EBS
S3
Hot data
Warm data
Cold data
Compaction
Compaction
58. https://www.risingwave.com/
Rethinking State Management Design
58
State State State
States
State State State
Compute
nodes
Persistent
storage
States
Checkpoint
Coupled compute-storage architecture Decoupled compute-storage architecture
59. https://www.risingwave.com/
Rethinking State Management Design
59
State State State
States
State State State
Compute
nodes
Persistent
storage
States
Checkpoint
Coupled compute-storage architecture Decoupled compute-storage architecture
Cache Cache Cache
60. https://www.risingwave.com/
Rethinking State Management Design
60
State State State
States
State State State
Compute
nodes
Persistent
storage
States
Checkpoint
Coupled compute-storage architecture Decoupled compute-storage architecture
Cache Cache Cache
“state as checkpoint”
61. https://www.risingwave.com/
Rethinking State Management Design
61
State State State
States
State State State
Compute
nodes
Persistent
storage
States
Checkpoint
Coupled compute-storage architecture Decoupled compute-storage architecture
Cache Cache Cache
“state as checkpoint”
Small state?
62. https://www.risingwave.com/
Rethinking State Management Design
62
State State State
States
State State State
Compute
nodes
Persistent
storage
States
Checkpoint
Coupled compute-storage architecture Decoupled compute-storage architecture
Cache Cache Cache
“state as checkpoint”
Big state?
68. https://www.risingwave.com/
Failure Recovery
68
State State State
States
State State State
Compute
nodes
Persistent
storage
States
Checkpoint
Coupled compute-storage architecture Decoupled compute-storage architecture
Cache Cache Cache
“state as checkpoint”
State
Recover from
checkpoint
69. https://www.risingwave.com/
Failure Recovery
69
State State State
States
State State State
Compute
nodes
Persistent
storage
States
Checkpoint
Coupled compute-storage architecture Decoupled compute-storage architecture
Cache Cache Cache
“state as checkpoint”
State
Recover from
checkpoint
70. https://www.risingwave.com/
Failure Recovery
70
State State State
States
State State State
Compute
nodes
Persistent
storage
States
Checkpoint
Coupled compute-storage architecture Decoupled compute-storage architecture
Cache Cache Cache
“state as checkpoint”
State
Recover from
checkpoint
71. https://www.risingwave.com/
Failure Recovery
71
State State State
States
State State State
Compute
nodes
Persistent
storage
States
Checkpoint
Coupled compute-storage architecture Decoupled compute-storage architecture
Cache Cache Cache
“state as checkpoint”
Read from
remote state
State
Recover from
checkpoint
76. https://www.risingwave.com/
Elastic Scaling
76
State State State
Compute
nodes
Persistent
storage
States
Checkpoint
Coupled compute-storage architecture
Scale out State State State
States
Decoupled compute-storage architecture
Cache Cache Cache
“state as checkpoint”
77. https://www.risingwave.com/
Elastic Scaling
77
State State State
Compute
nodes
Persistent
storage
States
Checkpoint
Coupled compute-storage architecture
Scale out State State State
States
Decoupled compute-storage architecture
Cache Cache Cache
“state as checkpoint”
Scale out
81. https://www.risingwave.com/
Challenging Problems: #1
• LSM tree compaction
81
EC2
EBS
S3
Hot data
Warm data
Cold data
Compaction
Compaction
Compaction can result in performance drops!
Remote compaction?
Lambda function?
82. https://www.risingwave.com/
Challenging Problems: #1
• LSM tree compaction
82
EC2
EBS
S3
Hot data
Warm data
Cold data
Compaction
Compaction
Compaction can result in performance drops!
Remote compaction?
Lambda function?
Still incur high CPU utilization rate!
88. https://www.risingwave.com/
Challenging Problems: #3
88
• Implementing “state as checkpoint”
• Multi-version concurrency control
State State State
States
Cache Cache Cache
“state as checkpoint”
Decoupled compute-storage architecture
89. https://www.risingwave.com/
Challenging Problems: #3
89
• Implementing “state as checkpoint”
• Multi-version concurrency control
• Use “epoch” to identify versions
State State State
States
Cache Cache Cache
“state as checkpoint”
Decoupled compute-storage architecture
93. https://www.risingwave.com/
Performance Evaluation
93
• I will not show any performance numbers in this talk!
• Not a fan of performance “bench-marketing”
• The objective is to maximize cost efficiency, not performance
• Yes, we have the performance numbers, and they look nice!
94. https://www.risingwave.com/
Performance Evaluation
94
• I will not show any performance numbers in this talk!
• Not a fan of performance “bench-marketing”
• The objective is to maximize cost efficiency, not performance
• Yes, we have the performance numbers, and they look nice!
• DM me if you want to read the performance report!
• yingjunwu@risingwave-labs.com