Designing Stateful Apps for
Cloud and Kubernetes
Evan Chan — Nov 10, 2020
State is hard!
THE hardest problem in distributed systems
@mathiasverraes
“There are only two hard problems in distributed
systems: 2. Exactly-once delivery 1. Guaranteed
order of messages 2. Exactly-once delivery”
What kind of state?
• Structured
• Semi-structured (logs, JSON, etc.)
• Graphs and networks
• Unstructured
• Config and passwords
• ML models and parameters
Characteristics of State
• Mutable vs Immutable
• Persistence (Temporary? Permanent? How permanent?)
• Availability
• Latency to retrieve and mutate
• Consistency
Stateless Kubernetes —
Just punt the state to the DB!
Kubernetes Stateless
“Stateless App”
RDS
PostGres
Op1 Op2 Op3 Op4 Kubernetes
S3
Kinesis
MongoDB
Container
Stateless - Where is my State
Requests/
Events
Memory
App
ReadOnly Disk Images
Temp local disk
Cloud Storage
— not persistent
“BUT WAIT… I thought stateless will solve all
my problems??”
Observations about Stateless
• A pattern that works for many scenarios
• All state pushed to other services - $$$
• Latency - stateless means every state change involves network
• Recovery - all local state must be recovered
• Many cloud data services are cloud specific (eg Dynamo, Kinesis) - multi-
cloud or moving clouds is huge amount of work
• Keeping state consistent across the cluster can be tricky
Container
Stateless vs Serverless
Requests/
Events
Memory
App
ReadOnly Disk Images
Temp local disk
Function
Temp mem/disk
across invocations
Mem use within
invocation only
Container
Using Local State with Cloud Storage
Requests/
Events
Memory
App
ReadOnly Disk Images
Temp local disk
Cloud Storage (S3?)
— not persistent
Local
State
Local
State
Logs: Reasoning about State
“Stateless App”
Op1 Op2 Op3 Op4 Kubernetes
DB
DB2Event
Checkpoint
Logs: Reasoning about State
• A log of events and mutations are kept
• Checkpoints in the log represent snapshots of state
• Consistent state of system that can be recovered to
• Replaying the log allows predictable reconstruction of state and changes
• The foundation of all modern databases and data systems
Container
Example: FiloDB
Requests/
Events
Memory
App
ReadOnly Disk Images
Temp local disk
Cassandra
Failure: recover state
from Kafka
Column
Cache
Lucene
Kafka
Millions of
samples/sec
Thousands of
chunk writes/
sec
Logs == Reactive/Streams?
“App/Data Processing”Events
Checkpoint
Output Data
Output Commands/
State Changes
Stateful Kubernetes —
Persistent Volumes
Container
Kubernetes Persistent Volumes
Requests/
Events
Memory
App
ReadOnly Disk Images
Cloud Storage
— not persistent
PV
• Persistent
• Survives pod restarts
Kubernetes Persistent Volumes
• Standard POSIX file semantics - just a mounted volume
• Yaml config of volume type, size, replication factor, desired speed, etc.
• Local Persistent Volumes - basically a HDD/SSD
• Networked, Replicated, not shared (single pod attachment only)
• AWS EBS, GCE PD, Azure Disk, Ceph, ScaleIO
• Can be very close to local disk in performance
• Replicated, Shared Network Storage (Multi pod attach)
• AWS EFS, CephFS, GlusterFS, NFS
Sample PV provisioning .yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: slow
provisioner: kubernetes.io/aws-ebs
parameters:
type: io1
iopsPerGB: "10"
fsType: ext4
• Decide on storage characteristics at deploy time!
Kubernetes StatefulSets: State Affinity
Pod 1
Memory
App
ReadOnly Disk Images
PV 1
Pod 2
Memory
App
ReadOnly Disk Images
PV 2
Pod 3
Memory
App
ReadOnly Disk Images
PV 3
Why Stateful Kubernetes?
• Run stateful services and databases yourself - to save $$
• You need local state persisted, or have a large amount of state
• Caching - lower latency
• ML models, iterative data transformations
• Want faster recovery for local state
• Need to work with local files (eg Lucene)
• Design for PVs, 1 abstraction - use on any cloud
Replicated DBs on Kubernetes
Leader Pod
Memory
PostGres L
ReadOnly Disk Images
PV 1
Follower Pod
Memory
PostGres F
ReadOnly Disk Images
PV 2
Kubernetes PVs vs S3
• Persistent, fast Local State
• S3/Remote Storage only for backups
• Kafka/persistent logging eliminated
or reduced
Pod 1
App
ReadOnly Disk Images
PV 1
Memory
Container
Kafka
App
ReadOnly Disk Images
Local DIsk
S3
Local
State
Memory
Local
State
PV
• Persistent logging (Kafka) and
cloud storage both essential
The Power of Replicated File Storage
• Don’t reinvent
distributed
coordination and
replication in every
data system/
database.
• Reuse a solid data
replication system.
Using Replicated Storage as a Building Block
App
RocksDB Lucene
Replicated PV
ML Model
Pod 1
Replicated Local State Using Kubernetes
App Shard 1
RocksDB Lucene
Replicated PV 1
SQLite
• Shard your app, each shard gets replicated storage
• Consistent snapshotting and state for diff parts of your app
Pod 2
App Shard 2
RocksDB Lucene
Replicated PV 2
SQLite
Reactive Event Streaming + Stateful K8s!
Pod 1
App Shard 1
Lucene
Replicated PV 1
SQLite
Pod 2
App Shard 2
RocksDB Lucene
Replicated PV 2
SQLiteRocksDB
Akka Cluster Sharding
Akka Event Sourcing etc.
Shared K8s PV for Machine Learning
• Shared networked
Persistent Volume (FSx
for Lustre)
• Training job writes
files to FSx
• Kubernetes pods
serves models from
FSx
https://aws.amazon.com/blogs/storage/using-high-performance-persistent-storage-for-machine-learning-workloads-on-kubernetes/
PVs vs Cloud Data Services
Cloud Data Services Persistent Volumes
Replication and distribution handled by service/
database
Data in volume is replicated (if replicated PV used).
App needs to shard and handle coordination.
Each database/service has its own APIs Standard POSIX volume
Additional network latency of cloud services
Varies, but options for latency and performance
close to local drives
Each data service has its own consistency and failure-
handling characteristics
All data shared on the same PV has same
consistency & failure
Where State can Live
Type of State Cloud Service Local/Persistent Volume
Structured/SQL
MySQL, PostGres, RedShift,
etc. etc.
SQLite, H2, etc.
Key/Value Cassandra, Redis RocksDB, LMDB, MapDB, etc.
Semi-structured MongoDB, etc. etc.
Unstructured (binary, ML models, etc)
S3 Files on PV
Config K8s ConfigMap K8s ConfigMap
In Conclusion
• Super important:
• Where is your state
• What are its characteristics
• Think about state recovery and failure handling during design phase
• Replicated storage (PVs) is a very useful paradigm for data systems
Thank You
• Evan Chan
• @evanfchan (Twitter)
• @platypus.arts (Instagram)
• https://velvia.github.io/about

Designing Stateful Apps for Cloud and Kubernetes

  • 1.
    Designing Stateful Appsfor Cloud and Kubernetes Evan Chan — Nov 10, 2020
  • 2.
    State is hard! THEhardest problem in distributed systems
  • 3.
    @mathiasverraes “There are onlytwo hard problems in distributed systems: 2. Exactly-once delivery 1. Guaranteed order of messages 2. Exactly-once delivery”
  • 4.
    What kind ofstate? • Structured • Semi-structured (logs, JSON, etc.) • Graphs and networks • Unstructured • Config and passwords • ML models and parameters
  • 5.
    Characteristics of State •Mutable vs Immutable • Persistence (Temporary? Permanent? How permanent?) • Availability • Latency to retrieve and mutate • Consistency
  • 6.
    Stateless Kubernetes — Justpunt the state to the DB!
  • 7.
    Kubernetes Stateless “Stateless App” RDS PostGres Op1Op2 Op3 Op4 Kubernetes S3 Kinesis MongoDB
  • 8.
    Container Stateless - Whereis my State Requests/ Events Memory App ReadOnly Disk Images Temp local disk Cloud Storage — not persistent
  • 9.
    “BUT WAIT… Ithought stateless will solve all my problems??”
  • 10.
    Observations about Stateless •A pattern that works for many scenarios • All state pushed to other services - $$$ • Latency - stateless means every state change involves network • Recovery - all local state must be recovered • Many cloud data services are cloud specific (eg Dynamo, Kinesis) - multi- cloud or moving clouds is huge amount of work • Keeping state consistent across the cluster can be tricky
  • 11.
    Container Stateless vs Serverless Requests/ Events Memory App ReadOnlyDisk Images Temp local disk Function Temp mem/disk across invocations Mem use within invocation only
  • 12.
    Container Using Local Statewith Cloud Storage Requests/ Events Memory App ReadOnly Disk Images Temp local disk Cloud Storage (S3?) — not persistent Local State Local State
  • 13.
    Logs: Reasoning aboutState “Stateless App” Op1 Op2 Op3 Op4 Kubernetes DB DB2Event Checkpoint
  • 14.
    Logs: Reasoning aboutState • A log of events and mutations are kept • Checkpoints in the log represent snapshots of state • Consistent state of system that can be recovered to • Replaying the log allows predictable reconstruction of state and changes • The foundation of all modern databases and data systems
  • 15.
    Container Example: FiloDB Requests/ Events Memory App ReadOnly DiskImages Temp local disk Cassandra Failure: recover state from Kafka Column Cache Lucene Kafka Millions of samples/sec Thousands of chunk writes/ sec
  • 16.
    Logs == Reactive/Streams? “App/DataProcessing”Events Checkpoint Output Data Output Commands/ State Changes
  • 17.
  • 18.
    Container Kubernetes Persistent Volumes Requests/ Events Memory App ReadOnlyDisk Images Cloud Storage — not persistent PV • Persistent • Survives pod restarts
  • 19.
    Kubernetes Persistent Volumes •Standard POSIX file semantics - just a mounted volume • Yaml config of volume type, size, replication factor, desired speed, etc. • Local Persistent Volumes - basically a HDD/SSD • Networked, Replicated, not shared (single pod attachment only) • AWS EBS, GCE PD, Azure Disk, Ceph, ScaleIO • Can be very close to local disk in performance • Replicated, Shared Network Storage (Multi pod attach) • AWS EFS, CephFS, GlusterFS, NFS
  • 20.
    Sample PV provisioning.yaml apiVersion: storage.k8s.io/v1 kind: StorageClass metadata: name: slow provisioner: kubernetes.io/aws-ebs parameters: type: io1 iopsPerGB: "10" fsType: ext4 • Decide on storage characteristics at deploy time!
  • 21.
    Kubernetes StatefulSets: StateAffinity Pod 1 Memory App ReadOnly Disk Images PV 1 Pod 2 Memory App ReadOnly Disk Images PV 2 Pod 3 Memory App ReadOnly Disk Images PV 3
  • 22.
    Why Stateful Kubernetes? •Run stateful services and databases yourself - to save $$ • You need local state persisted, or have a large amount of state • Caching - lower latency • ML models, iterative data transformations • Want faster recovery for local state • Need to work with local files (eg Lucene) • Design for PVs, 1 abstraction - use on any cloud
  • 23.
    Replicated DBs onKubernetes Leader Pod Memory PostGres L ReadOnly Disk Images PV 1 Follower Pod Memory PostGres F ReadOnly Disk Images PV 2
  • 24.
    Kubernetes PVs vsS3 • Persistent, fast Local State • S3/Remote Storage only for backups • Kafka/persistent logging eliminated or reduced Pod 1 App ReadOnly Disk Images PV 1 Memory Container Kafka App ReadOnly Disk Images Local DIsk S3 Local State Memory Local State PV • Persistent logging (Kafka) and cloud storage both essential
  • 25.
    The Power ofReplicated File Storage • Don’t reinvent distributed coordination and replication in every data system/ database. • Reuse a solid data replication system.
  • 26.
    Using Replicated Storageas a Building Block App RocksDB Lucene Replicated PV ML Model
  • 27.
    Pod 1 Replicated LocalState Using Kubernetes App Shard 1 RocksDB Lucene Replicated PV 1 SQLite • Shard your app, each shard gets replicated storage • Consistent snapshotting and state for diff parts of your app Pod 2 App Shard 2 RocksDB Lucene Replicated PV 2 SQLite
  • 28.
    Reactive Event Streaming+ Stateful K8s! Pod 1 App Shard 1 Lucene Replicated PV 1 SQLite Pod 2 App Shard 2 RocksDB Lucene Replicated PV 2 SQLiteRocksDB Akka Cluster Sharding Akka Event Sourcing etc.
  • 29.
    Shared K8s PVfor Machine Learning • Shared networked Persistent Volume (FSx for Lustre) • Training job writes files to FSx • Kubernetes pods serves models from FSx https://aws.amazon.com/blogs/storage/using-high-performance-persistent-storage-for-machine-learning-workloads-on-kubernetes/
  • 30.
    PVs vs CloudData Services Cloud Data Services Persistent Volumes Replication and distribution handled by service/ database Data in volume is replicated (if replicated PV used). App needs to shard and handle coordination. Each database/service has its own APIs Standard POSIX volume Additional network latency of cloud services Varies, but options for latency and performance close to local drives Each data service has its own consistency and failure- handling characteristics All data shared on the same PV has same consistency & failure
  • 31.
    Where State canLive Type of State Cloud Service Local/Persistent Volume Structured/SQL MySQL, PostGres, RedShift, etc. etc. SQLite, H2, etc. Key/Value Cassandra, Redis RocksDB, LMDB, MapDB, etc. Semi-structured MongoDB, etc. etc. Unstructured (binary, ML models, etc) S3 Files on PV Config K8s ConfigMap K8s ConfigMap
  • 32.
    In Conclusion • Superimportant: • Where is your state • What are its characteristics • Think about state recovery and failure handling during design phase • Replicated storage (PVs) is a very useful paradigm for data systems
  • 33.
    Thank You • EvanChan • @evanfchan (Twitter) • @platypus.arts (Instagram) • https://velvia.github.io/about