No stress with state
Hands-on tips for scalable applications

Uwe Friedrichsen, codecentric AG, 2014
@ufried
Uwe Friedrichsen | uwe.friedrichsen@codecentric.de | http://slideshare.net/ufried | http://ufried.tumblr.com
A word about scalability ...
Your system is only scalable
if you can easily scale up
and down
State in scalable systems
State is the enemy of
scalable applications
State …





•  … avoids load distribution
•  … requires replication
•  … makes systems brittle
•  … makes scaling down hard
General design rule






•  Move state either to the frontend and/or to the backend
•  Make everything else stateless
•  Use caches if that’s too slow (but be aware of their tradeoffs)
Frontend
Client
Backend
Store
Application
Stateful
 Stateful
Stateless
Frontend client state examples





•  Cookies (fondly handcrafted)
•  Play framework
•  JavaScript client
•  HTML5 storage
And what if we are bound to
a web framework that
heavily uses session state?
Standard session state handling






•  Sticky Sessions
•  Session Replication
Sticky Sessions
Load
Balancer
Session
 Node
4711
 2
0815
 4
…
 …
Session: 4711
Node: 2
Sticky Sessions
Load
Balancer
Session
 Node
4711
 2
0815
 4
…
 …
Session: 4711
Problem: Node is down
✖
New Node: 4
Node 4 does not
know session 4711
Sticky Sessions
Load
Balancer
Session
 Node
4711
 2
0815
 4
…
 …
Problem: Load is unequally distributed
Sticky Sessions
Load
Balancer
Session
 Node
4711
 2
0815
 4
…
 …
Problem: Scale down
#Sessions: 1
 #Sessions: 1
 #Sessions: 1
 #Sessions: 1
 #Sessions: 1
Cannot shut down node
without losing session
Session Replication
Load
Balancer
Request
Arbitrary
node
Session replication
But what if we really
need to scale?
Lazy session replication
leaves you with inconsistencies
Load
Balancer
Eager session replication
does not scale well
What we really want



•  Scaling (up) behavior of sticky sessions
•  Node independence of session replication



à Isolated nodes with shared session cache
Isolated nodes with shared session cache
Load
Balancer
Request
Arbitrary
node
Application nodes
Cache nodes
Isolated nodes with shared session cache





•  Easy load distribution
•  Easy up- and down-scaling
•  Tolerant against node failures
•  Independent scaling of application and cache nodes
But isn’t that way too slow?
Scalability vs. Latency







You need to give up some latency for an individual user

to provide acceptable latency for many users
Latencydistribution
Not scalable
Acceptable latency
Single user
Many users
Scalable
Acceptable latency
Latencydistribution
Single user
Many users
And how do I implement it?
Interception points



•  StandardSession (Tomcat)
•  HttpSessionAttributeListener (JEE)
•  Valve (Tomcat)
•  Filter (JEE)
•  SessionManager (ManagerBase, Tomcat)
Available libraries



•  For Redis
•  For MemCached
•  For MongoDB
•  For Cassandra
•  …
Example library


•  Redis Session Manager for Apache Tomcat

https://github.com/jcoleman/tomcat-redis-session-manager
•  A bit outdated, but easy to use
•  Usage
•  Install and start Redis
•  Copy tomcat-redis-session-manager.jar and

jedis-<version>.jar to Tomcat lib directory
•  Update Tomcat configuration
•  Restart Tomcat
context.xml
<Context>
...
<Valve className="com.radiadesign.catalina.session.RedisSessionHandlerValve" />
<Manager className="com.radiadesign.catalina.session.RedisSessionManager"
host="localhost”
port="6379"
database="0”
maxInactiveInterval="60" />
</Context>
Tradeoffs

•  Easy to set up and use
•  No code required, only configuration
•  Works smoothly

•  Not best for web frameworks

that store lots of session state
•  Does not work well with AJAX
•  Does not work with single page applications


à Another reason to prefer ROCA style?
General design rule






•  Move state either to the frontend and/or to the backend
•  Make everything else stateless
•  Use caches if that’s too slow (but be aware of their tradeoffs)
Frontend
Client
Backend
Store
Application
Stateful
 Stateful
Stateless
Challenges of scalable datastores





•  Requires relaxed consistency guarantees
•  Leads to (temporal) anomalies and inconsistencies

short-term due to replication or long-term due to partitioning

It might happen. Thus, it will happen!
C
Consistency
A
Availability
P
Partition Tolerance
Strict Consistency

ACID / 2PC
Strong Consistency

Quorum R&W / Paxos
Eventual Consistency

CRDT / Gossip / Hinted Handoff
Strict Consistency (CA)

•  Great programming model

no anomalies or inconsistencies need to be considered
•  Does not scale well

best for single node databases

„We know ACID – It works!“
„We know 2PC – It sucks!“

Use for moderate data amounts
And what if I need more data?






•  Distributed datastore
•  Partition tolerance is a must
•  Need to give up strict consistency (CP or AP)
Strong Consistency (CP)

•  Majority based consistency model

can tolerate up to N nodes failing out of 2N+1 nodes
•  Good programming model

Single-copy consistency
•  Trades consistency for availability

in case of partitioning

Paxos (for sequential consistency)
Quorum-based reads & writes
And what if I need more availability?






•  Need to give up strong consistency (CP)
•  Relax required consistency properties even more
•  Leads to eventual consistency (AP)
Eventual Consistency (AP)

•  Gives up some consistency guarantees

no sequential consistency, anomalies become visible
•  Maximum availability possible

can tolerate up to N-1 nodes failing out of N nodes
•  Challenging programming model

anomalies usually need to be resolved explicitly

Gossip / Hinted Handoffs
CRDT
Conflict-free Replicated Data Types


•  Eventually consistent, self-stabilizing data structures
•  Designed for maximum availability
•  Tolerates up to N-1 out of N nodes failing

State-based CRDT: Convergent Replicated Data Type (CvRDT)
Operation-based CRDT: Commutative Replicated Data Type (CmRDT)
A bit of theory first ...
Convergent Replicated Data Type

State-based CRDT – CvRDT

•  All replicas (usually) connected
•  Exchange state between replicas, calculate new state on target replica
•  State transfer at least once over eventually-reliable channels
•  Set of possible states form a Semilattice
•  Partially ordered set of elements where all subsets have a Least Upper Bound (LUB)
•  All state changes advance upwards with respect to the partial order
Commutative Replicated Data Type

Operation-based CRDT - CmRDT

•  All replicas (usually) connected
•  Exchange update operations between replicas, apply on target replica
•  Reliable broadcast with ordering guarantee for non-concurrent
updates
•  Concurrent updates must be commutative
That‘s been enough theory ...
Counter
Op-based Counter

Data
Integer i
Init
i ≔ 0
Query
return i
Operations
increment(): i ≔ i + 1
decrement(): i ≔ i - 1
State-based G-Counter (grow only)
(Naïve approach)

Data
Integer i
Init
i ≔ 0
Query
return i
Update
increment(): i ≔ i + 1
Merge( j)
i ≔ max(i, j)
State-based G-Counter (grow only)
(Naïve approach)
R1
R3
R2
i = 1
U
i = 1
U
i = 0
i = 0
i = 0
I
I
I
M
i = 1
 i = 1
M
State-based G-Counter (grow only)
(Vector-based approach)

Data
Integer V[] / one element per replica set
Init
V ≔ [0, 0, ... , 0]
Query
return ∑i V[i]
Update
increment(): V[i] ≔ V[i] + 1 / i is replica set number
Merge(V‘)
∀i ∈ [0, n-1] : V[i] ≔ max(V[i], V‘[i])
State-based G-Counter (grow only)
(Vector-based approach)
R1
R3
R2
V = [1, 0, 0]
U
V = [0, 0, 0]
I
I
I
V = [0, 0, 0]
V = [0, 0, 0]
U
V = [0, 0, 1]
M
V = [1, 0, 0]
M
V = [1, 0, 1]
State-based PN-Counter (pos./neg.)


•  Simple vector approach as with G-Counter does not work
•  Violates monotonicity requirement of semilattice
•  Need to use two vectors
•  Vector P to track incements
•  Vector N to track decrements
•  Query result is ∑i P[i] – N[i]
State-based PN-Counter (pos./neg.)

Data
Integer P[], N[] / one element per replica set
Init
P ≔ [0, 0, ... , 0], N ≔ [0, 0, ... , 0]
Query
Return ∑i P[i] – N[i]
Update
increment(): P[i] ≔ P[i] + 1 / i is replica set number
decrement(): N[i] ≔ N[i] + 1 / i is replica set number
Merge(P‘, N‘)
∀i ∈ [0, n-1] : P[i] ≔ max(P[i], P‘[i])
∀i ∈ [0, n-1] : N[i] ≔ max(N[i], N‘[i])
Non-negative Counter

Problem: How to check a global invariant with local information only?

•  Approach 1: Only dec if local state is > 0
•  Concurrent decs could still lead to negative value

•  Approach 2: Externalize negative values as 0
•  inc(negative value) == noop(), violates counter semantics

•  Approach 3: Local invariant – only allow dec if P[i] - N[i] > 0
•  Works, but may be too strong limitation

•  Approach 4: Synchronize
•  Works, but violates assumptions and prerequisites of CRDTs
More datatypes

•  Register
•  Set
•  Dictionary (Map)
•  Tree
•  Graph
•  Array
•  List

plus several representations for each datatype
Limitations of CRDTs

•  Very weak consistency guarantees

Strives for „quiescent consistency“
•  Eventually consistent

Not suitable for high-volume ID generator or alike
•  Not easy to understand and model
•  Not all data structures representable

Use if availability is extremely important
Further reading on CRDTs

1.  Shapiro et al., Conflict-free Replicated Data
Types, Inria Research report, 2011
2.  Shapiro et al., A comprehensive study of
Convergent and Commutative Replicated
Data Types,

Inria Research report, 2011
3.  Basho Tag Archives: CRDT,

https://basho.com/tag/crdt/
4.  Leslie Lamport,

Time, clocks, and the ordering of events in a
distributed system,

Communications of the ACM (1978)
Wrap-up

•  Scalability means scaling up and down
•  State is the enemy of scalability
•  Move state to the frontend and backend
•  Shared session state layer
•  CRDTs

The best way to deal with state is to avoid it
@ufried
Uwe Friedrichsen | uwe.friedrichsen@codecentric.de | http://slideshare.net/ufried | http://ufried.tumblr.com
No stress with state

No stress with state

  • 1.
    No stress withstate Hands-on tips for scalable applications Uwe Friedrichsen, codecentric AG, 2014
  • 2.
    @ufried Uwe Friedrichsen |uwe.friedrichsen@codecentric.de | http://slideshare.net/ufried | http://ufried.tumblr.com
  • 3.
    A word aboutscalability ...
  • 4.
    Your system isonly scalable if you can easily scale up and down
  • 5.
  • 6.
    State is theenemy of scalable applications
  • 7.
    State … •  …avoids load distribution •  … requires replication •  … makes systems brittle •  … makes scaling down hard
  • 8.
    General design rule • Move state either to the frontend and/or to the backend •  Make everything else stateless •  Use caches if that’s too slow (but be aware of their tradeoffs) Frontend Client Backend Store Application Stateful Stateful Stateless
  • 9.
    Frontend client stateexamples •  Cookies (fondly handcrafted) •  Play framework •  JavaScript client •  HTML5 storage
  • 10.
    And what ifwe are bound to a web framework that heavily uses session state?
  • 11.
    Standard session statehandling •  Sticky Sessions •  Session Replication
  • 12.
    Sticky Sessions Load Balancer Session Node 4711 2 0815 4 … … Session: 4711 Node: 2
  • 13.
    Sticky Sessions Load Balancer Session Node 4711 2 0815 4 … … Session: 4711 Problem: Node is down ✖ New Node: 4 Node 4 does not know session 4711
  • 14.
    Sticky Sessions Load Balancer Session Node 4711 2 0815 4 … … Problem: Load is unequally distributed
  • 15.
    Sticky Sessions Load Balancer Session Node 4711 2 0815 4 … … Problem: Scale down #Sessions: 1 #Sessions: 1 #Sessions: 1 #Sessions: 1 #Sessions: 1 Cannot shut down node without losing session
  • 16.
  • 17.
    But what ifwe really need to scale?
  • 18.
    Lazy session replication leavesyou with inconsistencies Load Balancer Eager session replication does not scale well
  • 19.
    What we reallywant •  Scaling (up) behavior of sticky sessions •  Node independence of session replication à Isolated nodes with shared session cache
  • 20.
    Isolated nodes withshared session cache Load Balancer Request Arbitrary node Application nodes Cache nodes
  • 21.
    Isolated nodes withshared session cache •  Easy load distribution •  Easy up- and down-scaling •  Tolerant against node failures •  Independent scaling of application and cache nodes
  • 22.
    But isn’t thatway too slow?
  • 23.
    Scalability vs. Latency Youneed to give up some latency for an individual user
 to provide acceptable latency for many users Latencydistribution Not scalable Acceptable latency Single user Many users Scalable Acceptable latency Latencydistribution Single user Many users
  • 24.
    And how doI implement it?
  • 25.
    Interception points •  StandardSession(Tomcat) •  HttpSessionAttributeListener (JEE) •  Valve (Tomcat) •  Filter (JEE) •  SessionManager (ManagerBase, Tomcat)
  • 26.
    Available libraries •  ForRedis •  For MemCached •  For MongoDB •  For Cassandra •  …
  • 27.
    Example library •  RedisSession Manager for Apache Tomcat
 https://github.com/jcoleman/tomcat-redis-session-manager •  A bit outdated, but easy to use •  Usage •  Install and start Redis •  Copy tomcat-redis-session-manager.jar and
 jedis-<version>.jar to Tomcat lib directory •  Update Tomcat configuration •  Restart Tomcat
  • 28.
    context.xml <Context> ... <Valve className="com.radiadesign.catalina.session.RedisSessionHandlerValve" /> <ManagerclassName="com.radiadesign.catalina.session.RedisSessionManager" host="localhost” port="6379" database="0” maxInactiveInterval="60" /> </Context>
  • 29.
    Tradeoffs •  Easy toset up and use •  No code required, only configuration •  Works smoothly •  Not best for web frameworks
 that store lots of session state •  Does not work well with AJAX •  Does not work with single page applications 
 à Another reason to prefer ROCA style?
  • 30.
    General design rule • Move state either to the frontend and/or to the backend •  Make everything else stateless •  Use caches if that’s too slow (but be aware of their tradeoffs) Frontend Client Backend Store Application Stateful Stateful Stateless
  • 31.
    Challenges of scalabledatastores •  Requires relaxed consistency guarantees •  Leads to (temporal) anomalies and inconsistencies
 short-term due to replication or long-term due to partitioning It might happen. Thus, it will happen!
  • 32.
    C Consistency A Availability P Partition Tolerance Strict Consistency ACID/ 2PC Strong Consistency Quorum R&W / Paxos Eventual Consistency CRDT / Gossip / Hinted Handoff
  • 33.
    Strict Consistency (CA) • Great programming model
 no anomalies or inconsistencies need to be considered •  Does not scale well
 best for single node databases „We know ACID – It works!“ „We know 2PC – It sucks!“ Use for moderate data amounts
  • 34.
    And what ifI need more data? •  Distributed datastore •  Partition tolerance is a must •  Need to give up strict consistency (CP or AP)
  • 35.
    Strong Consistency (CP) • Majority based consistency model
 can tolerate up to N nodes failing out of 2N+1 nodes •  Good programming model
 Single-copy consistency •  Trades consistency for availability
 in case of partitioning Paxos (for sequential consistency) Quorum-based reads & writes
  • 36.
    And what ifI need more availability? •  Need to give up strong consistency (CP) •  Relax required consistency properties even more •  Leads to eventual consistency (AP)
  • 37.
    Eventual Consistency (AP) • Gives up some consistency guarantees
 no sequential consistency, anomalies become visible •  Maximum availability possible
 can tolerate up to N-1 nodes failing out of N nodes •  Challenging programming model
 anomalies usually need to be resolved explicitly Gossip / Hinted Handoffs CRDT
  • 38.
    Conflict-free Replicated DataTypes •  Eventually consistent, self-stabilizing data structures •  Designed for maximum availability •  Tolerates up to N-1 out of N nodes failing State-based CRDT: Convergent Replicated Data Type (CvRDT) Operation-based CRDT: Commutative Replicated Data Type (CmRDT)
  • 39.
    A bit oftheory first ...
  • 40.
    Convergent Replicated DataType State-based CRDT – CvRDT •  All replicas (usually) connected •  Exchange state between replicas, calculate new state on target replica •  State transfer at least once over eventually-reliable channels •  Set of possible states form a Semilattice •  Partially ordered set of elements where all subsets have a Least Upper Bound (LUB) •  All state changes advance upwards with respect to the partial order
  • 41.
    Commutative Replicated DataType Operation-based CRDT - CmRDT •  All replicas (usually) connected •  Exchange update operations between replicas, apply on target replica •  Reliable broadcast with ordering guarantee for non-concurrent updates •  Concurrent updates must be commutative
  • 42.
  • 43.
  • 44.
    Op-based Counter Data Integer i Init i≔ 0 Query return i Operations increment(): i ≔ i + 1 decrement(): i ≔ i - 1
  • 45.
    State-based G-Counter (growonly) (Naïve approach) Data Integer i Init i ≔ 0 Query return i Update increment(): i ≔ i + 1 Merge( j) i ≔ max(i, j)
  • 46.
    State-based G-Counter (growonly) (Naïve approach) R1 R3 R2 i = 1 U i = 1 U i = 0 i = 0 i = 0 I I I M i = 1 i = 1 M
  • 47.
    State-based G-Counter (growonly) (Vector-based approach) Data Integer V[] / one element per replica set Init V ≔ [0, 0, ... , 0] Query return ∑i V[i] Update increment(): V[i] ≔ V[i] + 1 / i is replica set number Merge(V‘) ∀i ∈ [0, n-1] : V[i] ≔ max(V[i], V‘[i])
  • 48.
    State-based G-Counter (growonly) (Vector-based approach) R1 R3 R2 V = [1, 0, 0] U V = [0, 0, 0] I I I V = [0, 0, 0] V = [0, 0, 0] U V = [0, 0, 1] M V = [1, 0, 0] M V = [1, 0, 1]
  • 49.
    State-based PN-Counter (pos./neg.) • Simple vector approach as with G-Counter does not work •  Violates monotonicity requirement of semilattice •  Need to use two vectors •  Vector P to track incements •  Vector N to track decrements •  Query result is ∑i P[i] – N[i]
  • 50.
    State-based PN-Counter (pos./neg.) Data IntegerP[], N[] / one element per replica set Init P ≔ [0, 0, ... , 0], N ≔ [0, 0, ... , 0] Query Return ∑i P[i] – N[i] Update increment(): P[i] ≔ P[i] + 1 / i is replica set number decrement(): N[i] ≔ N[i] + 1 / i is replica set number Merge(P‘, N‘) ∀i ∈ [0, n-1] : P[i] ≔ max(P[i], P‘[i]) ∀i ∈ [0, n-1] : N[i] ≔ max(N[i], N‘[i])
  • 51.
    Non-negative Counter Problem: Howto check a global invariant with local information only? •  Approach 1: Only dec if local state is > 0 •  Concurrent decs could still lead to negative value •  Approach 2: Externalize negative values as 0 •  inc(negative value) == noop(), violates counter semantics •  Approach 3: Local invariant – only allow dec if P[i] - N[i] > 0 •  Works, but may be too strong limitation •  Approach 4: Synchronize •  Works, but violates assumptions and prerequisites of CRDTs
  • 52.
    More datatypes •  Register • Set •  Dictionary (Map) •  Tree •  Graph •  Array •  List plus several representations for each datatype
  • 53.
    Limitations of CRDTs • Very weak consistency guarantees
 Strives for „quiescent consistency“ •  Eventually consistent
 Not suitable for high-volume ID generator or alike •  Not easy to understand and model •  Not all data structures representable Use if availability is extremely important
  • 54.
    Further reading onCRDTs 1.  Shapiro et al., Conflict-free Replicated Data Types, Inria Research report, 2011 2.  Shapiro et al., A comprehensive study of Convergent and Commutative Replicated Data Types,
 Inria Research report, 2011 3.  Basho Tag Archives: CRDT,
 https://basho.com/tag/crdt/ 4.  Leslie Lamport,
 Time, clocks, and the ordering of events in a distributed system,
 Communications of the ACM (1978)
  • 55.
    Wrap-up •  Scalability meansscaling up and down •  State is the enemy of scalability •  Move state to the frontend and backend •  Shared session state layer •  CRDTs The best way to deal with state is to avoid it
  • 56.
    @ufried Uwe Friedrichsen |uwe.friedrichsen@codecentric.de | http://slideshare.net/ufried | http://ufried.tumblr.com