Implementation of
Cluster-wide Causal
Consistency in
- What is causal consistency
- Academics view on Causal Consistency
- MongoDB architecture
- Causal Consistency building blocks
- Making Causal Consistency secure
- Making Causal Consistency fast
- Making Causal Consistency reliable
- Causal consistency for end-users
Outline
Client-side properties of causal consistency
- Read your writes
- Writes follow reads
- Monotonic reads
- Monotonic writes
Implementing with a non causally consistent
system
- Single server systems are causally consistent
- Read and write from the same node
- Add an application logic to handle the scenarios that have to be causally
consistent
Ordering of Events in Distributed System
Process P Process Q Process R
q2: <C2> = 11
p1
r1: <C3> = 11
q1
<C3> = 0<C1> = 10 <C2> = 10
r2: <C3> = 12
q3
Server-side causal consistency
Causal consistency is a partial order of events in a distributed
system. If an event A causes another event B, then causal
consistency provides an assurance that each other process of the
system observes event A before observing event B.
If an event A is not causally related to an event B then they are
concurrent.
System’s Architecture
Gossiping clusterTime
ClusterTime: {uint64}
Timestamp(1495470881, 6)
Ticking clusterTime
{ts: 6} ...
{ts: 5} ...
{ts: 11} ...
insert({x:1}, clusterTime: 4)
<clusterTime>: 6
<Wall clock>: 11
Reporting operationTime
insert({x:1})
{ts: 11} ...
{ts: 5} ...
{ok:1}, { operationTime: 12}
{ts: 12} ...
<clusterTime>: 11
<Wall clock>: 11
Waiting for afterClusterTime
find({x:1}, afterClusterTime: {10},
clusterTime: {15})
{ts: 6} ...
{ts: 5} ...
{x:1}, { operationTime: {11}}
{ts: 11} ...
Breaking clusterTime
{ts: Timestamp(1495470881, 6), term: 1},
...
{ts: Timestamp(1495470881, 5), term: 1},
...
{Timestamp(0xFFFFFFFF, 0xFFFFFFFF}
...
insert({x:1}, clusterTime:
{0xFFFFFFFF, 0xFFFFFFFE})
LogicalClock:clusterTime =
Timestamp(0xFFFFFFFF, 0xFFFFFFFE)
Protecting clusterTime
"$clusterTime" : {
"clusterTime" : Timestamp(1495470881, 5),
"signature" : {
"hash" : BinData(0,"7olYjQCLtnfORsI9IAhdsftESR4="),
"keyId" : NumberLong("6422998367101517844")
}
}
Protecting against operator errors
{ts: 6} ...
{ts: 5} ...
insert({x:1}, clusterTime: 100,000)
<clusterTime>: 6
<Wall clock>: 11
{Error}, { operationTime: 6}
Signing a range of clusterTime
find({x:1}, clusterTime: <val>,
signature:<hash>)
<timeRange> =
<val> | 0x0000’0000’0000’FFFF
cache:{ <timeRange>:<hash> }
Use dummy signatures
- When the auth is off
- When a user has advanceClusterTime privilege
How end users see it
let session=db.getMongo().startSession({causalConsistency: true})
db = session.getDatabase(db.getName());
{checking:100}
find({name:”misha”})
afterClusterTime: 15
update({name:”misha”
checking:100})
{ok:1} operationTime: 15
startSession()
Misha Tyulenev
misha@mongodb.com

Подробно о том, как Causal Consistency реализовано в MongoDB / Михаил Тюленев (MongoDB)

Editor's Notes

  • #4 Even if the read request goes to primary it's not guaranteed to read its own writes for example read concern level = majority may delay it
  • #11 Add a logical clock object to each cluster node (routers, storage, clients) Every client tracks the greatest operationTime inside a causally consistent session
  • #12 clusterTime is incremented only on the write to the oplog (storage) (clusterTime + election term) is a primary key in the oplog collection
  • #13 Every command returns an operationTime: the greatest clusterTime stored with an opLog entry at the time the command finishes its execution
  • #14 Every request includes the afterClusterTime A storage node waits for opLog to replicate the entry with clusterTime >= afterClusterTime
  • #15 Clients have to participate, but we don’t trust the clients There is a maximum time after which primary can’t do a write So we want to be sure that all cluster times from clients are from trusted source
  • #17 clusterTime is incremented only on the write to the oplog (storage) (clusterTime + election term) is a primary key in the oplog collection