2. Doveryai, no Proveryai
A Russian proverb which means “Trust,
but verify”.
Popular during the Cold War when the
US and Soviet Union were signing
nuclear disarmament accords.
2
3. Talk overview
1. Problem definition
2. What is TLA+, PlusCal, TLC...
3. Example 1 : Childcare facility
4. Example 2 : Dining Philosophers
5. Example 3 : Alternating Bit Protocol
6. Concluding observations
Code : https://github.com/sanjosh/tlaplus
Slides: https://www.slideshare.net/SandeepJoshi55/
3
4. Hard to prove correctness in a distributed system
In a distributed system, how do you prove
1. Safety : Something bad will never happen
2. Liveness : Something good will eventually happen
When you have
1. Multiple agents/actors, each with their state machine(FSM)
2. Non-determinism which leads to Arbitrary Interleaved execution
3. Failures and restarts
4
5. Microsoft .NET remote authentication FSMs https://msdn.microsoft.com/en-us/library/ms973909.aspx
Verify if this 2-process FSM (.NET) is correct.. ?
5
6. Or this 2-process FSM (for TCP) is correct ?
https://thewalnut.io/app/release/73/
6
8. How to reason about time in a distributed system
Required :
1. A formal theory
2. A language to express the problem
3. A tool to verify
8
9. How to reason about time in a distributed system
Required :
1. A formal theory : Temporal Logic
2. A language to express the problem : TLA+ and others.
3. A tool to verify : TLC and other model checkers
9
10. Temporal logic simplified
In programs, we write formulae using Boolean operators (AND, OR, NOT).
“Assert (a > 0 AND b < 0)”
Temporal logic provides you with temporal operators which hold over one or
more paths of execution (called “Path quantifiers”).
1. I will like chocolate from now on.
2. After weather becomes cold, at some point, I will start eating chocolate.
https://en.wikipedia.org/wiki/Computation_tree_logic#Examples
10
11. What is TLA+
● Language created by Leslie Lamport to express temporal logic.
● PlusCal is a simpler variant of TLA+ (This talk uses PlusCal).
● TLC is the “model checker” - the compiler which verifies if your PlusCal
program is correct.
● It has a GUI called Toolbox. In this talk, only command line tool is used.
11
12. How to get started with TLA+
● Read general background on model checkers
● Download the TLA toolbox (GUI + java jar file)
● Read the PlusCal manual and Lamport’s tutorial “Specifying systems”
● Read sample PlusCal programs written by others
● Start with a small problem and try writing your own program
● Run it...
$ java pcal.trans myspec.tla
$ java tlc2.TLC myspec.tla
12
13. Childcare facility problem
Children and adults continuously enter and exit a childcare facility.
Ensure that there is always one adult present for every three children.
[ from The Little Book of Semaphores by Allen Downey ]
13
14. Childcare constraints
Adult can enter anytime, but exit ONLY if
1. NEW number of adults is at least three times number of children
Children can exit anytime, but enter ONLY if
1. Number of adults is at least three times NEW number of children
14
15. Childcare - create child & parent process
Define a PlusCal “process” for each actor in your system
-- algorithm childcare {
Process (a in 1.. ADULTS) {... }
Process (c in 1..CHILDREN) {... }
}
15
16. Childcare - “labels” denote Atomic actions
Use one PlusCal label for each atomic action of Child.
Child performs two actions : enter and exit the childcare facility.
Process {
c_enter: number_children = number_children + 1
c_exit : number_children = number_children - 1
}
16
17. What are PlusCal Labels
All statements within a label are atomically executed by TLC.
TLC internally interleaves the execution of many processes in order
to verify correctness
LabelA : Y = X + 1
Label1 : X = Y + 1
17
Label2 : X = Y - 1
Child 1 Adult 2
18. Childcare - use “await” to wait for a condition
Every Child will wait until there are sufficient number of adults present inside
c_enter : Await (number_adults * 3 >= number_children + 1)
number_children = number_children + 1
c_exit : number_children = number_children - 1
Assert (number_adults * 3 >= number_children)
18
19. Childcare - specify adult process
Follow same steps to define adult process - using process, label, await
19
Process {
a_enter: number_adults = number_adults + 1
a_exit : Await ( number_adults * 3 >= number_children)
number_adults = number_adults - 1
Assert (number_adults * 3 >= number_children)
}
20. TLC (model checker) Failure output
At this point, assert fires
since adult exited due to
incorrect “await”
condition
20
21. Childcare - correct the condition
Change the await condition to check new value instead of old
21
Process {
a_enter: number_adults = number_adults + 1
a_exit : Await ((number_adults - 1)* 3 >= number_children)
number_adults = number_adults - 1
}
24. Dining Philosophers Problem
Each philosopher keeps doing the following
1. Think
2. Take right fork
3. Take left fork
4. Eat
5. Put down both forks
24
25. Dining Philosophers with PlusCal
Define five philosopher instances; Step through three labels (atomic actions)
25
Process (ph in 1..5) {
Wait_first_fork : await (forks[right] = FALSE);
forks[right] = TRUE;
}
26. Dining Philosophers with PlusCal
Define five philosopher instances; Step through three labels (atomic actions)
26
Process (ph in 1..5) {
Wait_first_fork : await (forks[right] = FALSE);
forks[right] = TRUE;
Wait_second_fork: await (forks[left] = FALSE);
forks[left] = TRUE;
}
27. Dining Philosophers with PlusCal
Define five philosopher instances; Step through three labels (atomic actions)
27
Process (ph in 1..5) {
Wait_first_fork : await (forks[right] = FALSE);
forks[right] = TRUE;
Wait_second_fork: await (forks[left] = FALSE);
forks[left] = TRUE;
Done_eating : forks[left] = forks[right] = FALSE;
}
34. Alternate bit protocol over lossy channel
34
Sender Receiver
Message channel
Ack channel
Both channels
are lossy
https://en.wikipedia.org/wiki/Alternating_bit_protocol
Discussed in Lamports’ book “Specifying Systems”.
35. Alternate bit protocol - define channel
Use “Sequences” module to define the communication channels
Declare the channels as a Sequence
Variables msgChan = <<>>, ackChan = <<>>
Append to channel
Append(msgChan, m)
Extract using
“Head(msgChan)” or “Tail(msgChan)”
35
36. Alternate bit protocol - sender and receiver process
Process (Sender = “S”) {
Send message
OR
Receive Ack
}
36
Define one Process each for Sender and Receiver
Process (Receiver = “S”) {
Receive message
OR
Send Ack
}
37. Alternate bit protocol - sender and receiver process
Process (Sender = “S”) {
Either {
Append(<<input>>, msgChan)
} or {
Recv(ack, ackChan)
}
}
37
Define one Process each for Sender and Receiver
Process (Receiver = “S”) {
Either {
Append(rbit, ackChan)
} or {
Recv(msg, msgChan)
}
}
38. PlusCal - Either Or
“Either Or” is an important feature of PlusCal language (TLA+)
It allows you to simulate non-determinism
TLC (model checker) will test both options at runtime.
38
Either { Do this }
Or { Do that }
39. Alternate Bit protocol - simulate lossy channel
To simulate lossy channel, add another process which randomly deletes
messages.
39
Process (LoseMsg = “L”) {
randomly delete messages from either channel
}
40. Alternate Bit protocol - simulate lossy channel
To simulate lossy channel, add another process which randomly deletes
messages.
40
Process (LoseMsg = “L”) {
While TRUE{
Either with (1 in 1..Len(msgChan)) {
msgChan = Remove(i, msgChan)
} or with (1 in 1..Len(ackChan)) {
ackChan = Remove(i, ackChan);
}
41. PlusCal constructs introduced
1. Algorithm : A problem that you want to model.
2. Process : An actor/thread of execution within the algorithm.
3. Labels : All statements inside a label are atomically executed.
4. Await : only execute after condition becomes true
5. Either-Or : non-deterministic execution of alternatives
6. With : Non-deterministically choose one element out of a Set.
41
42. Notable users of TLA+
1. Intel CPU cache coherence protocol [Brannon Batson]
2. Microsoft CosmosDB
3. Amazon : S3, DynamoDB, EBS, Distributed Lock manager [Chris
Newcombe]
Newcombe(Amazon) has released two of their TLA+ specs
(See my github for a copy)
None of the others are publicly available
42
43. Conclusion
1. TLC can find bugs.
2. Complex programs can take hours to run (TLC also has “simulation” mode
which does random verification)
Learning curve
1. Formulation : Lack of sample programs, but google group is helpful.
2. Debugging : Check the backtrace; add prints !
3. Mastery over TLA+ requires some Mathematics knowledge (i.e. Set theory).
4. [Newcombe, Experience of Software Engineers using TLA+]
http://tla2012.loria.fr/contributed/newcombe-slides.pdf
43
45. TLA+ operators
1. <> P : atleast one execution path has P true
2. [] P : P is eventually true
3. Q ~> P : If Q becomes true, P will be true
4. <>[] P : at some point P becomes true and stays true
45
46. Other model checkers besides TLA+
46
https://en.wikipedia.org/wiki/List_of_model_checking_tools