As developers, we have a number of well-known practices to ensure code quality, such as unit tests, code review and so on. But these practices often break down when we need to design concurrent systems. Often, there can be subtle and serious bugs that are not found with conventional practices.
But there’s another approach that you can use -- model-checking -- that can detect potential concurrency errors at design time, and so dramatically increase your confidence in your code. In this talk, I’ll demonstrate and demystify TLA+, a powerful design and model-checking system. We’ll see how it can check your concurrent designs for errors, saving you time up front and frustration later!
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Building confidence in concurrent code with a model checker: TLA+ for programmers
1. Building confidence in
concurrent code
using a model checker
(aka TLA+ for programmers)
@ScottWlaschin
fsharpforfunandprofit.com
Warning – this talk will have
too much information!
2. People who have
written concurrent
code
People who have had
weird painful bugs in
concurrent code
Why concurrent code in particular?
3. People who have
written concurrent
code
People who have had
weird painful bugs in
concurrent code
Why concurrent code in particular?
4. People who have
written concurrent
code
People who have had
weird painful bugs in
concurrent code
A perfect circle
Why concurrent code in particular?
9. Tools to improve confidence
All of the above, plus
• "Model checking"
10.
11. What is "model checking"?
• Use a special DSL to design a "model"
• Then "check" the model:
– Are all the constraints met?
– Does anything unexpected happen?
– Does it deadlock?
• This is part of a "formal methods" approach
12. Two popular model checkers
• TLA+ (TLC)
– Focuses on temporal properties
– Good for modeling concurrent systems
• Alloy (Alloy Analyzer)
– Focuses on relational logic
– Good for modeling structures
13. Two popular model checkers
• TLA+ (TLC)
– Focuses on temporal properties
– Good for modeling concurrent systems
• Alloy (Alloy Analyzer)
– Focuses on relational logic
– Good for modeling structures
14. Start(s) == serverState[s] = "online_v1"
/ ~(E other in servers : serverState[other] = "offline")
/ serverState' = [serverState EXCEPT ![s] = "offline"]
Finish(s) == serverState[s] = "offline"
/ serverState' = [serverState EXCEPT ![s] = "online_v2"]
UpgradeStep == E s in servers : Start(s) / Finish(s)
Done == A s in servers : serverState[s] = "online_v2"
/ UNCHANGED serverState
Spec == / Init / [][Next]_serverState
/ WF_serverState(UpgradeStep)
Here's what TLA+ looks like
15. Start(s) == serverState[s] = "online_v1"
/ ~(E other in servers : serverState[other] = "offline")
/ serverState' = [serverState EXCEPT ![s] = "offline"]
Finish(s) == serverState[s] = "offline"
/ serverState' = [serverState EXCEPT ![s] = "online_v2"]
UpgradeStep == E s in servers : Start(s) / Finish(s)
Done == A s in servers : serverState[s] = "online_v2"
/ UNCHANGED serverState
Spec == / Init / [][Next]_serverState
/ WF_serverState(UpgradeStep)
Here's what TLA+ looks like
By the end of the talk you should be
able to make sense of it!
19. Outline of this talk
• How confident are you?
• IntroducingTLA+
• Examples:
– UsingTLA+ for a simple model
– Checking a Producer/Consumer model
– Checking a zero-downtime deployment model
21. To sort a list:
1) If the list is empty or has 1 element, it is already sorted.
So just return it unchanged.
2) Otherwise, take the first element (called the "pivot")
3) Divide the remaining elements into two piles:
* those < than the pivot
* those > than the pivot
4) Sort each of the two piles using this sort algorithm
5) Return the sorted list by concatenating:
* the sorted "smaller" list
* then the pivot
* then the sorted "bigger" list
Here's a spec for a sort algorithm
22. To sort a list:
1) If the list is empty or has 1 element, it is already sorted.
So just return it unchanged.
2) Otherwise, take the first element (called the "pivot")
3) Divide the remaining elements into two piles:
* those < than the pivot
* those > than the pivot
4) Sort each of the two piles using this sort algorithm
5) Return the sorted list by concatenating:
* the sorted "smaller" list
* then the pivot
* then the sorted "bigger" list
Here's a spec for a sort algorithm
Link to live poll: bit.ly/tlapoll
23. Poll #2 results:
"What is your confidence in
the design of this sort algorithm?"
Link to live poll: bit.ly/tlapoll
24. To sort a list:
1) If the list is empty or has 1 element, it is already sorted.
So just return it unchanged.
2) Otherwise, take the first element (called the "pivot")
3) Divide the remaining elements into two piles:
* those < than the pivot
* those > than the pivot
4) Sort each of the two piles using this sort algorithm
5) Return the sorted list by concatenating:
* the sorted "smaller" list
* then the pivot
* then the sorted "bigger" list
Here's a spec for a sort algorithm
25. Some approaches to gain confidence
• Careful inspection and code review
• Create an implementation
and then test it thoroughly
– E.g. Using property-based tests
• Use mathematical proof assistant tool
27. A concurrent producer/consumer system
A queue
Consumer spec (2 separate steps)
1) Check if queue is not empty
2) If true, then read item from queue
Producer spec (2 separate steps)
1) Check if queue is not full
2) If true, then write item to queue
Consumer
reads from
queue
Producer
writes to
queue
28. Given a bounded queue of items
And 1 producer, 1 consumer running concurrently
Constraints:
* never read from an empty queue
* never add to a full queue
Producer spec (separate steps)
1) Check if queue is not full
2) If true, then write item to queue
3) Go to step 1
Consumer spec (separate steps)
1) Check if queue is not empty
2) If true, then read item from queue
3) Go to step 1
A spec for a producer/consumer system
Link to live poll: bit.ly/tlapoll
29. Poll #3 results:
"What is your confidence in the design
of this producer/consumer system?"
Link to live poll: bit.ly/tlapoll
30. Given a bounded queue of items
And 2 producers, 2 consumers running concurrently
Constraints:
* never read from an empty queue
* never add to a full queue
Producer spec (separate steps)
1) Check if queue is not full
2) If true, then write item to queue
3) Go to step 1
Consumer spec (separate steps)
1) Check if queue is not empty
2) If true, then read item from queue
3) Go to step 1
A spec for a producer/consumer system
Link to live poll: bit.ly/tlapoll
31. Poll #4 results:
"What is your confidence in the design
of this producer/consumer system
(now with multiple clients)?"
33. How to gain confidence for concurrency?
• Careful inspection and code review
– Human intuition for concurrency is very bad
• Create an implementation and then test it
– Many concurrency errors might never show up
• Use mathematical proof assistant tool
– A model checker is much easier!
36. TLA+ was designed by Leslie Lamport
– Famous "Time & Clocks" paper
– Paxos algorithm for consensus
– Turing award winner
– Initial developer of LaTeX
44. Boolean Logic
Boolean Mathematics TLA+ Programming
AND a ∧ b a / b a && b
OR a ∨ b a / b a || b
NOT ¬a ~a !a; not a
You all know how
this works, I hope!
45. Boolean Logic
A "predicate" is an expression that returns a boolean
* TLA-style definition
operator(a,b,c) ==
(a / b) / (a / ~c)
// programming language definition
function(a,b,c) {
(a && b) || (a && !c)
}
50. "hello" "goodbye"
States and transitions in TLA+
State before State after
state = "hello"
In TLA+
state' = "goodbye"
In TLA+
An "action"
51. "hello" "goodbye"
States and transitions in TLA+
Next ==
state = "hello"
/ state' = "goodbye"
In TLA+, define the action "Next" like this
Next
Or in English:
state before is "hello"
AND state after is "goodbye"
54. Actions are not assignments.
Actions are tests
state = "hello" / state' = "goodbye"
"hello" "goodbye"
Does match
"hello" "ciao" Doesn't match
"howdy" "goodbye" Doesn't match
56. TLA+ models a series of state transitions over time
InTLA+ you can ask questions like:
• Is something always true?
• Is something ever true?
• If X happens,mustY happen afterwards?
57. Temporal Logic of Actions
Boolean logic of state transitions over time
58. Temporal Logic of Actions
Boolean logic of state transitions over time
59. Temporal Logic of Actions
Boolean logic of state transitions over time
60. Temporal Logic of Actions
Boolean logic of state transitions over time
82. Staying in the same state is
almost always a valid state transition!
1 1 2 2 3 3
What is the difference between these two systems?
1 2 3
1 -> 1 2 -> 2 3 -> 3
83. "Count to three" with stuttering
Init == x=1
Step1 == x=1 / x'=2
Step2 == x=2 / x'=3
Done == x=3 / UNCHANGED x
Next == Step1 / Step2 / Done / UNCHANGED x
1 2 3
85. Temporal properties
A property applies to the whole system over time
– Not just to individual states
Checking these properties is important
– Humans are bad at this
– Programming languages are bad at this too
– TLA+ is good at this!
86. Useful properties to check
• Always true
– For all states, "x > 0"
• Eventually true
– At some point in time, "x = 2"
• Eventually always
– x eventually becomes 3 and then stays there
• Leads to
– if x ever becomes 2 then it will become 3 later
87. Properties for "count to three"
In English Formally InTLA+
x is always > 2 Always (x > 0) [] (x > 0)
88. Properties for "count to three"
In English Formally InTLA+
x is always > 2 Always (x > 0) [] (x > 0)
At some point
x is 2
Eventually (x = 2) <> (x = 2)
89. Properties for "count to three"
In English Formally InTLA+
x is always > 2 Always (x > 0) [] (x > 0)
At some point
x is 2
Eventually (x = 2) <> (x = 2)
x eventually
becomes 3 and
then stays there.
Eventually (Always (x = 3)) <>[] (x = 3)
90. Properties for "count to three"
In English Formally InTLA+
x is always > 2 Always (x > 0) [] (x > 0)
At some point
x is 2
Eventually (x = 2) <> (x = 2)
x eventually
becomes 3 and
then stays there.
Eventually (Always (x = 3)) <>[] (x = 3)
if x ever becomes
2 then it will
become 3 later.
(x=2) leads to (x=3) (x=2) ~> (x=3)
91. Adding properties to the script
* Always, x >= 1 && x <= 3
AlwaysWithinBounds == [](x >= 1 / x <= 3)
* At some point, x = 2
EventuallyTwo == <>(x = 2)
* At some point, x = 3 and stays there
EventuallyAlwaysThree == <>[](x = 3)
* Whenever x=2, then x=3 later
TwoLeadsToThree == (x = 2) ~> (x = 3)
92. Tell the model checker what
the properties are,
and run the model checker again
Adding properties to the model in the TLA+ toolbox
93. Adding properties to the script
* Always, x >= 1 && x <= 3
AlwaysWithinBounds == [](x >= 1 / x <= 3)
* At some point, x = 2
EventuallyTwo == <>(x = 2)
* At some point, x = 3 and stays there
EventuallyAlwaysThree == <>[](x = 3)
* Whenever x=2, then x=3 later
TwoLeadsToThree == (x = 2) ~> (x = 3)
Link to live poll: bit.ly/tlapoll
94. Poll #5 results:
"How many of these
properties are true?"
Link to live poll: bit.ly/tlapoll
110. Modeling a Producer/Consumer system
A queue
Consumer spec (2 separate steps)
1) Check if queue is not empty
2) If true, then read item from queue
Producer spec (2 separate steps)
1) Check if queue is not full
2) If true, then write item to queue
Consumer
reads from
queue
Producer
writes to
queue
120. And if we run this script?
• Detects "8 distinct states"
– Good
• No errors!
– Means invariant was always true.
– We now have confidence in this design!
– But only with a single producer/consumer
We don't need to guess, as
we did in the earlier poll!
123. TLA plus… Set theory
Set theory Mathematics TLA+ Programming
e is an element of set S e ∈ S e in S
Define a set by
enumeration
{1,2,3} {1,2,3} [1,2,3]
Define a set by
predicate "p"
{ e ∈ S | p } {e in S : p} Set.filter(p)
For all e in Set, some
predicate "p" is true
∀ e ∈ S : p A e in S : p Set.all(p)
There exists e in Set
such that some
predicate "p" is true
∃ e ∈ S : p E x in S : p Set.any(p)
124. Plus… Set theory
Set theory Mathematics TLA Programming
e is an element of set S e ∈ S e in S
Define a set by
enumeration
{1,2,3} {1,2,3} [1,2,3]
Define a set by
predicate "p"
{ e ∈ S | p } {e in S : p} Set.filter(p)
For all e in Set, some
predicate "p" is true
∀ e ∈ S : p A e in S : p Set.all(p)
There exists e in Set
such that some
predicate "p" is true
∃ e ∈ S : p E x in S : p Set.any(p)
Set theory Mathematics TLA+ Programming
e is an element of set S e ∈ S e in S
Define a set by
enumeration
{1,2,3} {1,2,3} [1,2,3]
Define a set by
predicate "p"
{ e ∈ S | p } {e in S : p} Set.filter(p)
For all e in Set, some
predicate "p" is true
∀ e ∈ S : p A e in S : p Set.all(p)
There exists e in Set
such that some
predicate "p" is true
∃ e ∈ S : p E x in S : p Set.any(p)
125. Plus… Set theory
Set theory Mathematics TLA+ Programming
e is an element of set S e ∈ S e in S
Define a set by
enumeration
{1,2,3} {1,2,3} [1,2,3]
Define a set by
predicate "p"
{ e ∈ S | p } {e in S : p} Set.filter(p)
For all e in Set, some
predicate "p" is true
∀ e ∈ S : p A e in S : p Set.all(p)
There exists e in Set
such that some
predicate "p" is true
∃ e ∈ S : p E x in S : p Set.any(p)
126. • We need
– a set of producers
– a set of consumers
• Need to use the set-description part of TLA+
producers={"p1","p2"}
consumers={"c1","c2"}
127. CONSTANT producers, consumers
* e.g
* 2 producers={"p1","p2"}
* 2 consumers={"c1","c2"}
VARIABLES queueSize, producerState, consumerState
MaxQueueSize == 2
Init ==
queueSize = 0
/ producerState = [p in producers |-> "ready"]
* same as {"p1":"ready","p2":"ready"}
/ consumerState = [c in consumers |-> "ready"]
Producer/Consumer Spec, part 1
128. CONSTANT producers, consumers
* e.g
* 2 producers={"p1","p2"}
* 2 consumers={"c1","c2"}
VARIABLES queueSize, producerState, consumerState
MaxQueueSize == 2
Init ==
queueSize = 0
/ producerState = [p in producers |-> "ready"]
* same as {"p1":"ready","p2":"ready"}
/ consumerState = [c in consumers |-> "ready"]
For each producer, set
the state to be "ready"
Producer/Consumer Spec, part 1
130. CheckWritable(p) ==
producerState[p] = "ready"
/ queueSize < MaxQueueSize
/ producerState' =
[producerState EXCEPT ![p] = "canWrite"]
/ UNCHANGED queueSize
/ UNCHANGED consumerState
Parameterized by a producer
Update one element of the
state map/dictionary only
Check the state
134. CheckReadable(c) ==
consumerState[c] = "ready"
/ queueSize > 0
/ consumerState' =
[consumerState EXCEPT ![c] = "canRead"]
/ UNCHANGED queueSize
/ UNCHANGED producerState
Parameterized by a consumer
Update one element of the
state map/dictionary only
Check the state
136. CheckReadable(c) ==
consumerState[c] = "ready"
/ queueSize > 0
/ consumerState' =
[consumerState EXCEPT ![c] = "canRead"]
/ UNCHANGED queueSize
/ UNCHANGED producerState
Read(c) ==
consumerState[c] = "canRead"
/ queueSize' = queueSize - 1
/ consumerState' =
[consumerState EXCEPT ![c] = "ready"]
/ UNCHANGED producerState
ConsumerAction ==
E c in consumers : CheckReadable(c) / Read(c)
Find any consumer which has a valid action
137. And if we run this script?
• Run model checker with 2 producers, 2 consumers
– And same "AlwaysWithinBounds" property
• Detects 38 distinct states now
– Too many for human inspection
• Error: "Invariant AlwaysWithinBounds is violated"
– We are confident that this design doesn't work!
We don't need to guess, as
we did in the earlier poll!
138. Fixing the error
• TLA+ won't tell you how to fix it
– You have to think!
• But it is easy to test fixes:
– Update the model with the fix
• Atomic operations (or locks, or whatever)
– Then rerun the model checker
– You have confidence that the fix works (or not!)
• All this in only 50 lines of code
140. UsingTLA+ as a tool to improve design
The process is:
– Sketch the design inTLA+
– Then check it with the model checker
– Then fix it
– Then check it again
– Repeat untilTLA+ says the design is correct
Think of it as TDD but for concurrency design
Red Green
Remodel
141. Modeling a zero-downtime deployment
What to model
– We have a bunch of servers
– Each server must be upgraded from v1 to v2
– Each server goes offline during the upgrade
Conditions to check
– There must always be an online server
– All servers must be upgraded eventually
Idea credit: https://www.hillelwayne.com/post/modeling-deployments/
142. Online(v1) Offline
Start
Sketching the design
* a dictionary of key/value pairs: server => state
VARIABLES serverState
Init == serverState = [s in servers |-> "online_v1"]
Start(s) ==
serverState[s] = "online_v1"
/ serverState' = [serverState EXCEPT ![s] = "offline"]
Finish(s) ==
serverState[s] = "offline"
/ serverState' = [serverState EXCEPT ![s] = "online_v2"]
Online(v2)
Finish Done
Server state
143. Online(v1) Offline
Start
Sketching the design
* try to find a server to start or finish
UpgradeStep == E s in servers : Start(s) / Finish(s)
* done if ALL servers are finished
Done ==
A s in servers : serverState[s] = "online_v2"
/ UNCHANGED serverState
* overall state transition
Next == UpgradeStep / Done
Online(v2)
Finish Done
Server state
144. Stop and check
• Run the script now to check our assumptions
– With 1 server: 3 distinct states (as expected)
– With 2 servers: 9 distinct states
– With 3 servers: 27 distinct states
• The number of states gets large very quickly!
– Eyeballing for errors will not work
145. Now let's add some properties
• Zero downtime
– "Not all servers should be offline at once"
• Upgrade should complete
– "All servers should eventually be upgraded to v2"
Temporal properties
146. * It is always true that there exists
* a server that is not offline (!= is /= in TLA)
ZeroDowntime ==
[](E s in servers : serverState[s] /= "offline")
Temporal properties
Always, there exists a
server, such that
the state for
that server
is not
"offline"
147. * Eventually, all servers will be online at v2
EventuallyUpgraded ==
<>(A s in servers : serverState[s] = "online_v2")
Temporal properties
eventually for all servers the state for
that server
is "v2"
* It is always true that there exists
* a server that is not offline (!= is /= in TLA)
ZeroDowntime ==
[](E s in servers : serverState[s] /= "offline")
148. Running the script
If we run this script with two servers
Error: "Invariant ZeroDowntime is violated"
The model checker trace shows us how:
s1 -> "online_v1", s2 -> "online_v1"
s1 -> "offline", s2 -> "online_v1"
s1 -> "offline", s2 -> "offline" // boom!
No problem, we think we
have a fix for this
149. Improving the design with upgrade condition
Start(s) ==
* server is ready
serverState[s] = "online_v1"
* NEW: there does not exist any other server which is offline
/ ~(E other in servers : serverState[other] = "offline")
* then transition
/ serverState' = [serverState EXCEPT ![s] = "offline"]
A new condition for the Start action:
You can only transition to "offline" if no other servers are offline.
150. Running the script
Now re-run this script with two servers
• "ZeroDowntime" works
– We have confidence in the design!
• "EventuallyUpgraded" fails
– Because of stuttering
– But add fairness and it works again, yay!
We now have confidence in the design!
151. Adding another condition
New rule! All online servers must be running the same version
* Define the set of servers which are online.
OnlineServers ==
{ s in servers : serverState[s] /= "offline" }
* It is always true that
* any two online servers are the same version
SameVersion ==
[] (A s1,s2 in OnlineServers :
serverState[s1] = serverState[s2])
152. Running the script
Now run this script with the new property
Error "Invariant SameVersion is violated"
The model checker trace shows us how:
s1 -> "online_v1", s2 -> "online_v1"
s1 -> "offline", s2 -> "online_v1"
s1 -> "online_v2", s2 -> "online_v1" // boom!
Let's add a load balancer to fix this
153. Improving the design with a load balancer
VARIABLES serverState, loadBalancer
* initialize all servers to "online_v1"
Init == serverState = [s in servers |-> "online_v1"]
/ loadBalancer = "v1"
* the online servers depend on the load balancer
OnlineServers ==
IF loadBalancer = "v1"
THEN { s in servers : serverState[s] = "online_v1" }
ELSE { s in servers : serverState[s] = "online_v2" }
The load balancer points to only "v1" or "v2" servers
154. Improving the design with a load balancer
Finish(s) ==
serverState[s] = "down"
/ serverState' = [serverState EXCEPT ![s] = "online_v2"]
* and load balancer can point to v2 pool now
/ loadBalancer' = "v2"
Then, when one server has successfully upgraded,
the load balancer can switch over to using v2
155. Running the script
Now re-run this script with the load balancer
• "ZeroDowntime" works
• "EventuallyUpgraded" works
• "SameVersion" works
156. Our sketch is complete (for now)
Think of TLA+ as "agile" modeling
for software systems
A few minutes of sketching =>
much more confidence!
157. Some common questions
• How to handle failures?
– Just add failure cases to the state diagram!
• How does this model convert to code?
– It doesn't! Modeling is a tool for thinking,not a
code generator.
– It's about having confidence in the design.
158. Conclusion
• TLA+ and model checking is not that scary
– It's just agile modeling for software systems!
– For concurrency, it's essential
– Check it out! A bigger toolbox is a good thing to have
• TLA+ can do much more than I showed today
– Not just model checking, but refinements, proofs, etc
• More information:
– TLA+ Home Page with videos, book, papers, etc
– learntla.com book (and trainings!) by Hillel Wayne
159. Slides and video here
fsharpforfunandprofit.com/tlaplus
Thank you!
"Domain Modeling Made Functional" book
fsharpforfunandprofit.com/books
@ScottWlaschin Me on twitter