2. Error Handling in Concurrent Systems
Aka Building Concurrent Systems in a Hostile Environment
Turning the dumpster fire we have into the one we deserve.
3. Hi
I’m Angus, I Guess I work at Liveops Cloud My opinions are my own
(as much as you can own an opinion comma man). @angusiguess
on twitter angusiguess on github I like bikes. A lot of bikes. A lot.
4. Why I’m interested
This time last year, I was interested in systems working, so I
talked about correctness.
A lot has happened since then.
5. Why I’m interested
Namely, a lot of my code has gone to production.
And a lot of that code has failed.
Sometimes silently.
And things have gotten weird.
6. So I did some reading
Because someone smarter than me probably solved
this in the 60’s through 80’s.
And I found this:
"Making reliable distributed systems in the presence of
software errors"
The Open Telecon Platform Model
7. An almost certainly reductionist history of computing.
For a while, computers could work synchronously.
Instructions could be processed in order.
A lot of tricks to deal with I/O, memory mapping, hardware.
8. Then communication networks happened:
Computing borrowed ideas from railroads and telegrams
Then computers were used to drive phones
Then phones were used to connect computers.
9. Gave rise to two obvious paradigms
Sequential
Concurrent
10. Modelling problems
A lot of computation benefits from being modelled
sequentially.
Problems where order matters
Numerical problems
Reading from and writing to things
Even executing programs
11. Modelling problems
A lot of computation suffers when modelled sequentially.
Communication
Sensory data
Modelling things affecting each other rather than the world
affecting things
12. When communication gets important, so does concurrency
1986, Joe Armstrong starts work on Erlang, to program
telephone systems.
13. Erlang is a strange language
Doesn’t like to share memory
Programs are split into processes
Processes have to send messages to each other
No guarantees that a message has been received
Processes don’t always know where to find each other
14. Everything old is new again.
2013, clojure core team starts work on core.async, based on
go’s goroutines
goroutines don’t share memory
goroutines communicate by putting messages on channels
No guarantees about a message being received
no way to even determine who is listening to a channel
15. What fresh hell is this?
These seem like strong constraints.
Why assume them?
16. Shared Memory
Suppose process P1 and P2 each have a list of instructions
P1 and P2 start executing at roughly the same time,
modifying memory.
We can’t guarantee the order that P1 and P2 will interleave
How can we write safe programs?
Well we kind of can’t. We can write some safe programs
17. Function Calls
Depend on the existence of a receiving function.
Couple the caller to the receiver
18. Not knowing about places
Assume the receiver will be there when we ask for something
Also a way to enforce no shared state
19. Still unclear
Synchronous systems fail as one.
Like a magic eight ball.
Concurrent systems fail partially
Like a highway or a casino
20. We can’t assume that all of our system will be intact
How are we supposed to work like this?
I quit
I always wanted to be a bike messenger anyway
21. No wait don’t go!
We can fix it
We just have to change how we think
Haha jk don’t try to fix it
22. Rule #1: Don’t try to fix it.
If we have a single process, we can try as hard as we want
before we fail
Things will either work, kind of work, or not work at all
If we have lots of processes we have to think about all the
ways a piece could fail.
It’s too much, so what if we just don’t?
23. Exceptions, Errors, and Failures
Exceptions are when the runtime hits something unspecified
Errors are when programmers don’t know what to do
Failures are when the system doesn’t know what to do about
programmers not knowing what to do.
24. Why let it crash then?
If a small piece of a system fails, we probably know what to do
with it.
25. Let’s try this really quick.
We’re processing a stream of events that looks like this:
[num-of-events, num-of-seconds]
We want to track the total events per second to take an
average later.
(+ acc (divider event))
acc = acc + divider(event)
26. Let’s try this really quick.
We get an event [0, 0]
Our code throws an exception, we can catch it before addition
happens.
What would we want from the function call?
27. Let’s try this really quick.
What if our code looked like:
(* acc (divider event))
acc = acc * divider(event)
28. What about a database?
We request something from a database and:
The query is wrong
Crash the process
The query fails.
Try again, could be a connection blip.
The query times out.
Maybe chill out there for a second, no sense in knocking our
database over.
29. Rule #2: Ask for help
If a small part of a program doesn’t know what to do, maybe a
larger part will.
30. Supervisors
Processes that watch other processes and decide how to act.
A supervisor can restart a process or fail and throw an
exception.
Supervisors decouple error handling from business logic
31. Things fit together
We start to get an idea of how things fit together.
It’s easier to see how parts of a system should fail an interact.
Trees are pretty intuitive.
32. Maybe it’s time for an example
Matchmaking server for a multiplayer games.
Checks which players are available
Determines whether these two players can be routed to each
other
Sends off a command to create a session
33. Maybe it’s time for an example
API gets REST requests, updates system state.
Matchmaker searches state for good matches, checks to see if
a connection can be made, sends them to a game session
service.
34. This seems nicer
We can reason about errors a little better
Parts of this system can run independently
It’s clearer what the system needs to run
35. So I guess my point is:
There are nice ways to model concurrent systems.
When building systems, think about ways to:
Isolate failure (let it crash)
Recover and operate partially
Cut down on dependencies
36. Shouts out to:
Joe Armstrong for writing an incredible dissertation and a cool
language.
My co-worker Simon Robinson for chatting about me long and
hard about availability in systems.
You
37. PS
If you want to do any of this
LiveOps is hiring
Get at me at the after party.
38. Never put in a questions slide, they said.
Questions?