Angus Fletcher - Error Handling in Concurrent Systems

Error Handling in Concurrent Systems
Aka Building Concurrent Systems in a Hostile Environment
Turning the dumpster ﬁre we have into the one we deserve.

Hi
I’m Angus, I Guess I work at Liveops Cloud My opinions are my own
(as much as you can own an opinion comma man). @angusiguess
on twitter angusiguess on github I like bikes. A lot of bikes. A lot.

Why I’m interested
This time last year, I was interested in systems working, so I
talked about correctness.
A lot has happened since then.

Why I’m interested
Namely, a lot of my code has gone to production.
And a lot of that code has failed.
Sometimes silently.
And things have gotten weird.

So I did some reading
Because someone smarter than me probably solved
this in the 60’s through 80’s.
And I found this:
"Making reliable distributed systems in the presence of
software errors"
The Open Telecon Platform Model

An almost certainly reductionist history of computing.
For a while, computers could work synchronously.
Instructions could be processed in order.
A lot of tricks to deal with I/O, memory mapping, hardware.

Then communication networks happened:
Computing borrowed ideas from railroads and telegrams
Then computers were used to drive phones
Then phones were used to connect computers.

Gave rise to two obvious paradigms
Sequential
Concurrent

Modelling problems
A lot of computation beneﬁts from being modelled
sequentially.
Problems where order matters
Numerical problems
Reading from and writing to things
Even executing programs

Modelling problems
A lot of computation suffers when modelled sequentially.
Communication
Sensory data
Modelling things affecting each other rather than the world
affecting things

When communication gets important, so does concurrency
1986, Joe Armstrong starts work on Erlang, to program
telephone systems.

Erlang is a strange language
Doesn’t like to share memory
Programs are split into processes
Processes have to send messages to each other
No guarantees that a message has been received
Processes don’t always know where to ﬁnd each other

Everything old is new again.
2013, clojure core team starts work on core.async, based on
go’s goroutines
goroutines don’t share memory
goroutines communicate by putting messages on channels
No guarantees about a message being received
no way to even determine who is listening to a channel

What fresh hell is this?
These seem like strong constraints.
Why assume them?

Shared Memory
Suppose process P1 and P2 each have a list of instructions
P1 and P2 start executing at roughly the same time,
modifying memory.
We can’t guarantee the order that P1 and P2 will interleave
How can we write safe programs?
Well we kind of can’t. We can write some safe programs

Function Calls
Depend on the existence of a receiving function.
Couple the caller to the receiver

Not knowing about places
Assume the receiver will be there when we ask for something
Also a way to enforce no shared state

Still unclear
Synchronous systems fail as one.
Like a magic eight ball.
Concurrent systems fail partially
Like a highway or a casino

We can’t assume that all of our system will be intact
How are we supposed to work like this?
I quit
I always wanted to be a bike messenger anyway

No wait don’t go!
We can ﬁx it
We just have to change how we think
Haha jk don’t try to ﬁx it

Rule #1: Don’t try to ﬁx it.
If we have a single process, we can try as hard as we want
before we fail
Things will either work, kind of work, or not work at all
If we have lots of processes we have to think about all the
ways a piece could fail.
It’s too much, so what if we just don’t?

Exceptions, Errors, and Failures
Exceptions are when the runtime hits something unspeciﬁed
Errors are when programmers don’t know what to do
Failures are when the system doesn’t know what to do about
programmers not knowing what to do.

Why let it crash then?
If a small piece of a system fails, we probably know what to do
with it.

Let’s try this really quick.
We’re processing a stream of events that looks like this:
[num-of-events, num-of-seconds]
We want to track the total events per second to take an
average later.
(+ acc (divider event))
acc = acc + divider(event)

We get an event [0, 0]
Our code throws an exception, we can catch it before addition
happens.
What would we want from the function call?

What if our code looked like:
(* acc (divider event))
acc = acc * divider(event)

What about a database?
We request something from a database and:
The query is wrong
Crash the process
The query fails.
Try again, could be a connection blip.
The query times out.
Maybe chill out there for a second, no sense in knocking our
database over.

Rule #2: Ask for help
If a small part of a program doesn’t know what to do, maybe a
larger part will.

Supervisors
Processes that watch other processes and decide how to act.
A supervisor can restart a process or fail and throw an
exception.
Supervisors decouple error handling from business logic

Things ﬁt together
We start to get an idea of how things ﬁt together.
It’s easier to see how parts of a system should fail an interact.
Trees are pretty intuitive.

Maybe it’s time for an example
Matchmaking server for a multiplayer games.
Checks which players are available
Determines whether these two players can be routed to each
other
Sends oﬀ a command to create a session

Maybe it’s time for an example
API gets REST requests, updates system state.
Matchmaker searches state for good matches, checks to see if
a connection can be made, sends them to a game session
service.

This seems nicer
We can reason about errors a little better
Parts of this system can run independently
It’s clearer what the system needs to run

So I guess my point is:
There are nice ways to model concurrent systems.
When building systems, think about ways to:
Isolate failure (let it crash)
Recover and operate partially
Cut down on dependencies

Shouts out to:
Joe Armstrong for writing an incredible dissertation and a cool
language.
My co-worker Simon Robinson for chatting about me long and
hard about availability in systems.
You

PS
If you want to do any of this
LiveOps is hiring
Get at me at the after party.

Never put in a questions slide, they said.
Questions?

Angus Fletcher - Error Handling in Concurrent Systems

Recommended

Recommended

More Related Content

Similar to Angus Fletcher - Error Handling in Concurrent Systems

Similar to Angus Fletcher - Error Handling in Concurrent Systems (20)

Recently uploaded

Recently uploaded (20)

Angus Fletcher - Error Handling in Concurrent Systems