Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Practical Fault Tolerance
in Elixir
Alexei Sholik
Elixir Club Kyiv, 17 Mar 2018
About me
Backend engineer at Contractbook.co.
Co-host at BeamEaters podcast.
Contributor to Elixir.
github.com/alco
What is fault tolerance?
Why it’s important
Why do only
Erlang/Elixir communities
seem to care about it?
A practical example
(demo)
“Let it crash”
is not the full story
Fail fast → restart → try again
Building blocks of fault tolerance
Process
Error
Link
Monitor
def call(process, request, timeout) do
monitor = Process.monitor(process)xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
send(process, {:"$...
Task
Supervisor
Bohrbug
Heisenbug
Improving our example
(demo)
...
DB
SuperDocs.Web.Endpoint
<request process>
<db_connection process>
Ecto’s connection pool
supervisor
DB
SuperDocs.Web.Endpoint
<request process>
<db_connection process>
SuperDocs.TaskSup
User.send_confirmation_email
...
Durable message queue
hex.pm/packages/amqp
DB
SuperDocs.Web.Endpoint
<request process>
<db_connection process>
RabbitMQ
SuperDocs.TaskSup
ImportService.import_docume...
Other tools and techniques
Remote error tracking
Sentry, Rollbar, log aggregators
Exponential backoff
hex.pm/packages/db_connection
hex.pm/packages/gen_retry
Alternative supervisor
hex.pm/packages/director
And so on...
Recap
● Trust in OTP…
● ...but don’t dismiss other tools
● Anticipate failures
● Isolate failures
● Fail fast, restart, tr...
● Why do computers fail and what can be done about it?
● Making reliable distributed systems in the presence of software
e...
Thank you!
Questions?
Practical Fault Tolerance in Elixir - Alexei Sholik
Practical Fault Tolerance in Elixir - Alexei Sholik
Practical Fault Tolerance in Elixir - Alexei Sholik
Practical Fault Tolerance in Elixir - Alexei Sholik
Practical Fault Tolerance in Elixir - Alexei Sholik
Practical Fault Tolerance in Elixir - Alexei Sholik
Upcoming SlideShare
Loading in …5
×

Practical Fault Tolerance in Elixir - Alexei Sholik

Elixir Club 10
March 17, 2018
Kyiv

  • Be the first to comment

Practical Fault Tolerance in Elixir - Alexei Sholik

  1. 1. Practical Fault Tolerance in Elixir Alexei Sholik Elixir Club Kyiv, 17 Mar 2018
  2. 2. About me Backend engineer at Contractbook.co. Co-host at BeamEaters podcast. Contributor to Elixir. github.com/alco
  3. 3. What is fault tolerance?
  4. 4. Why it’s important
  5. 5. Why do only Erlang/Elixir communities seem to care about it?
  6. 6. A practical example (demo)
  7. 7. “Let it crash” is not the full story
  8. 8. Fail fast → restart → try again
  9. 9. Building blocks of fault tolerance
  10. 10. Process
  11. 11. Error
  12. 12. Link
  13. 13. Monitor
  14. 14. def call(process, request, timeout) do monitor = Process.monitor(process)xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx send(process, {:"$gen_call", {self(), monitor}, request}) receive {^monitor, reply} -> Process.demonitor(monitor, [:flush]) {:ok, reply} {:DOWN, ^monitor, _, _, reason} ->⁣xxxxxxxxxxxxxxxxxxxxxxxxxxxx exit(reason) after timeout -> Process.demonitor(monitor, [:flush]) exit(:timeout) end end
  15. 15. Task
  16. 16. Supervisor
  17. 17. Bohrbug
  18. 18. Heisenbug
  19. 19. Improving our example (demo)
  20. 20. ... DB SuperDocs.Web.Endpoint <request process> <db_connection process> Ecto’s connection pool supervisor
  21. 21. DB SuperDocs.Web.Endpoint <request process> <db_connection process> SuperDocs.TaskSup User.send_confirmation_email ...
  22. 22. Durable message queue hex.pm/packages/amqp
  23. 23. DB SuperDocs.Web.Endpoint <request process> <db_connection process> RabbitMQ SuperDocs.TaskSup ImportService.import_documents_for AMQPClient AMQPWorker AMQPWorker ...
  24. 24. Other tools and techniques
  25. 25. Remote error tracking Sentry, Rollbar, log aggregators
  26. 26. Exponential backoff hex.pm/packages/db_connection hex.pm/packages/gen_retry
  27. 27. Alternative supervisor hex.pm/packages/director
  28. 28. And so on...
  29. 29. Recap ● Trust in OTP… ● ...but don’t dismiss other tools ● Anticipate failures ● Isolate failures ● Fail fast, restart, try again
  30. 30. ● Why do computers fail and what can be done about it? ● Making reliable distributed systems in the presence of software errors ● It's About the Guarantees ● On Erlang, State and Crashes ● Error Kernels Reading material
  31. 31. Thank you! Questions?

×