Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Systematic error management - we ported rudder to zio

631 views

Published on

This talk was given at ScalaIO 2019.
It explains how you can manage errors in a systematic way in your applications, and show how we did it in Rudder with the functional library ZIO.

It presents 4 big principles which direct my devloper job:
- 1/ Our work as developers is to discover and assess failure modes.
- 2/ ERRORS are a SOCIAL construction to give AGENCY to the receiver of the error.
- 3/ An application has always at least 3 kinds of users: users; devs; and ops. Don’t forget any.
- 4/ It’s YOUR work to choose the SEMANTIC between nominal case and error and KEEP your PROMISES.

The talk gives 5 guidelines to help you implement these principles. It also introduces a very light glimpse on system thinking that you can explore in more details in the related article "Understand things as interacting systems": https://medium.com/@fanf42/understand-things-as-interacting-systems-b273bdba5dec

If you have any questions, please ask: there is several way to contact me at the end of the deck (slide 87)!

Published in: Software
  • It seems that slideshare does not allow to update a deck after first update - what a shame for a slide sharing plateform. AND you can't add structured comment, what a double shame. So, not sure how the question/answer I want to add will display: What about making impossible state unrepresentable from the beginning? That’s a very good point and you should ALWAYS try to do so. The idea is to change method’s domain definition (ie, the parameter’s shape) to only work on inputs that can’t rise errors. Typically, in my trivial “divide” example, we should have use “non zero integer” for denominator input. Alexis King (@lexy_lambda) wrote a wonderful article on that, so just go read it, she explains it better than I can: “Parse, don’t validate” https://lexi-lambda.github.io/blog/2019/11/05/parse-don-t-validate/ We use that technique a lot in Rudder to drive understanding of what is possible. Each time we can restrict domain definition, we try to keep that information for latter use. Typical example: parsing plugin license (we have 4 “xxxLicenses” classes depending what we now about its state); Validating user policy (again several “SomethingPolicyDraft” with different hypothesis needed to build the “Something”). the general goal is the same than with error management: assess failure mode, give agency to users to react efficiently. There’s still plenty of cases where that technique is hard to use (fluzzy business cases…) or not what you are looking for (you just want to tell users that something is the nominal case, or not, and give them agency to react accordingly).
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Systematic error management - we ported rudder to zio

  1. 1. Systematic error management in application We ported Rudder to ZIO 2019-10-30 francois@rudder.io @fanf42
  2. 2. Hi! devops automation/compliance app manage ten of thousands computers 2 François ARMAND CTO Founder Free Software Company “Stay Up”
  3. 3. Hi! devops automation/compliance app manage ten of thousands computers 3 François ARMAND CTO Founder Free Software Company “Stay Up” Developer
  4. 4. Developer ? ● Model the world into code ○ Try to make it useful 4
  5. 5. Developer ? ● Model the world into code ○ Try to make it useful ● Nominal case necessary (of course) 5
  6. 6. Developer ? ● Model the world into code ○ Try to make it useful ● Nominal case necessary (of course) ● But not sufficient (models are false) ○ Bugs ○ Misunderstanding of needs ○ open world ○ Damn users using your app ■ often “me, 3 days in the future” 6
  7. 7. This talk ● systematic management of errors ● caveat emptor: ○ I’m a scala dev, mainly ■ expect Scala terminologie ■ statically typed language with union types, interfaces ○ application, not library ■ closer world (genericity is not the main goal) 7
  8. 8. This talk ● It's an important talk for me ● Much harder to do than expected ○ based on lots of deeply rooted, fuzzy, experimental knowledge ● Please, please, I beg you: if anything unclear, come chat with me / ask questions (whatever the medium) 8
  9. 9. 9 Not so popular opinions - 4 Hills I would die on -
  10. 10. Our work as developers is to discover and assess failure modes 10 Not so popular opinion 1/4
  11. 11. ERRORS are a SOCIAL construction to give AGENCY to the receiver of the error 11 Not so popular opinion 2/4
  12. 12. An application has always at least 3 kinds of users: users ; devs ; and ops. Don’t forget any. 12 Not so popular opinion 3/4
  13. 13. It’s YOUR work to choose the SEMANTIC between nominal case and error and KEEP your PROMISES Not so popular opinion 4/4 13
  14. 14. OK. But in concret terms 14 ?
  15. 15. 15 Assess failure modes. Give agency to your users and don’t forget any of them. You are responsible to keep promises made.
  16. 16. 16 Pure, total functions Explicit error channel Program to strict interfaces and protocols Composition and tooling 1. 2. 4. 5. Failures vs Errors 3. Assess failure modes. Give agency to your users and don’t forget any of them. You are responsible to keep promises made.
  17. 17. 17 1. 2. 4. 5. These points are also important and cans be translated at architecture / UX / team / ecosystem levels. But let’s keep it simple with code. 3. Assess failure modes. Give agency to your users and don’t forget any of them. You are responsible to keep promises made.
  18. 18. 1. 18 don’t lie about your promises Pure, total functions
  19. 19. Don’t lie! divide(a: Int, b: Int): Int 19
  20. 20. Don’t lie! 20 Divide By Zero ? divide(a: Int, b: Int): Int
  21. 21. Don’t lie! 21 Divide By Zero ? ● non total functions are a lie ○ your promises are unsound ○ your users can’t react appropriately divide(a: Int, b: Int): Int
  22. 22. Don’t lie! 22 getUserFromDB(id: UserId): User
  23. 23. Don’t lie! 23 No such user ? (non total) getUserFromDB(id: UserId): User
  24. 24. Don’t lie! 24 No such user ? (non total) DB connexion error? getUserFromDB(id: UserId): User
  25. 25. Don’t lie! 25 No such user ? (non total) DB connexion error? ● non pure functions are a lie ○ your promises are unsound ○ your users can’t react appropriately getUserFromDB(id: UserId): User
  26. 26. Sound promises 26 ● use total functions ○ or make them total with union return type ● use pure functions ○ or make them pure with IO monad ● Don’t lie to your users, ● allow them to react efficiently:
  27. 27. 2. 27 make it unambiguous in your types Explicit error channel
  28. 28. 28 It’s a signal make it unambiguous give agency
  29. 29. ● Don’t assume what’s obvious ● It’s an open world out there ● Don’t force users to revert-engineer possible cases 29 It’s a signal make it unambiguous give agency
  30. 30. Which intent is less ambiguous? 30 blobzurg(a: Int, b: Int): Option[Int] blobzurg(a: Int, b: Int): PureResult[DivideByZero, Int] It’s a signal make it unambiguous give agency
  31. 31. 31 It’s a signal make it unambiguous give agency automate it ● Use the type system to automate classification of errors?
  32. 32. 32 A type system is a tractable syntactic method for proving the absence of certain program behaviors by classifying phrases according to the kinds of values they compute. Benjamin Pierce It’s a signal make it unambiguous give agency automate it ● Use the type system to automate classification of errors?
  33. 33. 33 By definition, a type system automatically categorize results ⟹ need for a dedicated error chanel + a common error trait A type system is a tractable syntactic method for proving the absence of certain program behaviors by classifying phrases according to the kinds of values they compute. Benjamin Pierce It’s a signal make it unambiguous give agency automate it def divide(a: Int, b: Int): PureResult[Int]
  34. 34. 34 A type system is a tractable syntactic method for proving the absence of certain program behaviors by classifying phrases according to the kinds of values they compute. Benjamin Pierce trait MyAppError // common properties of errors type PureResult[A] = Either[MyAppError, A] It’s a signal make it unambiguous give agency automate it def divide(a: Int, b: Int): PureResult[Int] By definition, a type system automatically categorize results ⟹ need for a dedicated error chanel + a common error trait
  35. 35. 35 It’s a signal make it unambiguous give agency automate it def getUser(id: UserId): IOResult[User] By definition, a type system automatically categorize results ⟹ need for a dedicated error chanel + a common error trait Same for effectful functions!
  36. 36. Same for effectful functions! 36 trait MyAppError // common properties of errors type IOResult[A] = IO[MyAppError, A] It’s a signal make it unambiguous give agency automate it def getUser(id: UserId): IOResult[User] By definition, a type system automatically categorize results ⟹ need for a dedicated error chanel + a common error trait
  37. 37. 37 It’s a signal make it unambiguous give agency automate it ● Use a dedicated error channel ○ ~ Either[E, A] for pure code, ○ else ~ IO[E, A] monad ● use a parent trait for common error properties… ● and for automatic categorization of errors by compiler
  38. 38. 3. 38 models are false by construction Failures vs Errors
  39. 39. Model everything? 39 writeFile(path: String, value: String): IOResult[Unit]
  40. 40. Model everything? 40 java.lang.SecurityException? (jvm perm to access FS) writeFile(path: String, value: String): IOResult[Unit]
  41. 41. Model everything? 41 java.lang.SecurityException? (jvm perm to access FS) ⟹ where do you put the limit? writeFile(path: String, value: String): IOResult[Unit]
  42. 42. Systems? Need for a systematic approach to error management 42 A school of systems
  43. 43. Systems? Need for a systematic approach to error management 43 ○ BOUNDED group of things ○ with a NAME Interacting ○ with others systems A school of systems
  44. 44. Systems have horizon. 44 ○ nothing exists beyond horizon
  45. 45. Systems have horizon. Horrors lie beyond. 45 ○ nothing exists beyond horizon ○ Like with Lovecraft: if something from beyond interact with a system, the system becomes inconsistent
  46. 46. Errors vs Failures 46 Errors ● expected non nominal case ● signal for users ● social construction: you choose alternative or error ● reflected in types Failures ● unexpected case: by definition, application is in an unknown state ● only choice is stop as cleanly as possible ● not reflected in types
  47. 47. Horizon limit is your choice - by definition 47 java.lang.SecurityException?
  48. 48. Horizon limit is your choice - by definition 48 java.lang.SecurityException? execScript(js: String): IOResult[String] In Rudder, we have a JS engine (JS from users):
  49. 49. Horizon limit is your choice - by definition 49 java.lang.SecurityException? execScript(js: String): IOResult[String] In Rudder, we have a JS engine (JS from users): ⟹ SecurityException is an expected error case here
  50. 50. Horizon limit is your choice - by definition 50 java.lang.SecurityException? execScript(js: String): IOResult[String] In Rudder, we have a JS engine (JS from users): ⟹ SecurityException is an expected error case here … but nowhere else in Rudder. By our choice.
  51. 51. 4. 51 use systems to materialize promises Program to strict interfaces and protocols
  52. 52. Need for a systematic approach to error management 52 ○ BOUNDED group of things ○ with a NAME Interacting ○ with others systems A school of systems A bit more about systems
  53. 53. A bit more about systems Need for a systematic approach to error management 53 ○ BOUNDED group of things ○ with a NAME Interacting ○ via INTERFACES ○ by a PROTOCOL with other systems ○ And PROMISING to have a behavior A school of systems
  54. 54. Example? 54 Typical web application.
  55. 55. Example? 55 Typical web application.
  56. 56. Example? 56 Typical web application. How to keep contradictory promises? Promises to third parties about REST behaviour Promises to business and developers about code manageability
  57. 57. Make promises, Keep them 57 ● systems allow to bound responsibilities
  58. 58. Make promises, Keep them 58 ● systems allow to bound responsibilities
  59. 59. Make promises, Keep them 59 ● systems allow to bound responsibilities Business Core sub-system: ● own ADT / logic (mostly pure) ● lifecycle bounded to developers understanding of needs (rapid changes)
  60. 60. Make promises, Keep them 60 ● systems allow to bound responsibilities Business Core sub-system: ● own ADT / logic (mostly pure) ● lifecycle bounded to developers understanding of needs (rapid changes) Pattern: “A pure heart (core) surrounded by side effects”* * works better in French: “un coeur pur encerclé par les effets de bords”
  61. 61. Make promises, Keep them 61 ● systems allow to bound responsibilities Users of the API want stability and to know what errors can happen Business Core sub-system: ● own ADT / logic (mostly pure) ● lifecycle bounded to developers understanding of needs (rapid changes)
  62. 62. Make promises, Keep them 62 ● systems allow to bound responsibilities Business Core sub-system: ● own ADT / logic (mostly pure) ● lifecycle bounded to developers understanding of needs (rapid changes) REST sub-system : ● own ADT / logic (mostly effects) ● lifecycle bounded to REST contract: strict versioning, changes are breaking changes Users of the API want stability and to know what errors can happen
  63. 63. Make promises, Keep them 63 ● systems allow to bound responsibilities Business Core sub-system: ● own ADT / logic (mostly pure) ● lifecycle bounded to developers understanding of needs (rapid changes) REST sub-system : ● own ADT / logic (mostly effects) ● lifecycle bounded to REST contract: strict versioning, changes are breaking changes Stable API : interface, strict protocol & promises (nominal cases + errors) Users of the API have agency (able to react efficiently)
  64. 64. Make promises, Keep them 64 ● systems allow to bound responsibilities Business Core sub-system: ● own ADT / logic (mostly pure) ● lifecycle bounded to developers understanding of needs (rapid changes) REST sub-system : ● own ADT / logic (mostly effects) ● lifecycle bounded to REST contract: strict versioning, changes are breaking changes Stable API : interface, strict protocol & promises (nominal cases + errors) Users of the API have agency (able to react efficiently) Translation between sub-systems: API: interface, protocol & promises!
  65. 65. Make promises, Keep them 65 ● systems allow to bound responsibilities ● translate errors between sub-systems ○ make errors relevant to their users ● It’s a model, it’s false ○ there is NO definitive answer. ○ discuss, share, iterate ● the bigger the promises, the stricter the API
  66. 66. 5. 66 make it extremely convenient to use Composition and tooling
  67. 67. Checked exceptions are a good signal for users 67 Unpopular opinion (sure)
  68. 68. Checked exceptions are a good signal for users Who likes them ? 68 Unpopular opinion (sure)
  69. 69. What’s missing for good error management in code ? ● signal must be unambiguous ○ exception are a pile of ambiguity ○ Error ? ○ Fatal error ? ○ Checked ? Unchecked ? 69
  70. 70. What’s missing for good error management in code ? ● signal must be unambiguous ○ exception are a pile of ambiguity ● exceptions are A PAIN to use ○ no tooling, no inference, nothing ■ you need to be able to manipulate errors like normal code ■ where are our higher order functions like map, fold, etc ? ○ no composition ■ loose referential transparency* 70 * the single biggest win regarding code comprehension
  71. 71. Make it a joy! 71 ● managing error should be enjoyable ! ○ automatic (in for loop + inference) ○ or as expressive as nominal case! ● safely, easely managing error should be the default ! ○ composition (referential transparency…) ○ higher level resource management: bracket, etc ● make the code extremely readable ○ add all the combinators you need! ○ it’s cheap with pure, total functions
  72. 72. 72 In Rudder: Why ZIO?
  73. 73. Why ZIO ? 73 ● you still have to think in systems by yourself
  74. 74. Why ZIO ? 74 ● you still have to think in systems by yourself ● then ZIO provides : ○ effect management ○ with an explicit error channel ○ IO[+E, +A] val pureCode = IO.effect(effectfulCode)
  75. 75. Why ZIO ? 75 ● you still have to think in systems by yourself ● then ZIO provides : ○ debuggable failures Complex error composition Async code trace
  76. 76. Why ZIO ? 76 ● you still have to think in systems by yourself ● then ZIO provides : ○ tons of convenience to manipulate errors ■ create: from Option, Either, value... ■ transform: mapError, fold, foldM, .. ■ recovery: total, partial, or else ○ composable effects ■ .bracket / Managed, asyncqueues, STM, etc ● safe, composable resource management
  77. 77. Why ZIO ? 77 ● you still have to think in system by yourself ● then ZIO provides : ○ effect management ○ with an explicit error channel ○ debuggable failures ○ tons of convenience to manipulate errors ○ composable
  78. 78. Why ZIO ? 78 ● you still have to think in system by yourself ● then ZIO provides : ○ effect management ○ with an explicit error channel ○ debuggable failures ○ tons of convenience to manipulate errors ○ composable ● Everything work in parallel, asynchronous code too! ● Inference just work!
  79. 79. Why ZIO ? 79 ● you still have to think in system by yourself ● then ZIO provides : ○ effect management ○ with an explicit error channel ○ debuggable failures ○ tons of convenience to manipulate errors ○ composable ● Everything work in parallel, concurrent code too! ● Inference just work! Lots of details: “Error Management: Future vs ZIO” https://www.slideshare.net/jdegoes/error-management-future-vs-zio
  80. 80. 80 In Rudder, with ZIO: we settled on that
  81. 81. One error hierarchy 81 ● One error type (trait) providing common tooling
  82. 82. Unambiguous type 82
  83. 83. Generic, useful errors 83
  84. 84. Specialized error for subsystems 84
  85. 85. Full example 85 ● inference just works ● each sub-system add relevant information (None, msg) => Unexpected(msg) PureResult[A] => IOResult[A] (err: RudderError[A], msg) => Chained(msg, err) error contextualisation between systems
  86. 86. 86 Pure, total functions don’t lie about your promises Explicit error channel make it unambiguous in your types Program to strict interfaces and protocols use systems to materialize promises Composition and tooling make it extremely convenient to use Assess failure modes. Give agency to your users and don’t forget any of them. You are responsible to keep promises made. 1. 2. 4. 5. Failures vs Errors models are false by construction3.
  87. 87. Question? Contact me / Chat with me! https://twitter.com/fanf42 https://github.com/fanf https://keybase.io/fanf42 irc/freenode: fanf francois@rudder.io 87 Ressources ○ Error management: future vs ZIO A much more detailed presentation of ZIO error management capabilities https://www.slideshare.net/jdegoes/error-management-future-vs-zio ○ Understand Things As Interacting Systems More insights on systems. https://medium.com/@fanf42/understand-things-as-interacting-systems-b273bdba5dec ○ Stay Up! Journey of a Free Software Company. One decade in search for a sustainable model https://medium.com/@fanf42/stay-up-5b780511109d
  88. 88. Some questions asked after the talk 88 ● Is SystemError used to catch / materialize failure ? ○ no, SystemError is here to translate Error that need to be dealts with (like connection error to DB, FS related problem, etc) but are encoded in Java with an Exception. SystemError is not used to catch Java “OutOfMemoryError”. These exception kills Rudder. We use the JVM Thread.setDefaultUncaughtExceptionHandler to try to give more information to dev/ops and clean things before killing the app.
  89. 89. Some questions asked after the talk 89 ● You have only one parent type for errors. Don’t you lose a lot of details with all special errors in subsystems losing the specificities when they are seen as RudderError? ○ this is a very pertinent question, and we spend a log of time pondering between the current design and one where all sub-systems would have their own error type (with no common super type). In the end, we settled on the current design because: ■ no common super type means no automatic inference. You need to guide it with transformer, and even if ZIO provide tooling to map errors, that means a lot of useless boilerplate that pollute the readability of your code. ■ there is common tooling that you really want to have in all errors (Chained, SystemError, but also “notOptional”, etc). You don’t want to rewrite them. Yes type class could be a solution, but you still have to write them, for no clear gain here. ■ you are fighting the automatic categorization done by the compiler in place of leveraging it. ■ The gain (detailed error) is actually almost never needed. When we switched to “only one super class for all error”, we saw that “Chained” is sufficient to deals with general trans-system cases, and in some very, very rare cases, you case build ad-hoc combinators when needed, it’s cheap. ○ So all in all, the wins in convenience and joy of just having evering working without boilerplate clearly outpaced the not clear gain of having different error hierarchies. ○ The problem would have been different if Rudder was not one monolithic app with a need of separated compilation between services. I think we would have made an “error” lib in that case.
  90. 90. Some questions asked after the talk 90 ● We use Future[Either[E,A]] + MTL, why should we switch to ZIO? ○ Well, the decision to switch is yours, and I don’t know the specific context of your company to give an advice on that. Nonetheless, here is my personal opinion: ■ ZIO stack seems simpler (less concepts) and work perfectly with inference. Thus it may be simpler to teach it to new people, and to maintain. YMMV. ■ ZIO perf are excellent, especially regarding concurrent code. Fibers are a very nice abstraction to work with. ■ ZIO enforce pure code, which is generally simpler to compose/refactor. ■ ZIO tooling and linked construction (Managed resources, Async Queues, STM, etc) are a joy to code with. It removes a lot of pains in tedious, boring, complicated tasks (closing resources correctly, sync between concurrent access, etc) ■ pertinent stack trace in concurrent code is a major win ● But at the end of the day, you decide!
  91. 91. Some questions asked after the talk 91 ● How long did it took to port Rudder to ZIO? ○ It’s complicated :). 1 month of part time (me), plus lots more time for teaching, refactoring, understanding new paradigm limits, etc ■ 1/ we didn’t started from nowhere. We were using Box from liftweb, and a lot of the code in Rudder was already “shaped” to deal with errors as explain in the talk (see https://issues.rudder.io/issues/14870 for context) ■ 2/ we didn’t ported all Rudder to ZIO. I estimated that we ported ~ 40% of the code (60k-70k lines ?). ■ 3/ we did some major refactoring along the lines, using new combinators and higher level structures (like async queues) ■ 4/ we started in end of 2018, when ZIO code was still moving a lot and we switch to new things we when became available (ZIO 1.0.0 is around the corner and it as been quite stable for months now) ■ we spent quite some time looking for the best choice for errors between sub-system (see other question)

×