Crash Only Web Services


Published on

Crash-Only Web Services: Failure Semantics in an SOA Environment

Published in: Technology, Business
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Crash Only Web Services

  1. 1. Crash-Only Web Services: Failure Semantics in an SOA Environment Chris Hobbs and Abbie Barbir Nortel OASIS Symposium 2007, San Diego
  2. 2. The crash-only model <ul><li>Software design approach </li></ul><ul><li>Easier to restart quickly in a known state than to clean up and rebuild to recover from an error </li></ul>George Candea and Armando Fox are key proponents of crash-only software
  3. 3. Two themes of this talk <ul><li>Discuss issues of the behaviors of individual and composed services and their part in Web Services Service Level Agreements (WSLA) </li></ul><ul><ul><li>Based on the behaviors of the individual services </li></ul></ul><ul><ul><li>Need a taxonomy or ontology of service behaviors </li></ul></ul><ul><ul><li>Need an approach to calculating behaviors of composed services </li></ul></ul><ul><li>The “crash-only” model of operation as a simple failure behavior for a Web Service </li></ul><ul><ul><li>Failure is one of many identified behaviors </li></ul></ul>
  4. 4. Background: Orchestration as a New Programming Paradigm <ul><li>SOA promotes the concept of combining services through orchestration - invoking services in a defined sequence to implement a business process </li></ul><ul><li>Orchestration compounds the difficulties of testing and managing the quality of the deployed services </li></ul><ul><li>Testing composite services in SOA environment is a discipline which is still at an early stage of study </li></ul><ul><li>Describing and usefully modeling the individual and combined behaviors - needed to offer Service Level Agreements (SLA) - is at an even earlier stage </li></ul><ul><li>We hope to stimulate additional research on these topics </li></ul>
  5. 5. Testing Composed Services <ul><li>It’s fairly straightforward to test the operation of a device or system if we control all the parts. </li></ul><ul><li>When we start offering orchestrated services as a product, the services we are using may be outside our control. </li></ul><ul><li>For example consider well-known components: </li></ul><ul><ul><li>Google mapping service </li></ul></ul><ul><ul><li>Amazon S3 storage service </li></ul></ul><ul><ul><li>Mobile operator’s location service </li></ul></ul>
  6. 6. Testing Composed Services (2) <ul><li>With orchestrated services, there is never a complete “box” we can test </li></ul><ul><li>With orchestration as the new programming paradigm, testing becomes a much bigger problem </li></ul><ul><li>Failures of orchestrated services are often “Heisenbugs” - impervious to conventional debugging, generally non-reproducible </li></ul><ul><li>Offering a WSLA based on testing alone, without reliable knowledge of component service behaviors, may be risky </li></ul>
  7. 7. Web Services SLA (WSLA) <ul><li>Concerned with behaviors of the message flows and services spanning the end-to-end business transaction </li></ul><ul><li>Clients can develop testing strategies that stress the service to ensure that the service provider has met the contracted WSLA commitment </li></ul><ul><li>Composed services make offering a WSLA more risky </li></ul>Service Provider Z Provider X Service X Provider Y Service Y Client WSLA Network Packets Message flows Web Service
  8. 8. How can WSLAs be derived from behaviors of component services? <ul><li>Need to develop a model of the behavioral attributes of the individual component Web Services which contribute to the overall behavior of an orchestrated or composed Web Service. </li></ul><ul><li>Need to model the combination of individual service behavioral models </li></ul>
  9. 9. Web Services behaviors <ul><li>Behaviors may be described and quantified for each Web Service </li></ul><ul><li>May be combined by a “calculus of behaviors” when multiple services are composed </li></ul><ul><li>Behavior parameters may become a part of the service description, perhaps in WSDL. </li></ul><ul><li>Availability and Reliability </li></ul><ul><li>Performance </li></ul><ul><li>Management </li></ul><ul><li>Failure </li></ul><ul><li>Security </li></ul><ul><li>Privacy, confidentiality and integrity </li></ul><ul><li>Scalability </li></ul><ul><li>Execution </li></ul><ul><li>Internationalization </li></ul><ul><li>Synchronization </li></ul><ul><li>Etc., … </li></ul>
  10. 10. Web Services behaviors (2) <ul><li>To develop a Service Level Agreement (SLA) for a composed service (Z), we need to have relevant behavior descriptions for the individual services (X and Y) </li></ul><ul><li>We also need a deep understanding of how to combine the descriptions of X and Y to calculate results for Z </li></ul>Z X Y
  11. 11. Web Services behaviors (3) <ul><li>For each behavior, the challenges include the following: </li></ul><ul><li>1. How may service X’s and service Y’s behavior be characterized? </li></ul><ul><li>2. How may those characterizations be formalized and advertised by X and Y? </li></ul><ul><li>3. How may Z incorporate X’s and Y’s characterizations and then advertise the result? </li></ul><ul><li>Z itself might become a component of an even larger service and therefore needs to advertise its own characteristics. It also needs this characterization to offer an SLA to consumers. </li></ul>
  12. 12. Web Services behaviors (4) <ul><li>Each behavior may have its own ontology, measures, and calculus of combining those measures when services are composed. </li></ul>Local Ontology Local Ontology Local Ontology Abstracted Ontology Abstracted Ontology X Y Z Z – Specific Ontology ? Need this analysis for each behavior of services X, Y and Z
  13. 13. Web Services behaviors (5) <ul><li>Ten behavior examples </li></ul><ul><ul><li>Availability and Reliability </li></ul></ul><ul><ul><li>Performance </li></ul></ul><ul><ul><li>Management </li></ul></ul><ul><ul><li>Failure (Crash-only is one mode) </li></ul></ul><ul><ul><li>Security </li></ul></ul><ul><ul><li>Privacy, confidentiality and integrity </li></ul></ul><ul><ul><li>Scalability </li></ul></ul><ul><ul><li>Execution </li></ul></ul><ul><ul><li>Internationalization </li></ul></ul><ul><ul><li>Synchronization </li></ul></ul><ul><li>Let’s focus on a few of these behaviors… </li></ul>Source: “Advertising Service Properties,” unpublished paper by C. Hobbs, J. Bell, P. Sanchez
  14. 14. Availability and Reliability <ul><li>“ Availability” is the percentage of client requests to which the server responds within the time it advertised. </li></ul><ul><li>“ Reliability” is the percentage of such server responses which return the correct answer. </li></ul><ul><li>In some applications availability is more important than reliability </li></ul><ul><ul><li>Many protocols used within the Internet, for example, are self-correcting and an occasional wrong answer is unimportant. The failure to give any answer, however, can cause a major network upheaval. </li></ul></ul>
  15. 15. Availability and Reliability (2) <ul><li>In other applications reliability is more important than availability </li></ul><ul><ul><li>If the service which calculates a person’s annual tax return does not respond occasionally it’s not a major problem - the user can try again </li></ul></ul><ul><ul><li>If that service does respond but with the wrong answer which is submitted to the tax authorities, then it could be disastrous </li></ul></ul>
  16. 16. Availability and Reliability (3) <ul><li>Services are built with either availability or reliability in mind, with clients accepting that no service can ever be 100% available or 100% reliable. </li></ul><ul><li>In combining services X and Y into a composite service Z, it is necessary to combine the underlying availability and reliability models and predict Z’s model. </li></ul><ul><li>To do so without manual intervention, X’s and Y’s models must be exposed. </li></ul>
  17. 17. Availability and Reliability (4) <ul><li>Availability and reliability models are often expressed as Markov Models or Petri Nets, which are easy to combine in a hierarchical way. </li></ul><ul><li>Major issues: </li></ul><ul><ul><li>Agreeing upon the semantics of the states in the Markov model or places in the Petri nets </li></ul></ul><ul><ul><li>Finding a way for X and Y to publish the models in a standard form. </li></ul></ul>
  18. 18. Availability and Reliability (5) <ul><li>Currently, apart from raw percentage figures, there is no method for describing these models </li></ul><ul><ul><li>Percentage time when the server is unavailable? </li></ul></ul><ul><ul><li>Percentage of requests to which it does not reply? </li></ul></ul><ul><ul><li>Different clients may experience these differently </li></ul></ul><ul><ul><li>A server which is unavailable from 00:00 to 04:00 every day can be 100% available to a client that only tries to access it in the afternoons. </li></ul></ul>
  19. 19. Availability and Reliability (6) <ul><li>If X and Y are distributed, then it is possible, following network failures, that for some customers, Z can access X but not Y and for others Y but not X. </li></ul><ul><li>The assessment of Z’s availability may be hard to quantify, so it may be difficult for Z to offer a meaningful WSLA. </li></ul>
  20. 20. Failure <ul><li>The failure models of X and Y may be very different: </li></ul><ul><ul><li>X fails cleanly and may, because of its idempotency, immediately be called again </li></ul></ul><ul><ul><li>Y has more complex failure modes </li></ul></ul><ul><ul><li>Z will add its own failure modes to those of X and Y </li></ul></ul><ul><ul><li>Predicting the outcome could be very difficult </li></ul></ul><ul><li>The complexity is increased because many developers do not understand failure modeling and, even were models to be published, their combination would be difficult due to their stochastic nature. </li></ul>
  21. 21. Failure (2) <ul><li>One approach to describing a service’s failure model: </li></ul><ul><ul><li>Service publishes the exceptions that it can raise and associates the required consumer behavior with each </li></ul></ul><ul><ul><li>“ Exception D may be thrown when the database is locked by another process. Required action is to try again after a random backoff period of not less than 34ms.” </li></ul></ul><ul><li>“ Crash-only” failure model is a simple starting point for building a taxonomy of failure behavior. This work is just beginning. </li></ul>
  22. 22. Scalability <ul><li>A behavioral description and WSLA for the composite service Z must include its scalability </li></ul><ul><li>How many simultaneous service instances can it support? </li></ul><ul><li>What service request rate does it handle? etc. </li></ul><ul><li>These parameters will almost certainly differ between the component services X and Y, and will need to be published by those services. </li></ul><ul><li>X and Y are presumably not dedicated solely to Z, so the actual load being applied to X and Y at any given time is unknown to the provider of Z, making the scalability of Z even harder to determine. </li></ul>
  23. 23. Web Services behaviors (again) <ul><li>Ten behavior examples </li></ul><ul><ul><li>Availability and Reliability </li></ul></ul><ul><ul><li>Performance </li></ul></ul><ul><ul><li>Management </li></ul></ul><ul><ul><li>Failure (Crash-only is one mode) </li></ul></ul><ul><ul><li>Security </li></ul></ul><ul><ul><li>Privacy, confidentiality and integrity </li></ul></ul><ul><ul><li>Scalability </li></ul></ul><ul><ul><li>Execution </li></ul></ul><ul><ul><li>Internationalization </li></ul></ul><ul><ul><li>Synchronization </li></ul></ul><ul><li>We described a few of these behaviors… </li></ul><ul><li>Can we use them to build WSLAs? </li></ul>
  24. 24. Web Service Level Agreement (WSLA) <ul><li>Based on behaviors and descriptors for these behaviors. </li></ul><ul><li>Example: Failure model </li></ul><ul><ul><li>Is transaction half-performed? </li></ul></ul><ul><ul><li>Is it re-wound? </li></ul></ul><ul><li>These behaviors and descriptors are not available in the WS description, in WSDL </li></ul><ul><ul><li>No performance info </li></ul></ul><ul><ul><li>Not even price! </li></ul></ul>
  25. 25. Web Service Level Agreements (2) <ul><li>Business acceptance of composed services for business-critical operations depends on a service provider’s ability to offer WSLA </li></ul><ul><ul><li>Uptime, response time, etc. </li></ul></ul><ul><ul><li>Offering a WSLA depends on ability to compose the WSLA-related behaviors of the individual services </li></ul></ul><ul><ul><li>This information needs to be available via WSDL or similar source </li></ul></ul><ul><ul><li>Should include test vectors to test the SLA claims </li></ul></ul><ul><li>The ability to determine and offer a WSLA commitment is a limiting factor for widespread acceptance of services based on orchestration </li></ul>
  26. 26. Web Service Level Agreements – conclusions <ul><li>Need a more precise way to express the parameters of behaviors </li></ul><ul><ul><li>Availability – What is 99.97% uptime? </li></ul></ul><ul><ul><ul><li>Several milliseconds outage each minute? </li></ul></ul></ul><ul><ul><ul><li>Several minutes planned downtime each month? </li></ul></ul></ul><ul><ul><li>Failure model – Crash-only as the simplest, lowest layer or level of failure in a future full failure model. </li></ul></ul><ul><ul><li>Eight other SLA-related behaviors listed here – each has a complex semantic for description and composition </li></ul></ul><ul><li>More questions than answers now - many PhDs still to be earned in this area! </li></ul>
  27. 27. Back to the crash-only software model <ul><li>Can it simplify service composition, testing, development of WSLA, and end world hunger? </li></ul>
  28. 28. Crash-only software (1) <ul><li>Historically, developers have spent a lot of effort making software resilient </li></ul><ul><ul><li>Put borders around it so it will not affect other things if it fails </li></ul></ul><ul><ul><li>Try to close it down cleanly </li></ul></ul><ul><ul><li>Save state </li></ul></ul><ul><ul><li>Reload the software component </li></ul></ul><ul><ul><li>Restart and replay </li></ul></ul><ul><li>Trying to keep the client from becoming aware that a failure occurred </li></ul>
  29. 29. Crash-only software (2) <ul><li>Years of work over last ten years on resilient software - which stays up all the time, and recovers from problems </li></ul><ul><ul><li>For example, tutorials by Bev Littlewood </li></ul></ul><ul><li>Crash-only software is the exact opposite </li></ul><ul><ul><li>Client accepts that the server may crash </li></ul></ul><ul><ul><ul><li>Power failure, network down, hardware, etc. </li></ul></ul></ul><ul><ul><ul><li>Client must be able to recover or restart the process by itself </li></ul></ul></ul>
  30. 30. Crash-only software (3) <ul><li>Crash-only principles </li></ul><ul><ul><li>Forget recovery - more trouble than it’s worth </li></ul></ul><ul><ul><li>When the server senses a problem, it will “crash” as cleanly as possible and may perform a “micro-reboot” to return to original state </li></ul></ul><ul><ul><ul><li>Sometimes recover to a well-defined checkpoint </li></ul></ul></ul><ul><ul><ul><li>Client may initiate the crash </li></ul></ul></ul><ul><ul><li>The server is back working sooner than if it tried to recover via logs and journals, etc. </li></ul></ul><ul><li>Principles fit the Web Services paradigm nicely! </li></ul><ul><ul><li>Loose coupling of services </li></ul></ul><ul><ul><li>Little state shared among services </li></ul></ul>
  31. 31. Crash-Only Software (4) <ul><li>Crash-only semantic has several advantages: </li></ul><ul><ul><li>Simpler macroscopic behavior with fewer externally visible states </li></ul></ul><ul><ul><li>Reduces outage time by removing all shutting-down time </li></ul></ul><ul><ul><li>Simplifies failure model by reducing recovery state table size </li></ul></ul><ul><ul><li>Crashing can be invoked from outside the software of the provider </li></ul></ul><ul><ul><ul><li>Recovery from a failed state is notoriously difficult and the crash-only paradigm coerces the system into a known state without attempting recovery </li></ul></ul></ul><ul><ul><ul><li>Reduce the complexity of the provider code </li></ul></ul></ul><ul><ul><li>Simplifies testing by reducing the failure combinations that have to be verified. Consumer is assumed to be able to initiate the crash. </li></ul></ul>
  32. 32. Crash-Only Web Services <ul><li>Candea’s list of properties required for a crash-only system can be abstracted to match properties of Web Services: </li></ul><ul><ul><li>Components have externally enforced boundaries. This is supported by the virtual machine concept used on many Web Service systems </li></ul></ul><ul><ul><li>All interactions between components have a timeout. This is implicit in any loosely-coupled Web Services interaction. </li></ul></ul><ul><ul><li>All resources are leased to the service rather than being permanently allocated. This is particularly useful in Web Services. </li></ul></ul><ul><ul><li>Requests are entirely self-describing. For crash-only services this requires that the request carries information about time-to-live and idempotency – will it return the same result if invoked again?. </li></ul></ul><ul><ul><li>All important non-volatile state is managed by dedicated state stores. </li></ul></ul>
  33. 33. Crash-Only Reliable Web Service <ul><li>For systems with hardware redundancy, by using crash only techniques, SOAP & WS-RM can be extended in order to produce an always available Web Service from the provider’s and consumer’s point of view </li></ul><ul><li>WSLA response time may be at risk if a service is forced to crash </li></ul>Crash-Only Application Server Stall Proxy Web Service Consumer Web Services Endpoint Recovery Agent Crash-Only Backend Crash-Only Backend Crash-only WSM Internet Reliable SOAP Protocol WS-ReliableMessaging
  34. 34. Conclusions <ul><li>Testing Web Services in an SOA environment is a discipline that is still in its infancy </li></ul><ul><li>There are no standard models to describe or combine Web Services behavior information across various services and providers </li></ul><ul><li>Web Services SLAs (WSLAs) for composed services are problematic </li></ul><ul><ul><li>Testing is only a partial solution </li></ul></ul><ul><ul><li>Behavioral composition needs work, but is promising </li></ul></ul><ul><li>Crash-only Web Services can address some of these difficulties </li></ul><ul><li>There are many related areas for further work </li></ul>
  35. 35. Q & A