What is a system?

620 views

Published on

First lecture in my series, Design of Digital Machines.

Begins with a real world case study showing how to be a system detective, then steps back to explain how shared characteristics of all systems helps see the systems around us.

Published in: Design
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
620
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
19
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

What is a system?

  1. 1. What is a system?№ 1, Design of Digital MachinesTim Sheiner0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
  2. 2. Sections in this presentation๏ A System Story๏ What is a system?๏ Characteristics of a system 0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
  3. 3. System Story0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
  4. 4. 0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
  5. 5. Huh?0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
  6. 6. Huh(2x)?0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
  7. 7. 0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
  8. 8. Outage:Christmas Eve,12:30pm PacificAmazon WebServices, ElasticLoad Balancers “Netflix streaming was impacted on Christmas Eve 2012 by problems in the Amazon Web Services (AWS) Elastic Load Balancer Text (ELB) service that routes network traffic to the Netflix services supporting streaming.” 0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
  9. 9. Outage:Christmas Eve,12:30pm PacificAmazon WebServices, ElasticLoad BalancersAmericas only “The outage primarily affected playback on TV connected devices in the US, Canada and Latin America. Our service in the UK, Ireland and TextTV connecteddevices, Nordic countries was not impacted.”primarily 0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
  10. 10. Outage:Christmas Eve,12:30pm PacificAmazon WebServices, ElasticLoad BalancersAmericas only “Netflix uses hundreds of ELBs. Each one supports a distinct service or a different version of a service and provides a network address TextTV connecteddevices, that your Web browser or streaming deviceprimarily calls. Netflix streaming has been implemented on over a thousand different streaming devices over the last few years, and groups of similar100’s of ELBs devices tend to depend on specific ELBs.”~1:1ELB: Device Type 0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
  11. 11. Outage: Failure localizedChristmas Eve, to only some12:30pm Pacific ELBsAmazon Web Issue wasServices, Elastic requests notLoad Balancers passed throughAmericas only “Out of hundreds of ELBs in use by Netflix, a handful failed, losing their ability to pass Text requests to the servers behind them. None ofTV connecteddevices,primarily the other AWS services failed, so our applications continued to respond normally whenever the requests were able to get through.”100’s of ELBs~1:1ELB: Device Type 0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
  12. 12. Outage: Failure localizedChristmas Eve, to only some12:30pm Pacific ELBsAmazon Web Issue wasServices, Elastic requests notLoad Balancers passed through SlightAmericas only “Over-all streaming playback via Macs and PCs performance impact to Mac/ was only slightly reduced from normal levels. A PC few devices also saw no impact at all as those TextTV connected devices have an ELB configuration that kept Game consolesdevices, impacted 7primarily running throughout the incident, providing hours normal playback levels. ... game consoles etc. were impacted for about100’s of ELBs seven hours.”~1:1ELB: Device Type 0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
  13. 13. Outage: Failure localizedChristmas Eve, to only some12:30pm Pacific ELBsAmazon Web Issue wasServices, Elastic requests notLoad Balancers passed through SlightAmericas only “It is still early days for cloud innovation and performance impact to Mac/ there is certainly more to do in terms of PC building resiliency in the cloud. TextTV connecteddevices, We have plans to work on this in 2013. It is an Game consoles impacted 7primarily interesting and hard problem to solve, since ... hours the systems involved ... must be extremely reliable and capable of avoiding cascading100’s of ELBs overload failures.”~1:1ELB: Device Type 0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
  14. 14. 0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
  15. 15. 0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
  16. 16. US-East RegionELBSevere butlocalizedinterruption “We would like to share more details with our customers about the event that occurred with the Amazon Elastic Load Balancing Service Text (“ELB”) earlier this week in the US-East Region. While the service disruption only affected applications using the ELB service (and only a fraction of the ELB load balancers were affected), the impacted load balancers saw significant impact for a prolonged period of time.” 0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
  17. 17. US-East RegionELBSevere butlocalizedinterruption12:24 PM PST onDecember 24 “The service disruption began at 12:24 PM PST on December 24th when a portion of the ELB TextELB state data state data was logically deleted. ”logically deleted 0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
  18. 18. US-East RegionELBSevere butlocalizedinterruption12:24 PM PST onDecember 24 “This data is used and maintained by the ELB control plane to manage the configuration of the ELB load balancers in the region (for TextELB state data example tracking all the backend hosts tologically deleted which traffic should be routed by each load balancer). ”ELB controlplane managesconfigurationsTracking hostsfor trafficrouting 0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
  19. 19. InadvertentUS-East Region maintenanceELB processSevere but productionlocalized environmentinterruption access12:24 PM PST on Unaware ofDecember 24 “The data was deleted by a maintenance error process that was inadvertently run against the Text production ELB state data. This process was runELB state datalogically deleted by one of a very small number of developers who have access to this production environment. Unfortunately, the developer didELB control not realize the mistake at the time. ”plane managesconfigurationsTracking hostsfor trafficrouting 0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
  20. 20. InadvertentUS-East Region maintenanceELB processSevere but productionlocalized environmentinterruption access12:24 PM PST onDecember 24 “After this data was deleted, the ELB control Unaware of error plane began experiencing high latency and error rates for API calls to manage ELB load TextELB state data balancers. In this initial part of the service High latency &logically deleted disruption, there was no impact to the request error rates handling functionality of running ELB load balancers because the missing ELB state dataELB controlplane manages was not integral to the basic operation of API callsconfigurations running load balancers. ”Tracking hostsfor traffic No impact torouting running ELBs 0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
  21. 21. Inadvertent Create new, butUS-East Region maintenance not manageELB process existingSevere but productionlocalizedinterruption “The team was puzzled as environment access Failure on attempt to scale many APIs were succeeding (customers were able to12:24 PM PST on create and manage new load Unaware ofDecember 24 balancers but not manage error existing load balancers) and Text others were failing. As thisELB state datalogically deleted continued, some customers High latency & error rates began to experience performance issues with theirELB control running load balancers. Theseplane manages API callsconfigurations issues only occurred after the ELB control plane attempted to make changes to a runningTracking hostsfor traffic load balancer. ” No impact to running ELBsrouting 0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
  22. 22. Inadvertent Create new, butUS-East Region maintenance not manageELB process existingSevere but production Failure onlocalized environmentinterruption “At 5:02 PM PST, the team access attempt to scale disabled several of the ELB control plane workflows 6.8% directly12:24 PM PST onDecember 24 (including the scaling and Unaware of error impacted, rest no scaling descaling workflows) to prevent additional running TextELB state data load balancers from being High latency &logically deleted affected by the missing ELB error rates state data. At the peak of the event, 6.8% of running ELBELB controlplane manages load balancers were API callsconfigurations impacted. The rest of the load balancers in the system wereTracking hosts unable to scale or be No impact tofor trafficrouting modified by customers, but running ELBs were operating correctly. ” 0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
  23. 23. Inadvertent Create new, butUS-East Region maintenance not manageELB process existingSevere but production Failure onlocalized environment attempt to scaleinterruption “The team attempted to access restore the ELB state data to a point-in-time just before the 6.8% directly12:24 PM PST on Unaware ofDecember 24 event began. By restoring the error impacted, rest no scaling data to this time, we would be able to merge in events TextELB state data that happened after ... to High latency & Merge old statelogically deleted create an accurate state. ... error rates the initial method used by theELB control team to restore the ELB stateplane managesconfigurations data ... failed to provide a API calls Initial recovery plan failed usable snapshot of the data. This delayed recovery until anTracking hostsfor traffic alternate recovery process No impact torouting was found. ” running ELBs 0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
  24. 24. Inadvertent Create new, butUS-East Region maintenance not manageELB process existingSevere butlocalized “The system began recovering production environment Failure on attempt to scaleinterruption the remaining affected load access balancers, and by 8:15 AM PST, the team had re-enabled 6.8% directly12:24 PM PST on Unaware ofDecember 24 the majority of APIs and error impacted, rest no scaling backend workflows. By 10:30 AM PST, almost all affected TextELB state data load balancers had been High latency & Merge old statelogically deleted error rates restored to full operation. While the service wasELB control substantially recovered at this Initial recoveryplane managesconfigurations time, the team continued to API calls plan failed closely monitor the service before communicating 10:30 amTracking hostsfor traffic broadly that it was operating No impact to substantial recovery; 20routing normally at 12:05 PM PST. ” running ELBs hours 0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
  25. 25. Inadvertent Create new, butUS-East Region maintenance not manageELB process existingSevere but productionlocalizedinterruption “We have made a number of environment access Failure on attempt to scale changes to protect the ELB service from this sort of12:24 PM PST on disruption in the future. Unaware of 6.8% directly impacted, restDecember 24 • modified the access controls on our error no scaling production ELB state data • modified our data recovery process to Text reflect the learning we went through inELB state data this event High latency & Merge old statelogically deleted error rates We will also incorporate our learning from this event intoELB control our service architecture. We Initial recoveryplane managesconfigurations believe that we can API calls plan failed reprogram [to] allow the service to recover 10:30 amTracking hostsfor traffic automatically from logical No impact to substantial recovery; 20 running ELBsrouting data loss.” hours 0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
  26. 26. Outage: Failure localized Inadvertent Create new, but US-East Region Christmas Eve, to only some maintenance not manage ELB 12:30pm Pacific ELBs process existing Amazon Web Issue was Severe but production Failure on Services, Elastic requests not localized environment attempt to scale Load Balancers passed through interruption access Slight 6.8% directly performance 12:24 PM PST on Unaware of Americas only impacted, rest impact to Mac/ December 24 error no scaling PC TV connected Game consoles ELB state data High latency & devices, impacted 7 Merge old state logically deleted error rates primarily hours ELB control plane manages Initial recovery 100’s of ELBs API calls configurations plan failed Tracking hosts 10:30 am ~1:1 No impact to substantial for traffic ELB: Device Type routing running ELBs recovery; 20 hours0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
  27. 27. EventsOutage: TV connected {(Netflix) + (Amazon)}Christmas Eve, Americas only devices,12:30pm Pacific primarily Severe but12:24 PM PST on US-East Region ELB state data localizedDecember 24 interruption ELB logically deleted Structural Explanation Inadvertent production Unaware of maintenance environment Objects & Relationships process access errorAmazon Web ~1:1Services, Elastic 100’s of ELBsLoad Balancers ELB: Device Type Create new, but Failure on not manage Merge old state attempt to scale existing ELB control Tracking hostsELB control 10:30 am plane manages for trafficplane substantial configurations routing Initial recovery plan failed recovery; 20 hours Patterns SlightFailure localized Issue was Game consoles performanceto only some requests not impacted 7 impact to Mac/ELBs passed through hours PC 6.8% directlyHigh latency & No impact to API calls impacted, resterror rates running ELBs no scaling 0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
  28. 28. EventsOutage: TV connected {Netflix + Amazon}Christmas Eve, Americas only devices,12:30pm Pacific primarily Severe but12:24 PM PST on US-East Region ELB state data localizedDecember 24 interruption ELB logically deleted Structural Explanation Inadvertent production Unaware of maintenance environment Objects & Relationships process access errorAmazon Web ~1:1Services, Elastic 100’s of ELBsLoad Balancers ELB: Device Type Create new, but Failure on not manage Merge old state attempt to scale existing ELB control Tracking hostsELB control 10:30 am plane manages for trafficplane substantial configurations routing Initial recovery plan failed recovery; 20 hours Patterns SlightFailure localized Issue was Game consoles performanceto only some requests not impacted 7 impact to Mac/ELBs passed through hours PC 6.8% directlyHigh latency & No impact to API calls impacted, resterror rates running ELBs no scaling 0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
  29. 29. 0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
  30. 30. What is a system?0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
  31. 31. Bricks 0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
  32. 32. Brick Systems or Brick Collections? 0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
  33. 33. A system is an interconnected set of elements that is coherently organized in a way that achieves something. Donella Meadows, Thinking in Systems0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
  34. 34. Operational View of a System 1. Objects A C B D 0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
  35. 35. Operational View of a System 1. Objects 2. Relationships A C B D 0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
  36. 36. Operational View of a System 1. Objects 2. Relationships A 3. Currency C B D 0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
  37. 37. Operational View of a System 1. Objects 2. Relationships A 3. Currency 4. Boundary C B D 0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
  38. 38. Operational View of a System 1. Objects 2. Relationships A 3. Currency 4. Boundary 5. Purpose Output C Input B D 0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
  39. 39. Dynamic View of a System A Output CInput B D A’ Output’ C’ Input’ B’ Time D’ 0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
  40. 40. Dynamic View of a System Behavior vs Time 100 Output 0 20 Time0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
  41. 41. A system is an interconnected set of elements that is coherently organized in a way that achieves something. The General System0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
  42. 42. These elements. Those connections. This organization. That boundary. The Specific System This purpose.0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
  43. 43. Seeing systems0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
  44. 44. If it looks like a duck...๏ A system’s parts must all be present for the system to carry out its purpose optimally.๏ A system’s parts must be arranged in a specific way for the system to carry out its purpose.๏ Systems have specific purposes within larger systems.๏ Systems maintain their stability through fluctuations and adjustments.๏ Systems have feedback. 0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
  45. 45. The nature of systems is that your understanding of a particular one gets more precise over time.0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States. 45
  46. 46. Seeing Systems Outage: Severe but Christmas Eve, localized 12:30pm Pacific interruption EVENTS Events are what we notice first.0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
  47. 47. Seeing Systems Outage: Severe but Christmas Eve, localized 12:30pm Pacific interruption EVENTS TV connected Failure on devices, attempt to scale primarily PATTERNS Patterns = Observation(Events + Time)0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
  48. 48. Seeing Systems Outage: Severe but Christmas Eve, localized 12:30pm Pacific interruption EVENTS TV connected Failure on devices, attempt to scale primarily PATTERNS Issue was ELB state data requests not logically deleted passed through STRUCTURE From patterns we deduce structure via ‘black box’ process0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
  49. 49. Seeing Systems Outage: Severe but Christmas Eve, localized 12:30pm Pacific interruption EVENTS TV connected Failure on devices, attempt to scale primarily PATTERNS Issue was ELB state data requests not logically deleted passed through STRUCTUREAmazon Web ELB controlServices, Elastic plane managesLoad Balancers configurations CONTEXT Context helps us discriminate the isomorph 0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.
  50. 50. Fin 1. Objects 2. Relationships A 3. Currency 4. Boundary 5. Purpose Output C Input B D 0.5beta 2013 This work by Tim Sheiner is licensed under a Creative Commons Attribution 3.0 United States.

×