Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Production Readiness Strategies in an Automated World

596 views

Published on

Production Ready. What does it mean? And to whom? Does that term factor in post-launch concerns such as debugability and ownership? What are the lifecycle phases for moving an idea into a hardened production system?

As the world continues its furious adoption of automation, Foo-as-a-Service, and ever changing tools, what are the baseline assumptions, risks, checklists, and processes required to support the evolving landscape of "production ready." In this talk we will deploy a sample application and build both a checklist and scorecard to evaluate the readiness of a system and an organization's practices.

Published in: Internet
  • I went to three of the auctions you guys provided, and I bought this Chevy that I'm going to sell and make some extra $$$ ☞☞☞ https://url.cn/krOAnJTk
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Production Readiness Strategies in an Automated World

  1. 1. Production Readiness Strategies in an Automated World
  2. 2. Sean Chittenden Engineering, HashiCorp @SeanChittenden https://keybase.io/seanc
  3. 3. Dev to Prod
  4. 4. Background
  5. 5. Software Life Cycle
  6. 6. Idea! Software Life Cycle
  7. 7. Idea! Software Life Cycle
  8. 8. Software Life Cycle Time Prod 1) Idea! R&D
  9. 9. Software Life Cycle Time Prod 1) Idea! 2) Production Ready R&D
  10. 10. Software Life Cycle Time Prod 1) Idea! 2) Production Ready R&D
  11. 11. Software Life Cycle Time Prod 1) Idea! 2) Production Ready R&D
  12. 12. Software Life Cycle Time Prod 1) Idea! 2) Production Ready R&D
  13. 13. Software Life Cycle Time Readiness 1) Idea! 2) Production Ready 3) End of Life 2.9) "It’ll be time to wind this service down when ___ happens and ___ comes online." R&D
  14. 14. Software Life Cycle Time Production 1) Idea! 2) Production Ready 3) End of Life "Production Supported" "Oops" R&D
  15. 15. Software Life Cycle Time Production 1) Idea! 2) Production Ready 4) End of Life "Production Supported" 3) "Oops" R&D
  16. 16. Software Life Cycle Time Production 1) Idea! N) End of Life "Production Supported" Forced to fix code or docs. R&D
  17. 17. Software Life Cycle Time Production 1) Idea! 2) Production Ready N) End of Life "Production Supported" "Drug feet to produce docs." [3,M) "Oops" R&D N-1) "That’s it, we’ve had enough…"
  18. 18. Software Life Cycle Time Production 1) Idea! 2) Production Ready N) End of Life "Production Supported" [3,M) "Oops" R&D N-2) "That’s it, we’ve had enough…" N-1) "Just support it until the next version is out"
  19. 19. Operations in the "Real World"
  20. 20. Complexity Abound The Echo Service: Stateless HTTP Echo $ go get github.com/hashicorp/http-echo $ http-echo -text foo $ curl http://127.0.0.1:5678/ foo
  21. 21. Echo as a Service Components: • Echo Service • Load Balancer • "Hardware" / OS • Metrics Agent • Logs Management • Reproducible Builds $ cd $GOPATH/src/github.com/hashicorp/http-echo/ $ git checkout 87ee38c517094993932bd76b37af03980e8c4151 $ go build
  22. 22. Complexity In The Simple Case Simple Example: The Echo Service Minimum of 6x dimensions to be concerned about No downstream services: only request + response
  23. 23. Echo as a Service Dimensions of Work to measure: • CPU • RAM usage • Network Usage • TCP accept/connection rate • Disk Capacity • Disk IO (maybe?) • Stability • Request volume • Request Latency
  24. 24. "Can't Escape the Signal, Mal" The Echo Service: Stateless HTTP Echo 2016/11/18 03:29:58 Server is listening on :5678 2016/11/18 03:30:00 127.0.0.1:5678 127.0.0.1:61932 "GET / HTTP/1.1" 200 4 "curl/7.51.0" 15.94µs
  25. 25. Echo as a Service Complexity Factor: ~10
  26. 26. Echo's Operational Concerns Loss Aversion • Uptime • Secrets • Planned Failure Modes: failure on a probability curve • Server Uptime (e.g. OS or Hardware) • Unplanned Failure Modes (e.g. DC or AZ fails)
  27. 27. Entropy and Failure: Best Friends
  28. 28. Echo's Operational Concerns Loss Aversion • Uptime • Secrets • Planned Failure Modes: failure on a probability curve • Server Uptime (e.g. OS or Hardware) • Unplanned Failure Modes (e.g. DC or AZ fails in an earthquake) • Success Failure Modes Randall A. Lewis and David H. Reiley. 2013. Down-to-the-minute effects of super bowl advertising on online search behavior.
 http://dx.doi.org/10.1145/2482540.2482600
  29. 29. Echo's Operational Concerns Loss Aversion • Uptime • Secrets • Planned Failure Modes: failure on a probability curve • Server Uptime (e.g. OS or Hardware) • Unplanned Failure Modes (e.g. DC or AZ fails) • Success Failure Modes • Known Architectural Limits • Unknown Architectural Limits
  30. 30. Performance Spelunking Exciting, but not very fun
  31. 31. Lurking Significant Details Imagine a more complex service: • an API server that fans out to ~20 downstream services • Uses async scatter/gather to fan out requests • Transient failures become the norm
  32. 32. Stateful Complexity Database-as-a-Service: PostgreSQL Edition
  33. 33. SQL WAL Files Log Files PostgreSQL as a Service Components: • PostgreSQL • Connection Pooler (pgbouncer) • PITR Manager (WAL-E, omnipitr, pgBackRest) • Logs Analyzer (pgbadger, pgfouine) • Metrics Agent • Failover Manager (Connections, State, Data Continuity/Self-Healing) • SchemaVersioning
  34. 34. SQL WAL Files Log Files PostgreSQL as a Service Dimensions of Work to measure: • CPU • RAM usage • Network Usage • TCP accept/connection rate • Disk Capacity • Maybe disk IO (read, write) • Stability • Request volume • Request Latency • Query performance • Kernel Lock Contention • Userland buffer eviction rate • Cache-miss rate • Size of blast radius • ... etc.
  35. 35. SQL WAL Files Log Files PostgreSQL as a Service Complexity Factor: ~30 x (number of tables x metrics per table)
  36. 36. SQL WAL Files Log Files PostgreSQL as a Service Database PSATangent: • Don't confuse complexity with value. • Databases are amazingly useful things because of their productivity and value as a network service. • Databases assume the lions share of complexity burden: centralized complexity is easier than distributed complexity.
  37. 37. How do you systematically address inherent, necessary complexity?
  38. 38. Checklists • Identify Problems • Read - Do Checklists • Ensure critical steps hit • Useful in emergencies (plane on fire? Do X,Y, and Z...) • Do - Confirm Checklists • Verify muscle memory • Combats atrophy and fatigue
  39. 39. Building a Modern Operations Checklist
  40. 40. Who uses checklists? Astronauts Surgeons Pilots Inspectors Military IT/Operations?
  41. 41. Good Checklists • Have a clear purpose • Are brief: 10-20 items, fit on a single page • Focus on what's essential/mandatory • Enumerate what must be done (and frequently forgotten) • Don't replace personal judgement or skill • Enforce discipline • Provide tools for collaboration and communication • Establish protocol or enforce a norm
  42. 42. Good Checklists • Have a clear purpose • Are brief: 10-20 items, fit on a single page • Focus on what's essential/mandatory • Enumerate what must be done (and frequently forgotten) • Don't replace personal judgement • Enforce discipline • Provide tools for collaboration and communication • Establish protocol or enforce a norm
  43. 43. Building a Modern Operations Checkli^WAudit
  44. 44. Production Ready SQL WAL Files Log Files
  45. 45. Production Ready SQL WAL Files Log Files Organizational Challenges Technical Challenges
  46. 46. Organizational Prerequisites Standardized Jargon (e.g. SEV1 vs SEV2, client vs consumer) Policy for Unique Service namespaces (app1 vs appN vs dbN) # Deny registration access to services prefixed # "app1-". Discovery of the service is still # allowed in read mode. service "app1-" { policy = "read" } service "app2-" { policy = "write" }
  47. 47. Organizational Prerequisites Standardized Jargon (e.g. SEV1 vs SEV2, client vs consumer) Policy for Unique Service namespaces (app1 vs appN vs dbN) Naming conventions established within a service (app1-api1 vs app1-dbN) Rules of Engagement outlining how outage is: 1. Identified 2. Responded to 3. Recovery is conducted 4. Prevention 5. Preparation 6. GOTO step #1
  48. 48. Organizational Prerequisites Standardized Jargon (e.g. SEV1 vs SEV2, client vs consumer) Policy for Unique Service namespaces (app1 vs appN vs dbN) Naming conventions established within a service (app1-api1 vs app1-dbN) Rules of Engagement outlining how outage is handled Centralized documentation Establish a culture of systems thinking
  49. 49. Organizational Prerequisites Establish a culture of systems thinking: •a system is composed of parts •a system is greater than the sum of its parts •all the parts of a system must be related (directly or indirectly), else there are really two or more distinct systems •a system is encapsulated (has a boundary) •a system can be nested inside another system •a system can overlap with another system •a system consists of processes that transform inputs into outputs •a system is autonomous in fulfilling its purpose:
 
 A car is not a system. A car with a driver is a system.
  50. 50. Organizational Prerequisites Standardized Jargon (e.g. SEV1 vs SEV2, client vs consumer) Policy for Unique Service namespaces (app1 vs appN vs dbN) Naming conventions established within a service (app1-api1 vs app1-dbN) Rules of Engagement outlining how outage is handled Centralized documentation Establish a culture of SystemsThinking Establish end-to-end ownership Decoupled service names from team names
  51. 51. Why do we care? • We aren't always going to be working on our code. • We need to establish a culture of maintenance and the necessary supporting systems, both organizational and technical.
  52. 52. Audit Reduced to a Checklist High-level summary of the service? Stateful or Stateless List of important consumers Release Process On-Call Instructions / Incident Response Health Defined Customer Service Endpoint? Backups Geographic Redundancy
  53. 53. Audit back to Checklist High-level summary of the service? Stateful or Stateless List of important consumers Release Process On-Call Instructions / Incident Response Health Defined Customer Service Endpoint? Backups Geographic Redundancy => Organizational Concern =>Technical Concern =>Tech and Org Concern => Organizational Concern => Organizational Concern =>Technical Concern => Organizational Concern => Organizational Concern => Organizational Concern
  54. 54. Plan, Doc, Vet, and Decide Starting Here... Time Prod 1) Idea! 2) Production Ready R&D
  55. 55. ... ideally before here... Time Production 1) Idea! N) End of Life "Production Supported" Forced to fix code or docs. R&D
  56. 56. ... but NO later than here!!! Time Production 1) Idea! N) End of Life "Production Supported" Forced to fix code or docs. R&D
  57. 57. (It's good to refine here when this happens) Time Production 1) Idea! N) End of Life "Production Supported" Forced to fix code or docs. R&D
  58. 58. Value from Checklists High-level summary of the service? Stateful or Stateless List of important consumers Release Process On-Call Instructions / Incident Response Health Defined Customer Service Endpoint? Backups Geographic Redundancy => FasterTraining / Fungible Skills => Universal / Consistent / Standard => Faster Understanding andTraining => Faster Resolution / Fungible Skills => Larger Pool / Increased Sympathy => Standardized Resolution => One Source ofTruth => Standard Procedures => Unplanned Disasters Mitigated
  59. 59. How do you build a checklist?
  60. 60. Summary: Vertical Places to Look SQL WAL Files Log Files Organizational Challenges Technical Challenges
  61. 61. Summary: Horizontal Places to Look Time Prod 1) Idea! 2) Production Ready R&D
  62. 62. Questions? Thank the audience for their time. Name: Sean Chittenden Twitter: @SeanChittenden
  63. 63. Recommended Reading
  64. 64. Seed Questions for Checklists
  65. 65. Service Checklist: Overview Service Overview • Description and relevance to the business • Short explanation of how the service fits into the eco system of micro services • Pointers to more detailed documentation • Pointers to the current team owners Stateful or Stateless service Does the service employ any internal caching Dependency management: e.g. embedded libraries that have been vendor/'ed (not necessary with Go, this is self-evident)
  66. 66. Service Overview $ head my-service.job # This declares a job named "service123". There can be exactly one # job declaration per job file. job "service123" { # Specify this job should run in the region named "us". Regions # are defined by the Nomad servers' configuration. region = "us" # Spread the tasks in this job between us-west-2 and us-east-1. datacenters = ["us-west-2", "us-east-1"] # Run this job as a "service" type. Each job type has different # properties. See the documentation below for more examples. type = "service" Service Checklist: Overview
  67. 67. Service Overview $ head my-docs.job # This declares a job named "docs". There can be exactly one # job declaration per job file. job "docs" { meta { owner = "https://github.com/myorg/myproject/blob/master/owners.md" docs-url = "https://github.com/myorg/myproject" system-summary = "https://github.com/myorg/myproject/blob/master/system-summary.md" } Service Checklist: Overview
  68. 68. Service Overview • Auditable via the API:
 http://nomad.service.consul:4646/v1/job/<ID> Service Checklist: Overview
  69. 69. List of high-level consumers • API consumed by other services within the organization • Public Internet • Marketing (a/b testing?) • Customer Service Service Confidentiality Classification Sales Information • Unofficial docs that can be used by sales or marketing. Authoritative information comes from the team writing the service. Doesn't need to be final copy, but should include useful figures about this service. Service Checklist: Overview
  70. 70. Release Process On-call - what's the fallback strategy for a small service with a team of two? How is the service installed? How is the service configured? How is the service's process managed? • How is it started? • How is it stopped? • Is there a graceful shutdown procedure vs a rapid shutdown procedure? • Can you send a SIGKILL signal to the process? Incident Response
  71. 71. Release Process On-call - what's the fallback strategy for a small service with a team of two? How is the service installed? How is the service configured? How is the service's process managed? Is the process management platform-specific? Is there a table mapping each signal to the effect of the signal Process Management Is Process Management hooked into the monitoring and alerting framework? Incident Response
  72. 72. Health Health of the Service What is the definition of healthy? TIP: Use Consul Health Checks for Break/Fix { "service": { "name": "redis", "tags": ["master"], "address": "127.0.0.1", "port": 8000, "enableTagOverride": false, "checks": [ { "script": "/usr/local/bin/check_redis.py", "interval": "10s" } ] } }
  73. 73. Health of the Service What is the definition of healthy? Is there any Seasonality to the definition of healthy? How do you observe the service? Is there any automated capacity planning attached to the service? Health
  74. 74. Customer Service How does customer service interact with this service? Does CS have direct access to PII or other sensitive material? Customer Service
  75. 75. Quality Metrics What are the important KPIs coming out of this service? • If you don't measure it, you won't optimize for it. • If you don't measure it, you can't manage it. • You can only succeed at what you can measure. • You can't improve what you don't measure.
  76. 76. Quality Metrics What are the important KPIs coming out of this service? Measuring the number of round-trips between Support and Customers/Users Measuring the number of round-trips between Support and Engineering Measuring the "level of effort" or amount of input a person has to submit in order to receive support. Accuracy of information provided by customers? Measure the "rate of access" to PII information.
  77. 77. Quality Metrics What are the important KPIs coming out of this service? Strategy: Centralize and poll for number of tagged issues out of GitHub.
  78. 78. Organization Prerequisites Define the gradients in an outage • SEV1 - Hard outage, complete loss of service or "major impact to business value/revenue". • SEV2 - Partial outage or impaired service (SLA violation). • SEV3 - Integrity of service issue (bugs). • SEV4 - Non-critical issue that needs to be prioritized 9-5 M-F. • SEV5 - Janitorial work that needs to happen on a routine schedule. Define what it means to follow through with an outage. • What level of follow through is required? • Postmortems? • Who patches it and who receives time to actually fix it permanently?
  79. 79. Outage Consequences Revenue Impact User Impact Systems Impact Escalation SEV1 SEV2 SEV3 SEV4 SEV5
  80. 80. Outage Consequences Define the gradients in an outage Sketch out the direct and indirect consequences on the system
  81. 81. Tracing Is there a tracing token sent by upstream? If not, why not? Is this service at the boundary of HTTP and RPC? Is there an API library available that will automatically inject the tracing token into downstream calls? Can tracing only be used in aggregate or can it be used for individual problems?
  82. 82. Geographic Redundancy Is the service geographically redundant or not? If not, why not? If yes: Does this happen automatically?
  83. 83. Geographic Redundancy { "Name": "my-query", "Session": "adf4238a-882b-9ddc-4a9d-5b6758e4159e", "Token": "", "Near": "node1", "Service": { "Service": "redis", "Failover": { "NearestN": 3, "Datacenters": ["dc1", "dc2"] }, "OnlyPassing": false, "Tags": ["master", "!experimental"] }, "DNS": { "TTL": "10s" } }
  84. 84. Geographic Redundancy Is the service geographically redundant or not? If not, why not? If yes: Does this happen automatically? What mechanisms handle this? Are there any regulatory concerns that come into play? Is the failover process manual? Does this happen at human timescale or on a machine timescale? Is the geographically redundant path continually tested?
  85. 85. Active-Active Can this service be active-active? If not, why not? If yes, what kind of locking concerns or information sharing concerns need to be factored in?
  86. 86. Data Classification Does the service come in contact with any sensitive data? If yes: What type of data? (PII, passwords, keys, financial information, credit cards,ACH, etc.) What regulatory compliance applicable to this service? (SafeHarbor, PCI, SOx?) Is the data stored, or just passed in transit? Can any sensitive data end up in log files? Can sensitive, but necessary data use a proxy token instead? Can this information leave the organization and goto a third party?
  87. 87. SPOFs What SPOFs exist, if any? What's the timescale for this SPOF? What's the timescale for transition from leader to follower or follower to leader? If stateful, is "split brain" possible? NOTE: State is a SPOF: failing over state takes time.
  88. 88. Escalation Path What's the escalation path inside of the organization? What's the escalation path outside of the organization? Open Source community or commercial support? Is there semi-regular training on how to triage and escalate? Is there a playbook for relevant low-level debugging tools available for use? TIP: Use automatic escalations within PagerDuty or OpsGenie. TIP: Use standardized service techniques to create fungible support resources.
  89. 89. Quantiles of Health Can health be defined in terms of quantiles vs binary up/down? What are the upper and lower bounds for healthy? What system is authoritative for determining if something is healthy? How can an external actor verify if the system is healthy? Is there a command-line tool or API?
  90. 90. Canary Does the request have a "canary request mode?" Can this be enabled per customer? Is the canary mode used in monitoring to validate end-to-end functionality?
  91. 91. Downstream Services How does this service respond upstream to failures in its downstream dependencies? Is there a metric to indicate timed-out requests? Is there a feature-flag that enables a circuit-breaker? How are connectivity problems retried in the system? Retry the same backend? Retry a different backend? Timeout? Is there a deadline timer passed in? Is a header added to indicate partial failure of downstream services? Are response codes standardized?
  92. 92. Architectural Limits What are the expected limits of this system? How often is "peak-load" defined? Is there 3x capacity for the service in order to absorb reasonable bustiness? Is the band of nominal resource usage defined? • "At 10K RPS, network utilization should be between 200-300Mbps, using two cores at ~60% utilization, 50MB of RAM, and doing an average of 5-10 disk IOPs. All values are +/- 25%."
  93. 93. Logging How is logging setup? What gets logged? What is the minimum log retention? How often are logs rotated? By size or by fixed interval? Are logs shipped off box? Are they streamed without hitting disk? Is there any sensitive data in the logs?
  94. 94. Load Shedding How can you load-shed? Are there any feature flags that enable circuit breakers that reduce expensive functionality?
  95. 95. Prepare For the Worst Assume the service can't come back online, what's the impact?
  96. 96. Backup and Restore Does this system have a reproducible build? How often are backups taken? How often are the restores executed? What's the recovery point objective? What's the mean time to recovery? What's the definition of acceptable data loss in the event of failure?
  97. 97. Deployment How is this service tested and deployed? Is the deployment in prod any different than test? How can you roll back? Is the application part of a CI/CD pipeline? How is production data scrubbed and used in staging/UAT in order to simulate production-like loads without using production data?

×