Successfully reported this slideshow.
Your SlideShare is downloading. ×

The "Holy Grail" of Dev/Ops

Loading in …3

Check these out next

1 of 89 Ad

More Related Content

Slideshows for you (20)

Similar to The "Holy Grail" of Dev/Ops (20)


Recently uploaded (20)


The "Holy Grail" of Dev/Ops

  1. 1. The “Holy Grail” of Dev/Ops A practical guide to what we’ve done at Cloud Posse Prepared by Erik Osterman Cloud Posse, LLC June 2017
  2. 2. Democratization of Information
  3. 3. Story Time
  4. 4. About Me ● Former Director of Cloud Architecture, CBS Interactive in San Francisco ● Ran Operations for,, and ● Worked with AWS since 2006 / Private Invite-only Beta ● Advise numerous successful venture backed startups ● Backend Software Developer, Open Source Advocate / Contributor ● Took ~2 years off to travel; visted ~30 countries
  5. 5. This Talk ● ~90 Minutes ● Q&A at the end ● Write question in the chat ● Actionable, practical advice ● Collection of our “Best Practices”
  6. 6. Best Practices (my) definition: An opinionated & proven strategy with specific tactics to help achieve the objectives for some overarching goal.
  7. 7. Emulate Giants Netflix Google Spotify Twitter
  8. 8. Our Best Practices Organizational Software Development, CI/CD, Testing, Q&A Infrastructure, Automation, Orchestration Logging, Monitoring, Alerting, Escalation, Remediation Security
  9. 9. Organization The all starts here
  10. 10. Realize we’re different. Managers vs Makers - We’re work differently (Paul Graham - YCombinator Founder) Makers plan in half-day blocks of time Managers plan to minimize empty 15 minute slots in their calendar Interrupts are costly for developers and therefore the business
  11. 11. HumanOps (i.e. not cyborgs) Humans get tired and stressed, they feel happy and sad. Human issues are system issues. Human health impacts business health. Humans need to switch off and on again (aka sleep). Humans build and fix systems. Humans > systems
  12. 12. Right Tools for the Job Email == external communication (not tasks, threaded conversations, cat pics) Slack == all internal communications; channels for topics #dogs Quip == all documentation for transparency (engineering & business) Zoom == reliable cross-platform conferencing Asana == issue tracking
  13. 13. Technical Debt is Real Tradeoffs are inevitable. Pay the tax now or later. Later usually means bankruptcy & software rewrites Includes upgrades, refactoring, optimizations, etc It’s anything that doesn’t move the product forward But it will hold the product back This is not just a software problem. It’s a business problem too. ...and unavoidable
  14. 14. Software Development
  15. 15. Software Development Cloud Native Design - the “12 Factor” Pattern Stable Code Requires Feature Branching / Pull Requests / Code Reviews Versioning / Version Pinning Logging Local Development Environments
  16. 16. Some Bad Practices Cowboy Coding, committing to master Hardcoding secrets, hostnames, paths, etc “Clever” code is often “complicated” code Writing un-greppable code, terse variable names, Inconsistent naming conventions, long functions, and………… you get the point. Using tabs :P
  17. 17. Some Good Ones…. Strict Linting (e.g. eslint, go lint) Semantic Versioning (semver) .editorconfig (tabs or spaces? Seed project repositories
  18. 18. Best Practice: Open Source Pattern* Leads to much cleaner code with fewer proprietary dependencies Fewer proprietary dependencies makes it more reusable across projects If decide to release, it demonstrates the kind of engineering you do It works because developer’s ego is on the line to write stuff that doesn’t suck Pro tip: follow the conventions of your favorite framework or package system * Does not require that organization releases code as open source
  19. 19. Best Practice: & Use well-formed Markdown syntax (.md) Write “README” files on all your projects. Explain the purpose of the project Show how to get started and where to look for more information Document breaking changes & upgrade path in Pro tip: Use a markdown editor if you’re not familiar with the syntax
  20. 20. Best Practice: Use Makefiles Provide targets for common usage E.g. deps, build, run, clean Include them with all repos Document targets purpose (##)
  21. 21. Makefile Example -include .secrets DB_HOST ?= localhost ## build a docker image build: docker build -t cloudposse/test . ## run container run: docker run -v $$(pwd):/app -e DB_HOST=$(DB_HOST) -e DB_PASS=$(DB_PASS) -p 8080:80 cloudposse/test ## test test: curl http://localhost:8080/
  22. 22. Best Practice: Local Dev Environments Onboarding new hires should take minutes not hours Use fully automated local dev environments Use same Docker images that will run in staging/production Bind-mount local volumes to speed up iterations for “live editing” Pro Tip: Use docker-compose rather than vagrant which is too heavy
  23. 23. Best Practice: Developers write Dockerfiles Always use alpine:3.5 Base images (be wary of unofficial images) Declare all ENV in Dockerfile (like function arguments to an OS) Write as few layers as possible (chain with && ) Version Pin Everything Use 2-stage build process for thin images (C/C++, Golang)
  24. 24. Best Practice: Branch Protection Essential for security and stability of your codebase Require PR approval to merge to master Force branches to be up-to-date Disallow commits to master Restrict to squash+merge
  25. 25. Best Practice: Branch Protection
  26. 26. Best Practice: Pull Requests Smaller the better; implement exactly 1 feature Milestones Use Labels: Define PULL_REQUEST_TEMPLATE (## what, ## why, ## dependencies) Use checkboxes for TODOs ….for clean commit histories in master
  27. 27. What a PR should look like....
  28. 28. Best Practice: Follow PRs with Trailer
  29. 29. Best Practice: Application Logging Use JSON structured log events Libraries will efficiently generate/parse Human readable, highly consistent Pro tip: use Sentry to aggregate errors+warnings and log them in issue tracker
  30. 30.
  31. 31. Best Practice: Pair Programming Lose: speed (arguably) Gain: fewer bugs, business continuity, education, team building/camaraderie When: implementing complicated features, onboarding, and triaging Pro tip: Use tmate for instant terminal sharing (
  32. 32. QA Developers with a focus on test automation Quality Control Masters of CI/CD
  33. 33. Best Practice: Bug Blowouts Set aside 1 day per week to dog food your own app Prepare test scripts (aka flows) for everyone to follow Get everyone on board, not just QA. That means developers, graphic artists, customer support, etc Monitor logs, submit bugs immediately to issue tracker
  34. 34. Best Practice: Synthetic Testing Continuous Testing of Critical User Paths Uses Browser to Automate Tests of Production Ensure User Registrations, Password Resets, Shopping Carts, and Checkout work 100% of the time Pro Tip: Checkout Selenium or PhantomJS
  35. 35. Cloud Native Design Service-Oriented Architectures (SOA) Single-purpose Services (aka micro services) Connected through APIs Highly Decoupled 12 Factor Pattern
  36. 36. “12 Factor” in a Nutshell Use Environment Variables for all configuration (credentials, ports, tuning parameters, etc) Use Backing Services for everything durable Write all services as stateless & disposable Automate all admin tasks (the rest is meh)
  37. 37. Best Practice: X509 Client Certificates Use CA to Sign SSL Certificates that perform certain functions Automatic transport & endpoint security for APIs Highly scalable - no API requests to validate tokens Don’t Rely on API tokens which are costly to authenticate and don’t secure the transport layer Examples: Kubernetes APIs, etcd
  38. 38. CI/CD Frequency reduces Difficulty. The more you deploy, the easier it gets. Latency between check-in and production is risky. It’s like HFT. Faster delivery improves software development practices Consistency improves confidence
  39. 39. Ensure applications support same backend schema for adjacent releases Use feature flags to enable new features of backend schemas Best Practice: Safe Schema Migrations
  40. 40. Write terse .travis.yaml, circle.yaml, Jenkinsfile Use the same targets in all projects Use Makefile to automate build, test Clone harness repo after git checkout Example: Best Practice: Use a Build Harness
  41. 41. Best Practice: Liberal Tagging Tag all docker images with multiple tags, in addition to release tags Let $ref = {branch|tag} Then, tag $ref $ref-$build $git_hash
  42. 42. #DevOps
  43. 43. It is not… a) A dedicated team within the organization b) A job title c) A sysadmin d) A skill e) all the above
  44. 44. The Old Paradigm
  45. 45. What it actually is... A cross-disciplinary engineering culture Infrastructure is Code Automation over toil A path towards “Serverless” (but we’re still far away!) Site Reliability Engineering (“SRE”)
  46. 46. Infrastructure as Code Infrastructure is now 100% API driven “Best Practices” of Development → Infrastructure Versioned Infrastructure Automated Remediations
  47. 47. Use Terraform to fully orchestrate environments (e.g. DNS, instances, volumes, AutoScaling Groups, Load Balancers, Databases) S3 remote backends to store state for collaboration and backups Use modules to encapsulate business logic for consistency / manageability Version pin modules and dependencies to ensure stability Best Practice: Automated Orchestration
  48. 48. Best Practice: Tools as Containers Only local dependency should be docker and maybe make =) Distribute all other local development tools or dependencies as containers (e.g. terraform, aws, kops, helm, etc...) Easier to standardize on one OS Example:
  49. 49. Best Practice: 100% Isolation Use (1) AWS Account per Stage (E.g. production, staging, dev) Use (1) VPC per Cluster Use (1) Dedicated TLD per AWS Account (e.g.,, Use (1) Single Process Containers for all Apps
  50. 50. Best Practice: Identical Environments Environments should only differ in size, not shape “Production”, “Staging”, “Dev” are only labels Run as many parallel environments as we need Only manual action is initiating build E.g. other labels: pentest, loadtest, erik Pro tip: each environment gets it’s own DNS zone (e.g.
  51. 51. What We Want Reliable - we want things to be online 100% of the time and when things go wrong, we want them to auto-heal. Fast - we want to run a site that can scale horizontally as traffic increases Easy - we shouldn't need rocket scientists to operate it on a day-to-day basis Affordable - we want it to be easy and cost effective to maintain in the long run Maintainable - we want to have a development or staging environment that is identical to production, so we can efficiently work on new versions of the site without it affecting production Secure - we don't want to get hacked
  52. 52. Technically, we need this… “Everything” Horizontal Auto Scaling, Auto Healing, Auto DNS, Auto SSL Automated deployments and rollbacks, Versioned History Service Discovery & Load Balancing Batch Job, Scheduled Job Execution Storage/Volume Orchestration ...out of the box
  53. 53. Best Practice: Use Kubernetes (sometimes) Ideally suited for microservices architectures, larger engineering teams “Infrastructure as Code” - write documents that describe you microservices (Pods ~ VMs, ReplicaSets ~ clusters, Services ~ Load Balancers) Comes with Everything out-of-the-box Cons: more complex to get started, difficult to triage issues, requires SME Pro tip: Use kops to spin up clusters automatically in AWS and GCE
  54. 54. Kubernetes Dashboard
  55. 55. Best Practice: Use Elastic Beanstalk Ideally suited for monolithic architectures Comes with almost Everything out-of-the-box Supports instances inside private VPC with root SSH access Formal process for promoting code to production / automatic rollbacks Pro tip: Use terraform to spin up beanstalk clusters automatically in AWS
  56. 56. Elastic Beanstalk
  57. 57. Configuration Management Immutablevs Mutable Declarative vs Imperative “WYSIWYG”
  58. 58. Best Practice: Immutable Containers/AMIs Like “Burning” a copy of your code in an image Easy to know exactly what is running Fast to deploy and rollback Use Docker containers for applications Use something like CoreOS for underlying host (~dom0)
  59. 59. Best Practice: Imperative Infrastructure “Give me a load balancer, 2 filesystems, 2 GB ram, 4 CPUs, 4 instances” There’s no guess work about what is output Compatible with legacy architectures There’s less magic
  60. 60. Monitoring Application - Synthetic Testing Infrastructure Real-User Monitoring (RUM) SLI Systems don't have feelings. They only have SLAs.
  61. 61. Best Practice: Team Dashboards Display Service Level Indicators (~ KPIs) relevant for specific teams Create dashboards for specific services like Kafka and Zookeeper First place to look when triaging issues Pro tip: Use Datadog dashboards with namespace filtering on clusters
  62. 62. Sample Dashboard Overview
  63. 63. Alerting Alert Fatigue == Human Fatigue Dashboards > Alerts > Email Human health impacts business health. Budgets Metrics driven; not log events Alerts need to be actionable - with links to documentation
  65. 65. Best Practice: Actionable Alerts
  66. 66. Escalation & Remediation Automate as much as possible, escalate to a human as a last resort. KPI~SLI / SLO / SLA On-call Engineers PagerDuty - Manage Calendars and Phone/SMS Escalations
  67. 67. Best Practice: #OCE Slack Channel One channel to reach engineers Searchable history of events and conversations Use topic to announce who is on-call Linked Google Calendar with Relevant Events (E.g. Customer Demo Calendar)
  68. 68. Best Practice: Post-Mortems Kill the shame game. Human issues are system issues. 5 Whys - Root Cause Analysis (“RCA”) Use Consistent Template (KISS) Weekly Retrospectives with past OCEs and Stakeholders Documented in Quip → Instantly Searchable Pro Tip: Check out how Google does it:
  69. 69. Security 100% Security Cannot Be Achieved Assume systems are insecure Devalue credentials with MFA
  70. 70. What not to do... 1. Store secrets in git repository 2. Hardcode secrets in configurations 3. Write them in plain-text 4. Manually distributed them 5. Reuse/share keys across users and apps 6. Build homegrown systems to protect secrets (* unless you’re Netflix, Hashicorp or Google) ...but you already knew that!
  71. 71. Best Practice: Beyond Corp Model Enterprise zero-trust security model used by Google Shift access controls from the network perimeter to individual devices/users Allow employees to work more securely from any location Do not rely on traditional VPNs
  72. 72. Best Practice: Identity-Aware Proxy (IAP). Protect internal services using an IAP Integrates cleanly with your SSO provide MFA Pro tip: Use the Bitly OAuth2 Proxy to add auth layer to any service
  73. 73. Best Practice: Bastion Host Centralized point for accessing systems Session logs, Slack Login Notifications Require MFA to authenticate Disable proxy mode and TCP socket forwarding Use bastion only for triage, not administration (because that’s scripted!) Pro Tip: Use Duo Push Notifications + Geofencing
  74. 74. Best Practice: Login Justifications
  75. 75. Best Practice: SSH Key Management 2 options - Github Public Key API or Signed Certificates ● You can’t protect the private key ● You can add multiple factors (a.k.a. MFA) ● Our Solution ○ Use Github Public Key API to distribute public keys ○ Use Duo for MFA Push Notifications + Geofencing Pro tip: Checkout Bless by Netflix
  76. 76. Duo Slack Integration and Dashboard
  77. 77. Best Practice: SSM Scripted Remediations Use SSM to execute commands in parallel across machines (don’t use parallel ssh since that is harder to audit) Full audit logs of command and output Use IAM roles to restrict execution Pro tip: use the aws cli to trigger remediations on the command line
  78. 78. Best Practice: Federated Accounts Reduce the blast radius when things explode Use one account per environment: dev, staging, production Use a one account for billing aggregation, IAM federation Assumed Roles (e.g. read-only, admin, dba) MFA required to assume roles - to devalue credentials Pro Tip: Use STS API with MFA to generate short lived AWS credentials Example: AWS
  79. 79. Best Practice: AWS Secrets (Client-side) Client Side (e.g. Terraform, AWS Cli) ● IAM User Account Access Keys (never shared!) ● Access Keys only permit Assume Role+MFA ● Assumed Roles (limit scope) ● Temporary Sessions Tokens with STS (expire after 1 hour) ● MFA (devalue credentials) Solution:
  80. 80. Best Practice: AWS Secrets (Server-side) Dynamic, Auto Rotating Credentials for Server Applications Never ever hardcode AWS credentials on EC2 instances Server Side (e.g. EC2 Instance, Docker Container) ● IAM Instance Profiles with Assumed Roles ● Use Kube2IAM with Kubernetes (kops) ○ Temporary AWS credentials ○ Drop-in Compatiblity with all official AWS client library
  81. 81. Best Practice: Bootstrap Secrets Secrets you need to provision new clusters on AWS... ● Run terraform inside of Container ● Private S3 Configuration Bucket ● Encrypted Bucket Objects ● Mount S3 Bucket inside container (S3FS) ● Use /dev/shm for caching Geodesic:
  82. 82. Best Practice: Password Managers Store Organizational Secrets in Password Manager (webhook urls, master account credentials, shared MFA) Use Vaults specific to some shared objective (e.g. team) Require MFA for decryption Avoid Shared Credentials as much as possible (this is a last resort) SSO > Shared Passwords Pro tip: Use 1Password for Teams. Abandon all other password managers.
  83. 83. Best Practice: Avoid Password Rules They don't work They frustrate average users Penalize people that use real random password generators They are often computationally weaker → vulnerable to brute force attacks
  84. 84. Best Practice: Avoid Password Rules
  85. 85. SaaS Cocktails What We Use
  86. 86. The Bible
  87. 87. __EOF__ Erik Osterman, Founder Cloud Posse, LLC