More Related Content

Similar to How do you eat a whale? cloud expo 2017(20)


How do you eat a whale? cloud expo 2017

  1. How do you eat a whale? One byte at a time! Cloud Computing Expo 2017 June 8th, New York, NY Kelly Looney, Director of DevOps Consulting Skytap Inc * No whales were harmed in the making of this presentation. Skytap does not promote the eating or harming of whales.
  2. Topics o Where we started o Where we’re now o How we got here o Organization o Education o Technology o Parting wisdom
  3. Skytap: Key Stats o Regions o 7 Multi-tenant (3 US, TOR, EMEA, AUS, APAC) o 3 Single-tenant (US) o 18,057,400 VMs deployed o Up to 44,500 / day o 10,356,700 virtual L2 networks deployed o Up to 19,600 per day o 604 petabytes of allocated virtual storage
  4. Starting Situation (circa 2014) o Complex distributed system deployed across several regions o The service was (mostly) reliable and scalable o Deployments once a month; patched as needed - but are scary o Heavy involvement from operations o Difficult for devs to develop, test, and deploy
  5. Starting Point
  6. Current Situation o All new services since 1/2016 run in K8S o All proprietary high churn services run in K8S o Integrated CI/CD pipeline o Ops focused on high value projects o Release as needed – with confidence!
  7. Current Situation
  8. SOME SORT OF K8S Picture K8s clusters in Skytap o Production o 11 clusters o 70 nodes o 185 namespaces o ~1K pods at any given time o Staging & Preprod o 9 clusters o 34 nodes o 400 pods at any given time
  9. What We Were Aiming For o Reduce the unit of deployment o Micro-services o Complexity will only increase o Comprehensive monitoring, service discovery, and orchestration o Easy stuff first o Stateless and immutable services
  10. First Steps… Guiding Principles o Change as little as possible o New tools harmonize with existing tools o New stuff in the new framework Actions o Get key players on board o Inventory and categorize services o Determine how to concurrently run old/new
  11. Organization o Recruit a dedicated tools team o Not a part-time job o Ideally members have o Deep technical ability o Architectural knowledge of o Major system components o CI/CD Tools o RM & Deployment Practices o Ability to teach
  12. The SRE Role • An alternative top of the tech ladder • Start reactive with goal of being mostly proactive • Fire Chief to SRE story • Let this be your primary means of improvement • Make the system easy to change first • Goal to be unafraid to replace or re- implement when needed • Be an educator and mentor
  13. Education o Buddy system o Documentation o Support channels o Reusing existing tools
  14. Our World o Devs own image generation & deployment config o Prebaked templates and custom builds o Educational Areas o Dependency management o Dockerfile authoring and image caching o Implementing K8S health checks o Estimating resource usage
  15. Technology o Developers are human o Release management process o Kube-native vs. traditional CI/CD o Which services to move to Kubernetes first?
  16. Service Categorization o Application Tier • Services (Web) • Platform • Infra o Communication model • Socket based • Message passing (MQ) o Application type: • Stateful • Stateless
  17. Our CI/CD & Kube o Deploy ~150 3rd party and proprietary services to ~1,000 machines in 10+ regions o Custom CI/CD tools on Capistrano and Jenkins o Kube integrated with existing CI/CD framework
  18. Testability is a P-Zero o Deployment tools are hard to test o Failed deployments == Dirty test environments o Automated multi-fidelity environment builds
  19. Fatal Mistakes to Avoid o Underestimating what you have o Not considering code, state, & data o Transient technology choices o Trying to deliver too much at once
  20. Parting Wisdom o A customer first attitude will drive adoption o Start with compute o Pick up networking & storage later o Consider your existing toolchain o Ability to reset environments will keep you moving fast o Much easier for container services to talk to legacy than vice versa
  21. Thank You!
  22. Contain Yourself: Incremental Adoption for Modernization CoreOS Feast, 2017 San Francisco, CA Petr Novodvorskiy, Development Lead Dan Jones, Director of Product Management

Editor's Notes

  1. Speaker: Dan
  2. Speaker: Dan
  3. Speaker: Dan
  4. Speaker: Petr Skytap, as any big public cloud is big distributed system we run on vsphere cluster, not GCE/AWS/Azure We have around 150 microservices Because of ties between two management systems, source code in mercurial && binaries managed by puppet unability to rollback deployments require a lot of orchestration between different teams in the company and happen in big chunks Developers only partially own system they are working on, making it harder to develop and use newer tools to test
  5. Speaker: Petr This is extremely simplified diagram of our system circa 2014 Everything is running in VMs All connections come in through F5 loadbalancer and go to web nodes All other services are communicating over RabbitMQ
  6. Speaker: Petr slide It was a long way with tons of mistakes and different organizations pushing back on our agenda, but we pulled through We worked with dev teams to understand which services have highest release churn and prioritized them first Developers of those teams usually had highest level of frustration with current deployment tools, so convincing them to move to kubernetes wasn’t a big problem Ops are not maintaining anything inside developers VMs anymore, let’s them be focused on high value projects QA doesn’t gets obsessed with discrepancy between provisined state with puppet and deployed source code QA is confiden they can roll back broken build on staging environment without involving developers
  7. Speaker: Petr Highly simplified version of our current state High release churn services have moved to kubernetes Some proprietary services are still running in VMs and we don’t have short term plan to move them We are considering moving mq and mysql galera to kubernetes next
  8. Speaker: Petr
  9. Speaker: Dan
  10. Speaker: Dan
  11. Speaker: Dan
  12. Speaker: Dan
  13. Speaker: Petr In new world developers have far more power However with power comes responsiblity. As with any transfer of responsibility, we needed to educate developers and explain advantages of the appraoch We needed to explain what immutable builds are, why is it important to track versions of packages they are installing and pin them Image caching and faster builds Explanation why healthchecks and readiness checks are important and useful Working with developers to teach them how to profile their system and estimate resource usage
  14. Speaker: Petr Other problem we experienced with introducing kubernetes to new company is fear of change Kubernetes is very opinionated system for good reasons, however people usually have their own opinions too. While people are happy to take advantage of building their own images, they already have release management deployment tools that they know and use Instead of adopting any kubernetes native CI/CD tools we decided to take our existing tools and adopt them to kubernetes We also tried to choose services that require least amount of change to start running in kubernetes We tried to choose to port services that require least amount of change to start running in kubernetes have developers that are most interested in kubernetes featureset (churn)
  15. Speaker: Petr Split services in several categories high release churn communication model stateful/stateless I don’t want to spread the myth that it’s impossible to run stateful services. And networking policies and ingress rules are really helping us now.. However, it’s harder to run stateful services then stateless setting up efficient direct network communication between non-kube services and kube-services in absence of cloud provided loadbalancers is hard So: mq based, high churn, stateless services first candidate: workflow service, then web workers
  16. Speaker: Petr Instead of adopting any kubernetes native CI/CD tools we decided to take our existing tools and adopt them to kubernetes Our deployment tool is based on capistrano that was heavily modified and it is fairly archaic and we considered throwing it away and replacing it with something better However we realized that: it would be too much of a change along with introduction of kubernetes There’s knowledge built in this tools, that is not explicit, but it was accumulated by years of usage and fixing Integrated kubernetes with these custom tools in a manner with which we can later transition parts of the product to helm/tiller
  17. Speaker: Petr Deployment tools and deployment processes are hard to test Transitioning to new deployment processes is even harder to test Without testing you’ll have more problems in transition and can loose confidence of developers and that can compromise the whole project While inside kubernetes there are nice deployment objects that you can rollback, there was nothing like that for maintaining kubernetes cluster as a whole (until tectonic came out, and in case of tectonic that would relate to only part of the system that already migrated to coreos/kube) We ended up creating a tool that allows us to build fully functional copies of production environments on demand, high fidelity to low fidelity Without it confident transition to kubernetes wouldn’t be possible Each developer gets kubernetes environment as it is deployed in production
  18. Speaker: Dan Tech Choices: F5 Mesos decision – just went to a conference.
  19. Speaker: Dan
  20. Speaker: Dan