How do you eat a whale?
One byte at a time!
Cloud Computing Expo 2017
June 8th, New York, NY
Kelly Looney,
Director of DevOps Consulting
Skytap Inc
* No whales were harmed in the making of this presentation. Skytap does not promote the eating or harming of whales.
Topics
o Where we started
o Where we’re now
o How we got here
o Organization
o Education
o Technology
o Parting wisdom
Skytap: Key Stats
o Regions
o 7 Multi-tenant (3 US, TOR, EMEA,
AUS, APAC)
o 3 Single-tenant (US)
o 18,057,400 VMs deployed
o Up to 44,500 / day
o 10,356,700 virtual L2 networks deployed
o Up to 19,600 per day
o 604 petabytes of allocated virtual storage
Starting Situation (circa 2014)
o Complex distributed system deployed across several regions
o The service was (mostly) reliable and scalable
o Deployments once a month; patched as needed - but are scary
o Heavy involvement from operations
o Difficult for devs to develop, test, and deploy
Current Situation
o All new services since 1/2016 run in K8S
o All proprietary high churn
services run in K8S
o Integrated CI/CD pipeline
o Ops focused on high value projects
o Release as needed – with confidence!
SOME SORT OF K8S Picture
K8s clusters in Skytap
o Production
o 11 clusters
o 70 nodes
o 185 namespaces
o ~1K pods at any given time
o Staging & Preprod
o 9 clusters
o 34 nodes
o 400 pods at any given time
What We Were Aiming For
o Reduce the unit of deployment
o Micro-services
o Complexity will only increase
o Comprehensive monitoring,
service discovery, and orchestration
o Easy stuff first
o Stateless and immutable services
First Steps…
Guiding Principles
o Change as little as possible
o New tools harmonize with
existing tools
o New stuff in the new framework
Actions
o Get key players on board
o Inventory and categorize services
o Determine how to concurrently
run old/new
Organization
o Recruit a dedicated tools team
o Not a part-time job
o Ideally members have
o Deep technical ability
o Architectural knowledge of
o Major system components
o CI/CD Tools
o RM & Deployment Practices
o Ability to teach
The SRE Role
• An alternative top of the tech ladder
• Start reactive with goal of being mostly
proactive
• Fire Chief to SRE story
• Let this be your primary means of
improvement
• Make the system easy to change first
• Goal to be unafraid to replace or re-
implement when needed
• Be an educator and mentor
Our World
o Devs own image generation &
deployment config
o Prebaked templates and custom
builds
o Educational Areas
o Dependency management
o Dockerfile authoring and image
caching
o Implementing K8S health checks
o Estimating resource usage
Technology
o Developers are human
o Release management process
o Kube-native vs. traditional CI/CD
o Which services to move to
Kubernetes first?
Service Categorization
o Application Tier
• Services (Web)
• Platform
• Infra
o Communication model
• Socket based
• Message passing (MQ)
o Application type:
• Stateful
• Stateless
Our CI/CD & Kube
o Deploy ~150 3rd party and
proprietary services to ~1,000
machines in 10+ regions
o Custom CI/CD tools on
Capistrano and Jenkins
o Kube integrated with existing
CI/CD framework
Testability is a P-Zero
o Deployment tools are hard to test
o Failed deployments == Dirty test
environments
o Automated multi-fidelity
environment builds
Fatal Mistakes to Avoid
o Underestimating what you have
o Not considering code, state, & data
o Transient technology choices
o Trying to deliver too much at once
Parting Wisdom
o A customer first attitude will drive
adoption
o Start with compute
o Pick up networking & storage later
o Consider your existing toolchain
o Ability to reset environments will keep
you moving fast
o Much easier for container services to talk
to legacy than vice versa
Speaker: Petr
Skytap, as any big public cloud is big distributed system
we run on vsphere cluster, not GCE/AWS/Azure
We have around 150 microservices
Because of ties between two management systems,
source code in mercurial && binaries managed by puppet
unability to rollback
deployments require a lot of orchestration between different teams in the company and happen in big chunks
Developers only partially own system they are working on, making it harder to develop and use newer tools to test
Speaker: Petr
This is extremely simplified diagram of our system circa 2014
Everything is running in VMs
All connections come in through F5 loadbalancer and go to web nodes
All other services are communicating over RabbitMQ
Speaker: Petr
slide
It was a long way with tons of mistakes and different organizations pushing back on our agenda, but we pulled through
We worked with dev teams to understand which services have highest release churn and prioritized them first
Developers of those teams usually had highest level of frustration with current deployment tools, so convincing them to move to kubernetes wasn’t a big problem
Ops are not maintaining anything inside developers VMs anymore, let’s them be focused on high value projects
QA doesn’t gets obsessed with discrepancy between provisined state with puppet and deployed source code
QA is confiden they can roll back broken build on staging environment without involving developers
Speaker: Petr
Highly simplified version of our current state
High release churn services have moved to kubernetes
Some proprietary services are still running in VMs and we don’t have short term plan to move them
We are considering moving mq and mysql galera to kubernetes next
Speaker: Petr
Speaker: Dan
Speaker: Dan
Speaker: Dan
Speaker: Dan
Speaker: Petr
In new world developers have far more power
However with power comes responsiblity.
As with any transfer of responsibility, we needed to educate developers and explain advantages of the appraoch
We needed to explain what immutable builds are, why is it important to track versions of packages they are installing and pin them
Image caching and faster builds
Explanation why healthchecks and readiness checks are important and useful
Working with developers to teach them how to profile their system and estimate resource usage
Speaker: Petr
Other problem we experienced with introducing kubernetes to new company is fear of change
Kubernetes is very opinionated system for good reasons, however people usually have their own opinions too.
While people are happy to take advantage of building their own images, they already have release management deployment tools that they know and use
Instead of adopting any kubernetes native CI/CD tools we decided to take our existing tools and adopt them to kubernetes
We also tried to choose services that require least amount of change to start running in kubernetes
We tried to choose to port services that
require least amount of change to start running in kubernetes
have developers that are most interested in kubernetes featureset (churn)
Speaker: Petr
Split services in several categories
high release churn
communication model
stateful/stateless
I don’t want to spread the myth that it’s impossible to run stateful services. And networking policies and ingress rules are really helping us now.. However,
it’s harder to run stateful services then stateless
setting up efficient direct network communication between non-kube services and kube-services in absence of cloud provided loadbalancers is hard
So:
mq based, high churn, stateless services
first candidate: workflow service, then web workers
Speaker: Petr
Instead of adopting any kubernetes native CI/CD tools we decided to take our existing tools and adopt them to kubernetes
Our deployment tool is based on capistrano that was heavily modified and it is fairly archaic and we considered throwing it away and replacing it with something better
However we realized that:
it would be too much of a change along with introduction of kubernetes
There’s knowledge built in this tools, that is not explicit, but it was accumulated by years of usage and fixing
Integrated kubernetes with these custom tools
in a manner with which we can later transition parts of the product to helm/tiller
Speaker: Petr
Deployment tools and deployment processes are hard to test
Transitioning to new deployment processes is even harder to test
Without testing you’ll have more problems in transition and can loose confidence of developers and that can compromise the whole project
While inside kubernetes there are nice deployment objects that you can rollback, there was nothing like that for maintaining kubernetes cluster as a whole (until tectonic came out, and in case of tectonic that would relate to only part of the system that already migrated to coreos/kube)
We ended up creating a tool that allows us to build fully functional copies of production environments on demand, high fidelity to low fidelity
Without it confident transition to kubernetes wouldn’t be possible
Each developer gets kubernetes environment as it is deployed in production
Speaker: Dan
Tech Choices: F5 Mesos decision – just went to a conference.