SlideShare a Scribd company logo
1 of 22
MoneySuperKubernetes
Navigating K8s at Moneysupermarket
Kubernetes Manchester, December 2018
David Stockton, DevOps Tech Lead for Core Platform
Jim Davies, Head of DevOps
Jim Davies David Stockton
Why do we need Kubernetes?
< 2016 2017 2018 2019 >
AWS
• Rightscale
• Masterless Puppet
• Jenkins on Mesos
• Single-container
hosts
• Docker Swarm
spikes
• Nomad spikes
• Kubernetes full trial
• AWS and industry
say ”Go K8s”
• EKS preview (not
impressed)
• Kubernetes full
development
• Continued
migration
• Test and learn
• DEPLOY ON DAY
ONE!
Road to Kubernetes
• AWS EKS
• First evaluated Jan 2018 (beta) – only available in US AZ’s (we want customer data in EU)
• At the time, required custom kubectl binaries & well behind in version support (much better
now) – we plan to re-evaluate EKS next year.
• No automation capability – Fargate for EKS wasn’t ready, Terraform not ready.
• Would need better interoperability with our existing EC2 VPC estate.
• GKE
• Have data analytics platform in GCP; but all SoA/front-end estate is in AWS + direct connect
Technology Choices
Platform Deployment
• Kops
• K8s deployment automation (AWS first-class citizen)
• Can output terraform files for IaC management - we keep terraform state in S3 bucket BUT
we throw-away the Terraform files and re-generate each run.
• + Not tempted to edit Terraform files (Don’t do this! We did and it’s a nightmare to use
any kops function again afterwards)
• + Can turn-off cluster (we shut-down environments at night - £££ ) without
’destroying’ it.
• - Hasn’t happened yet but theoretical worry if kops re-architects terraform files
(theoretically just rebuild the whole cluster anyway)
• - Launch configurations only (no launch template multi instance-type support yet)
• Puppet – Invoked via kops hook
• Small (master-less) puppet run to add some instance level goodness (e.g. security tooling!)
• Route 53
• Delegated sub-domain to Route53 management (controlled by ‘external-dns’)
• ELB
• Route53 entries point to ELBs
• More on this when I talk about Istio…
Traffic Ingress
Application Deployment
• Jenkins
• Orchestrates & provides feedback
• e.g. PR merged  Webhook  Helm-Deploy (Jenkinsfile)
• Helm
• THE kubernetes package manager
• YAML + golang template = k8s YAML
• ProTip: ’Sprig’ for free - http://masterminds.github.io/sprig/
• ‘generic-service’ helm chart 
• helm-deploy.rb
• Proprietary Ruby script
• Defines list of helm releases to run per environment
• helm chart + chart version + values.yaml path
• Remember, environments restarted each day from code (disaster recovery = normal day!)
• Artifactory
• Docker repository (plus virtual aggregation)
• Helm repository (plus virtual aggregation)
• Service Mesh
• Connect – control flow of traffic (e.g. blue/green deployments, testing)
• Secure – automatically secure services (auth ’N’ auth + encryption) – solves our internal
network requirements; not cloud vendor specfic (i.e. could span over multiple
clouds/datacenters)
• Control – enforce policies (e.g. circuit breaker)
• Observe – automatic tracing, monitoring & logging
• Traffic Ingress
• k8s service  Istio Ingress Gateway  Gateway  VirtualService  <upstream>
• Egress
• ServiceEntries – can use to white-list outbound traffic or create rules (e.g. no more than X
connections to external service Y)
• Squads in Control
• Central point of control (e.g. /endpoint was ‘serviceA’, now it’s ‘serviceB’ – only 1 piece of
configuration)
• Squads can ’wire-up’ their service – in their control
• Limits / Service Protection - no more than X concurrent connections
• Fault Testing - delay injection / fail X% of requests
• Canary Deployments – maybe next talk!?
Istio
• Docker Images
• Linted (docker lint)
• Scanned (klar  clair  clair-db)
• On build (existing vuln. at build time)
• On schedule (new vulns)
• Corporate white-list YAML from git repo
• IGNORE_UNFIXED is a better strategy – but you still want to know!
• Lots of vendor solutions available too
• Hosts
• Standard Linux tooling / SaaS vendors available
• Many have profiling capabilities
• Many offer container (or even helm chart); beware they typically need to be a privileged
container – only appropriate if you manage your own hosts
• k8s
• CVE-2018-1002105 (9.8!)
• https://access.redhat.com/security/cve/cve-2018-1002105
• https://github.com/kubernetes/kubernetes/issues/71411
• Affected versions:
• Kubernetes v1.0.x-1.9.x
• Kubernetes v1.10.0-1.10.10 (fixed in v1.10.11)
• Kubernetes v1.11.0-1.11.4 (fixed in v1.11.5)
• Kubernetes v1.12.0-1.12.2 (fixed in v1.12.3)
• https://kubernetes.io/docs/tasks/administer-cluster/securing-a-cluster/
• https://github.com/neuvector/kubernetes-cis-benchmark
Security
• Integrate with SSO
• kuberos  dex  Okta (LDAPS)  ldaps[ ]
• kuberos = UI to OIDC provider
• https://github.com/helm/charts/tree/master/stable/kuberos
• dex = OIDC provider which aggregates back-end auth providers
• https://github.com/helm/charts/tree/master/stable/dex
• Okta = SSO solution; provides SAML, OIDC, LDAPS front-ends
• Wait, what? Okta OIDC doesn’t by default present group names in refresh token
response – additional feature = £££
• dex talks to (no extra charge) Okta LDAPS directory (and/or others – e.g. static
passwords for out-of-cluster service accounts
• LDAPS[ ] = Active Directory / Samba / Whatever
• Why?
• Kuberos provides ~/.kube credentials UI
• Thereafter kubectl talks to dex and refreshes token
• Joiner / Mover / Leaver process = someone else’s problem 
• Standard kubectl tooling
• Group mappings in code
Security – API Creds / RBAC
• Most charts allow specifying attributes of storage; but critically NOT the volume ID :face-palm:
•  Need reclaim policy = retain (or you’ll lose the volume!)
•  Cannot delete and re-deploy helm chart without manual intervention
• Need to take PV out of retained state
• If you’re not making changes to the PV definition (shouldn’t be; then at least helm upgrade’s work)
• Environments up forever = 
• Working through stable helm charts with PRs to support volume IDs
• Not as simple as it sounds; often need to split replica-sets into multiples or similar approaches to allow specific
IDs to be used
Storage
• Horizontal Pod Auto-scaler
• Fancy way of saying ‘look at this Prometheus metric and scale if above threshold’
• Business metric scaling = cool… ‘if average user search time > 0.5s then scale up search API’
• TODO: OR rules only at the moment
• metric-server (Pod CPU & mem API) & prometheus-adapter (custom API)
• Cluster Auto Scaler
• Increase instances running in AWS ASG if nodes can’t satisfy request
• TODO: Add option for headroom to allow faster pod scaling
• Kops rolling update cluster
• It just works!
• Pod disruption budget if required = don’t move this pod (e.g. jenkins build slave)
Scaling / Updating
• Docker Logs (StdOut, JSON codec please!)  Logspout  Logstash  <log stack of your choice>
• Logspout slurps docker logs and spits them out to logstash
• Metrics
• Prometheus – Who ISN’T using this?!
• Kubernetes 1st class citizen
• Graphite – to – Prometheus
• Majority of current estate is AWS native & uses graphite
• https://github.com/prometheus/graphite_exporter
• We collect approx 1.5million metrics every 30s
• Ruby graphite-to-prometheus exporter
• (Golang implementation was go-slow at-scale and Ruby impl. is easier to maintain in-house)
• Scrape metrics from pod
• Observability
• Weave Scope – Great but beware as it’s a cluster admin (root SSH to nodes!)
• Kiali – Istio specific. Nice but unclear how actively developed it is
• Jaeger – Like Zipkin (distributed tracing); auto-deployed with Kiali – nice for free (enabled: true)
• Cockpit – Nice UI and uses kubectl creds
• Kube-ops-view – Read only  ; Ugly 
• Kubernetic – Desktop UI (beta – free!)
• Pretty pictures incoming…
Observability
Weave Scope
Weave Scope
Jaeger
Cockpit
Kube Ops View
Desktop – Kubernetic (beta = free)
• Excessive Istio logging - Remove stdout rule!
• Java Heap memory in containers
• Heap = ¼ host RAM by default
• Chicken/egg design of hosting orchestration tool
• Tooling sits inside cloud?!
• Deployment ordering – 1st deploy = no istio; 2nd deploy/scales with Istio!
• Ruby v2.4+ ndot limitation for DNS lookups – WTF!
• Istio requires ServiceEntry for headless services
• Early Istio maturity (reliability and e.g. max secgroups) v1.0.2+ = 
• ‘Finding’ the default container limits per NS
• Unreliable Jenkins K8s slave plugin (1.13.2 = )
• Kops clashing with existing VPC setup (please don’t delete those
subnets!)
• AWS limit on ELBs (8 IP per subnet!! solved with Istio ingress gateway
service)
• StatefulSets and EBS mapping (earlier)
• RBAC with Okta (atypical OIDC)
• No UDP support in Istio (don’t pick syslog as your first service to move!)
Jim: ”Dave, why’s it taking so long?”
• Two customer facing services in production
• 50+ Jenkins slave agent images moved from Mesos to k8s
• Groovy script FTW
• Around a dozen services in progress
• 80% of central infrastructure services are migrated (sonar,
dashboards, jenkins slaves, etc)
• Benefits are being sold to all squads, platform team keen for this to
be “pull” action. Squads WANT to move to k8s.
• Success = developer led
Where are we today?
• Canary deployments
• Working on automated canary deployments & roll-back
• Helm already gives us this with readiness/liveness checks –extending
coverage to include automated acceptance testing.
• Automated fault tolerance acceptance testing over multiple services.
• Improved performance testing plan – auto-scaling so harder to test to break; new
patterns emerging.
• Cost savings through consolidation (merge VPCs & compute)
• Improved spot price ‘storm’ tolerance (prod = spot)
Roadmap
Thanks to Tom, Tristan and Booking.com
Questions

More Related Content

What's hot

Moving to Kubernetes - Tales from SoundCloud
Moving to Kubernetes - Tales from SoundCloudMoving to Kubernetes - Tales from SoundCloud
Moving to Kubernetes - Tales from SoundCloudTobias Schmidt
 
Micro services vs hadoop
Micro services vs hadoopMicro services vs hadoop
Micro services vs hadoopGergely Devenyi
 
CMP376 - Another Week, Another Million Containers on Amazon EC2
CMP376 - Another Week, Another Million Containers on Amazon EC2CMP376 - Another Week, Another Million Containers on Amazon EC2
CMP376 - Another Week, Another Million Containers on Amazon EC2aspyker
 
A Million ways of Deploying a Kubernetes Cluster
A Million ways of Deploying a Kubernetes ClusterA Million ways of Deploying a Kubernetes Cluster
A Million ways of Deploying a Kubernetes ClusterJimmy Lu
 
How kubernetes works community, velocity, and contribution - osls 2017 (1)
How kubernetes works  community, velocity, and contribution - osls 2017 (1)How kubernetes works  community, velocity, and contribution - osls 2017 (1)
How kubernetes works community, velocity, and contribution - osls 2017 (1)Brian Grant
 
Innovating faster with SBT, Continuous Delivery, and LXC
Innovating faster with SBT, Continuous Delivery, and LXCInnovating faster with SBT, Continuous Delivery, and LXC
Innovating faster with SBT, Continuous Delivery, and LXCkscaldef
 
Aura Framework Overview
Aura Framework OverviewAura Framework Overview
Aura Framework Overviewrajdeep
 
Intro to Docker Containers and the Oracle Platform – Database, WebLogic &Clo...
 Intro to Docker Containers and the Oracle Platform – Database, WebLogic &Clo... Intro to Docker Containers and the Oracle Platform – Database, WebLogic &Clo...
Intro to Docker Containers and the Oracle Platform – Database, WebLogic &Clo...Lucas Jellema
 
How to build a SaaS solution in 60 days
How to build a SaaS solution in 60 daysHow to build a SaaS solution in 60 days
How to build a SaaS solution in 60 daysBrett McLain
 
TDC2017 | São Paulo - Trilha Cloud Computing How we figured out we had a SRE ...
TDC2017 | São Paulo - Trilha Cloud Computing How we figured out we had a SRE ...TDC2017 | São Paulo - Trilha Cloud Computing How we figured out we had a SRE ...
TDC2017 | São Paulo - Trilha Cloud Computing How we figured out we had a SRE ...tdc-globalcode
 
Interoperability: The Elephants in the Room & What We're Doing About Them
Interoperability: The Elephants in the Room & What We're Doing About ThemInteroperability: The Elephants in the Room & What We're Doing About Them
Interoperability: The Elephants in the Room & What We're Doing About ThemMark Voelker
 
Design Choices for Cloud Data Platforms
Design Choices for Cloud Data PlatformsDesign Choices for Cloud Data Platforms
Design Choices for Cloud Data PlatformsAshish Mrig
 
What is Kafka & why is it Important? (UKOUG Tech17, Birmingham, UK - December...
What is Kafka & why is it Important? (UKOUG Tech17, Birmingham, UK - December...What is Kafka & why is it Important? (UKOUG Tech17, Birmingham, UK - December...
What is Kafka & why is it Important? (UKOUG Tech17, Birmingham, UK - December...Lucas Jellema
 
Choosing a dev ops paas platform svccd presentation v2 for slideshare
Choosing a dev ops paas platform svccd presentation v2 for slideshareChoosing a dev ops paas platform svccd presentation v2 for slideshare
Choosing a dev ops paas platform svccd presentation v2 for slideshareJohn Mathon
 
Rohit yadav cloud stack internals
Rohit yadav   cloud stack internalsRohit yadav   cloud stack internals
Rohit yadav cloud stack internalsShapeBlue
 
ContainerDays NYC 2015: "Container Orchestration Compared: Kubernetes and Doc...
ContainerDays NYC 2015: "Container Orchestration Compared: Kubernetes and Doc...ContainerDays NYC 2015: "Container Orchestration Compared: Kubernetes and Doc...
ContainerDays NYC 2015: "Container Orchestration Compared: Kubernetes and Doc...DynamicInfraDays
 
Fabio Ferrari | particles.io | Presentation
Fabio Ferrari | particles.io | PresentationFabio Ferrari | particles.io | Presentation
Fabio Ferrari | particles.io | PresentationFabio Ferrari
 
DevNexus 2015: Kubernetes & Container Engine
DevNexus 2015: Kubernetes & Container EngineDevNexus 2015: Kubernetes & Container Engine
DevNexus 2015: Kubernetes & Container EngineKit Merker
 
CERN Data Centre Evolution
CERN Data Centre EvolutionCERN Data Centre Evolution
CERN Data Centre EvolutionGavin McCance
 
Kubernetes Concepts And Architecture Powerpoint Presentation Slides
Kubernetes Concepts And Architecture Powerpoint Presentation SlidesKubernetes Concepts And Architecture Powerpoint Presentation Slides
Kubernetes Concepts And Architecture Powerpoint Presentation SlidesSlideTeam
 

What's hot (20)

Moving to Kubernetes - Tales from SoundCloud
Moving to Kubernetes - Tales from SoundCloudMoving to Kubernetes - Tales from SoundCloud
Moving to Kubernetes - Tales from SoundCloud
 
Micro services vs hadoop
Micro services vs hadoopMicro services vs hadoop
Micro services vs hadoop
 
CMP376 - Another Week, Another Million Containers on Amazon EC2
CMP376 - Another Week, Another Million Containers on Amazon EC2CMP376 - Another Week, Another Million Containers on Amazon EC2
CMP376 - Another Week, Another Million Containers on Amazon EC2
 
A Million ways of Deploying a Kubernetes Cluster
A Million ways of Deploying a Kubernetes ClusterA Million ways of Deploying a Kubernetes Cluster
A Million ways of Deploying a Kubernetes Cluster
 
How kubernetes works community, velocity, and contribution - osls 2017 (1)
How kubernetes works  community, velocity, and contribution - osls 2017 (1)How kubernetes works  community, velocity, and contribution - osls 2017 (1)
How kubernetes works community, velocity, and contribution - osls 2017 (1)
 
Innovating faster with SBT, Continuous Delivery, and LXC
Innovating faster with SBT, Continuous Delivery, and LXCInnovating faster with SBT, Continuous Delivery, and LXC
Innovating faster with SBT, Continuous Delivery, and LXC
 
Aura Framework Overview
Aura Framework OverviewAura Framework Overview
Aura Framework Overview
 
Intro to Docker Containers and the Oracle Platform – Database, WebLogic &Clo...
 Intro to Docker Containers and the Oracle Platform – Database, WebLogic &Clo... Intro to Docker Containers and the Oracle Platform – Database, WebLogic &Clo...
Intro to Docker Containers and the Oracle Platform – Database, WebLogic &Clo...
 
How to build a SaaS solution in 60 days
How to build a SaaS solution in 60 daysHow to build a SaaS solution in 60 days
How to build a SaaS solution in 60 days
 
TDC2017 | São Paulo - Trilha Cloud Computing How we figured out we had a SRE ...
TDC2017 | São Paulo - Trilha Cloud Computing How we figured out we had a SRE ...TDC2017 | São Paulo - Trilha Cloud Computing How we figured out we had a SRE ...
TDC2017 | São Paulo - Trilha Cloud Computing How we figured out we had a SRE ...
 
Interoperability: The Elephants in the Room & What We're Doing About Them
Interoperability: The Elephants in the Room & What We're Doing About ThemInteroperability: The Elephants in the Room & What We're Doing About Them
Interoperability: The Elephants in the Room & What We're Doing About Them
 
Design Choices for Cloud Data Platforms
Design Choices for Cloud Data PlatformsDesign Choices for Cloud Data Platforms
Design Choices for Cloud Data Platforms
 
What is Kafka & why is it Important? (UKOUG Tech17, Birmingham, UK - December...
What is Kafka & why is it Important? (UKOUG Tech17, Birmingham, UK - December...What is Kafka & why is it Important? (UKOUG Tech17, Birmingham, UK - December...
What is Kafka & why is it Important? (UKOUG Tech17, Birmingham, UK - December...
 
Choosing a dev ops paas platform svccd presentation v2 for slideshare
Choosing a dev ops paas platform svccd presentation v2 for slideshareChoosing a dev ops paas platform svccd presentation v2 for slideshare
Choosing a dev ops paas platform svccd presentation v2 for slideshare
 
Rohit yadav cloud stack internals
Rohit yadav   cloud stack internalsRohit yadav   cloud stack internals
Rohit yadav cloud stack internals
 
ContainerDays NYC 2015: "Container Orchestration Compared: Kubernetes and Doc...
ContainerDays NYC 2015: "Container Orchestration Compared: Kubernetes and Doc...ContainerDays NYC 2015: "Container Orchestration Compared: Kubernetes and Doc...
ContainerDays NYC 2015: "Container Orchestration Compared: Kubernetes and Doc...
 
Fabio Ferrari | particles.io | Presentation
Fabio Ferrari | particles.io | PresentationFabio Ferrari | particles.io | Presentation
Fabio Ferrari | particles.io | Presentation
 
DevNexus 2015: Kubernetes & Container Engine
DevNexus 2015: Kubernetes & Container EngineDevNexus 2015: Kubernetes & Container Engine
DevNexus 2015: Kubernetes & Container Engine
 
CERN Data Centre Evolution
CERN Data Centre EvolutionCERN Data Centre Evolution
CERN Data Centre Evolution
 
Kubernetes Concepts And Architecture Powerpoint Presentation Slides
Kubernetes Concepts And Architecture Powerpoint Presentation SlidesKubernetes Concepts And Architecture Powerpoint Presentation Slides
Kubernetes Concepts And Architecture Powerpoint Presentation Slides
 

Similar to Kubernetes Manchester - 6th December 2018

Five Years of EC2 Distilled
Five Years of EC2 DistilledFive Years of EC2 Distilled
Five Years of EC2 DistilledGrig Gheorghiu
 
Containers, Serverless and Functions in a nutshell
Containers, Serverless and Functions in a nutshellContainers, Serverless and Functions in a nutshell
Containers, Serverless and Functions in a nutshellEugene Fedorenko
 
Latest (storage IO) patterns for cloud-native applications
Latest (storage IO) patterns for cloud-native applications Latest (storage IO) patterns for cloud-native applications
Latest (storage IO) patterns for cloud-native applications OpenEBS
 
Kubernetes at NU.nl (Kubernetes meetup 2019-09-05)
Kubernetes at NU.nl   (Kubernetes meetup 2019-09-05)Kubernetes at NU.nl   (Kubernetes meetup 2019-09-05)
Kubernetes at NU.nl (Kubernetes meetup 2019-09-05)Tibo Beijen
 
Deploying microservices on AWS
Deploying microservices on AWSDeploying microservices on AWS
Deploying microservices on AWSMichael Haberman
 
Tips For Maintaining OSS Projects
Tips For Maintaining OSS ProjectsTips For Maintaining OSS Projects
Tips For Maintaining OSS ProjectsTaro L. Saito
 
Kubernates : An Small introduction for Beginners by Rajiv Vishwkarma
Kubernates : An Small introduction for Beginners by Rajiv VishwkarmaKubernates : An Small introduction for Beginners by Rajiv Vishwkarma
Kubernates : An Small introduction for Beginners by Rajiv VishwkarmaRajiv Vishwkarma
 
To Build My Own Cloud with Blackjack…
To Build My Own Cloud with Blackjack…To Build My Own Cloud with Blackjack…
To Build My Own Cloud with Blackjack…Sergey Dzyuban
 
CT Software Developers Meetup: Using Docker and Vagrant Within A GitHub Pull ...
CT Software Developers Meetup: Using Docker and Vagrant Within A GitHub Pull ...CT Software Developers Meetup: Using Docker and Vagrant Within A GitHub Pull ...
CT Software Developers Meetup: Using Docker and Vagrant Within A GitHub Pull ...E. Camden Fisher
 
Fun with Kubernetes and Payara Micro 5
Fun with Kubernetes and Payara Micro 5Fun with Kubernetes and Payara Micro 5
Fun with Kubernetes and Payara Micro 5Payara
 
The Kubernetes Operator Pattern - ContainerConf Nov 2017
The Kubernetes Operator Pattern - ContainerConf Nov 2017The Kubernetes Operator Pattern - ContainerConf Nov 2017
The Kubernetes Operator Pattern - ContainerConf Nov 2017Jakob Karalus
 
Docker and kubernetes
Docker and kubernetesDocker and kubernetes
Docker and kubernetesDongwon Kim
 
Nordic infrastructure Conference 2017 - SQL Server in DevOps
Nordic infrastructure Conference 2017 - SQL Server in DevOpsNordic infrastructure Conference 2017 - SQL Server in DevOps
Nordic infrastructure Conference 2017 - SQL Server in DevOpsTravis Wright
 
Container Orchestration
Container OrchestrationContainer Orchestration
Container Orchestrationdfilppi
 
Cassandra on Docker
Cassandra on DockerCassandra on Docker
Cassandra on DockerInstaclustr
 
DataStax: Dockerizing Cassandra on Modern Linux
DataStax: Dockerizing Cassandra on Modern LinuxDataStax: Dockerizing Cassandra on Modern Linux
DataStax: Dockerizing Cassandra on Modern LinuxDataStax Academy
 

Similar to Kubernetes Manchester - 6th December 2018 (20)

Five Years of EC2 Distilled
Five Years of EC2 DistilledFive Years of EC2 Distilled
Five Years of EC2 Distilled
 
Containers, Serverless and Functions in a nutshell
Containers, Serverless and Functions in a nutshellContainers, Serverless and Functions in a nutshell
Containers, Serverless and Functions in a nutshell
 
Latest (storage IO) patterns for cloud-native applications
Latest (storage IO) patterns for cloud-native applications Latest (storage IO) patterns for cloud-native applications
Latest (storage IO) patterns for cloud-native applications
 
Kubernetes at NU.nl (Kubernetes meetup 2019-09-05)
Kubernetes at NU.nl   (Kubernetes meetup 2019-09-05)Kubernetes at NU.nl   (Kubernetes meetup 2019-09-05)
Kubernetes at NU.nl (Kubernetes meetup 2019-09-05)
 
Deploying microservices on AWS
Deploying microservices on AWSDeploying microservices on AWS
Deploying microservices on AWS
 
Kubernetes
KubernetesKubernetes
Kubernetes
 
Tips For Maintaining OSS Projects
Tips For Maintaining OSS ProjectsTips For Maintaining OSS Projects
Tips For Maintaining OSS Projects
 
Rex gke-clustree
Rex gke-clustreeRex gke-clustree
Rex gke-clustree
 
Kubernates : An Small introduction for Beginners by Rajiv Vishwkarma
Kubernates : An Small introduction for Beginners by Rajiv VishwkarmaKubernates : An Small introduction for Beginners by Rajiv Vishwkarma
Kubernates : An Small introduction for Beginners by Rajiv Vishwkarma
 
To Build My Own Cloud with Blackjack…
To Build My Own Cloud with Blackjack…To Build My Own Cloud with Blackjack…
To Build My Own Cloud with Blackjack…
 
CT Software Developers Meetup: Using Docker and Vagrant Within A GitHub Pull ...
CT Software Developers Meetup: Using Docker and Vagrant Within A GitHub Pull ...CT Software Developers Meetup: Using Docker and Vagrant Within A GitHub Pull ...
CT Software Developers Meetup: Using Docker and Vagrant Within A GitHub Pull ...
 
Tech4Africa 2014
Tech4Africa 2014Tech4Africa 2014
Tech4Africa 2014
 
OpenStack and Windows
OpenStack and WindowsOpenStack and Windows
OpenStack and Windows
 
Fun with Kubernetes and Payara Micro 5
Fun with Kubernetes and Payara Micro 5Fun with Kubernetes and Payara Micro 5
Fun with Kubernetes and Payara Micro 5
 
The Kubernetes Operator Pattern - ContainerConf Nov 2017
The Kubernetes Operator Pattern - ContainerConf Nov 2017The Kubernetes Operator Pattern - ContainerConf Nov 2017
The Kubernetes Operator Pattern - ContainerConf Nov 2017
 
Docker and kubernetes
Docker and kubernetesDocker and kubernetes
Docker and kubernetes
 
Nordic infrastructure Conference 2017 - SQL Server in DevOps
Nordic infrastructure Conference 2017 - SQL Server in DevOpsNordic infrastructure Conference 2017 - SQL Server in DevOps
Nordic infrastructure Conference 2017 - SQL Server in DevOps
 
Container Orchestration
Container OrchestrationContainer Orchestration
Container Orchestration
 
Cassandra on Docker
Cassandra on DockerCassandra on Docker
Cassandra on Docker
 
DataStax: Dockerizing Cassandra on Modern Linux
DataStax: Dockerizing Cassandra on Modern LinuxDataStax: Dockerizing Cassandra on Modern Linux
DataStax: Dockerizing Cassandra on Modern Linux
 

Recently uploaded

Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsHyundai Motor Group
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 

Recently uploaded (20)

Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 

Kubernetes Manchester - 6th December 2018

  • 1. MoneySuperKubernetes Navigating K8s at Moneysupermarket Kubernetes Manchester, December 2018 David Stockton, DevOps Tech Lead for Core Platform Jim Davies, Head of DevOps
  • 2. Jim Davies David Stockton
  • 3.
  • 4. Why do we need Kubernetes?
  • 5. < 2016 2017 2018 2019 > AWS • Rightscale • Masterless Puppet • Jenkins on Mesos • Single-container hosts • Docker Swarm spikes • Nomad spikes • Kubernetes full trial • AWS and industry say ”Go K8s” • EKS preview (not impressed) • Kubernetes full development • Continued migration • Test and learn • DEPLOY ON DAY ONE! Road to Kubernetes
  • 6. • AWS EKS • First evaluated Jan 2018 (beta) – only available in US AZ’s (we want customer data in EU) • At the time, required custom kubectl binaries & well behind in version support (much better now) – we plan to re-evaluate EKS next year. • No automation capability – Fargate for EKS wasn’t ready, Terraform not ready. • Would need better interoperability with our existing EC2 VPC estate. • GKE • Have data analytics platform in GCP; but all SoA/front-end estate is in AWS + direct connect Technology Choices Platform Deployment • Kops • K8s deployment automation (AWS first-class citizen) • Can output terraform files for IaC management - we keep terraform state in S3 bucket BUT we throw-away the Terraform files and re-generate each run. • + Not tempted to edit Terraform files (Don’t do this! We did and it’s a nightmare to use any kops function again afterwards) • + Can turn-off cluster (we shut-down environments at night - £££ ) without ’destroying’ it. • - Hasn’t happened yet but theoretical worry if kops re-architects terraform files (theoretically just rebuild the whole cluster anyway) • - Launch configurations only (no launch template multi instance-type support yet) • Puppet – Invoked via kops hook • Small (master-less) puppet run to add some instance level goodness (e.g. security tooling!)
  • 7. • Route 53 • Delegated sub-domain to Route53 management (controlled by ‘external-dns’) • ELB • Route53 entries point to ELBs • More on this when I talk about Istio… Traffic Ingress Application Deployment • Jenkins • Orchestrates & provides feedback • e.g. PR merged  Webhook  Helm-Deploy (Jenkinsfile) • Helm • THE kubernetes package manager • YAML + golang template = k8s YAML • ProTip: ’Sprig’ for free - http://masterminds.github.io/sprig/ • ‘generic-service’ helm chart  • helm-deploy.rb • Proprietary Ruby script • Defines list of helm releases to run per environment • helm chart + chart version + values.yaml path • Remember, environments restarted each day from code (disaster recovery = normal day!) • Artifactory • Docker repository (plus virtual aggregation) • Helm repository (plus virtual aggregation)
  • 8. • Service Mesh • Connect – control flow of traffic (e.g. blue/green deployments, testing) • Secure – automatically secure services (auth ’N’ auth + encryption) – solves our internal network requirements; not cloud vendor specfic (i.e. could span over multiple clouds/datacenters) • Control – enforce policies (e.g. circuit breaker) • Observe – automatic tracing, monitoring & logging • Traffic Ingress • k8s service  Istio Ingress Gateway  Gateway  VirtualService  <upstream> • Egress • ServiceEntries – can use to white-list outbound traffic or create rules (e.g. no more than X connections to external service Y) • Squads in Control • Central point of control (e.g. /endpoint was ‘serviceA’, now it’s ‘serviceB’ – only 1 piece of configuration) • Squads can ’wire-up’ their service – in their control • Limits / Service Protection - no more than X concurrent connections • Fault Testing - delay injection / fail X% of requests • Canary Deployments – maybe next talk!? Istio
  • 9. • Docker Images • Linted (docker lint) • Scanned (klar  clair  clair-db) • On build (existing vuln. at build time) • On schedule (new vulns) • Corporate white-list YAML from git repo • IGNORE_UNFIXED is a better strategy – but you still want to know! • Lots of vendor solutions available too • Hosts • Standard Linux tooling / SaaS vendors available • Many have profiling capabilities • Many offer container (or even helm chart); beware they typically need to be a privileged container – only appropriate if you manage your own hosts • k8s • CVE-2018-1002105 (9.8!) • https://access.redhat.com/security/cve/cve-2018-1002105 • https://github.com/kubernetes/kubernetes/issues/71411 • Affected versions: • Kubernetes v1.0.x-1.9.x • Kubernetes v1.10.0-1.10.10 (fixed in v1.10.11) • Kubernetes v1.11.0-1.11.4 (fixed in v1.11.5) • Kubernetes v1.12.0-1.12.2 (fixed in v1.12.3) • https://kubernetes.io/docs/tasks/administer-cluster/securing-a-cluster/ • https://github.com/neuvector/kubernetes-cis-benchmark Security
  • 10. • Integrate with SSO • kuberos  dex  Okta (LDAPS)  ldaps[ ] • kuberos = UI to OIDC provider • https://github.com/helm/charts/tree/master/stable/kuberos • dex = OIDC provider which aggregates back-end auth providers • https://github.com/helm/charts/tree/master/stable/dex • Okta = SSO solution; provides SAML, OIDC, LDAPS front-ends • Wait, what? Okta OIDC doesn’t by default present group names in refresh token response – additional feature = £££ • dex talks to (no extra charge) Okta LDAPS directory (and/or others – e.g. static passwords for out-of-cluster service accounts • LDAPS[ ] = Active Directory / Samba / Whatever • Why? • Kuberos provides ~/.kube credentials UI • Thereafter kubectl talks to dex and refreshes token • Joiner / Mover / Leaver process = someone else’s problem  • Standard kubectl tooling • Group mappings in code Security – API Creds / RBAC
  • 11. • Most charts allow specifying attributes of storage; but critically NOT the volume ID :face-palm: •  Need reclaim policy = retain (or you’ll lose the volume!) •  Cannot delete and re-deploy helm chart without manual intervention • Need to take PV out of retained state • If you’re not making changes to the PV definition (shouldn’t be; then at least helm upgrade’s work) • Environments up forever =  • Working through stable helm charts with PRs to support volume IDs • Not as simple as it sounds; often need to split replica-sets into multiples or similar approaches to allow specific IDs to be used Storage • Horizontal Pod Auto-scaler • Fancy way of saying ‘look at this Prometheus metric and scale if above threshold’ • Business metric scaling = cool… ‘if average user search time > 0.5s then scale up search API’ • TODO: OR rules only at the moment • metric-server (Pod CPU & mem API) & prometheus-adapter (custom API) • Cluster Auto Scaler • Increase instances running in AWS ASG if nodes can’t satisfy request • TODO: Add option for headroom to allow faster pod scaling • Kops rolling update cluster • It just works! • Pod disruption budget if required = don’t move this pod (e.g. jenkins build slave) Scaling / Updating
  • 12. • Docker Logs (StdOut, JSON codec please!)  Logspout  Logstash  <log stack of your choice> • Logspout slurps docker logs and spits them out to logstash • Metrics • Prometheus – Who ISN’T using this?! • Kubernetes 1st class citizen • Graphite – to – Prometheus • Majority of current estate is AWS native & uses graphite • https://github.com/prometheus/graphite_exporter • We collect approx 1.5million metrics every 30s • Ruby graphite-to-prometheus exporter • (Golang implementation was go-slow at-scale and Ruby impl. is easier to maintain in-house) • Scrape metrics from pod • Observability • Weave Scope – Great but beware as it’s a cluster admin (root SSH to nodes!) • Kiali – Istio specific. Nice but unclear how actively developed it is • Jaeger – Like Zipkin (distributed tracing); auto-deployed with Kiali – nice for free (enabled: true) • Cockpit – Nice UI and uses kubectl creds • Kube-ops-view – Read only  ; Ugly  • Kubernetic – Desktop UI (beta – free!) • Pretty pictures incoming… Observability
  • 18. Desktop – Kubernetic (beta = free)
  • 19. • Excessive Istio logging - Remove stdout rule! • Java Heap memory in containers • Heap = ¼ host RAM by default • Chicken/egg design of hosting orchestration tool • Tooling sits inside cloud?! • Deployment ordering – 1st deploy = no istio; 2nd deploy/scales with Istio! • Ruby v2.4+ ndot limitation for DNS lookups – WTF! • Istio requires ServiceEntry for headless services • Early Istio maturity (reliability and e.g. max secgroups) v1.0.2+ =  • ‘Finding’ the default container limits per NS • Unreliable Jenkins K8s slave plugin (1.13.2 = ) • Kops clashing with existing VPC setup (please don’t delete those subnets!) • AWS limit on ELBs (8 IP per subnet!! solved with Istio ingress gateway service) • StatefulSets and EBS mapping (earlier) • RBAC with Okta (atypical OIDC) • No UDP support in Istio (don’t pick syslog as your first service to move!) Jim: ”Dave, why’s it taking so long?”
  • 20. • Two customer facing services in production • 50+ Jenkins slave agent images moved from Mesos to k8s • Groovy script FTW • Around a dozen services in progress • 80% of central infrastructure services are migrated (sonar, dashboards, jenkins slaves, etc) • Benefits are being sold to all squads, platform team keen for this to be “pull” action. Squads WANT to move to k8s. • Success = developer led Where are we today?
  • 21. • Canary deployments • Working on automated canary deployments & roll-back • Helm already gives us this with readiness/liveness checks –extending coverage to include automated acceptance testing. • Automated fault tolerance acceptance testing over multiple services. • Improved performance testing plan – auto-scaling so harder to test to break; new patterns emerging. • Cost savings through consolidation (merge VPCs & compute) • Improved spot price ‘storm’ tolerance (prod = spot) Roadmap
  • 22. Thanks to Tom, Tristan and Booking.com Questions

Editor's Notes

  1. Intro - Me - Company People/org - Core Platform team — Software put on shelf - Embedded Ops Engineers/Specialists in delivery teams — If anyone has seen our job postings, yes we call them DevOps Engineers… — Teams should have the power Why?  - Deploy on day one - Challenges with current platform Constraints - Regulated - Multiple teams/ brands Timeline - Current architecture — Environments — Network  — AWS/Rightscale/Masterless Puppet/IaC — Single container host — Mesos - Many smaller spikes with Docker Swarm, Nomad and K8s - Initially put off by K8s complexity - Gained support for moving on this after focussing on a different area during 2017 - End-2017 did successful spike with real services - Our AWS architect tipped us the wink before EKS to stick with K8s - We took part in the preview but wasn’t impressed - As we know now that when AWS released EKS then the world went to K8s - From Spring this year, we developed the current platform Tech choices/design  - Why not EKS (subnet, integration with existing, Ireland availability [now in]) - Why not GKE (already in AWS, integration challenges, however we are using it in our data analytics platform) Platform deployment - Kops (Terraform managing ELBs/ASGs importantly steps outside of our current Iac) - Route53 to manage the internal private domain  - ELB for ingress Application deployment - Helmdeploy (this is our secret sauce, a little Ruby wrapper enabling us to template helm files and deploy and handle pod configuration in a consistent way) - Helm and our ‘generic-service’ chart - Jenkins orchestrates, moving towards ‘read-only’ Jenkins and K8s operations are essentially done by pull request - Artifactory hosts private helm repo and our Docker registry Other tech choices - Istio (for service mesh) — Feels like this more than K8s is the game-changer — Allowing service teams to not only deploy but ‘wire up’ their services and dependencies — Also solves our internal network encryption requirements — It gives us canary deployments and fault injection too but that’s for next time - Storage — ???? - Image layer scanning with CLAR — Done on image build and on a scheduled basis - RBAC on the K8s API — This is a work in progress integrating with our internal identity system, Okta.  - Prometheus for monitoring (just starting out on that and looking to get someone else to host storage) - Horizontal pod autoscaling on custom scaling metrics read from Prometheus - Currently ‘teeing’ to Prom and existing metric storage Graphite and Grafana - Docker logs > Logspout > Logstash (for options) > Loggly -  Weave (???) - Jaeger (tracing) - Kiali (traffic flow) Where we are today - Two customer-facing services in production - Around a dozen in progress - 80% of the central infrastructure services are migrated - The benefits are being sold to all teams but the platform team are keen for this to be a pull - This is a developer-led platform War stories  “The juicy bits. This is what you’re all interested in, right?” - Excessive Istio logging - Java Heap memory in containers - Chicken/egg design of hosting orchestration tool - Cluster scales without Istio sidecar - Ruby not limitation for DNS lookups - Istio and headless services  (??) - Early Istio maturity (reliability and e.g. max secgroups) - ‘Finding’ the default container limits - Unreliable Jenkins K8s slave plugin - Kops clashing with existing VPC setup - AWS limit on ELBs (solved with Istio single ingress) - StatefulSets and EBS mapping - RBAC with Okta (atypical OIDC) - Not UDP in Istio (a surprise but quickly worked around) What’s next/roadmap - Canary - Fault injection
  2. Intro Not quite Holly and Dec but we’ll give it a go Jim (Head Of DevOps, MSM for 10 years, look after Ops specialists embedded in Product Delivery teams across the group and help the teams achieve their delivery and availability goals) David (I’ll hand you over in a minute but David looks after the DevOps Core team based in Ewloe near Chester. The Core team develop and maintain the central infrastructure services) We’ve been building out our own Kubernetes clusters for around 12 months now and we wanted to share our successes and failures. We’re still really at the beginning of our journey so also hoping to get some info from you too.
  3. So a quick scene-setter… The whole group is dedicated to saving UK households money every single day. The estimated savings for customers in 2017 was over £2bn operating over the three brands you see here, MSM, TSM and MSE… We have delivery teams across all three brands and they are at different stages with their platforms. For example, we have Kubernetes clusters running in GKE on Google Cloud running our Data analytics platform, AWS Fargate in test under TSM and AWS Lambda running under the MSE website. Today though, we’re going to talk about what we’re doing for the Moneysupermarket or MSM teams as we have rolled our own. MSM is sub-divided into the Product Engineering teams Insure and Home Services channels e.g. Car Insurance, Gas and Electricity Money e.g. Savings, Current Accounts These teams operate their own front-end and back-end services. The website also has a backend service layer providing shared capabilities that we can also expose to third-party front-ends and the mobile app. The Group Services delivery team look after this microservices architecture. All of these teams have dedicated change, infrastructure and automation specialists embedded in the team providing that local development and support. Yes. We call them DevOps Engineers which isn’t right but it does seem to work when recruiting the right people. We give as much autonomy to the teams as possible but there’s more that can be done which I’ll come to in a moment. Alongside the brands, there is a Group Technology Operations function that houses central incident and release management, information security, the workspace infrastructure team and also David’s Core team. The function of the Core team is developing and maintaining central infra services and creating those shared deployment patterns. Almost creating patterns and putting them on the shelf for the teams to pull.
  4. Why?  More specifically, why do we need a container orchestration platform. Deploy on Day One We must be able to release to production with as few obstacles as possible We think the best test of this is a new developer creates some code, runs it through a minimal workflow and deploys to production on her first day We are signing up teams to get to this point for new team members this year. It will be a measure of the team’s autonomy and we are developing the new platform for this purpose. I know this is common practice in many organisations but we can’t right now because… Dependent platform We designed and built the current platform back in 2014 as part of the Group cloud migration project It was built by Ops for Ops to solve a number of infrastructure problems we knew all too well from running out of a datacenter It proved OK and we also built out the DevOps ways of working democratizing Operations to Delivery teams However, our toolsets were all powerful so admin meant admin everywhere across the stack and being a regulated company, this will not fly. This resulted in bottlenecks in teams where their Devops Engineers are needed to do certain things Most of the current services are bootstrapped in AWS using a tool called Rightsacale and a masterless Puppet system. This infrastructure as code solution has been very flexible but we have a very complicated config base that is a nightmare for new engineers. We also shutdown our development environments overnight to stop config drift and save a few quid but the morning rebuilds still have a less than perfect record of failure causing problems for the teams Operations by pull request To help our deploy on day one goal and allow developers to deploy their own services with little friction, we need enable anyone to be able make changes to the service configuration, monitoring configuration, network ingress points and integrations with other services. When I say anyone, that’s anyone with the right automated approval. Of course, those automated approval workflows need to be changed too. We already have the majority of infrastructure configuration as code but what’s left is not generally accessible and causes obstacles for the developer. Kubernetes and a service mesh which David will talk about in a bit, brings everything under source control and we can then put git merge workflows around any changes. So with all that in mind, I’ll handover to David to talk about more about where we have come from, where we are and where we’re going. Oh, and the interesting bits of where we went wrong!
  5. Timeline - Current architecture — Environments — Network  — AWS/Rightscale/Masterless Puppet/IaC — Single container host — Mesos - Many smaller spikes with Docker Swarm, Nomad and K8s - Initially put off by K8s complexity - Gained support for moving on this after focussing on a different area during 2017 - End-2017 did successful spike with real services - Our AWS architect tipped us the wink before EKS to stick with K8s - We took part in the preview but wasn’t impressed - As we know now that when AWS released EKS then the world went to K8s - From Spring this year, we developed the current platform
  6. Questions: K8s in prod? Keep hand-up if Istio in prod? Who’s using EKS? / Who’s using GKE? Who’s using kops? Kubeadm? rke? Other?
  7. Who’s using helm? Anyone automating helm (e.g. multiple helm-runs? / keel)? Artifactory – Used for Java artifacts anyway; support (such as helm) provided by Jfrog, not community