SlideShare a Scribd company logo
1 of 23
Hashicorp Meetup 26th June 2018
Building Congruent infrastructure @Zopa
Ben Coughlan
2
Ben Coughlan
Senior Zopa Reliability Engineer
• Version 1 (software consultancy), SSE Airtricity, Allianz, Qualcomm.
• Docker Everywhere! TLS Everywhere!
• Joined Zopa in February 2017.
• Ben.Coughlan@zopa.com
What is Congruent infrastructure?
Divergence:
• Manual administration.
• Scripts, ssh, rsync, manually mounting and extending disks.
• Constantly creeping from desired state.
• Snowflake Servers.
Convergence:
• Configuration Management used to spin up infrastructure.
• Ansible, chef, puppet, salt.
• CM tools are unreliable, impossible to keep the state of all
servers the same.
• SSH allows for configuration drift.
Congruence:
• No configuration management.
• Immutable infrastructure.
• No state creep, because there is no SSH.
How to build congruent infrastructure.
• Build software, not servers.
• The first platform will take weeks or month’s, the second one will take minutes.
• Instead of updating servers, you completely replace the servers. Think cattle, not pets.
• Build hardened systems, with lots of monitoring.
• Everything needs to be immutable.
• Manage your state.
• Standardise your tools, languages, practices, add an order to everything, and build it
into your culture.
The tools for SRE’s @Zopa
Tools
• Terraform
• Vault
• Consul
• CoreOS (Ignition)
• Docker
Telemetry
• Prometheus
• Grafana
• Splunk
Languages
• Golang
Platforms
• AWS
• vSphere
The ethos for SRE’s @Zopa
• Try to remain vendor agnostic
• Use open source where possible
• Everything needs to be immutable
• Everything runs in a Container
• TLS everywhere
• Telemetry everywhere
• Perfect is the enemy of good
The Challenge – Infrastructure at Zopa.
The Monolith
• Overloaded DC’s.
• Many single points of failure.
• Many snowflake servers.
• Every transaction goes through the monolith.
• Single digit production releases per week.
• Huge range of databases sitting on three servers.
• Divergent infrastructure, very little configuration
management.
Tech Stack
• Ubuntu VM’s (some running applications in Docker).
• Windows Server running C# and .NET Services.
The Solution – Infrastructure at Zopa.
Microservices
• Multi Cloud.
• Well defined, and split pipelines.
• Hundreds of deployments per day.
• Distributed databases.
• Immutable infrastructure.
Tech Stack
• Kubernetes!!!!!!1 
• Kafka
• Containerize everything.
• Windows Services to move to .NET Core.
• Move as much as possible into Kubernetes.
Use Case: Building Kafka with congruence in
mind.
Why is it hard to automate a deployment of Kafka?
1. Kafka is far from cloud native.
2. It can be unstable, even when built manually.
3. Zopa’s use case of using Kafka for critical transactions, and using it as a persistent database,
meant that we couldn’t ever afford to lose data.
4. As it’s open source, it had a lot of inconsistent behaviours, run’s different version’s of a JVM across
products of the same version, has different jar’s on each product, etc. This took a lot of pull
requests to resolve over the past 12 months.
5. Logging to file, SSL/TLS/AAA/ACL’s or “security”, dockerised images, rebalancing a cluster
without taking it offline, online operations, etc, were all an afterthought with this product.
6. Initially built outside of Kubernetes, due to the complexities of Kafka.
Infrastructure Orchestration - spinning up the
machines.
TERRAFORM – Infrastructure Orchestration.
IGNITION – User Data.
Remote Storage is attached -
Managing state.
Certificates are issued to each server..
Bootstrapping takes place.
Service is up and running – telemetry.
Terraform
1. Pulls in secrets from Vault to access AWS. (vault provider)
2. Pulls in Values from remote state files. (for the VPC, subnets, etc)
3. Utilises reusable modules from the Zopa stack, (spot pricing, service discovery, re-attachable storage,
etc)
4. Creates launch configurations, autoscaling groups, Route53 addresses, etc.
5. Passes the Ignition user data into the launch configuration.
Orchestration/Config management: Rationale
Ansible : Configuration Management
• Agentless in push mode.
• In push mode, requires a solid SSH Connection.
• Ansible's reliance on SSH makes it difficult to use securely in an environment with a lot of churn.
• Ansible has no native protection against concurrent runs.
• Ansible in push mode is really slow.
• It’s workaround for its slow performance -- tags -- encourages config drift.
Ansible : Configuration Management
IGNITION – User Data
• Kicks in at runtime.
• Templates dynamic “drop-in” files instead of carrying out configuration management.
• No Configuration drift, you kill the server every time you want to update. (clean slate)
• Adding/changing configurations is much easier.
• Far easier to use than legacy CM tools like Ansible, Puppet, Chef, etc.
• No Race conditions, no lag, no SSH.
• Secure! Specific IAM permissions to retrieve it’s own user data.
IGNITION – User Data
Infrastructure Orchestration – Rack awareness and Persistence.
1. “Smilodon” is used to attach a remote EBS volume.
2. Once the volume is attached, a file that has been placed on the
volume by Terraform, will have the “rack” or AWS site in place,
and the node number.
3. The node number is then used to attach the correct network
interface.
4. The network interface gives the node a second IP address, which
is tied to the Route53 record.
This allows us to ensure that the correct data is used for each node,
which prevents rebalancing, and ensures we can lose any AWS
availability zone, without losing any data.
SSL Certificates – Issuing and renewal.
• Vault PKI endpoint generated for each platform.
• Consul configuration in place to allow for distributed lock’s, so that certs are issued
one at a time.
• “Certificated” an in-house tool, contacts Vault and Consul to issue certs to each
machine.
• Attempts to get a “lock” to issue a cert for itself.
• Gets the lock.
• Issues certs, and generates keystores and truststores.
• Mounts those cert stores into Kafka’s docker container.
• Starts/Restarts the Kafka service.
• Releases the lock.
Bootstrapping takes place
• Services Register with each other
independently.
• Zero manual work done to carry out the required
setup steps in Kafka.
• Based off dynamic “count’s” in ignition code.
• Dynamic, and easily scalable, use of
autoscaling function in AWS makes horizontal
scaling trivial.
• Prometheus – Built in Application and Server
node exporter used extensively.
• Custom checks used extensively for service
specific deliverables. (Consumes, produces to
cluster regularly, checks disk, io, network, uses
probes for services, checks JVM, etc. )
• Grafana used to visualise each cluster
• Splunk used for Log aggregation, and for
monitoring ACL’s on each interaction/operation.
Service up and running
The Terraform lifecycle.
1. Merge changes in Git.
2. Terraform apply.
3. Destroy Servers.
4. Beer.
The Challenges of going Congruent.
1. Time
It takes a lot longer to produce a hardened, stable platform.
2. Complexity
Many products don’t work well in an immutable design pattern.
3. Culture
Trying to maintain the same development pattern across a team gets complicated as a team grows.
Trying to convince Puppet/Ansible and Debian fan’s to adopt CoreOS and Ignition instead.
4. Windows
Doesn’t handle configuration management well.
.NET core and dockerised services seems to be the only answer.
5. Finding people to help! 
This is another shameless plug to remind you that we are hiring.
Questions?

More Related Content

What's hot

SaltConf14 - Craig Sebenik, LinkedIn - SaltStack at Web Scale
SaltConf14 - Craig Sebenik, LinkedIn - SaltStack at Web ScaleSaltConf14 - Craig Sebenik, LinkedIn - SaltStack at Web Scale
SaltConf14 - Craig Sebenik, LinkedIn - SaltStack at Web ScaleSaltStack
 
Hashicorp-Terraform_Packer_Vault-by Sushil
Hashicorp-Terraform_Packer_Vault-by SushilHashicorp-Terraform_Packer_Vault-by Sushil
Hashicorp-Terraform_Packer_Vault-by SushilSushil Kumar
 
How HashiCorp platform tools can make the difference in development and deplo...
How HashiCorp platform tools can make the difference in development and deplo...How HashiCorp platform tools can make the difference in development and deplo...
How HashiCorp platform tools can make the difference in development and deplo...Dmytro Mykhailov
 
GPLS/PINES GenaSYS Presentation - EG2012
GPLS/PINES GenaSYS Presentation - EG2012GPLS/PINES GenaSYS Presentation - EG2012
GPLS/PINES GenaSYS Presentation - EG2012pines
 
Webinar: Queues with RabbitMQ - Lorna Mitchell
Webinar: Queues with RabbitMQ - Lorna MitchellWebinar: Queues with RabbitMQ - Lorna Mitchell
Webinar: Queues with RabbitMQ - Lorna MitchellCodemotion
 
Distributed automation sel_conf_2015
Distributed automation sel_conf_2015Distributed automation sel_conf_2015
Distributed automation sel_conf_2015aragavan
 
How DreamHost builds a Public Cloud with OpenStack
How DreamHost builds a Public Cloud with OpenStackHow DreamHost builds a Public Cloud with OpenStack
How DreamHost builds a Public Cloud with OpenStackCarl Perry
 
Network Infrastructure as Code with Chef and Cisco
Network Infrastructure as Code with Chef and CiscoNetwork Infrastructure as Code with Chef and Cisco
Network Infrastructure as Code with Chef and CiscoMatt Ray
 
Sharding Containers: Make Go Apps Computer-Friendly Again by Andrey Sibiryov
Sharding Containers: Make Go Apps Computer-Friendly Again by Andrey Sibiryov Sharding Containers: Make Go Apps Computer-Friendly Again by Andrey Sibiryov
Sharding Containers: Make Go Apps Computer-Friendly Again by Andrey Sibiryov Docker, Inc.
 
DCSF19 Container Security: Theory & Practice at Netflix
DCSF19 Container Security: Theory & Practice at NetflixDCSF19 Container Security: Theory & Practice at Netflix
DCSF19 Container Security: Theory & Practice at NetflixDocker, Inc.
 
No Docker? No Problem: Automating installation and config with Ansible
No Docker? No Problem: Automating installation and config with AnsibleNo Docker? No Problem: Automating installation and config with Ansible
No Docker? No Problem: Automating installation and config with AnsibleJeff Potts
 
Erlang factory SF 2011 "Erlang and the big switch in social games"
Erlang factory SF 2011 "Erlang and the big switch in social games"Erlang factory SF 2011 "Erlang and the big switch in social games"
Erlang factory SF 2011 "Erlang and the big switch in social games"Paolo Negri
 
Monitoring Open Source Databases with Icinga
Monitoring Open Source Databases with IcingaMonitoring Open Source Databases with Icinga
Monitoring Open Source Databases with IcingaIcinga
 
Going serverless with aws
Going serverless with awsGoing serverless with aws
Going serverless with awsAlex Landa
 
Next generation pipelines
Next generation pipelinesNext generation pipelines
Next generation pipelinesAlex Landa
 
Cassandra Core Concepts - Cassandra Day Toronto
Cassandra Core Concepts - Cassandra Day TorontoCassandra Core Concepts - Cassandra Day Toronto
Cassandra Core Concepts - Cassandra Day TorontoJon Haddad
 
Python performance profiling
Python performance profilingPython performance profiling
Python performance profilingJon Haddad
 
NSBCon UK nservicebus on Azure by Yves Goeleven
NSBCon UK nservicebus on Azure by Yves GoelevenNSBCon UK nservicebus on Azure by Yves Goeleven
NSBCon UK nservicebus on Azure by Yves GoelevenParticular Software
 
Managing Large Selenium Grid
Managing Large Selenium Grid�Managing Large Selenium Grid�
Managing Large Selenium Griddimakovalenko
 

What's hot (20)

SaltConf14 - Craig Sebenik, LinkedIn - SaltStack at Web Scale
SaltConf14 - Craig Sebenik, LinkedIn - SaltStack at Web ScaleSaltConf14 - Craig Sebenik, LinkedIn - SaltStack at Web Scale
SaltConf14 - Craig Sebenik, LinkedIn - SaltStack at Web Scale
 
Hashicorp-Terraform_Packer_Vault-by Sushil
Hashicorp-Terraform_Packer_Vault-by SushilHashicorp-Terraform_Packer_Vault-by Sushil
Hashicorp-Terraform_Packer_Vault-by Sushil
 
How HashiCorp platform tools can make the difference in development and deplo...
How HashiCorp platform tools can make the difference in development and deplo...How HashiCorp platform tools can make the difference in development and deplo...
How HashiCorp platform tools can make the difference in development and deplo...
 
GPLS/PINES GenaSYS Presentation - EG2012
GPLS/PINES GenaSYS Presentation - EG2012GPLS/PINES GenaSYS Presentation - EG2012
GPLS/PINES GenaSYS Presentation - EG2012
 
Webinar: Queues with RabbitMQ - Lorna Mitchell
Webinar: Queues with RabbitMQ - Lorna MitchellWebinar: Queues with RabbitMQ - Lorna Mitchell
Webinar: Queues with RabbitMQ - Lorna Mitchell
 
Distributed automation sel_conf_2015
Distributed automation sel_conf_2015Distributed automation sel_conf_2015
Distributed automation sel_conf_2015
 
How DreamHost builds a Public Cloud with OpenStack
How DreamHost builds a Public Cloud with OpenStackHow DreamHost builds a Public Cloud with OpenStack
How DreamHost builds a Public Cloud with OpenStack
 
Network Infrastructure as Code with Chef and Cisco
Network Infrastructure as Code with Chef and CiscoNetwork Infrastructure as Code with Chef and Cisco
Network Infrastructure as Code with Chef and Cisco
 
Sharding Containers: Make Go Apps Computer-Friendly Again by Andrey Sibiryov
Sharding Containers: Make Go Apps Computer-Friendly Again by Andrey Sibiryov Sharding Containers: Make Go Apps Computer-Friendly Again by Andrey Sibiryov
Sharding Containers: Make Go Apps Computer-Friendly Again by Andrey Sibiryov
 
DCSF19 Container Security: Theory & Practice at Netflix
DCSF19 Container Security: Theory & Practice at NetflixDCSF19 Container Security: Theory & Practice at Netflix
DCSF19 Container Security: Theory & Practice at Netflix
 
No Docker? No Problem: Automating installation and config with Ansible
No Docker? No Problem: Automating installation and config with AnsibleNo Docker? No Problem: Automating installation and config with Ansible
No Docker? No Problem: Automating installation and config with Ansible
 
Erlang factory SF 2011 "Erlang and the big switch in social games"
Erlang factory SF 2011 "Erlang and the big switch in social games"Erlang factory SF 2011 "Erlang and the big switch in social games"
Erlang factory SF 2011 "Erlang and the big switch in social games"
 
Monitoring Open Source Databases with Icinga
Monitoring Open Source Databases with IcingaMonitoring Open Source Databases with Icinga
Monitoring Open Source Databases with Icinga
 
Going serverless with aws
Going serverless with awsGoing serverless with aws
Going serverless with aws
 
Next generation pipelines
Next generation pipelinesNext generation pipelines
Next generation pipelines
 
Cassandra Core Concepts - Cassandra Day Toronto
Cassandra Core Concepts - Cassandra Day TorontoCassandra Core Concepts - Cassandra Day Toronto
Cassandra Core Concepts - Cassandra Day Toronto
 
Python performance profiling
Python performance profilingPython performance profiling
Python performance profiling
 
NSBCon UK nservicebus on Azure by Yves Goeleven
NSBCon UK nservicebus on Azure by Yves GoelevenNSBCon UK nservicebus on Azure by Yves Goeleven
NSBCon UK nservicebus on Azure by Yves Goeleven
 
Managing Large Selenium Grid
Managing Large Selenium Grid�Managing Large Selenium Grid�
Managing Large Selenium Grid
 
Docker in the Cloud
Docker in the CloudDocker in the Cloud
Docker in the Cloud
 

Similar to London Hashicorp Meetup #22 - Congruent infrastructure @zopa by Ben Coughlan

Sanger, upcoming Openstack for Bio-informaticians
Sanger, upcoming Openstack for Bio-informaticiansSanger, upcoming Openstack for Bio-informaticians
Sanger, upcoming Openstack for Bio-informaticiansPeter Clapham
 
To Build My Own Cloud with Blackjack…
To Build My Own Cloud with Blackjack…To Build My Own Cloud with Blackjack…
To Build My Own Cloud with Blackjack…Sergey Dzyuban
 
Cloud Native Camel Riding
Cloud Native Camel RidingCloud Native Camel Riding
Cloud Native Camel RidingChristian Posta
 
Cloud-Native DevOps: Simplifying application lifecycle management with AWS | ...
Cloud-Native DevOps: Simplifying application lifecycle management with AWS | ...Cloud-Native DevOps: Simplifying application lifecycle management with AWS | ...
Cloud-Native DevOps: Simplifying application lifecycle management with AWS | ...Amazon Web Services
 
PowerPoint Presentation
PowerPoint PresentationPowerPoint Presentation
PowerPoint Presentationlalitjangra9
 
Flying to clouds - can it be easy? Cloud Native Applications
Flying to clouds - can it be easy? Cloud Native ApplicationsFlying to clouds - can it be easy? Cloud Native Applications
Flying to clouds - can it be easy? Cloud Native ApplicationsJacek Bukowski
 
JDD 2016 - Jacek Bukowski - "Flying To Clouds" - Can It Be Easy?
JDD 2016 - Jacek Bukowski - "Flying To Clouds" - Can It Be Easy?JDD 2016 - Jacek Bukowski - "Flying To Clouds" - Can It Be Easy?
JDD 2016 - Jacek Bukowski - "Flying To Clouds" - Can It Be Easy?PROIDEA
 
Why Kubernetes as a container orchestrator is a right choice for running spar...
Why Kubernetes as a container orchestrator is a right choice for running spar...Why Kubernetes as a container orchestrator is a right choice for running spar...
Why Kubernetes as a container orchestrator is a right choice for running spar...DataWorks Summit
 
Kuby, ActiveDeployment for Rails Apps
Kuby, ActiveDeployment for Rails AppsKuby, ActiveDeployment for Rails Apps
Kuby, ActiveDeployment for Rails AppsCameron Dutro
 
Moving Windows Applications to the Cloud
Moving Windows Applications to the CloudMoving Windows Applications to the Cloud
Moving Windows Applications to the CloudRightScale
 
Leveraging Docker and CoreOS to provide always available Cassandra at Instacl...
Leveraging Docker and CoreOS to provide always available Cassandra at Instacl...Leveraging Docker and CoreOS to provide always available Cassandra at Instacl...
Leveraging Docker and CoreOS to provide always available Cassandra at Instacl...DataStax
 
Making Apache Tomcat Multi-tenant, Elastic and Metered
Making Apache Tomcat Multi-tenant, Elastic and MeteredMaking Apache Tomcat Multi-tenant, Elastic and Metered
Making Apache Tomcat Multi-tenant, Elastic and MeteredPaul Fremantle
 
Immutable infrastructure isn’t the answer
Immutable infrastructure isn’t the answerImmutable infrastructure isn’t the answer
Immutable infrastructure isn’t the answerSam Bashton
 
Spot Trading - A case study in continuous delivery for mission critical finan...
Spot Trading - A case study in continuous delivery for mission critical finan...Spot Trading - A case study in continuous delivery for mission critical finan...
Spot Trading - A case study in continuous delivery for mission critical finan...SaltStack
 
Netflix oss season 2 episode 1 - meetup Lightning talks
Netflix oss   season 2 episode 1 - meetup Lightning talksNetflix oss   season 2 episode 1 - meetup Lightning talks
Netflix oss season 2 episode 1 - meetup Lightning talksRuslan Meshenberg
 
Infrastructure as Code - Getting Started, Concepts & Tools
Infrastructure as Code - Getting Started, Concepts & ToolsInfrastructure as Code - Getting Started, Concepts & Tools
Infrastructure as Code - Getting Started, Concepts & ToolsLior Kamrat
 
John adams talk cloudy
John adams   talk cloudyJohn adams   talk cloudy
John adams talk cloudyJohn Adams
 

Similar to London Hashicorp Meetup #22 - Congruent infrastructure @zopa by Ben Coughlan (20)

Sanger, upcoming Openstack for Bio-informaticians
Sanger, upcoming Openstack for Bio-informaticiansSanger, upcoming Openstack for Bio-informaticians
Sanger, upcoming Openstack for Bio-informaticians
 
Flexible compute
Flexible computeFlexible compute
Flexible compute
 
To Build My Own Cloud with Blackjack…
To Build My Own Cloud with Blackjack…To Build My Own Cloud with Blackjack…
To Build My Own Cloud with Blackjack…
 
Cloud Native Camel Riding
Cloud Native Camel RidingCloud Native Camel Riding
Cloud Native Camel Riding
 
Cloud-Native DevOps: Simplifying application lifecycle management with AWS | ...
Cloud-Native DevOps: Simplifying application lifecycle management with AWS | ...Cloud-Native DevOps: Simplifying application lifecycle management with AWS | ...
Cloud-Native DevOps: Simplifying application lifecycle management with AWS | ...
 
PowerPoint Presentation
PowerPoint PresentationPowerPoint Presentation
PowerPoint Presentation
 
Flying to clouds - can it be easy? Cloud Native Applications
Flying to clouds - can it be easy? Cloud Native ApplicationsFlying to clouds - can it be easy? Cloud Native Applications
Flying to clouds - can it be easy? Cloud Native Applications
 
JDD 2016 - Jacek Bukowski - "Flying To Clouds" - Can It Be Easy?
JDD 2016 - Jacek Bukowski - "Flying To Clouds" - Can It Be Easy?JDD 2016 - Jacek Bukowski - "Flying To Clouds" - Can It Be Easy?
JDD 2016 - Jacek Bukowski - "Flying To Clouds" - Can It Be Easy?
 
Why Kubernetes as a container orchestrator is a right choice for running spar...
Why Kubernetes as a container orchestrator is a right choice for running spar...Why Kubernetes as a container orchestrator is a right choice for running spar...
Why Kubernetes as a container orchestrator is a right choice for running spar...
 
Kuby, ActiveDeployment for Rails Apps
Kuby, ActiveDeployment for Rails AppsKuby, ActiveDeployment for Rails Apps
Kuby, ActiveDeployment for Rails Apps
 
Moving Windows Applications to the Cloud
Moving Windows Applications to the CloudMoving Windows Applications to the Cloud
Moving Windows Applications to the Cloud
 
Leveraging Docker and CoreOS to provide always available Cassandra at Instacl...
Leveraging Docker and CoreOS to provide always available Cassandra at Instacl...Leveraging Docker and CoreOS to provide always available Cassandra at Instacl...
Leveraging Docker and CoreOS to provide always available Cassandra at Instacl...
 
Making Apache Tomcat Multi-tenant, Elastic and Metered
Making Apache Tomcat Multi-tenant, Elastic and MeteredMaking Apache Tomcat Multi-tenant, Elastic and Metered
Making Apache Tomcat Multi-tenant, Elastic and Metered
 
Infrastructure as Code
Infrastructure as CodeInfrastructure as Code
Infrastructure as Code
 
Immutable infrastructure isn’t the answer
Immutable infrastructure isn’t the answerImmutable infrastructure isn’t the answer
Immutable infrastructure isn’t the answer
 
Spot Trading - A case study in continuous delivery for mission critical finan...
Spot Trading - A case study in continuous delivery for mission critical finan...Spot Trading - A case study in continuous delivery for mission critical finan...
Spot Trading - A case study in continuous delivery for mission critical finan...
 
Kubernetes
KubernetesKubernetes
Kubernetes
 
Netflix oss season 2 episode 1 - meetup Lightning talks
Netflix oss   season 2 episode 1 - meetup Lightning talksNetflix oss   season 2 episode 1 - meetup Lightning talks
Netflix oss season 2 episode 1 - meetup Lightning talks
 
Infrastructure as Code - Getting Started, Concepts & Tools
Infrastructure as Code - Getting Started, Concepts & ToolsInfrastructure as Code - Getting Started, Concepts & Tools
Infrastructure as Code - Getting Started, Concepts & Tools
 
John adams talk cloudy
John adams   talk cloudyJohn adams   talk cloudy
John adams talk cloudy
 

Recently uploaded

Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentationphoebematthew05
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 

Recently uploaded (20)

Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentation
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 

London Hashicorp Meetup #22 - Congruent infrastructure @zopa by Ben Coughlan

  • 1. Hashicorp Meetup 26th June 2018 Building Congruent infrastructure @Zopa Ben Coughlan
  • 2. 2 Ben Coughlan Senior Zopa Reliability Engineer • Version 1 (software consultancy), SSE Airtricity, Allianz, Qualcomm. • Docker Everywhere! TLS Everywhere! • Joined Zopa in February 2017. • Ben.Coughlan@zopa.com
  • 3. What is Congruent infrastructure? Divergence: • Manual administration. • Scripts, ssh, rsync, manually mounting and extending disks. • Constantly creeping from desired state. • Snowflake Servers. Convergence: • Configuration Management used to spin up infrastructure. • Ansible, chef, puppet, salt. • CM tools are unreliable, impossible to keep the state of all servers the same. • SSH allows for configuration drift. Congruence: • No configuration management. • Immutable infrastructure. • No state creep, because there is no SSH.
  • 4. How to build congruent infrastructure. • Build software, not servers. • The first platform will take weeks or month’s, the second one will take minutes. • Instead of updating servers, you completely replace the servers. Think cattle, not pets. • Build hardened systems, with lots of monitoring. • Everything needs to be immutable. • Manage your state. • Standardise your tools, languages, practices, add an order to everything, and build it into your culture.
  • 5. The tools for SRE’s @Zopa Tools • Terraform • Vault • Consul • CoreOS (Ignition) • Docker Telemetry • Prometheus • Grafana • Splunk Languages • Golang Platforms • AWS • vSphere
  • 6. The ethos for SRE’s @Zopa • Try to remain vendor agnostic • Use open source where possible • Everything needs to be immutable • Everything runs in a Container • TLS everywhere • Telemetry everywhere • Perfect is the enemy of good
  • 7. The Challenge – Infrastructure at Zopa. The Monolith • Overloaded DC’s. • Many single points of failure. • Many snowflake servers. • Every transaction goes through the monolith. • Single digit production releases per week. • Huge range of databases sitting on three servers. • Divergent infrastructure, very little configuration management. Tech Stack • Ubuntu VM’s (some running applications in Docker). • Windows Server running C# and .NET Services.
  • 8. The Solution – Infrastructure at Zopa. Microservices • Multi Cloud. • Well defined, and split pipelines. • Hundreds of deployments per day. • Distributed databases. • Immutable infrastructure. Tech Stack • Kubernetes!!!!!!1  • Kafka • Containerize everything. • Windows Services to move to .NET Core. • Move as much as possible into Kubernetes.
  • 9. Use Case: Building Kafka with congruence in mind. Why is it hard to automate a deployment of Kafka? 1. Kafka is far from cloud native. 2. It can be unstable, even when built manually. 3. Zopa’s use case of using Kafka for critical transactions, and using it as a persistent database, meant that we couldn’t ever afford to lose data. 4. As it’s open source, it had a lot of inconsistent behaviours, run’s different version’s of a JVM across products of the same version, has different jar’s on each product, etc. This took a lot of pull requests to resolve over the past 12 months. 5. Logging to file, SSL/TLS/AAA/ACL’s or “security”, dockerised images, rebalancing a cluster without taking it offline, online operations, etc, were all an afterthought with this product. 6. Initially built outside of Kubernetes, due to the complexities of Kafka.
  • 10. Infrastructure Orchestration - spinning up the machines. TERRAFORM – Infrastructure Orchestration. IGNITION – User Data. Remote Storage is attached - Managing state. Certificates are issued to each server.. Bootstrapping takes place. Service is up and running – telemetry.
  • 11. Terraform 1. Pulls in secrets from Vault to access AWS. (vault provider) 2. Pulls in Values from remote state files. (for the VPC, subnets, etc) 3. Utilises reusable modules from the Zopa stack, (spot pricing, service discovery, re-attachable storage, etc) 4. Creates launch configurations, autoscaling groups, Route53 addresses, etc. 5. Passes the Ignition user data into the launch configuration.
  • 13. Ansible : Configuration Management • Agentless in push mode. • In push mode, requires a solid SSH Connection. • Ansible's reliance on SSH makes it difficult to use securely in an environment with a lot of churn. • Ansible has no native protection against concurrent runs. • Ansible in push mode is really slow. • It’s workaround for its slow performance -- tags -- encourages config drift.
  • 15. IGNITION – User Data • Kicks in at runtime. • Templates dynamic “drop-in” files instead of carrying out configuration management. • No Configuration drift, you kill the server every time you want to update. (clean slate) • Adding/changing configurations is much easier. • Far easier to use than legacy CM tools like Ansible, Puppet, Chef, etc. • No Race conditions, no lag, no SSH. • Secure! Specific IAM permissions to retrieve it’s own user data.
  • 17. Infrastructure Orchestration – Rack awareness and Persistence. 1. “Smilodon” is used to attach a remote EBS volume. 2. Once the volume is attached, a file that has been placed on the volume by Terraform, will have the “rack” or AWS site in place, and the node number. 3. The node number is then used to attach the correct network interface. 4. The network interface gives the node a second IP address, which is tied to the Route53 record. This allows us to ensure that the correct data is used for each node, which prevents rebalancing, and ensures we can lose any AWS availability zone, without losing any data.
  • 18. SSL Certificates – Issuing and renewal. • Vault PKI endpoint generated for each platform. • Consul configuration in place to allow for distributed lock’s, so that certs are issued one at a time. • “Certificated” an in-house tool, contacts Vault and Consul to issue certs to each machine. • Attempts to get a “lock” to issue a cert for itself. • Gets the lock. • Issues certs, and generates keystores and truststores. • Mounts those cert stores into Kafka’s docker container. • Starts/Restarts the Kafka service. • Releases the lock.
  • 19. Bootstrapping takes place • Services Register with each other independently. • Zero manual work done to carry out the required setup steps in Kafka. • Based off dynamic “count’s” in ignition code. • Dynamic, and easily scalable, use of autoscaling function in AWS makes horizontal scaling trivial.
  • 20. • Prometheus – Built in Application and Server node exporter used extensively. • Custom checks used extensively for service specific deliverables. (Consumes, produces to cluster regularly, checks disk, io, network, uses probes for services, checks JVM, etc. ) • Grafana used to visualise each cluster • Splunk used for Log aggregation, and for monitoring ACL’s on each interaction/operation. Service up and running
  • 21. The Terraform lifecycle. 1. Merge changes in Git. 2. Terraform apply. 3. Destroy Servers. 4. Beer.
  • 22. The Challenges of going Congruent. 1. Time It takes a lot longer to produce a hardened, stable platform. 2. Complexity Many products don’t work well in an immutable design pattern. 3. Culture Trying to maintain the same development pattern across a team gets complicated as a team grows. Trying to convince Puppet/Ansible and Debian fan’s to adopt CoreOS and Ignition instead. 4. Windows Doesn’t handle configuration management well. .NET core and dockerised services seems to be the only answer. 5. Finding people to help!  This is another shameless plug to remind you that we are hiring.