SlideShare a Scribd company logo
1 of 42
Introducing Envoy-based
service mesh at
Booking.com
Ivan Kruglov
KubeCon Europe 2018
02.05.2018
based in Amsterdam
28M listings
1,5M room nights per day
1700 IT staff
Agenda
• what is service mesh?
• why did we start the project?
• our setup
• learnings
• conclusion
What is
service mesh*?
* the way I understand it
application provider
service consumer service provider
application provider
service consumer service provider
communication infra
application provider
service consumer service provider
proxy proxy
control plane
Why did we start
service mesh
project?
monolith
service
service
service service service
service
service
service
service
service
service
Consistency & Visibility
in communications
Our setup
application
provider
proxy
service consumer service provider
data plane
HTTP(S)
HTTP
• graceful restart
• TCP proxy
application
provider
envoy
service consumer service provider
data plane
HTTP(S)
HTTP
control plane
control plane
• routing rules
• timeout/retry/etc policies
• service discovery
control plane
• in-house (v1/v2 API)
• started in August 2017; Istio is around 0.2
• one complex system at a time
• start with bare-metal support
• minimal abstraction
• yup, just write Envoy config (partially)
• fine with Envoy’s set of features
• self-service for service owners
application
provider
envoy
service consumer service provider
data plane
HTTP(S)
HTTP
control plane
control plane
• routing rules
• timeout/retry/etc policies
• service discovery
ZooKeeper
(service discovery)
Kubernetes
(service discovery)
ZooKeeper
(configuration)
configuration
• Envoy configuration is quite rich
• power vs. usability:
• cluster specification
• virtual host specification
kind: VirtualHostSpec
spec:
name: api.vhost
domains: ["api.service"]
routes:
- match:
prefix: /
route:
cluster: api.prod.cluster
timeout: 10s
retry_policy:
retry_on: 5xx
num_retries: 3
per_try_timeout: 3s
kind: ClusterSpec
metadata:
annotations:
watcher_name: zookeeper.watcher
paths: ["/pools/api/dc"]
spec:
name: api.prod.cluster
lb_policy: LEAST_REQUEST
connect_timeout: 1s
lb_subset_config:
subset_selectors:
- keys: ["DC"]
- keys: ["IsLocal"]
fallback_policy: DEFAULT_SUBSET
default_subset:
IsLocal: true
kind: VirtualHostSpec
spec:
name: api.vhost
domains: ["api.service"]
routes:
- match:
prefix: /
route:
cluster: api.prod.cluster
timeout: 10s
retry_policy:
retry_on: 5xx
num_retries: 3
per_try_timeout: 3s
kind: ClusterSpec
metadata:
annotations:
watcher_name: zookeeper.watcher
paths: ["/pools/api/dc"]
spec:
name: api.prod.cluster
lb_policy: LEAST_REQUEST
connect_timeout: 1s
lb_subset_config:
subset_selectors:
- keys: ["DC"]
- keys: ["IsLocal"]
fallback_policy: DEFAULT_SUBSET
default_subset:
IsLocal: true
kind: VirtualHostSpec
spec:
name: api.vhost
domains: ["api.service"]
routes:
- match:
prefix: /
route:
cluster: api.prod.cluster
timeout: 10s
retry_policy:
retry_on: 5xx
num_retries: 3
per_try_timeout: 3s
kind: ClusterSpec
metadata:
annotations:
watcher_name: zookeeper.watcher
paths: ["/pools/api/dc"]
spec:
name: api.prod.cluster
lb_policy: LEAST_REQUEST
connect_timeout: 1s
lb_subset_config:
subset_selectors:
- keys: ["DC"]
- keys: ["IsLocal"]
fallback_policy: DEFAULT_SUBSET
default_subset:
IsLocal: true
kind: VirtualHostSpec
spec:
name: api.vhost
domains: ["api.service"]
routes:
- match:
prefix: /
route:
cluster: api.prod.cluster
timeout: 10s
retry_policy:
retry_on: 5xx
num_retries: 3
per_try_timeout: 3s
kind: ClusterSpec
metadata:
annotations:
watcher_name: zookeeper.watcher
paths: ["/pools/api/dc"]
spec:
name: api.prod.cluster
lb_policy: LEAST_REQUEST
connect_timeout: 1s
lb_subset_config:
subset_selectors:
- keys: ["DC"]
- keys: ["IsLocal"]
fallback_policy: DEFAULT_SUBSET
default_subset:
IsLocal: true
kind: VirtualHostSpec
spec:
name: api.vhost
domains: ["api.service"]
routes:
- match:
prefix: /
route:
cluster: api.prod.cluster
timeout: 10s
retry_policy:
retry_on: 5xx
num_retries: 3
per_try_timeout: 3s
kind: ClusterSpec
metadata:
annotations:
watcher_name: zookeeper.watcher
paths: ["/pools/api/dc"]
spec:
name: api.prod.cluster
lb_policy: LEAST_REQUEST
connect_timeout: 1s
lb_subset_config:
subset_selectors:
- keys: ["DC"]
- keys: ["IsLocal"]
fallback_policy: DEFAULT_SUBSET
default_subset:
IsLocal: true
envoy cluster spec envoy virtual host spec
kind: VirtualHostSpec
spec:
name: api.vhost
domains: ["api.service"]
routes:
- match:
prefix: /
route:
cluster: api.prod.cluster
timeout: 10s
retry_policy:
retry_on: 5xx
num_retries: 3
per_try_timeout: 3s
kind: ClusterSpec
metadata:
annotations:
watcher_name: zookeeper.watcher
paths: ["/pools/api/dc"]
spec:
name: api.prod.cluster
lb_policy: LEAST_REQUEST
connect_timeout: 1s
lb_subset_config:
subset_selectors:
- keys: ["DC"]
- keys: ["IsLocal"]
fallback_policy: DEFAULT_SUBSET
default_subset:
IsLocal: true
envoy cluster spec envoy virtual host spec
bootstrap wizard
control plane
ZooKeeper
VC VC VC
service A
expose to service D
service C
expose to service Dservice B
envoy
I’m service D
here is service A
and service C specs
application
provider
envoy
service consumer service provider
data plane
HTTP(S)
HTTP
control plane
control plane
• routing rules
• timeout/retry/etc policies
• service discovery
ZooKeeper
(service discovery)
Kubernetes
(service discovery)
ZooKeeper
(configuration)
configuration
changes
Deploying changes
• integration with existent infrastructure
• git-deploy
• vanguard (canary) deployment
• syntax and semantic validation
• fast rollback
application
provider
envoy
service consumer service provider
data plane
HTTP(S)
HTTP
control plane
control plane
• routing rules
• timeout/retry/etc policies
• service discovery
ZooKeeper
(service discovery)
ZooKeeper
(configuration)
configuration
changes
metrics
Kubernetes
(service discovery)
Monitoring
• approach #1 – graphite
• excellent in-house facilities
• aggregations is too heavy at scale
• approach #2 – prometheus
• statsd support
• statsd_exporter
• still iterating
• standard dashboard
envoy
statsd_exporter
envoy…
application
provider
envoy
service consumer service provider
data plane
HTTP(S)
HTTP
control plane
control plane
• routing rules
• timeout/retry/etc policies
• service discovery
ZooKeeper
(service discovery)
ZooKeeper
(configuration)
configuration
changes
metrics
Kubernetes
(service discovery)
Production
Some numbers
• ~ 6 months in production
• ~ 30 projects
• ~ 10K servers
• hundreds of thousands RPS
• overheads…
p50
p75 p90
p95
p99
HTTP
perceived latency
0ms
5ms
10ms
15ms
+ 1ms
HTTPS
perceived latency
p50
p75 p90
p95
p99
10ms
+ 1ms
15ms
20ms
25ms
30ms
40ms
Some findings
• graceful restart works, but not always!
• (major) Envoy upgrades may not be graceful
• ”x-envoy-retry-on = connect-failure” – too few
“x-envoy-retry-on = 5xx” – too much
• gateway-error – HTTP 502, 503, 504
• absence of TCP keepalive => stale connections
• coming in 1.7
• idle_timeout in TCP proxy in 1.6
• cluster_name (1.5) -> cluster_names (1.6)
Conclusion
service mesh* is
a building block on the path to SOA
technically & organizationally
* the way I understand it
Andrei Vereha
Envoy production stories
Friday, May 4th 14:00
Envoy Deep Dive
Thank you!
Ivan Kruglov
ivan.kruglov@booking.com

More Related Content

What's hot

Java MySQL Connector & Connection Pool Features & Optimization
Java MySQL Connector & Connection Pool Features & OptimizationJava MySQL Connector & Connection Pool Features & Optimization
Java MySQL Connector & Connection Pool Features & OptimizationKenny Gryp
 
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013mumrah
 
Bloat and Fragmentation in PostgreSQL
Bloat and Fragmentation in PostgreSQLBloat and Fragmentation in PostgreSQL
Bloat and Fragmentation in PostgreSQLMasahiko Sawada
 
Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm Chandler Huang
 
Towards Flink 2.0: Unified Batch & Stream Processing - Aljoscha Krettek, Verv...
Towards Flink 2.0: Unified Batch & Stream Processing - Aljoscha Krettek, Verv...Towards Flink 2.0: Unified Batch & Stream Processing - Aljoscha Krettek, Verv...
Towards Flink 2.0: Unified Batch & Stream Processing - Aljoscha Krettek, Verv...Flink Forward
 
Hive join optimizations
Hive join optimizationsHive join optimizations
Hive join optimizationsSzehon Ho
 
Introduction to Apache Calcite
Introduction to Apache CalciteIntroduction to Apache Calcite
Introduction to Apache CalciteJordan Halterman
 
From my sql to postgresql using kafka+debezium
From my sql to postgresql using kafka+debeziumFrom my sql to postgresql using kafka+debezium
From my sql to postgresql using kafka+debeziumClement Demonchy
 
SpringBoot 3 Observability
SpringBoot 3 ObservabilitySpringBoot 3 Observability
SpringBoot 3 ObservabilityKnoldus Inc.
 
RabbitMQ + CouchDB = Awesome
RabbitMQ + CouchDB = AwesomeRabbitMQ + CouchDB = Awesome
RabbitMQ + CouchDB = AwesomeLenz Gschwendtner
 
Microservices Architecture Part 2 Event Sourcing and Saga
Microservices Architecture Part 2 Event Sourcing and SagaMicroservices Architecture Part 2 Event Sourcing and Saga
Microservices Architecture Part 2 Event Sourcing and SagaAraf Karsh Hamid
 
API Design, A Quick Guide to REST, SOAP, gRPC, and GraphQL, By Vahid Rahimian
API Design, A Quick Guide to REST, SOAP, gRPC, and GraphQL, By Vahid RahimianAPI Design, A Quick Guide to REST, SOAP, gRPC, and GraphQL, By Vahid Rahimian
API Design, A Quick Guide to REST, SOAP, gRPC, and GraphQL, By Vahid RahimianVahid Rahimian
 
Stability Patterns for Microservices
Stability Patterns for MicroservicesStability Patterns for Microservices
Stability Patterns for Microservicespflueras
 
OpenTelemetry For Operators
OpenTelemetry For OperatorsOpenTelemetry For Operators
OpenTelemetry For OperatorsKevin Brockhoff
 
Cassandra Introduction & Features
Cassandra Introduction & FeaturesCassandra Introduction & Features
Cassandra Introduction & FeaturesDataStax Academy
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing Saga Pattern in Microservices with Spring Statemachine
Introducing Saga Pattern in Microservices with Spring StatemachineIntroducing Saga Pattern in Microservices with Spring Statemachine
Introducing Saga Pattern in Microservices with Spring StatemachineVMware Tanzu
 
Scalability, Availability & Stability Patterns
Scalability, Availability & Stability PatternsScalability, Availability & Stability Patterns
Scalability, Availability & Stability PatternsJonas Bonér
 

What's hot (20)

Java MySQL Connector & Connection Pool Features & Optimization
Java MySQL Connector & Connection Pool Features & OptimizationJava MySQL Connector & Connection Pool Features & Optimization
Java MySQL Connector & Connection Pool Features & Optimization
 
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
 
Bloat and Fragmentation in PostgreSQL
Bloat and Fragmentation in PostgreSQLBloat and Fragmentation in PostgreSQL
Bloat and Fragmentation in PostgreSQL
 
Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm
 
Towards Flink 2.0: Unified Batch & Stream Processing - Aljoscha Krettek, Verv...
Towards Flink 2.0: Unified Batch & Stream Processing - Aljoscha Krettek, Verv...Towards Flink 2.0: Unified Batch & Stream Processing - Aljoscha Krettek, Verv...
Towards Flink 2.0: Unified Batch & Stream Processing - Aljoscha Krettek, Verv...
 
Hive join optimizations
Hive join optimizationsHive join optimizations
Hive join optimizations
 
Introduction to Apache Calcite
Introduction to Apache CalciteIntroduction to Apache Calcite
Introduction to Apache Calcite
 
From my sql to postgresql using kafka+debezium
From my sql to postgresql using kafka+debeziumFrom my sql to postgresql using kafka+debezium
From my sql to postgresql using kafka+debezium
 
SpringBoot 3 Observability
SpringBoot 3 ObservabilitySpringBoot 3 Observability
SpringBoot 3 Observability
 
Domain Driven Design
Domain Driven DesignDomain Driven Design
Domain Driven Design
 
RabbitMQ + CouchDB = Awesome
RabbitMQ + CouchDB = AwesomeRabbitMQ + CouchDB = Awesome
RabbitMQ + CouchDB = Awesome
 
Microservices Architecture Part 2 Event Sourcing and Saga
Microservices Architecture Part 2 Event Sourcing and SagaMicroservices Architecture Part 2 Event Sourcing and Saga
Microservices Architecture Part 2 Event Sourcing and Saga
 
API Design, A Quick Guide to REST, SOAP, gRPC, and GraphQL, By Vahid Rahimian
API Design, A Quick Guide to REST, SOAP, gRPC, and GraphQL, By Vahid RahimianAPI Design, A Quick Guide to REST, SOAP, gRPC, and GraphQL, By Vahid Rahimian
API Design, A Quick Guide to REST, SOAP, gRPC, and GraphQL, By Vahid Rahimian
 
Domain Driven Design
Domain Driven Design Domain Driven Design
Domain Driven Design
 
Stability Patterns for Microservices
Stability Patterns for MicroservicesStability Patterns for Microservices
Stability Patterns for Microservices
 
OpenTelemetry For Operators
OpenTelemetry For OperatorsOpenTelemetry For Operators
OpenTelemetry For Operators
 
Cassandra Introduction & Features
Cassandra Introduction & FeaturesCassandra Introduction & Features
Cassandra Introduction & Features
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing Saga Pattern in Microservices with Spring Statemachine
Introducing Saga Pattern in Microservices with Spring StatemachineIntroducing Saga Pattern in Microservices with Spring Statemachine
Introducing Saga Pattern in Microservices with Spring Statemachine
 
Scalability, Availability & Stability Patterns
Scalability, Availability & Stability PatternsScalability, Availability & Stability Patterns
Scalability, Availability & Stability Patterns
 

Similar to Introducing envoy-based service mesh at Booking.com

Managing Microservices With The Istio Service Mesh on Kubernetes
Managing Microservices With The Istio Service Mesh on KubernetesManaging Microservices With The Istio Service Mesh on Kubernetes
Managing Microservices With The Istio Service Mesh on KubernetesIftach Schonbaum
 
Canadian CNCF: "Emissary-ingress 101: An introduction to the CNCF incubation-...
Canadian CNCF: "Emissary-ingress 101: An introduction to the CNCF incubation-...Canadian CNCF: "Emissary-ingress 101: An introduction to the CNCF incubation-...
Canadian CNCF: "Emissary-ingress 101: An introduction to the CNCF incubation-...Daniel Bryant
 
IVS CTO Night And Day 2018 Winter - [re:Cap] Serverless & Mobile
IVS CTO Night And Day 2018 Winter - [re:Cap] Serverless & MobileIVS CTO Night And Day 2018 Winter - [re:Cap] Serverless & Mobile
IVS CTO Night And Day 2018 Winter - [re:Cap] Serverless & MobileAmazon Web Services Japan
 
Integrating Infrastructure as Code into a Continuous Delivery Pipeline | AWS ...
Integrating Infrastructure as Code into a Continuous Delivery Pipeline | AWS ...Integrating Infrastructure as Code into a Continuous Delivery Pipeline | AWS ...
Integrating Infrastructure as Code into a Continuous Delivery Pipeline | AWS ...Amazon Web Services
 
F5 Meetup presentation automation 2017
F5 Meetup presentation automation 2017F5 Meetup presentation automation 2017
F5 Meetup presentation automation 2017Guy Brown
 
Load Balancing in the Cloud using Nginx & Kubernetes
Load Balancing in the Cloud using Nginx & KubernetesLoad Balancing in the Cloud using Nginx & Kubernetes
Load Balancing in the Cloud using Nginx & KubernetesLee Calcote
 
DCEU 18: Docker Container Networking
DCEU 18: Docker Container NetworkingDCEU 18: Docker Container Networking
DCEU 18: Docker Container NetworkingDocker, Inc.
 
Exploring Twitter's Finagle technology stack for microservices
Exploring Twitter's Finagle technology stack for microservicesExploring Twitter's Finagle technology stack for microservices
Exploring Twitter's Finagle technology stack for microservices💡 Tomasz Kogut
 
Working with PowerVC via its REST APIs
Working with PowerVC via its REST APIsWorking with PowerVC via its REST APIs
Working with PowerVC via its REST APIsJoe Cropper
 
OpenShift Meetup - Tokyo - Service Mesh and Serverless Overview
OpenShift Meetup - Tokyo - Service Mesh and Serverless OverviewOpenShift Meetup - Tokyo - Service Mesh and Serverless Overview
OpenShift Meetup - Tokyo - Service Mesh and Serverless OverviewMaría Angélica Bracho
 
Microservice bus tutorial
Microservice bus tutorialMicroservice bus tutorial
Microservice bus tutorialHuabing Zhao
 
Monitoring in Motion: Monitoring Containers and Amazon ECS
Monitoring in Motion: Monitoring Containers and Amazon ECSMonitoring in Motion: Monitoring Containers and Amazon ECS
Monitoring in Motion: Monitoring Containers and Amazon ECSAmazon Web Services
 
DEVNET-1128 Cisco Intercloud Fabric NB Api's for Business & Providers
DEVNET-1128	Cisco Intercloud Fabric NB Api's for Business & ProvidersDEVNET-1128	Cisco Intercloud Fabric NB Api's for Business & Providers
DEVNET-1128 Cisco Intercloud Fabric NB Api's for Business & ProvidersCisco DevNet
 
Service Discovery using etcd, Consul and Kubernetes
Service Discovery using etcd, Consul and KubernetesService Discovery using etcd, Consul and Kubernetes
Service Discovery using etcd, Consul and KubernetesSreenivas Makam
 
DCUS17 : Docker networking deep dive
DCUS17 : Docker networking deep diveDCUS17 : Docker networking deep dive
DCUS17 : Docker networking deep diveMadhu Venugopal
 
Web services - A Practical Approach
Web services - A Practical ApproachWeb services - A Practical Approach
Web services - A Practical ApproachMadhaiyan Muthu
 
The Good, the Bad and the Ugly of Migrating Hundreds of Legacy Applications ...
 The Good, the Bad and the Ugly of Migrating Hundreds of Legacy Applications ... The Good, the Bad and the Ugly of Migrating Hundreds of Legacy Applications ...
The Good, the Bad and the Ugly of Migrating Hundreds of Legacy Applications ...Josef Adersberger
 
Migrating Hundreds of Legacy Applications to Kubernetes - The Good, the Bad, ...
Migrating Hundreds of Legacy Applications to Kubernetes - The Good, the Bad, ...Migrating Hundreds of Legacy Applications to Kubernetes - The Good, the Bad, ...
Migrating Hundreds of Legacy Applications to Kubernetes - The Good, the Bad, ...QAware GmbH
 
Ansible benelux meetup - Amsterdam 27-5-2015
Ansible benelux meetup - Amsterdam 27-5-2015Ansible benelux meetup - Amsterdam 27-5-2015
Ansible benelux meetup - Amsterdam 27-5-2015Pavel Chunyayev
 

Similar to Introducing envoy-based service mesh at Booking.com (20)

Managing Microservices With The Istio Service Mesh on Kubernetes
Managing Microservices With The Istio Service Mesh on KubernetesManaging Microservices With The Istio Service Mesh on Kubernetes
Managing Microservices With The Istio Service Mesh on Kubernetes
 
Canadian CNCF: "Emissary-ingress 101: An introduction to the CNCF incubation-...
Canadian CNCF: "Emissary-ingress 101: An introduction to the CNCF incubation-...Canadian CNCF: "Emissary-ingress 101: An introduction to the CNCF incubation-...
Canadian CNCF: "Emissary-ingress 101: An introduction to the CNCF incubation-...
 
IVS CTO Night And Day 2018 Winter - [re:Cap] Serverless & Mobile
IVS CTO Night And Day 2018 Winter - [re:Cap] Serverless & MobileIVS CTO Night And Day 2018 Winter - [re:Cap] Serverless & Mobile
IVS CTO Night And Day 2018 Winter - [re:Cap] Serverless & Mobile
 
Integrating Infrastructure as Code into a Continuous Delivery Pipeline | AWS ...
Integrating Infrastructure as Code into a Continuous Delivery Pipeline | AWS ...Integrating Infrastructure as Code into a Continuous Delivery Pipeline | AWS ...
Integrating Infrastructure as Code into a Continuous Delivery Pipeline | AWS ...
 
F5 Meetup presentation automation 2017
F5 Meetup presentation automation 2017F5 Meetup presentation automation 2017
F5 Meetup presentation automation 2017
 
Load Balancing in the Cloud using Nginx & Kubernetes
Load Balancing in the Cloud using Nginx & KubernetesLoad Balancing in the Cloud using Nginx & Kubernetes
Load Balancing in the Cloud using Nginx & Kubernetes
 
DCEU 18: Docker Container Networking
DCEU 18: Docker Container NetworkingDCEU 18: Docker Container Networking
DCEU 18: Docker Container Networking
 
Exploring Twitter's Finagle technology stack for microservices
Exploring Twitter's Finagle technology stack for microservicesExploring Twitter's Finagle technology stack for microservices
Exploring Twitter's Finagle technology stack for microservices
 
Working with PowerVC via its REST APIs
Working with PowerVC via its REST APIsWorking with PowerVC via its REST APIs
Working with PowerVC via its REST APIs
 
OpenShift Meetup - Tokyo - Service Mesh and Serverless Overview
OpenShift Meetup - Tokyo - Service Mesh and Serverless OverviewOpenShift Meetup - Tokyo - Service Mesh and Serverless Overview
OpenShift Meetup - Tokyo - Service Mesh and Serverless Overview
 
REST APIs
REST APIsREST APIs
REST APIs
 
Microservice bus tutorial
Microservice bus tutorialMicroservice bus tutorial
Microservice bus tutorial
 
Monitoring in Motion: Monitoring Containers and Amazon ECS
Monitoring in Motion: Monitoring Containers and Amazon ECSMonitoring in Motion: Monitoring Containers and Amazon ECS
Monitoring in Motion: Monitoring Containers and Amazon ECS
 
DEVNET-1128 Cisco Intercloud Fabric NB Api's for Business & Providers
DEVNET-1128	Cisco Intercloud Fabric NB Api's for Business & ProvidersDEVNET-1128	Cisco Intercloud Fabric NB Api's for Business & Providers
DEVNET-1128 Cisco Intercloud Fabric NB Api's for Business & Providers
 
Service Discovery using etcd, Consul and Kubernetes
Service Discovery using etcd, Consul and KubernetesService Discovery using etcd, Consul and Kubernetes
Service Discovery using etcd, Consul and Kubernetes
 
DCUS17 : Docker networking deep dive
DCUS17 : Docker networking deep diveDCUS17 : Docker networking deep dive
DCUS17 : Docker networking deep dive
 
Web services - A Practical Approach
Web services - A Practical ApproachWeb services - A Practical Approach
Web services - A Practical Approach
 
The Good, the Bad and the Ugly of Migrating Hundreds of Legacy Applications ...
 The Good, the Bad and the Ugly of Migrating Hundreds of Legacy Applications ... The Good, the Bad and the Ugly of Migrating Hundreds of Legacy Applications ...
The Good, the Bad and the Ugly of Migrating Hundreds of Legacy Applications ...
 
Migrating Hundreds of Legacy Applications to Kubernetes - The Good, the Bad, ...
Migrating Hundreds of Legacy Applications to Kubernetes - The Good, the Bad, ...Migrating Hundreds of Legacy Applications to Kubernetes - The Good, the Bad, ...
Migrating Hundreds of Legacy Applications to Kubernetes - The Good, the Bad, ...
 
Ansible benelux meetup - Amsterdam 27-5-2015
Ansible benelux meetup - Amsterdam 27-5-2015Ansible benelux meetup - Amsterdam 27-5-2015
Ansible benelux meetup - Amsterdam 27-5-2015
 

More from Ivan Kruglov

SRE: Site Reliability Engineering
SRE: Site Reliability EngineeringSRE: Site Reliability Engineering
SRE: Site Reliability EngineeringIvan Kruglov
 
Blue-green & canary deployments
Blue-green & canary deploymentsBlue-green & canary deployments
Blue-green & canary deploymentsIvan Kruglov
 
Обратная сторона сервис-ориентированной архитектуры
Обратная сторона сервис-ориентированной архитектурыОбратная сторона сервис-ориентированной архитектуры
Обратная сторона сервис-ориентированной архитектурыIvan Kruglov
 
Kubernetes в Booking.com
Kubernetes в Booking.comKubernetes в Booking.com
Kubernetes в Booking.comIvan Kruglov
 
Тернии контейнеризованных приложений и микросервисов
Тернии контейнеризованных приложений и микросервисовТернии контейнеризованных приложений и микросервисов
Тернии контейнеризованных приложений и микросервисовIvan Kruglov
 
Service mesh для микросервисов
Service mesh для микросервисовService mesh для микросервисов
Service mesh для микросервисовIvan Kruglov
 
SOA: Строим свой service mesh
SOA: Строим свой service meshSOA: Строим свой service mesh
SOA: Строим свой service meshIvan Kruglov
 
Solving some of the scalability problems at booking.com
Solving some of the scalability problems at booking.comSolving some of the scalability problems at booking.com
Solving some of the scalability problems at booking.comIvan Kruglov
 
Sereal: a view from inside
Sereal: a view from insideSereal: a view from inside
Sereal: a view from insideIvan Kruglov
 
SOA: послать запрос на сервер? Что может быть проще?!
SOA: послать запрос на сервер? Что может быть проще?!SOA: послать запрос на сервер? Что может быть проще?!
SOA: послать запрос на сервер? Что может быть проще?!Ivan Kruglov
 
Мониторинг, когда не тестируешь
Мониторинг, когда не тестируешьМониторинг, когда не тестируешь
Мониторинг, когда не тестируешьIvan Kruglov
 
Архитектура поиска в Booking.com
Архитектура поиска в Booking.comАрхитектура поиска в Booking.com
Архитектура поиска в Booking.comIvan Kruglov
 
Processing JSON messages in highspeed
Processing JSON messages in highspeedProcessing JSON messages in highspeed
Processing JSON messages in highspeedIvan Kruglov
 
Bringing code to the data: from MySQL to RocksDB for high volume searches
Bringing code to the data: from MySQL to RocksDB for high volume searchesBringing code to the data: from MySQL to RocksDB for high volume searches
Bringing code to the data: from MySQL to RocksDB for high volume searchesIvan Kruglov
 
Sereal and its tooling
Sereal and its toolingSereal and its tooling
Sereal and its toolingIvan Kruglov
 

More from Ivan Kruglov (16)

SRE: Site Reliability Engineering
SRE: Site Reliability EngineeringSRE: Site Reliability Engineering
SRE: Site Reliability Engineering
 
Blue-green & canary deployments
Blue-green & canary deploymentsBlue-green & canary deployments
Blue-green & canary deployments
 
Обратная сторона сервис-ориентированной архитектуры
Обратная сторона сервис-ориентированной архитектурыОбратная сторона сервис-ориентированной архитектуры
Обратная сторона сервис-ориентированной архитектуры
 
Kubernetes в Booking.com
Kubernetes в Booking.comKubernetes в Booking.com
Kubernetes в Booking.com
 
Тернии контейнеризованных приложений и микросервисов
Тернии контейнеризованных приложений и микросервисовТернии контейнеризованных приложений и микросервисов
Тернии контейнеризованных приложений и микросервисов
 
Service mesh для микросервисов
Service mesh для микросервисовService mesh для микросервисов
Service mesh для микросервисов
 
SOA: Строим свой service mesh
SOA: Строим свой service meshSOA: Строим свой service mesh
SOA: Строим свой service mesh
 
Solving some of the scalability problems at booking.com
Solving some of the scalability problems at booking.comSolving some of the scalability problems at booking.com
Solving some of the scalability problems at booking.com
 
Sereal: a view from inside
Sereal: a view from insideSereal: a view from inside
Sereal: a view from inside
 
SOA: послать запрос на сервер? Что может быть проще?!
SOA: послать запрос на сервер? Что может быть проще?!SOA: послать запрос на сервер? Что может быть проще?!
SOA: послать запрос на сервер? Что может быть проще?!
 
Мониторинг, когда не тестируешь
Мониторинг, когда не тестируешьМониторинг, когда не тестируешь
Мониторинг, когда не тестируешь
 
Архитектура поиска в Booking.com
Архитектура поиска в Booking.comАрхитектура поиска в Booking.com
Архитектура поиска в Booking.com
 
Processing JSON messages in highspeed
Processing JSON messages in highspeedProcessing JSON messages in highspeed
Processing JSON messages in highspeed
 
Bringing code to the data: from MySQL to RocksDB for high volume searches
Bringing code to the data: from MySQL to RocksDB for high volume searchesBringing code to the data: from MySQL to RocksDB for high volume searches
Bringing code to the data: from MySQL to RocksDB for high volume searches
 
Optimize sereal
Optimize serealOptimize sereal
Optimize sereal
 
Sereal and its tooling
Sereal and its toolingSereal and its tooling
Sereal and its tooling
 

Recently uploaded

Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...panagenda
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 

Recently uploaded (20)

Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 

Introducing envoy-based service mesh at Booking.com

  • 1. Introducing Envoy-based service mesh at Booking.com Ivan Kruglov KubeCon Europe 2018 02.05.2018
  • 2. based in Amsterdam 28M listings 1,5M room nights per day 1700 IT staff
  • 3. Agenda • what is service mesh? • why did we start the project? • our setup • learnings • conclusion
  • 4. What is service mesh*? * the way I understand it
  • 6. application provider service consumer service provider communication infra
  • 7. application provider service consumer service provider proxy proxy control plane
  • 8. Why did we start service mesh project?
  • 11.
  • 12.
  • 13. Consistency & Visibility in communications
  • 15. application provider proxy service consumer service provider data plane HTTP(S) HTTP
  • 17. application provider envoy service consumer service provider data plane HTTP(S) HTTP control plane control plane • routing rules • timeout/retry/etc policies • service discovery
  • 18. control plane • in-house (v1/v2 API) • started in August 2017; Istio is around 0.2 • one complex system at a time • start with bare-metal support • minimal abstraction • yup, just write Envoy config (partially) • fine with Envoy’s set of features • self-service for service owners
  • 19. application provider envoy service consumer service provider data plane HTTP(S) HTTP control plane control plane • routing rules • timeout/retry/etc policies • service discovery ZooKeeper (service discovery) Kubernetes (service discovery) ZooKeeper (configuration)
  • 20. configuration • Envoy configuration is quite rich • power vs. usability: • cluster specification • virtual host specification
  • 21. kind: VirtualHostSpec spec: name: api.vhost domains: ["api.service"] routes: - match: prefix: / route: cluster: api.prod.cluster timeout: 10s retry_policy: retry_on: 5xx num_retries: 3 per_try_timeout: 3s kind: ClusterSpec metadata: annotations: watcher_name: zookeeper.watcher paths: ["/pools/api/dc"] spec: name: api.prod.cluster lb_policy: LEAST_REQUEST connect_timeout: 1s lb_subset_config: subset_selectors: - keys: ["DC"] - keys: ["IsLocal"] fallback_policy: DEFAULT_SUBSET default_subset: IsLocal: true
  • 22. kind: VirtualHostSpec spec: name: api.vhost domains: ["api.service"] routes: - match: prefix: / route: cluster: api.prod.cluster timeout: 10s retry_policy: retry_on: 5xx num_retries: 3 per_try_timeout: 3s kind: ClusterSpec metadata: annotations: watcher_name: zookeeper.watcher paths: ["/pools/api/dc"] spec: name: api.prod.cluster lb_policy: LEAST_REQUEST connect_timeout: 1s lb_subset_config: subset_selectors: - keys: ["DC"] - keys: ["IsLocal"] fallback_policy: DEFAULT_SUBSET default_subset: IsLocal: true
  • 23. kind: VirtualHostSpec spec: name: api.vhost domains: ["api.service"] routes: - match: prefix: / route: cluster: api.prod.cluster timeout: 10s retry_policy: retry_on: 5xx num_retries: 3 per_try_timeout: 3s kind: ClusterSpec metadata: annotations: watcher_name: zookeeper.watcher paths: ["/pools/api/dc"] spec: name: api.prod.cluster lb_policy: LEAST_REQUEST connect_timeout: 1s lb_subset_config: subset_selectors: - keys: ["DC"] - keys: ["IsLocal"] fallback_policy: DEFAULT_SUBSET default_subset: IsLocal: true
  • 24. kind: VirtualHostSpec spec: name: api.vhost domains: ["api.service"] routes: - match: prefix: / route: cluster: api.prod.cluster timeout: 10s retry_policy: retry_on: 5xx num_retries: 3 per_try_timeout: 3s kind: ClusterSpec metadata: annotations: watcher_name: zookeeper.watcher paths: ["/pools/api/dc"] spec: name: api.prod.cluster lb_policy: LEAST_REQUEST connect_timeout: 1s lb_subset_config: subset_selectors: - keys: ["DC"] - keys: ["IsLocal"] fallback_policy: DEFAULT_SUBSET default_subset: IsLocal: true
  • 25. kind: VirtualHostSpec spec: name: api.vhost domains: ["api.service"] routes: - match: prefix: / route: cluster: api.prod.cluster timeout: 10s retry_policy: retry_on: 5xx num_retries: 3 per_try_timeout: 3s kind: ClusterSpec metadata: annotations: watcher_name: zookeeper.watcher paths: ["/pools/api/dc"] spec: name: api.prod.cluster lb_policy: LEAST_REQUEST connect_timeout: 1s lb_subset_config: subset_selectors: - keys: ["DC"] - keys: ["IsLocal"] fallback_policy: DEFAULT_SUBSET default_subset: IsLocal: true envoy cluster spec envoy virtual host spec
  • 26. kind: VirtualHostSpec spec: name: api.vhost domains: ["api.service"] routes: - match: prefix: / route: cluster: api.prod.cluster timeout: 10s retry_policy: retry_on: 5xx num_retries: 3 per_try_timeout: 3s kind: ClusterSpec metadata: annotations: watcher_name: zookeeper.watcher paths: ["/pools/api/dc"] spec: name: api.prod.cluster lb_policy: LEAST_REQUEST connect_timeout: 1s lb_subset_config: subset_selectors: - keys: ["DC"] - keys: ["IsLocal"] fallback_policy: DEFAULT_SUBSET default_subset: IsLocal: true envoy cluster spec envoy virtual host spec bootstrap wizard
  • 27. control plane ZooKeeper VC VC VC service A expose to service D service C expose to service Dservice B envoy I’m service D here is service A and service C specs
  • 28. application provider envoy service consumer service provider data plane HTTP(S) HTTP control plane control plane • routing rules • timeout/retry/etc policies • service discovery ZooKeeper (service discovery) Kubernetes (service discovery) ZooKeeper (configuration) configuration changes
  • 29. Deploying changes • integration with existent infrastructure • git-deploy • vanguard (canary) deployment • syntax and semantic validation • fast rollback
  • 30. application provider envoy service consumer service provider data plane HTTP(S) HTTP control plane control plane • routing rules • timeout/retry/etc policies • service discovery ZooKeeper (service discovery) ZooKeeper (configuration) configuration changes metrics Kubernetes (service discovery)
  • 31. Monitoring • approach #1 – graphite • excellent in-house facilities • aggregations is too heavy at scale • approach #2 – prometheus • statsd support • statsd_exporter • still iterating • standard dashboard envoy statsd_exporter envoy…
  • 32.
  • 33. application provider envoy service consumer service provider data plane HTTP(S) HTTP control plane control plane • routing rules • timeout/retry/etc policies • service discovery ZooKeeper (service discovery) ZooKeeper (configuration) configuration changes metrics Kubernetes (service discovery)
  • 35. Some numbers • ~ 6 months in production • ~ 30 projects • ~ 10K servers • hundreds of thousands RPS • overheads…
  • 38. Some findings • graceful restart works, but not always! • (major) Envoy upgrades may not be graceful • ”x-envoy-retry-on = connect-failure” – too few “x-envoy-retry-on = 5xx” – too much • gateway-error – HTTP 502, 503, 504 • absence of TCP keepalive => stale connections • coming in 1.7 • idle_timeout in TCP proxy in 1.6 • cluster_name (1.5) -> cluster_names (1.6)
  • 40. service mesh* is a building block on the path to SOA technically & organizationally * the way I understand it
  • 41. Andrei Vereha Envoy production stories Friday, May 4th 14:00 Envoy Deep Dive