SlideShare a Scribd company logo
Debugging your debugging tools;
What to do when your service mesh goes down in production?
Neeraj Poddar
Co-founder & Chief Architect, Aspen Mesh
John Howard
Software Engineer, Google
What Is a Service Mesh
A service mesh is…
a transparent infrastructure layer that manages communication
between microservices
so that developers can focus on business logic
while operators work independent of dev cycles to provide a
more resilient environment
Key Benefits of a Service Mesh
Security ObservabilityTraffic
Management
Istio Architecture Overview (without Istiod)
Istio Architecture Overview (with Istiod)
Debugging Istio
Control plane
● istiod
Data plane
○ Envoy sidecar
○ Istio Agent
Debugging Envoy
● Connectivity to istiod
● Enabling & understanding access logs
● Debug logging
● Configuration dump
Connectivity to Istiod
Application Access Logs
● Useful for diagnosing traffic flows and failures
● End-to-end request routing across all microservices
● Default “turned off” in Istio
○ Enabled only in “demo” profile
● First tool in your debugging toolkit
Globally Enabling Access Logs
● Globally enabling using “istioctl install”
● Globally customizing encoding which defaults to “TEXT”
● Reverting back to no access log
● This configuration is stored in “istio” configmap as default “mesh” configuration
Enabling Access Logs for a Namespace
Use Namespace scope EnvoyFilter resource (Shouldn’t be used if enabled globally)
Understanding Access Logs
Understanding Response Flags
Envoy Debug Logging
● Debug logging is verbose and expensive
○ Not recommended for permanent production usage
○ Defaults to “warning”
● Enable debug logging for a workload using istioctl
○ Doesn’t require restarting the pod
● Enable debug logging for a workload using annotations
○ Add pod annotation “sidecar.istio.io/logLevel: debug”
○ Requires pod restarts
● Enable debug logging globally
○ Requires pod restarts
Envoy Configuration Overview
Listeners Filter Chains
HTTP Filters
TCP Filters
Routes
Clusters Endpoints
Most specific matched Specific match order
Runs in order
Most specific matched
Envoy Listener Dump
● Virtual inbound listener
● Virtual outbound listener
● Outbound listeners
● All listeners
Envoy HTTP Listener Configuration
Envoy TCP Listener Configuration
Mapping Resources to Envoy Configuration
Resource Type Envoy Configuration Notes
Kubernetes Services Listeners
Routes
Clusters
New listeners if port/protocol combo is unique
Add virtual hosts for existing routes
One cluster per Service/Port/Subset
Kubernetes Endpoints Endpoints
Istio Gateway Listeners Apply to Ingress/Egress Gateways
Istio VirtualService Listeners
Routes
Client side proxies
TLS/TCP affect listeners
HTTP match blocks affect routes
Istio DestinationRule Clusters
Endpoints
Client side proxies
Connection/HTTP/TLS settings
Istio ServiceEntry Clusters
Endpoints
Client side proxies
Istio PeerAuthentication Listeners
Clusters
Server side proxies
Istio RequestAuthentication Listeners Server side proxies
Istio Authorization Policies Listeners Server side proxies
Istio EnvoyFilter All Break glass API to directly manipulate Envoy
Istio Sidecar All Client or server side proxies
Sidecar scope sets config visibility
Istio Agent Bootstrapping
Istiod Istio Agent Envoy
1. Start Envoy
2. Fetch Cert (SDS)
Auth: None (local)
3. Fetch Cert (CSR)
Auth: JWT
4. Fetch Config (XDS)
Auth: Cert
Istio Agent Lifecycle
Istiod Istio Agent Envoy
When cert is near expired:
Refresh Cert
When config/endpoints change:
Push updated config
Push new certs
Periodic reconnect based on
max conn age/keep alives
Istio Agent Debugging
● Viewing secrets/certs using istioctl
○ istioctl proxy-config secret <POD>
○ default secret is the workload certificate
○ ROOTCA is the workload root certificate
○ Other secrets are certificates from DestinationRule/Gateway
○ Describe the full certificate information with openssl
■ istioctl proxy-config secret <POD> -o json | jq
'.dynamicActiveSecrets[0].secret.tlsCertificate.certificateChain.inlineBytes'
-r | base64 -d | openssl x509 -noout -text -in -
● "Warming" secrets
○ If a secret isn't found it will be stuck warming. Traffic that relies on that secret will fail.
○ For gateways, may indicate the referenced Secret is not present. Check logs or istioctl analyze.
○ For workload certificate, may indicate issues connecting to the CA (Istiod).
Istio Agent Readiness
● Probe: every 2s, check readiness status at localhost:15021/healthz/ready
○ On success, mark container ready
○ On 30 consecutive failures, mark container not ready
○ Runs for entire lifetime of the pod
● Envoy proxy is NOT ready: config not received from Pilot
○ Indicates we don't have the correct XDS configuration
○ Can be from rejected config
■ Envoy log: gRPC config for type.googleapis.com/... rejected
■ istiod log: ADS: ACK ERROR ...
■ Metrics: pilot_xds_{cds,lds,eds,rds}_reject
■ Generally a bug cause by non-standard configuration
○ Can be from connectivity issues to istiod
■ StreamAggregatedResources gRPC config stream closed: 14, no healthy upstream
■ Check certificates are ready
■ Check istiod is ready
■ Check Network Policies if applicable
● Envoy proxy is ready - Output on first successful probe
Istiod Debugging
● Enabling debug logging
● ControlZ dashboard
● Istiod metrics for config synchronization/pushes
Istiod Debug Logging
● Default level is “info”.Available options: “debug”, “info”, “warn”, “error”, “fatal”, “none”
● Enable debug logging without restarting “istiod”
○ Toggle logging for a scope or for all scopes
● Enable debug logging via “istioctl”
○ Requires re-deploying istiod
Istiod ControlZ Dashboard
Istiod Metrics
● Grafana Dashboard comes prepopulated with useful metrics to track.
● Datadog has an excellent deep dive on Istio metrics.
● Useful metrics
○ Total number of invalid config: pilot_total_xds_rejects
■ Rejections generally indicate misconfiguration or bugs
○ Time to push config update to proxy: pilot_proxy_convergence_time
■ Slow times may lead to delays in config or endpoints updates. May indicate under-scaled
control plane.
○ Number of XDS clients: pilot_xds
■ Useful to spot unbalanced load distribution
Profiling
● Both control plane and data plane support profiling which can help identify CPU and
memory usages.
● go tool pprof -http=: localhost:8080/debug/pprof/<ENDPOINT>
○ profile: Captures 30s of CPU usage
○ heap: Captures in use and lifetime memory allocations
○ goroutine: Capture all goroutines in use
Istioctl Debugging Capabilities
● Proxy status cluster-wide (Per-proxy view gives detailed config “diffs”)
● Wait for config distribution (experimental)
Istioctl Configuration Analysis
● Provides multi resource validation e.g. “Is a VirtualService tied to a Gateway that doesn’t
exist”
● Complete list of analyzers
● Enable analyzer reporting via CRD status field
Common Problems
Scaling Istiod
● Istiod can scale horizontally and vertically
● Ensure CPU limit is disabled or give sufficient CPU
○ Throttling may lead to slow configuration updates
● Use the latest Istio version!
○ Each release has substantial performance improvements
Factors for Istiod Scaling
● Size of config to generate
○ Impacted by total number of services/pods & Istio resources
○ For large scale, Sidecar scoping is recommended to bring down size of config
● Rate of change of environment
○ Every time a new Service is created or Istio configuration is changed full updates are sent to proxies
○ Adding new endpoints are cheap as only incremental updates are sent
● Number of proxies for which configuration needs to be generated
○ Impacted by number of pods with a sidecar, and gateways
Unbalanced Istiod Load
● xDS connection is a long lived gRPC stream
○ Makes load balancing challenging
● Connection will close every 30m to slowly rebalance load
● Quick scaling events may cause new instances to have small load
○ This can trigger HPA to scale up and down repeatedly
○ Running at least 2 replicas can help alleviate this issue
Understanding mTLS
Client Server
DestinationRule: what type of traffic is sent
● mode: DISABLE sends plain text. Common for services outside
of the mesh
● mode: ISTIO_MUTUAL sends mTLS
● mode: SIMPLE/MUTUAL can be used to originate TLS
PeerAuthentication: what type of traffic is accepted
● mode: DISABLE accept only plain text
● mode: STRICT accept only mTLS
● mode: PERMISSIVE accept mTLS or plain text
Envoy Envoy
Understanding Auto mTLS
Client Server
Auto mTLS: what type of traffic is sent
● If DestinationRule is configured, use DestinationRule settings
● If server has a sidecar and PeerAuthentication allows mTLS,
send mTLS
● Otherwise, send plain text
PeerAuthentication: what type of traffic is accepted
● mode: DISABLE accept only plain text
● mode: STRICT accept only mTLS.
● mode: PERMISSIVE accept mTLS or plain text.
Envoy Envoy
Troubleshooting Guide
Step 4
Access logs
Resp flags
Step 5
Envoy
config dump
Step 6
“istiod”
metrics
Step 8
Profiling
Step 7
“debug” logging
Step 3
“istiod” logs
Step 2
“istioctl proxy-
status”
Step 1
“istioctl
analyze”
Production Istio Installation
● Metrics & logs from control & data plane
○ Setup alerts
● Enable Access logs
● Outbound traffic control
● Strict mTLS instead of “auto”
● Scale out control plane
○ Configure HPA
○ Configure pod anti-affinity
● Non self signed CA certificates
● Locking down Ingress GW ports
● Auto sidecar injection
● Production grade Prometheus & Jaeger
Questions?
Neeraj Poddar
@nrjpoddar
neeraj@aspenmesh.io

More Related Content

What's hot

Kubernetes Security with Calico and Open Policy Agent
Kubernetes Security with Calico and Open Policy AgentKubernetes Security with Calico and Open Policy Agent
Kubernetes Security with Calico and Open Policy Agent
CloudOps2005
 
HTTP2 and gRPC
HTTP2 and gRPCHTTP2 and gRPC
HTTP2 and gRPC
Guo Jing
 
Nginx
NginxNginx
Get Hands-On with NGINX and QUIC+HTTP/3
Get Hands-On with NGINX and QUIC+HTTP/3Get Hands-On with NGINX and QUIC+HTTP/3
Get Hands-On with NGINX and QUIC+HTTP/3
NGINX, Inc.
 
Docker swarm
Docker swarmDocker swarm
Nginx Internals
Nginx InternalsNginx Internals
Nginx Internals
Joshua Zhu
 
Patroni: Kubernetes-native PostgreSQL companion
Patroni: Kubernetes-native PostgreSQL companionPatroni: Kubernetes-native PostgreSQL companion
Patroni: Kubernetes-native PostgreSQL companion
Alexander Kukushkin
 
Observability in Java: Getting Started with OpenTelemetry
Observability in Java: Getting Started with OpenTelemetryObservability in Java: Getting Started with OpenTelemetry
Observability in Java: Getting Started with OpenTelemetry
DevOps.com
 
Microservice 4.0 Journey - From Spring NetFlix OSS to Istio Service Mesh and ...
Microservice 4.0 Journey - From Spring NetFlix OSS to Istio Service Mesh and ...Microservice 4.0 Journey - From Spring NetFlix OSS to Istio Service Mesh and ...
Microservice 4.0 Journey - From Spring NetFlix OSS to Istio Service Mesh and ...
Daniel Oh
 
Kubecon US 2019: Kubernetes Multitenancy WG Deep Dive
Kubecon US 2019: Kubernetes Multitenancy WG Deep DiveKubecon US 2019: Kubernetes Multitenancy WG Deep Dive
Kubecon US 2019: Kubernetes Multitenancy WG Deep Dive
Sanjeev Rampal
 
Infrastructure-as-Code with Pulumi - Better than all the others (like Ansible)?
Infrastructure-as-Code with Pulumi- Better than all the others (like Ansible)?Infrastructure-as-Code with Pulumi- Better than all the others (like Ansible)?
Infrastructure-as-Code with Pulumi - Better than all the others (like Ansible)?
Jonas Hecht
 
NGINX Installation and Tuning
NGINX Installation and TuningNGINX Installation and Tuning
NGINX Installation and Tuning
NGINX, Inc.
 
Security and Multi-Tenancy with Apache Pulsar in Yahoo! (Verizon Media) - Pul...
Security and Multi-Tenancy with Apache Pulsar in Yahoo! (Verizon Media) - Pul...Security and Multi-Tenancy with Apache Pulsar in Yahoo! (Verizon Media) - Pul...
Security and Multi-Tenancy with Apache Pulsar in Yahoo! (Verizon Media) - Pul...
StreamNative
 
Salvatore Sanfilippo – How Redis Cluster works, and why - NoSQL matters Barce...
Salvatore Sanfilippo – How Redis Cluster works, and why - NoSQL matters Barce...Salvatore Sanfilippo – How Redis Cluster works, and why - NoSQL matters Barce...
Salvatore Sanfilippo – How Redis Cluster works, and why - NoSQL matters Barce...
NoSQLmatters
 
NGINX: High Performance Load Balancing
NGINX: High Performance Load BalancingNGINX: High Performance Load Balancing
NGINX: High Performance Load Balancing
NGINX, Inc.
 
Infrastructure as "Code" with Pulumi
Infrastructure as "Code" with PulumiInfrastructure as "Code" with Pulumi
Infrastructure as "Code" with Pulumi
Venura Athukorala
 
Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...
Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...
Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...
StreamNative
 
Comparison of Current Service Mesh Architectures
Comparison of Current Service Mesh ArchitecturesComparison of Current Service Mesh Architectures
Comparison of Current Service Mesh Architectures
Mirantis
 
Deep Dive into Building a Secure & Multi-tenant SaaS Solution with NATS
Deep Dive into Building a Secure & Multi-tenant SaaS Solution with NATSDeep Dive into Building a Secure & Multi-tenant SaaS Solution with NATS
Deep Dive into Building a Secure & Multi-tenant SaaS Solution with NATS
NATS
 
OVS v OVS-DPDK
OVS v OVS-DPDKOVS v OVS-DPDK
OVS v OVS-DPDK
Md Safiyat Reza
 

What's hot (20)

Kubernetes Security with Calico and Open Policy Agent
Kubernetes Security with Calico and Open Policy AgentKubernetes Security with Calico and Open Policy Agent
Kubernetes Security with Calico and Open Policy Agent
 
HTTP2 and gRPC
HTTP2 and gRPCHTTP2 and gRPC
HTTP2 and gRPC
 
Nginx
NginxNginx
Nginx
 
Get Hands-On with NGINX and QUIC+HTTP/3
Get Hands-On with NGINX and QUIC+HTTP/3Get Hands-On with NGINX and QUIC+HTTP/3
Get Hands-On with NGINX and QUIC+HTTP/3
 
Docker swarm
Docker swarmDocker swarm
Docker swarm
 
Nginx Internals
Nginx InternalsNginx Internals
Nginx Internals
 
Patroni: Kubernetes-native PostgreSQL companion
Patroni: Kubernetes-native PostgreSQL companionPatroni: Kubernetes-native PostgreSQL companion
Patroni: Kubernetes-native PostgreSQL companion
 
Observability in Java: Getting Started with OpenTelemetry
Observability in Java: Getting Started with OpenTelemetryObservability in Java: Getting Started with OpenTelemetry
Observability in Java: Getting Started with OpenTelemetry
 
Microservice 4.0 Journey - From Spring NetFlix OSS to Istio Service Mesh and ...
Microservice 4.0 Journey - From Spring NetFlix OSS to Istio Service Mesh and ...Microservice 4.0 Journey - From Spring NetFlix OSS to Istio Service Mesh and ...
Microservice 4.0 Journey - From Spring NetFlix OSS to Istio Service Mesh and ...
 
Kubecon US 2019: Kubernetes Multitenancy WG Deep Dive
Kubecon US 2019: Kubernetes Multitenancy WG Deep DiveKubecon US 2019: Kubernetes Multitenancy WG Deep Dive
Kubecon US 2019: Kubernetes Multitenancy WG Deep Dive
 
Infrastructure-as-Code with Pulumi - Better than all the others (like Ansible)?
Infrastructure-as-Code with Pulumi- Better than all the others (like Ansible)?Infrastructure-as-Code with Pulumi- Better than all the others (like Ansible)?
Infrastructure-as-Code with Pulumi - Better than all the others (like Ansible)?
 
NGINX Installation and Tuning
NGINX Installation and TuningNGINX Installation and Tuning
NGINX Installation and Tuning
 
Security and Multi-Tenancy with Apache Pulsar in Yahoo! (Verizon Media) - Pul...
Security and Multi-Tenancy with Apache Pulsar in Yahoo! (Verizon Media) - Pul...Security and Multi-Tenancy with Apache Pulsar in Yahoo! (Verizon Media) - Pul...
Security and Multi-Tenancy with Apache Pulsar in Yahoo! (Verizon Media) - Pul...
 
Salvatore Sanfilippo – How Redis Cluster works, and why - NoSQL matters Barce...
Salvatore Sanfilippo – How Redis Cluster works, and why - NoSQL matters Barce...Salvatore Sanfilippo – How Redis Cluster works, and why - NoSQL matters Barce...
Salvatore Sanfilippo – How Redis Cluster works, and why - NoSQL matters Barce...
 
NGINX: High Performance Load Balancing
NGINX: High Performance Load BalancingNGINX: High Performance Load Balancing
NGINX: High Performance Load Balancing
 
Infrastructure as "Code" with Pulumi
Infrastructure as "Code" with PulumiInfrastructure as "Code" with Pulumi
Infrastructure as "Code" with Pulumi
 
Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...
Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...
Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...
 
Comparison of Current Service Mesh Architectures
Comparison of Current Service Mesh ArchitecturesComparison of Current Service Mesh Architectures
Comparison of Current Service Mesh Architectures
 
Deep Dive into Building a Secure & Multi-tenant SaaS Solution with NATS
Deep Dive into Building a Secure & Multi-tenant SaaS Solution with NATSDeep Dive into Building a Secure & Multi-tenant SaaS Solution with NATS
Deep Dive into Building a Secure & Multi-tenant SaaS Solution with NATS
 
OVS v OVS-DPDK
OVS v OVS-DPDKOVS v OVS-DPDK
OVS v OVS-DPDK
 

Similar to Debugging Your Debugging Tools: What to do When Your Service Mesh Goes Down

Secure Developer Access at Decisiv
Secure Developer Access at DecisivSecure Developer Access at Decisiv
Secure Developer Access at Decisiv
Teleport
 
Docker Monitoring Webinar
Docker Monitoring  WebinarDocker Monitoring  Webinar
Docker Monitoring Webinar
Sematext Group, Inc.
 
CNCF Singapore - Introduction to Envoy
CNCF Singapore - Introduction to EnvoyCNCF Singapore - Introduction to Envoy
CNCF Singapore - Introduction to Envoy
Harish
 
Zero Downtime JEE Architectures
Zero Downtime JEE ArchitecturesZero Downtime JEE Architectures
Zero Downtime JEE Architectures
Alexander Penev
 
Tech Talk: ONOS- A Distributed SDN Network Operating System
Tech Talk: ONOS- A Distributed SDN Network Operating SystemTech Talk: ONOS- A Distributed SDN Network Operating System
Tech Talk: ONOS- A Distributed SDN Network Operating System
nvirters
 
KrakenD API Gateway
KrakenD API GatewayKrakenD API Gateway
KrakenD API Gateway
Albert Lombarte
 
Distributed fun with etcd
Distributed fun with etcdDistributed fun with etcd
Distributed fun with etcd
Abdulaziz AlMalki
 
From nothing to Prometheus : one year after
From nothing to Prometheus : one year afterFrom nothing to Prometheus : one year after
From nothing to Prometheus : one year after
Antoine Leroyer
 
Gluster dev session #6 understanding gluster's network communication layer
Gluster dev session #6  understanding gluster's network   communication layerGluster dev session #6  understanding gluster's network   communication layer
Gluster dev session #6 understanding gluster's network communication layer
Pranith Karampuri
 
UltraESB - Advanced services
UltraESB - Advanced servicesUltraESB - Advanced services
UltraESB - Advanced services
AdroitLogic
 
Building a Unified Logging Layer with Fluentd, Elasticsearch and Kibana
Building a Unified Logging Layer with Fluentd, Elasticsearch and KibanaBuilding a Unified Logging Layer with Fluentd, Elasticsearch and Kibana
Building a Unified Logging Layer with Fluentd, Elasticsearch and Kibana
Mushfekur Rahman
 
Introduction-to-Service-Mesh-with-Istio-and-Kiali-OSS-Japan-July-2019.pdf
Introduction-to-Service-Mesh-with-Istio-and-Kiali-OSS-Japan-July-2019.pdfIntroduction-to-Service-Mesh-with-Istio-and-Kiali-OSS-Japan-July-2019.pdf
Introduction-to-Service-Mesh-with-Istio-and-Kiali-OSS-Japan-July-2019.pdf
ALVAROEMMANUELSOCOPP
 
Comparison between zookeeper, etcd 3 and other distributed coordination systems
Comparison between zookeeper, etcd 3 and other distributed coordination systemsComparison between zookeeper, etcd 3 and other distributed coordination systems
Comparison between zookeeper, etcd 3 and other distributed coordination systems
Imesha Sudasingha
 
如何透過 Go-kit 快速搭建微服務架構應用程式實戰
如何透過 Go-kit 快速搭建微服務架構應用程式實戰如何透過 Go-kit 快速搭建微服務架構應用程式實戰
如何透過 Go-kit 快速搭建微服務架構應用程式實戰
KAI CHU CHUNG
 
EuroBSDCon 2013 - Mitigating DDoS Attacks at Layer 7
EuroBSDCon 2013 - Mitigating DDoS Attacks at Layer 7EuroBSDCon 2013 - Mitigating DDoS Attacks at Layer 7
EuroBSDCon 2013 - Mitigating DDoS Attacks at Layer 7
allanjude
 
Turbo charge your logs
Turbo charge your logsTurbo charge your logs
Turbo charge your logs
Jeremy Cook
 
University of Oslo's TSD service - storing sensitive & restricted data by D...
  University of Oslo's TSD service - storing sensitive & restricted data by D...  University of Oslo's TSD service - storing sensitive & restricted data by D...
University of Oslo's TSD service - storing sensitive & restricted data by D...
eurobsdcon
 
Monitoring at scale: Migrating to Prometheus at Fastly
Monitoring at scale: Migrating to Prometheus at FastlyMonitoring at scale: Migrating to Prometheus at Fastly
Monitoring at scale: Migrating to Prometheus at Fastly
Marcus Barczak
 
GrayLog for Java developers FOSDEM 2018
GrayLog for Java developers FOSDEM 2018GrayLog for Java developers FOSDEM 2018
GrayLog for Java developers FOSDEM 2018
Jose Manuel Ortega Candel
 

Similar to Debugging Your Debugging Tools: What to do When Your Service Mesh Goes Down (20)

Secure Developer Access at Decisiv
Secure Developer Access at DecisivSecure Developer Access at Decisiv
Secure Developer Access at Decisiv
 
Docker Monitoring Webinar
Docker Monitoring  WebinarDocker Monitoring  Webinar
Docker Monitoring Webinar
 
CNCF Singapore - Introduction to Envoy
CNCF Singapore - Introduction to EnvoyCNCF Singapore - Introduction to Envoy
CNCF Singapore - Introduction to Envoy
 
Zero Downtime JEE Architectures
Zero Downtime JEE ArchitecturesZero Downtime JEE Architectures
Zero Downtime JEE Architectures
 
Tech Talk: ONOS- A Distributed SDN Network Operating System
Tech Talk: ONOS- A Distributed SDN Network Operating SystemTech Talk: ONOS- A Distributed SDN Network Operating System
Tech Talk: ONOS- A Distributed SDN Network Operating System
 
KrakenD API Gateway
KrakenD API GatewayKrakenD API Gateway
KrakenD API Gateway
 
Distributed fun with etcd
Distributed fun with etcdDistributed fun with etcd
Distributed fun with etcd
 
From nothing to Prometheus : one year after
From nothing to Prometheus : one year afterFrom nothing to Prometheus : one year after
From nothing to Prometheus : one year after
 
Gluster dev session #6 understanding gluster's network communication layer
Gluster dev session #6  understanding gluster's network   communication layerGluster dev session #6  understanding gluster's network   communication layer
Gluster dev session #6 understanding gluster's network communication layer
 
UltraESB - Advanced services
UltraESB - Advanced servicesUltraESB - Advanced services
UltraESB - Advanced services
 
Building a Unified Logging Layer with Fluentd, Elasticsearch and Kibana
Building a Unified Logging Layer with Fluentd, Elasticsearch and KibanaBuilding a Unified Logging Layer with Fluentd, Elasticsearch and Kibana
Building a Unified Logging Layer with Fluentd, Elasticsearch and Kibana
 
Introduction-to-Service-Mesh-with-Istio-and-Kiali-OSS-Japan-July-2019.pdf
Introduction-to-Service-Mesh-with-Istio-and-Kiali-OSS-Japan-July-2019.pdfIntroduction-to-Service-Mesh-with-Istio-and-Kiali-OSS-Japan-July-2019.pdf
Introduction-to-Service-Mesh-with-Istio-and-Kiali-OSS-Japan-July-2019.pdf
 
Comparison between zookeeper, etcd 3 and other distributed coordination systems
Comparison between zookeeper, etcd 3 and other distributed coordination systemsComparison between zookeeper, etcd 3 and other distributed coordination systems
Comparison between zookeeper, etcd 3 and other distributed coordination systems
 
如何透過 Go-kit 快速搭建微服務架構應用程式實戰
如何透過 Go-kit 快速搭建微服務架構應用程式實戰如何透過 Go-kit 快速搭建微服務架構應用程式實戰
如何透過 Go-kit 快速搭建微服務架構應用程式實戰
 
EuroBSDCon 2013 - Mitigating DDoS Attacks at Layer 7
EuroBSDCon 2013 - Mitigating DDoS Attacks at Layer 7EuroBSDCon 2013 - Mitigating DDoS Attacks at Layer 7
EuroBSDCon 2013 - Mitigating DDoS Attacks at Layer 7
 
Turbo charge your logs
Turbo charge your logsTurbo charge your logs
Turbo charge your logs
 
University of Oslo's TSD service - storing sensitive & restricted data by D...
  University of Oslo's TSD service - storing sensitive & restricted data by D...  University of Oslo's TSD service - storing sensitive & restricted data by D...
University of Oslo's TSD service - storing sensitive & restricted data by D...
 
Monitoring at scale: Migrating to Prometheus at Fastly
Monitoring at scale: Migrating to Prometheus at FastlyMonitoring at scale: Migrating to Prometheus at Fastly
Monitoring at scale: Migrating to Prometheus at Fastly
 
Pcp
PcpPcp
Pcp
 
GrayLog for Java developers FOSDEM 2018
GrayLog for Java developers FOSDEM 2018GrayLog for Java developers FOSDEM 2018
GrayLog for Java developers FOSDEM 2018
 

Recently uploaded

A Comprehensive Look at Generative AI in Retail App Testing.pdf
A Comprehensive Look at Generative AI in Retail App Testing.pdfA Comprehensive Look at Generative AI in Retail App Testing.pdf
A Comprehensive Look at Generative AI in Retail App Testing.pdf
kalichargn70th171
 
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Globus
 
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data AnalysisProviding Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Globus
 
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus
 
Enhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdfEnhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdf
Globus
 
RISE with SAP and Journey to the Intelligent Enterprise
RISE with SAP and Journey to the Intelligent EnterpriseRISE with SAP and Journey to the Intelligent Enterprise
RISE with SAP and Journey to the Intelligent Enterprise
Srikant77
 
Lecture 1 Introduction to games development
Lecture 1 Introduction to games developmentLecture 1 Introduction to games development
Lecture 1 Introduction to games development
abdulrafaychaudhry
 
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, BetterWebinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
XfilesPro
 
Enterprise Resource Planning System in Telangana
Enterprise Resource Planning System in TelanganaEnterprise Resource Planning System in Telangana
Enterprise Resource Planning System in Telangana
NYGGS Automation Suite
 
Understanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSageUnderstanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSage
Globus
 
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Anthony Dahanne
 
Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604
Fermin Galan
 
Into the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdfInto the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdf
Ortus Solutions, Corp
 
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
informapgpstrackings
 
BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024
Ortus Solutions, Corp
 
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.ILBeyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Natan Silnitsky
 
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdfEnhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Jay Das
 
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus
 
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdfDominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
AMB-Review
 
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Globus
 

Recently uploaded (20)

A Comprehensive Look at Generative AI in Retail App Testing.pdf
A Comprehensive Look at Generative AI in Retail App Testing.pdfA Comprehensive Look at Generative AI in Retail App Testing.pdf
A Comprehensive Look at Generative AI in Retail App Testing.pdf
 
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
 
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data AnalysisProviding Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
 
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024
 
Enhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdfEnhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdf
 
RISE with SAP and Journey to the Intelligent Enterprise
RISE with SAP and Journey to the Intelligent EnterpriseRISE with SAP and Journey to the Intelligent Enterprise
RISE with SAP and Journey to the Intelligent Enterprise
 
Lecture 1 Introduction to games development
Lecture 1 Introduction to games developmentLecture 1 Introduction to games development
Lecture 1 Introduction to games development
 
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, BetterWebinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
 
Enterprise Resource Planning System in Telangana
Enterprise Resource Planning System in TelanganaEnterprise Resource Planning System in Telangana
Enterprise Resource Planning System in Telangana
 
Understanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSageUnderstanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSage
 
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
 
Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604
 
Into the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdfInto the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdf
 
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
 
BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024
 
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.ILBeyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
 
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdfEnhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
 
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024
 
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdfDominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
 
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
 

Debugging Your Debugging Tools: What to do When Your Service Mesh Goes Down

  • 1. Debugging your debugging tools; What to do when your service mesh goes down in production? Neeraj Poddar Co-founder & Chief Architect, Aspen Mesh John Howard Software Engineer, Google
  • 2. What Is a Service Mesh A service mesh is… a transparent infrastructure layer that manages communication between microservices so that developers can focus on business logic while operators work independent of dev cycles to provide a more resilient environment
  • 3. Key Benefits of a Service Mesh Security ObservabilityTraffic Management
  • 4. Istio Architecture Overview (without Istiod)
  • 6. Debugging Istio Control plane ● istiod Data plane ○ Envoy sidecar ○ Istio Agent
  • 7. Debugging Envoy ● Connectivity to istiod ● Enabling & understanding access logs ● Debug logging ● Configuration dump
  • 9. Application Access Logs ● Useful for diagnosing traffic flows and failures ● End-to-end request routing across all microservices ● Default “turned off” in Istio ○ Enabled only in “demo” profile ● First tool in your debugging toolkit
  • 10. Globally Enabling Access Logs ● Globally enabling using “istioctl install” ● Globally customizing encoding which defaults to “TEXT” ● Reverting back to no access log ● This configuration is stored in “istio” configmap as default “mesh” configuration
  • 11. Enabling Access Logs for a Namespace Use Namespace scope EnvoyFilter resource (Shouldn’t be used if enabled globally)
  • 14. Envoy Debug Logging ● Debug logging is verbose and expensive ○ Not recommended for permanent production usage ○ Defaults to “warning” ● Enable debug logging for a workload using istioctl ○ Doesn’t require restarting the pod ● Enable debug logging for a workload using annotations ○ Add pod annotation “sidecar.istio.io/logLevel: debug” ○ Requires pod restarts ● Enable debug logging globally ○ Requires pod restarts
  • 15. Envoy Configuration Overview Listeners Filter Chains HTTP Filters TCP Filters Routes Clusters Endpoints Most specific matched Specific match order Runs in order Most specific matched
  • 16. Envoy Listener Dump ● Virtual inbound listener ● Virtual outbound listener ● Outbound listeners ● All listeners
  • 17. Envoy HTTP Listener Configuration
  • 18. Envoy TCP Listener Configuration
  • 19. Mapping Resources to Envoy Configuration Resource Type Envoy Configuration Notes Kubernetes Services Listeners Routes Clusters New listeners if port/protocol combo is unique Add virtual hosts for existing routes One cluster per Service/Port/Subset Kubernetes Endpoints Endpoints Istio Gateway Listeners Apply to Ingress/Egress Gateways Istio VirtualService Listeners Routes Client side proxies TLS/TCP affect listeners HTTP match blocks affect routes Istio DestinationRule Clusters Endpoints Client side proxies Connection/HTTP/TLS settings Istio ServiceEntry Clusters Endpoints Client side proxies Istio PeerAuthentication Listeners Clusters Server side proxies Istio RequestAuthentication Listeners Server side proxies Istio Authorization Policies Listeners Server side proxies Istio EnvoyFilter All Break glass API to directly manipulate Envoy Istio Sidecar All Client or server side proxies Sidecar scope sets config visibility
  • 20. Istio Agent Bootstrapping Istiod Istio Agent Envoy 1. Start Envoy 2. Fetch Cert (SDS) Auth: None (local) 3. Fetch Cert (CSR) Auth: JWT 4. Fetch Config (XDS) Auth: Cert
  • 21. Istio Agent Lifecycle Istiod Istio Agent Envoy When cert is near expired: Refresh Cert When config/endpoints change: Push updated config Push new certs Periodic reconnect based on max conn age/keep alives
  • 22. Istio Agent Debugging ● Viewing secrets/certs using istioctl ○ istioctl proxy-config secret <POD> ○ default secret is the workload certificate ○ ROOTCA is the workload root certificate ○ Other secrets are certificates from DestinationRule/Gateway ○ Describe the full certificate information with openssl ■ istioctl proxy-config secret <POD> -o json | jq '.dynamicActiveSecrets[0].secret.tlsCertificate.certificateChain.inlineBytes' -r | base64 -d | openssl x509 -noout -text -in - ● "Warming" secrets ○ If a secret isn't found it will be stuck warming. Traffic that relies on that secret will fail. ○ For gateways, may indicate the referenced Secret is not present. Check logs or istioctl analyze. ○ For workload certificate, may indicate issues connecting to the CA (Istiod).
  • 23. Istio Agent Readiness ● Probe: every 2s, check readiness status at localhost:15021/healthz/ready ○ On success, mark container ready ○ On 30 consecutive failures, mark container not ready ○ Runs for entire lifetime of the pod ● Envoy proxy is NOT ready: config not received from Pilot ○ Indicates we don't have the correct XDS configuration ○ Can be from rejected config ■ Envoy log: gRPC config for type.googleapis.com/... rejected ■ istiod log: ADS: ACK ERROR ... ■ Metrics: pilot_xds_{cds,lds,eds,rds}_reject ■ Generally a bug cause by non-standard configuration ○ Can be from connectivity issues to istiod ■ StreamAggregatedResources gRPC config stream closed: 14, no healthy upstream ■ Check certificates are ready ■ Check istiod is ready ■ Check Network Policies if applicable ● Envoy proxy is ready - Output on first successful probe
  • 24. Istiod Debugging ● Enabling debug logging ● ControlZ dashboard ● Istiod metrics for config synchronization/pushes
  • 25. Istiod Debug Logging ● Default level is “info”.Available options: “debug”, “info”, “warn”, “error”, “fatal”, “none” ● Enable debug logging without restarting “istiod” ○ Toggle logging for a scope or for all scopes ● Enable debug logging via “istioctl” ○ Requires re-deploying istiod
  • 27. Istiod Metrics ● Grafana Dashboard comes prepopulated with useful metrics to track. ● Datadog has an excellent deep dive on Istio metrics. ● Useful metrics ○ Total number of invalid config: pilot_total_xds_rejects ■ Rejections generally indicate misconfiguration or bugs ○ Time to push config update to proxy: pilot_proxy_convergence_time ■ Slow times may lead to delays in config or endpoints updates. May indicate under-scaled control plane. ○ Number of XDS clients: pilot_xds ■ Useful to spot unbalanced load distribution
  • 28. Profiling ● Both control plane and data plane support profiling which can help identify CPU and memory usages. ● go tool pprof -http=: localhost:8080/debug/pprof/<ENDPOINT> ○ profile: Captures 30s of CPU usage ○ heap: Captures in use and lifetime memory allocations ○ goroutine: Capture all goroutines in use
  • 29. Istioctl Debugging Capabilities ● Proxy status cluster-wide (Per-proxy view gives detailed config “diffs”) ● Wait for config distribution (experimental)
  • 30. Istioctl Configuration Analysis ● Provides multi resource validation e.g. “Is a VirtualService tied to a Gateway that doesn’t exist” ● Complete list of analyzers ● Enable analyzer reporting via CRD status field
  • 32. Scaling Istiod ● Istiod can scale horizontally and vertically ● Ensure CPU limit is disabled or give sufficient CPU ○ Throttling may lead to slow configuration updates ● Use the latest Istio version! ○ Each release has substantial performance improvements
  • 33. Factors for Istiod Scaling ● Size of config to generate ○ Impacted by total number of services/pods & Istio resources ○ For large scale, Sidecar scoping is recommended to bring down size of config ● Rate of change of environment ○ Every time a new Service is created or Istio configuration is changed full updates are sent to proxies ○ Adding new endpoints are cheap as only incremental updates are sent ● Number of proxies for which configuration needs to be generated ○ Impacted by number of pods with a sidecar, and gateways
  • 34. Unbalanced Istiod Load ● xDS connection is a long lived gRPC stream ○ Makes load balancing challenging ● Connection will close every 30m to slowly rebalance load ● Quick scaling events may cause new instances to have small load ○ This can trigger HPA to scale up and down repeatedly ○ Running at least 2 replicas can help alleviate this issue
  • 35. Understanding mTLS Client Server DestinationRule: what type of traffic is sent ● mode: DISABLE sends plain text. Common for services outside of the mesh ● mode: ISTIO_MUTUAL sends mTLS ● mode: SIMPLE/MUTUAL can be used to originate TLS PeerAuthentication: what type of traffic is accepted ● mode: DISABLE accept only plain text ● mode: STRICT accept only mTLS ● mode: PERMISSIVE accept mTLS or plain text Envoy Envoy
  • 36. Understanding Auto mTLS Client Server Auto mTLS: what type of traffic is sent ● If DestinationRule is configured, use DestinationRule settings ● If server has a sidecar and PeerAuthentication allows mTLS, send mTLS ● Otherwise, send plain text PeerAuthentication: what type of traffic is accepted ● mode: DISABLE accept only plain text ● mode: STRICT accept only mTLS. ● mode: PERMISSIVE accept mTLS or plain text. Envoy Envoy
  • 37. Troubleshooting Guide Step 4 Access logs Resp flags Step 5 Envoy config dump Step 6 “istiod” metrics Step 8 Profiling Step 7 “debug” logging Step 3 “istiod” logs Step 2 “istioctl proxy- status” Step 1 “istioctl analyze”
  • 38. Production Istio Installation ● Metrics & logs from control & data plane ○ Setup alerts ● Enable Access logs ● Outbound traffic control ● Strict mTLS instead of “auto” ● Scale out control plane ○ Configure HPA ○ Configure pod anti-affinity ● Non self signed CA certificates ● Locking down Ingress GW ports ● Auto sidecar injection ● Production grade Prometheus & Jaeger