Comprehensive container based service monitoring with kubernetes and istio

•Download as PPTX, PDF•

2 likes•912 views

The document provides an overview of using Kubernetes and Istio to monitor microservices. It discusses using Istio to collect telemetry data on requests, including rate, errors, and duration. This data can be visualized in Grafana dashboards to monitor key performance indicators. Histograms are recommended to capture request durations as they allow calculating percentiles over time for service level indicators. An Istio metrics adapter is also described that sends telemetry data to Circonus for long-term storage and alerting.

Technology

Comprehensive Container-Based Service
Monitoring with Kubernetes and Istio
SREcon Asia Australia 2018-06-06
Fred Moyer

Monitoring Nerd
@phredmoyer
Developer Evangelist
@Circonus / @IRONdb
@IstioMesh Geek
Observability and
Statistics Dork

Talk Agenda
❏ Istio Overview
❏ Service Level Objectives
❏ RED Dashboard
❏ Histogram Telemetry
❏ Istio Metrics Adapter
❏ Asking the Right Questions

Istio.io
“An open platform to connect,
manage, and secure
microservices”

K8S + Istio
● Orchestration
● Deployment
● Scaling
● Data Plane
● Policy Enforcement
● Traffic Management
● Telemetry
● Control Plane

Istio Sample App
$ istioctl create -f apps/bookinfo.yaml

Istio Sample App
kind: Deployment
metadata:
name: ratings-v1
spec:
replicas: 1
template:
metadata:
labels:
app: ratings
version: v1
spec:
containers:
- name: ratings
image: istio/examples-bookinfo-ratings-v1
imagePullPolicy: IfNotPresent
ports:
- containerPort: 9080

Istio Sample App
$ istioctl create -f apps/bookinfo/route-rule-reviews-v2-v3.yaml
type: route-rule
spec:
name: reviews-default
destination: reviews.default.svc.cluster.local
precedence: 1
route:
- tags:
version: v2
weight: 80
- tags:
version: v3
weight: 20

Istio K8s Services
> kubectl get services
NAME CLUSTER-IP EXTERNAL-IP PORT(S) AGE
details 10.0.0.31 <none> 9080/TCP 6m
kubernetes 10.0.0.1 <none> 443/TCP 7d
productpage 10.0.0.120 <none> 9080/TCP 6m
ratings 10.0.0.15 <none> 9080/TCP 6m
reviews 10.0.0.170 <none> 9080/TCP 6m

Istio K8s App Pods
> kubectl get pods
NAME READY STATUS RESTARTS AGE
details-v1-1520924117 2/2 Running 0 6m
productpage-v1-560495357 2/2 Running 0 6m
ratings-v1-734492171 2/2 Running 0 6m
reviews-v1-874083890 2/2 Running 0 6m
reviews-v2-1343845940 2/2 Running 0 6m
reviews-v3-1813607990 2/2 Running 0 6m

Istio K8s System Pods
> kubectl get pods -n istio-system
NAME READY STATUS RESTARTS AGE
istio-ca-797dfb66c5 1/1 Running 0 2m
istio-ingress-84f75844c4 1/1 Running 0 2m
istio-egress-29a16321d3 1/1 Running 0 2m
istio-mixer-9bf85fc68 3/3 Running 0 2m
istio-pilot-575679c565 2/2 Running 0 2m
grafana-182346ba12 2/2 Running 0 2m
prometheus-837521fe34 2/2 Running 0 2m

Service Level Objectives
● SLI - Service Level Indicator
● SLO - Service Level Objective
● SLA - Service Level Agreement

“SLIs drive SLOs which
inform SLAs”
SLI - Service Level Indicator, a measure
of the service that can be quantified
“95th percentile latency of homepage
requests over past 5 minutes < 300ms”
Excerpted from
“SLIs, SLOs, SLAs, oh my!”
@sethvargo @lizthegrey
https://youtu.be/tEylFyxbDLE

“SLIs drive SLOs which
inform SLAs”
SLO - Service Level Objective, a target
for Service Level Indicators
“95th percentile homepage SLI will
succeed 99.9% over trailing year”
Excerpted from
“SLIs, SLOs, SLAs, oh my!”
@sethvargo @lizthegrey
https://youtu.be/tEylFyxbDLE

“SLIs drive SLOs which
inform SLAs”
SLA - Service Level Agreement, a legal
agreement between a customer and a
service provider based on SLOs
“Service credits if 95th percentile
homepage SLI succeeds less than
99.5% over trailing year”
Excerpted from
“SLIs, SLOs, SLAs, oh my!”
@sethvargo @lizthegrey
https://youtu.be/tEylFyxbDLE

Talk Agenda
❏ Istio Overview
❏ Service Level Objectives
❏ RED Dashboard
❏ Histogram Telemetry Collection
❏ Istio Metrics Adapter
❏ Asking the Right Questions

Emerging Standards
● USE
○ Utilization, Saturation, Errors
○ Introduced by Brendan Gregg @brendangregg
○ KPIs for host based health
● The Four Golden Signals
○ Latency, Traffic, Errors, Saturation
○ Covered in the Google SRE Book
○ Extended version of RED
● RED
○ Rate, Errors, Duration
○ Introduced by Tom Wilkie @tom_wilkie
○ KPIs for API based health, SLI focused

Containers?
● Ephemeral
● High Cardinality
● Difficult to Instrument
● Instrument Services, Not Containers

Istio Mixer Provided Telemetry
● Request Count by Response Code
● Request Duration
● Request Size
● Response Size
● Connection Received Bytes
● Connection Sent Bytes
● Connection Duration
● Template Based MetaData (Metric Tags)

● Rate
○ Requests per second
○ First derivative of request count provided by Istio
● Errors
○ Unsuccessful requests per second
○ First derivative of failed request count provided by Istio
● Duration
○ Request latency provided by Istio
RED

Duration
Problems:
● Percentiles > averages, but have limitations
○ Aggregated metric, fixed time window
○ Cannot be re-aggregated for cluster health
○ Cannot be averaged (common mistake)
● Stored aggregates are outputs, not inputs
● Difficult to measure cluster SLIs
● Leave a lot to be desired for

WE CAN DO BETTER
WE HAVE THE
TECHNOLOGY^W^W
MATH
https://youtu.be/yCX1Ze3OcKo

Histogram
https://github.com/circonus-labs/circonusllhist

Log linear histogram
https://github.com/circonus-labs/circonusllhist

Duration - Histogram
https://github.com/circonus-labs/circonusllhist

Duration - SLI
SLI - “90th percentile latency of requests over
past 5 minutes < 1,000ms”

Istio Metrics Adapter
● Golang based adapter API
● In process (built into the Mixer executable)
○ Out of process for new adapter dev
● Set of handler hooks and YAML files

Istio Metrics Adapter
“SHOW ME THE CODE”
https://github.com/istio/istio/
https://github.com/istio/istio/blob/master/mixer/adapter/circonus

$Istio Metrics Adapter // HandleMetric submits metrics to Circonus via circonus-gometrics func (h *handler) HandleMetric(ctx context.Context, insts []*metric.Instance) error { for _, inst := range insts { metricName := inst.Name metricType := h.metrics[metricName] switch metricType { case config.GAUGE: value, _ := inst.Value.(int64) h.cm.Gauge(metricName, value) case config.COUNTER: h.cm.Increment(metricName)$

Istio Metrics Adapter
case config.DISTRIBUTION:
value, _ :=
inst.Value.(time.Duration)
h.cm.Timing(metricName,
float64(value))
}
}
return nil

$Istio Metrics Adapter handler struct { cm *cgm.CirconusMetrics env adapter.Env metrics map[string]config.Params_MetricInfo_Type cancel context.CancelFunc }$

And some YAML
metrics:
- name: requestcount.metric.istio-system
type: COUNTER
- name: requestduration.metric.istio-system
type: DISTRIBUTION
- name: requestsize.metric.istio-system
type: GAUGE
- name: responsesize.metric.istio-system
type: GAUGE

Buffer metrics, then report
env.ScheduleDaemon(
func() {
ticker :=
time.NewTicker(b.adpCfg.SubmissionInterval)
for {
select {
case <-ticker.C:
cm.Flush()
case <-adapterContext.Done():
ticker.Stop()
cm.Flush()
return
}
}
})

Your boss wants to know
● How many users got angry on the
Tuesday slowdown after the big
marketing promotion?
● Are we over-provisioned or under-
provisioned on our purchasing checkout
service?
● Other business centric questions

The Slowdown
● Marketing launched a new product
● Users complained the site was slow
● Median human reaction time is 215 ms [1]
● If users get angry (rage clicks) when requests take
more than 500 ms, how many users got angry?
[1] - https://www.humanbenchmark.com/tests/reactiontime/

The Slowdown
1. Record all service request latencies as distribution
2. Plot as a heatmap
3. Calculate percentage of requests that exceed 500ms SLI using
inverse percentiles
4. Multiply result by total number requests, integrate over time

Under or Over Provisioned?
● “It depends”
● Time of day, day of week
● Special events
● Behavior under load
● Latency bands shed some light

Conclusions
● Monitor services, not containers
● Record distributions, not aggregates
● Istio gives you RED metrics for free
● Use math to ask the right questions

Thank you! Questions?
Tweet me
@phredmoyer

This document provides an overview of communication amongst microservices using Kubernetes, Istio, and Spring Cloud. It discusses how Kubernetes is a container orchestrator that allows developers to run applications across infrastructure, and how Pivotal Container Service (PKS) provides managed Kubernetes. Istio is introduced as a platform that connects, secures, and observes microservices, utilizing sidecar proxies. Spring Cloud services are also discussed as providing abstractions for common patterns in distributed systems. The presentation explores how Istio and Kubernetes can work together to provide capabilities like retries, load balancing and mutual TLS for microservices, and compares this to features provided by Spring Cloud.

Istio service mesh: past, present, future (TLV meetup)

elevran

21st Docker Switzerland Meetup - ISTIO

Niklaus Hirt

The document discusses Istio, an open source service mesh that provides traffic management, resilience, and security for microservices applications. It begins with an overview of microservices and common challenges in managing microservices applications. It then introduces Istio and its components that address these challenges, such as intelligent routing, policy enforcement, and telemetry collection. Specific Istio features like traffic control, splitting, and mirroring are demonstrated. Finally, it provides instructions for getting started with Istio and links for additional information.

From zero to hero with Kubernetes and Istio

Sergii Bishyr

Putting microservices on a diet with Istio

QAware GmbH

Software Architecture Conference 2018, London (UK): Talk by Mario-Leander Reimer (@LeanderReimer, Principal Software Architect at QAware) === Please download slides if blurred! === Abstract: In a microservice world, things become more complex. Platforms such as Kubernetes address a lot of the complexity; they handle resource isolation and utilization, networking, and deployments nicely. But a lot of the involved complexity such as load balancing, rollout scenarios, circuit breaking, retries, rate limiting, observability, tracing and transport security is still left up to the development teams. Of course, you can address all of these challenges in your microservices programmatically using popular open-source components such as Hystrix, Ribbon, Eureka, the EFK Stack, Prometheus or Jaeger. But, unfortunately, this approach can quickly lead to excessive library bloat and suddenly your microservices are not quite so micro anymore. All this might seem acceptable if you’re on a single, consistent development stack like Java EE or Spring Boot. But tackling these complexities becomes even more challenging if you’re dealing with multiple stacks and multiple frameworks, to say nothing about dealing with legacy applications that you can’t modify to retrofit these requirements. In comes Istio to the rescue. It is a so-called service mesh that addresses many of the cross-cutting communication concerns in a microservice architecture. Think of Istio as AOP (aspect-oriented programming) for microservice communication. Instead of implementing everything directly within your services, Istio transparently injects and decorates the desired concerns into the individual communication channels. Mario-Leander Reimer offers an overview of Istio and explains how it addresses the inherent complexities in microservice architectures. He briefly discusses the conceptual architecture and the main building blocks of Istio before diving into several examples deployed on a live Kubernetes cluster to demonstrate the different traffic management features, as well as diagnosability and security.

Sailing into 2018 with Kubernetes and Istio

Fernand Galiana

This document discusses Istio, an open platform for connecting, managing, and securing microservices. It provides features like circuit breaking, fault injection, canary deployments, rate limiting, service metrics, tracing, and mutual TLS without requiring code changes. Istio uses interceptors and Mixer to enforce policies across traffic in a Kubernetes cluster. The document outlines some capabilities of Istio and also notes it is still in alpha stage with some known issues around documentation and debugging.

Stop reinventing the wheel with Istio by Mete Atamel (Google)

Codemotion

#Codemotion Rome 2018 - Containers provide a consistent environment to run services. Kubernetes help us to manage and scale our container cluster. Good start for a loosely coupled microservices architecture but not enough. How do you control the flow of traffic & enforce policies between services? How do you visualize service dependencies & identify issues? How can you provide verifiable service identities, test for failures? You can implement your own custom solutions or you can rely on Istio, an open platform to connect, manage and secure microservices.

This document discusses Kubernetes and Istio. It provides an overview of Kubernetes as a container orchestration engine and cluster management system. It then discusses the rise of microservices and some of the complexities they introduce. It introduces Istio as a service mesh that takes care of communication and policies between microservices to help manage this complexity. Key components of Istio like Pilot, Mixer, and Envoy are described. Examples of capabilities like intelligent routing, failure handling, and fault injection are provided. A demo application and platform is used to demonstrate Istio's observability, monitoring, and traffic shifting features.

Istio - A Service Mesh for Microservices as Scale

Ram Vennam

Manage microservices on Kubernetes using the open source Istio service mesh from IBM, Google, and Lyft. In this presentation we explore the overall value and architecture of Istio and walk through key mechanisms for using Istio to drive highly secure microservices. We will also demonstrate the various features of Istio showing how to intelligently load balance traffic between services, conduct A/B tests, release canaries, and more.

Introduction to Istio Service Mesh

Georgios Andrianakis

Cloud Native Spring - The role of Spring Cloud after Kubernetes became a main...

Orkhan Gasimov

Gatekeeper: API gateway

ChengHui Weng

Make Java Microservices Resilient with Istio - Mangesh - IBM - CC18

CodeOps Technologies LLP

This presentation was made by Mangesh Patankar (Developer Advocate - IBM Cloud) as part of Container Conference 2018: www.containerconf.in. "How do we make microservices resilient and fault-tolerant? How do we enforce policy decisions, such as fine-grained access control and rate limits? How do we enable timeouts/retries, health checks, etc.? A service-mesh architecture attempts to resolve these issues by extracting the common resiliency features needed by a microservices framework away from the applications and frameworks and into the platform itself. Istio provides an easy way to create this service mesh."

Managing traffic routing with istio and envoy workshop

Opsta

Connecting All Abstractions with Istio

VMware Tanzu

SpringOne Platform 2017 Ramiro Salas, Pivotal The concept of a service mesh represents a paradigm shift on application connectivity for distributed systems, with wide implications for analytics, policy and extensibility. In this talk, we will explain what a service mesh is, the power it brings to microservices, and its impact on Cloud Foundry and K8s, both separately and together. We will also discuss the implications for the traditional network infrastructure, and the shifting of responsibilities from L3/4 to L7, and our current thinking of using Istio to integrate all abstractions.

WTF Do We Need a Service Mesh?

Anton Weiss

Service meshes are all the buzz in cloud-native world. How come only yesterday we didn't know such a thing existed and now everybody seems to want one? If you're already running a microservice-based system or only starting out with one — you may be asking yourself: "Do I also need a mesh?" In this session we'll try to answer what the mesh is good for, what problems it solves, what new questions it poses. More specifically we will: explore the SMI Spec; understand why everybody wants a mesh; see how the mesh helps with progressive delivery; discuss if it's time for you to get into the mesh.

Istio: Using nginMesh as the service proxy

Lee Calcote

With microservices and containers becoming mainstream, container orchestrators provide much of what the cluster (nodes and containers) needs. With container orchestrators' core focus on scheduling, discovery, and health at an infrastructure level, microservices are left with unmet, service-level needs, such as: - Traffic management, routing, and resilient and secure communication between services - Policy enforcement, rate-limiting, circuit breaking - Visibility and monitoring with metrics, logs, and traces - Load balancing and rollout/canary deployment support Service meshes provide for these needs. In this session, we will dive into Istio - its components, capabilities, and extensibility. Istio envelops and integrates with other open source projects to deliver a full-service mesh. We'll explore these integrations and Istio's extensibility in terms of choice of proxies and adapters, such as nginMesh.

Delivering with Microservices - How to Iterate Towards Sophistication

Thoughtworks

A microservices architecture requires some complex engineering. So when choosing microservices, a software engineering team must be prepared to tackle some challenges early in the project. On the other hand, an Agile approach requires the team to focus on delivering business value from Iteration 1. The fully-featured architecture of the solution will not emerge until later. While clear and strong technical choices must be made early in a microservices project, how do you resist the temptation of building the infrastructure first? In this presentation, Jean Robert D'Amore talks about the tools and techniques you can use to let your product evolve towards a microservices architecture, while setting up the team for success.

Multi-Clusters Made Easy with Liqo: Getting Rid of Your Clusters Keeping Them...

KCDItaly

Many companies are experiencing a dramatic increase in the number of their Kubernetes clusters, for reasons such as geographical/legislative constraints, data/service replication, etc. However, when the number of clusters increases, the complexity of deploying apps, managing the entire multi-cluster infrastructure, and keeping its state under control, becomes rapidly an unmanageable problem. A possible solution is Liqo, an open-source project that simplifies the creation of multi-cluster topologies by replicating the Kubernetes “cattle” model also to clusters. Liqo creates a virtual cluster that spans multiple real clusters, either on-prem or managed (AKS, EKS, GKE), and instantiates the desired applications seamlessly in the appropriate cluster. This talk will discuss the potentials and roadblocks of this vision and highlight how Liqo brings multi- cluster transparency to the users.

Istio on Kubernetes

Daneyon Hansen

This document discusses Istio, an open source service mesh that provides traffic management, telemetry and security for microservices applications. It describes Istio's architecture including its data and control planes. The data plane uses Envoy proxies that are injected as sidecars to intercept and manage traffic between services. The control plane includes Pilot for service discovery and traffic management and Mixer for policy enforcement. The document provides examples of how Istio manages issues like ingress, egress, security and provides a sample BookInfo application.

Application Rollout - Istio

Mandar Jog

The document discusses common DevOps challenges related to rolling out new versions of microservices and testing them. It introduces Istio as a solution for addressing these challenges through intelligent routing, resiliency features, traffic controls, telemetry collection, and other capabilities. Istio uses the Envoy proxy and control tools like Pilot and Mixer to provide features for reliable traffic management between services, such as advanced routing rules for canary releases, fault injection for testing resiliency, and policy enforcement across the mesh.

SpringBoot and Spring Cloud Service for MSA

Oracle Korea

The service mesh: resilient communication for microservice applications

Outlyer

Modern application architecture is shifting from monolith to microservices: componentized, containerized, and orchestrated with systems like Kubernetes, Mesos, and Docker Swarm. While this environment is resilient to many failures of both hardware and software, applications require more than this to be truly resilient. In this talk, we introduce the notion of a "service mesh": a userspace infrastructure layer designed to manage service-to-service communication in microservice applications, including handling partial failures and unexpected load, while reducing tail latencies and degrading gracefully in the presence of component failure.

Spring Cloud and Netflix OSS overview v1

Dmitry Skaredov

Spring Cloud and Netflix OSS Overview is a technical document that: 1) Provides an overview of the Spring Cloud framework for building distributed and microservice applications, highlighting projects such as Spring Cloud Config, Spring Cloud Netflix, Spring Cloud Bus, Spring Cloud Stream and Spring Cloud Sleuth. 2) Describes the Netflix OSS projects integrated with Spring Cloud including service discovery with Eureka, client side load balancing with Ribbon, circuit breaking with Hystrix, routing and filtering with Zuul, and monitoring with Hystrix Dashboard and Turbine. 3) Includes demos of key Spring Cloud features like externalized configuration with Spring Cloud Config and propagating configuration changes with Spring Cloud Bus.

The Good, the Bad and the Ugly of Migrating Hundreds of Legacy Applications ...

Josef Adersberger

Running applications on Kubernetes can provide a lot of benefits: more dev speed, lower ops costs, and a higher elasticity & resiliency in production. Kubernetes is the place to be for cloud native apps. But what to do if you’ve no shiny new cloud native apps but a whole bunch of JEE legacy systems? No chance to leverage the advantages of Kubernetes? Yes you can! We’re facing the challenge of migrating hundreds of JEE legacy applications of a major German insurance company onto a Kubernetes cluster within one year. We're now close to the finish line and it worked pretty well so far. The talk will be about the lessons we've learned - the best practices and pitfalls we've discovered along our way. We'll provide our answers to life, the universe and a cloud native journey like: - What technical constraints of Kubernetes can be obstacles for applications and how to tackle these? - How to architect a landscape of hundreds of containerized applications with their surrounding infrastructure like DBs MQs and IAM and heavy requirements on security? - How to industrialize and govern the migration process? - How to leverage the possibilities of a cloud native platform like Kubernetes without challenging the tight timeline?

Load Balancing for Containers and Cloud Native Architecture

Chiradeep Vittal

This document discusses how load balancing needs have changed with the rise of containers and microservices. Traditional load balancers are not well-suited for dynamic container environments where services and topology frequently change. New approaches are needed like integrating load balancers with container orchestration platforms, using client-side load balancing, and leveraging load balancers for advanced patterns like zero-downtime deployments, circuit breaking, visibility and security segmentation. Load balancers are playing an increasingly important role in cloud native architectures.

Serverless Networking - How We Provide Cloud-Native Connectivity for IoT Devices

Steffen Gebert

Serverless networking allows EMnify to provide cloud-native connectivity for IoT devices using AWS Transit Gateway (TGW) in a serverless fashion. TGW securely connects customer VPCs to EMnify's core VPC and IoT devices through resource sharing and routing. The provisioning process is triggered via API and uses AWS Step Functions for serverless orchestration. This allows EMnify to build connectivity features with minimal maintenance through a fully serverless data and control plane.

Comprehensive Container Based Service Monitoring with Kubernetes and Istio

Fred Moyer

This document summarizes Fred Moyer's talk on comprehensive container-based service monitoring with Kubernetes and Istio. The talk covered Istio architecture and deployment, using the Istio sample bookinfo application, and monitoring the application with Istio metrics and Grafana dashboards. It also discussed Istio Mixer metrics adapters, math and statistics concepts like histograms and quantiles, and monitoring concepts like service level objectives, indicators, and agreements. The talk provided exercises for attendees to deploy sample applications and create custom metrics adapters.

Observability foundations in dynamically evolving architectures

Boyan Dimitrov

Holistic application health monitoring, request tracing across distributed systems, instrumentation, business process SLAs - all of them are integral parts of today’s technical stacks. Nevertheless many teams decide to integrate observability last which makes it an almost impossible challenge - especially if you have to deal with hundreds and thousands of services. Therefore starting early is essential and in this talk we are going to see how we can solve those challenges early and explore the foundations of building and evolving complex microservices platforms in respect to observability. We are going to share some of the best practices and quick wins that allow us to correlate different telemetry systems and gradually build up towards more sophisticated use-cases. We are also going to look at some of the standard AWS services such as X-Ray and Cloudwatch that help us get going "for free" and then discuss more complex tooling and integrations building up towards a fully integrated ecosystem. As part of this talk we are also going to share some of the learnings we have made at Sixt on this topic and we are going to introduce some of the solutions that help us operate our microservices stack

What's hot

Kubernetes And Istio and Azure AKS DevOps

Ofir Makmal

Istio - A Service Mesh for Microservices as Scale

Ram Vennam

Introduction to Istio Service Mesh

Georgios Andrianakis

Cloud Native Spring - The role of Spring Cloud after Kubernetes became a main...

Orkhan Gasimov

Gatekeeper: API gateway

ChengHui Weng

Make Java Microservices Resilient with Istio - Mangesh - IBM - CC18

CodeOps Technologies LLP

Managing traffic routing with istio and envoy workshop

Opsta

Connecting All Abstractions with Istio

VMware Tanzu

WTF Do We Need a Service Mesh?

Anton Weiss

Istio: Using nginMesh as the service proxy

Lee Calcote

Delivering with Microservices - How to Iterate Towards Sophistication

Thoughtworks

Multi-Clusters Made Easy with Liqo: Getting Rid of Your Clusters Keeping Them...

KCDItaly

Istio on Kubernetes

Daneyon Hansen

Application Rollout - Istio

Mandar Jog

SpringBoot and Spring Cloud Service for MSA

Oracle Korea

The service mesh: resilient communication for microservice applications

Outlyer

Spring Cloud and Netflix OSS overview v1

Dmitry Skaredov

The Good, the Bad and the Ugly of Migrating Hundreds of Legacy Applications ...

Josef Adersberger

Load Balancing for Containers and Cloud Native Architecture

Chiradeep Vittal

Serverless Networking - How We Provide Cloud-Native Connectivity for IoT Devices

Steffen Gebert

What's hot (20)

Kubernetes And Istio and Azure AKS DevOps

Istio - A Service Mesh for Microservices as Scale

Introduction to Istio Service Mesh

Cloud Native Spring - The role of Spring Cloud after Kubernetes became a main...

Gatekeeper: API gateway

Make Java Microservices Resilient with Istio - Mangesh - IBM - CC18

Managing traffic routing with istio and envoy workshop

Connecting All Abstractions with Istio

WTF Do We Need a Service Mesh?

Istio: Using nginMesh as the service proxy

Delivering with Microservices - How to Iterate Towards Sophistication

Multi-Clusters Made Easy with Liqo: Getting Rid of Your Clusters Keeping Them...

Istio on Kubernetes

Application Rollout - Istio

SpringBoot and Spring Cloud Service for MSA

The service mesh: resilient communication for microservice applications

Spring Cloud and Netflix OSS overview v1

The Good, the Bad and the Ugly of Migrating Hundreds of Legacy Applications ...

Load Balancing for Containers and Cloud Native Architecture

Serverless Networking - How We Provide Cloud-Native Connectivity for IoT Devices

Similar to Comprehensive container based service monitoring with kubernetes and istio

Comprehensive Container Based Service Monitoring with Kubernetes and Istio

Fred Moyer

Observability foundations in dynamically evolving architectures

Boyan Dimitrov

Introducing MongoDB Stitch, Backend-as-a-Service from MongoDB

MongoDB

Watch this webinar to learn about our new Backend as a Service (BaaS) – MongoDB Stitch. MongoDB Stitch lets developers focus on building applications rather than on managing data manipulation code, service integration, or backend infrastructure. Whether you’re just starting up and want a fully managed backend as a service, or you’re part of an enterprise and want to expose existing MongoDB data to new applications, Stitch lets you focus on building the app users want, not on writing boilerplate backend logic. This webinar will cover the what, why, and how of MongoDB Stitch. We’ll cover everything from the features it provides to the architecture that makes it possible. By the end of the session, you should understand how Stitch can kickstart your new project or take your existing application to the next level. Attendees will learn: - The basics of MongoDB Stitch and how to use it for new projects or to expose existing data to new applications - How to control what data and services individual users can access - How to integrate your favorite services with your MongoDB application without writing extra code

Scaling Experimentation & Data Capture at Grab

Roman

Sprint 45 review

ManageIQ

Dynamic Authorization & Policy Control for Docker Environments

Torin Sandall

How do you enable rapid deployment of innovative applications on top of Docker containers while still satisfying strict requirements from your InfoSec and compliance departments? The Open Policy Agent (OPA), an open-source tool, enables you to update and enforce policies without slowing down developers or modifying application code. In this talk, Justin Cormack (Security Engineer at Docker) and Torin Sandall (Co-founder of the OPA project) will show how you can leverage the integrations between Docker and OPA to enforce fine-grained policies in your organization's container platform while still allowing your developers to move quickly. This talk is targeted at engineers building and operating container platforms who are interested in security and policy enforcement. The audience can expect to take aware fresh ideas about how to enforce fine-grained security policies across their container platform.

Istio Triangle Kubernetes Meetup Aug 2019

Ram Vennam

Architecting for Real-Time Insights with Amazon Kinesis (ANT310) - AWS re:Inv...

Amazon Web Services

Amazon Kinesis makes it easy to speed up the time it takes for you to get valuable, real-time insights from your streaming data. In this session, we walk through the most popular applications that customers implement using Amazon Kinesis, including streaming extract-transform-load, continuous metric generation, and responsive analytics. Our customer Autodesk joins us to describe how they created real-time metrics generation and analytics using Amazon Kinesis and Amazon Elasticsearch Service. They walk us through their architecture and the best practices they learned in building and deploying their real-time analytics solution.

Everything you want to know about microservices

Youness Lasmak

A Practical Deep Dive into Observability of Streaming Applications with Kosta...

HostedbyConfluent

This document provides an overview of observability of streaming applications using Kafka. It discusses the three pillars of observability - logging, metrics, and tracing. It describes how to expose Kafka client-side metrics using interceptors, metric reporters, and the Spring Boot framework. It demonstrates calculating consumer lag from broker and client-side metrics. It introduces OpenTelemetry for collecting telemetry data across applications and exporting to various backends. Finally, it wraps up with lessons on monitoring consumer lag trends and selecting the right metrics to ship.

Elastic Stack: Using data for insight and action

Elasticsearch

Introduction to Time Series Analytics with Microsoft Azure

Codit

Real-time Analytics with Trino and Apache Pinot

Xiang Fu

MongoDB Stich Overview

MongoDB

Speaker:Drew DiPalma Learn more about MongoDB Stitch, our new Backend as a Service (BaaS) that makes it easy for developers to create and launch applications across mobile and web platforms. Stitch provides a REST API on top of MongoDB with read, write, and validation rules built-in and full integration with the services you love. This talk will cover the what, why, and how of MongoDB Stitch. We'll discuss everything from features to the architecture. You'll walk away knowing how Stitch can kickstart your new project or take your existing application to the next level.

AWS Public Sector Symposium 2014 Canberra | Putting the "Crowd" to work in th...

Amazon Web Services

"Cloud" computing provides significant advantages and enormous cost savings by allowing IT infrastructure to be provisioned as a ubiquitous, metered, unit priced and on demand service. However, the other major resourcing issue faced by CIO’s is the provision of skilled labour to develop, support and maintain a increasing wide range of IT applications. This session will show attendees how the worldwide pool of freelance developers, the "Crowd", can be utilised as a ubiquitous, metered, unit priced and on demand resource pool to work in the "Cloud" to improve responsiveness to customer demands, reduce development timeframes and achieve significant cost savings. Although the crowd can bring enormous benefits in terms of cost and agility, there are some technical and business barriers to adoption in large organisations. This presentation will discuss the barriers and, using some real examples, will explain how GoSource overcomes them.

Feature drift monitoring as a service for machine learning models at scale

Noriaki Tatsumi

Use Custom Metadata Types for Easy ALM & Compliance for Your Custom Apps

Salesforce Developers

The document discusses custom metadata types, which allow organizations to easily manage application-level metadata (ALM) and comply with IT controls. Custom metadata types can be deployed and retrieved via metadata APIs and packages, providing traceability of changes. They help reduce errors by validating changes before deployment to production and tracking all metadata changes in a single transaction.

Sprint 73

ManageIQ

Flink Forward San Francisco 2018: David Reniz & Dahyr Vergara - "Real-time m...

Flink Forward

“Customer experience is the next big battle ground for telcos,” proclaimed recently Amit Akhelikar, Global Director of Lynx Analytics at TM Forum Live! Asia in Singapore. But, how to fight in this battle? A common approach has been to keep “under control” some well-known network quality indicators, like dropped calls, radio access congestion, availability, and so on; but this has proven not to be enough to keep customers happy, like a siege weapon is not enough to conquer a city. But, what if it were possible to know how customers perceive services, at least most demanded ones, like web browsing or video streaming? That would be like a squad of archers ready to battle. And even having that, how to extract value of it and take actions in no time, giving our skilled archers the right targets? Meet CANVAS (Customer And Network Visualization and AnaltyticS), one of the first LATAM implementations of a Flink-based stream processing use case for a telco, which successfully combines leading and innovative technologies like Apache Hadoop, YARN, Kafka, Nifi, Druid and advanced visualizations with Flink core features like non-trivial stateful stream processing (joins, windows and aggregations on event time) and CEP capabilities for alarm generation, delivering a next-generation tool for SOC (Service Operation Center) teams.

CA Performance Management 2.6 Deep Dive

CA Technologies

CA Performance Management 2.6 is a next-generation tool for monitoring mega-sized networks. This session, led by CA network monitoring experts, is designed to help new and existing users expand their knowledge of key capabilities and maximize the value of their performance data. The session will focus on foundational features, including understanding the architecture (data collectors, data repository/database, data aggregator, user interface and integration with CA Mediation Manager), leveraging the predefined dashboards and reports, understanding metric families and vendor-specific device certification, creating and deploying discovery and monitoring profiles and eventing. You'll also learn about advanced features, such as customizing dashboards and reports, automating custom groups creation and device population, using the application program interface (API) to integrate CA Performance Management with basic service set (BSS)/configuration management systems and create a zero-touch, automated process flow to on-board monitoring and self-certification procedures for device monitoring. For more information, please visit http://cainc.to/Nv2VOe

Similar to Comprehensive container based service monitoring with kubernetes and istio (20)

Comprehensive Container Based Service Monitoring with Kubernetes and Istio

Observability foundations in dynamically evolving architectures

Introducing MongoDB Stitch, Backend-as-a-Service from MongoDB

Scaling Experimentation & Data Capture at Grab

Sprint 45 review

Dynamic Authorization & Policy Control for Docker Environments

Istio Triangle Kubernetes Meetup Aug 2019

Architecting for Real-Time Insights with Amazon Kinesis (ANT310) - AWS re:Inv...

Everything you want to know about microservices

A Practical Deep Dive into Observability of Streaming Applications with Kosta...

Elastic Stack: Using data for insight and action

Introduction to Time Series Analytics with Microsoft Azure

Real-time Analytics with Trino and Apache Pinot

MongoDB Stich Overview

AWS Public Sector Symposium 2014 Canberra | Putting the "Crowd" to work in th...

Feature drift monitoring as a service for machine learning models at scale

Use Custom Metadata Types for Easy ALM & Compliance for Your Custom Apps

Sprint 73

Flink Forward San Francisco 2018: David Reniz & Dahyr Vergara - "Real-time m...

CA Performance Management 2.6 Deep Dive

More from Fred Moyer

Reliable observability at scale: Error Budgets for 1,000+

Fred Moyer

This document summarizes a presentation about implementing service level objectives (SLOs) and error budgets at scale. It discusses establishing service level indicators (SLIs) to define good and bad service, setting SLOs as targets for SLIs over time periods, and calculating error budgets as the complement of SLOs. The presentation provides examples of SLIs, SLOs, and error budgets for latency and availability. It also discusses challenges including variance from real users and different stakeholders' needs, and recommends approaches like flexible latency metrics and measuring as close to users as possible.

Practical service level objectives with error budgeting

Fred Moyer

This document summarizes Fred Moyer's presentation on practical service level objectives with error budgeting. It discusses how to calculate error budgets based on service level indicators, objectives, and agreements. It presents methods to calculate error budgets using log files by counting errors and slow requests, and using metrics like counters and histograms to track errors and response time distributions over time. Maintaining an error budget allows managing risk while releasing new code by setting a target for acceptable errors or slow requests.

SREcon americas 2019 - Latency SLOs Done Right

Fred Moyer

This document summarizes a presentation about properly computing service level objectives (SLOs) using latency data. It discusses common mistakes like averaging percentiles, and better approaches like using histograms. Log linear histograms are recommended as they provide flexibility in choosing thresholds while being space efficient. Open source libraries like libcircllhist can be used to calculate SLOs from latency data stored in histograms.

Scale17x - Latency SLOs Done Right

Fred Moyer

The document discusses proper techniques for defining and measuring service level objectives (SLOs). It begins with an overview of SLOs, service level indicators (SLIs), and service level agreements (SLAs). It then describes a common mistake in averaging percentiles across data sets. The rest of the document discusses different methods for accurately computing SLOs using log data, counting requests, and histograms. It argues that histograms provide the most flexibility while avoiding issues with averaging percentiles.

Latency SLOs Done Right

Fred Moyer

The document discusses techniques for accurately calculating service level objectives (SLOs) based on latency. It begins with an overview of common SLO terminology. It then describes a common mistake where percentiles are incorrectly averaged across time windows. The document proceeds to examine approaches to computing SLOs using log data, request counting, and histograms. Histograms are identified as the most flexible technique since they allow thresholds to be chosen as needed and provide full statistical analysis of latency data.

Latency SLOs done right

Fred Moyer

This document discusses best practices for defining and measuring latency service level objectives (SLOs). It recommends computing SLOs directly from raw log data using histograms, which allow arbitrary percentiles to be derived and are better than averaging sample percentiles. Histograms can also be aggregated over time and used to count the number of requests above a latency threshold regardless of what the threshold was set to initially. Common histogram implementations like HDR-Histogram and t-digest are suggested.

Effective management of high volume numeric data with histograms

Fred Moyer

Statistics for dummies

Fred Moyer

This document provides an overview of key statistical concepts including: 1. The average (arithmetic mean) is calculated by summing all values and dividing by the number of samples. 2. The median is the middle value of a data set when values are sorted from lowest to highest. 3. The 90th percentile represents the value where 90% of values are below it. 4. Standard deviation measures how spread out values are from the average and 68% of values fall within one standard deviation of the average in a normal distribution.

GrafanaCon EU 2018

Fred Moyer

Fred Moyer from Circonus presented on IRONdb and Grafana. IRONdb is a time series database that can replace existing TSDBs without changes to ingestion or visualizations. It provides scale, reliability, and ease of operations. IRONdb is distributed, replicated across multiple datacenters for reliability, and can store years of high-cardinality histogram and metric data. The upcoming IRONdb data source for Grafana will support histograms, stream tags, and Prometheus storage. Attendees could sign up for early access and preview accounts.

Fredmoyer postgresopen 2017

Fred Moyer

Better service monitoring through histograms sv perl 09012016

Fred Moyer

Better service monitoring through histograms

Fred Moyer

This document discusses using histograms and percentiles to better monitor service performance. It begins by noting the limitations of synthetic monitoring and outlines how real user data can provide a more accurate picture. Percentiles like the median and 90th percentile are explained as useful metrics for understanding performance. Histograms of request latency data over time are presented as a way to detect non-normal distributions that could indicate issues. Calculating alerting thresholds based on percentiles rather than averages is advocated to avoid missing multiple high samples. Examples are given of how percentile-based alerting can more effectively detect performance problems and avoid unnecessary alerts.

The Breakup - Logically Sharding a Growing PostgreSQL Database

Fred Moyer

The document discusses the process of logically sharding a growing PostgreSQL database. It describes the stages involved: diagnosing which tables are largest; evaluating options like account, geographic or hardware-based sharding; scoping the solution by separating tables between a main and marks database; implementing changes including managing transactions and configuration across databases; releasing the changes; and cleaning up afterwards. It emphasizes testing rollback processes, managing technical debt, and bringing empathy to understanding legacy code and configurations.

Learning go for perl programmers

Fred Moyer

The document discusses differences between Perl and Go for Perl programmers. It covers Go topics like goroutines (threads), channels (queues), formatting code with gofmt, defining structs instead of hashes/objects, using slices instead of arrays, maps instead of hashes, error handling, importing packages instead of using Perl modules, writing tests with godoc instead of perldoc, and getting code with go get instead of cpanminus. It also provides Golang web resources for learning more.

Surge 2012 fred_moyer_lightning

Fred Moyer

Netfilter was used to solve performance and scalability issues with an existing captive portal solution. A netfilter module was developed that removed port numbers from HTTP requests, allowing most static content to be fetched directly from origin servers rather than through a proxy. This avoided proxying all traffic and achieved better performance than alternatives like Tinyproxy. The netfilter solution worked well technically but did not prove viable long-term for business reasons.

Qpsmtpd

Fred Moyer

Apache Dispatch

Fred Moyer

This document discusses Apache::Dispatch, a lightweight abstraction layer for mod_perl applications. It maps URIs to application resources via method handlers, providing the power of mod_perl handlers with a painless migration. The document reviews how Apache::Dispatch works, provides examples of configuration, method handlers, and testing with Apache::Test. It also covers additional Apache::Dispatch features like pre/post-dispatch handlers, inheritance, autoloading, and filtering.

Ball Of Mud Yapc 2008

Fred Moyer

Data::FormValidator Simplified

Fred Moyer

This document discusses the Data::FormValidator module, which provides a simplified way to validate form data in Perl. It allows defining validation profiles that specify required and optional fields, as well as custom and built-in constraint methods. The module takes request parameters, runs validation according to the profile, and returns results that can be easily integrated into templates to display error messages.

More from Fred Moyer (19)

Reliable observability at scale: Error Budgets for 1,000+

Practical service level objectives with error budgeting

SREcon americas 2019 - Latency SLOs Done Right

Scale17x - Latency SLOs Done Right

Latency SLOs Done Right

Latency SLOs done right

Effective management of high volume numeric data with histograms

Statistics for dummies

GrafanaCon EU 2018

Fredmoyer postgresopen 2017

Better service monitoring through histograms sv perl 09012016

Better service monitoring through histograms

The Breakup - Logically Sharding a Growing PostgreSQL Database

Learning go for perl programmers

Surge 2012 fred_moyer_lightning

Qpsmtpd

Apache Dispatch

Ball Of Mud Yapc 2008

Data::FormValidator Simplified

Recently uploaded

Best 20 SEO Techniques To Improve Website Visibility In SERP

Pixlogix Infotech

How to Get CNIC Information System with Paksim Ga.pptx

danishmna97

Ocean lotus Threat actors project by John Sitima 2024 (1).pptx

SitimaJohn

Ocean Lotus cyber threat actors represent a sophisticated, persistent, and politically motivated group that poses a significant risk to organizations and individuals in the Southeast Asian region. Their continuous evolution and adaptability underscore the need for robust cybersecurity measures and international cooperation to identify and mitigate the threats posed by such advanced persistent threat groups.

Mind map of terminologies used in context of Generative AI

Kumud Singh

UI5 Controls simplified - UI5con2024 presentation

Wouter Lemaire

Your One-Stop Shop for Python Success: Top 10 US Python Development Providers

akankshawande

Uni Systems Copilot event_05062024_C.Vlachos.pdf

Uni Systems S.M.S.A.

GenAI Pilot Implementation in the organizations

kumardaparthi1024

AI 101: An Introduction to the Basics and Impact of Artificial Intelligence

IndexBug

June Patch Tuesday

Ivanti

Ivanti’s Patch Tuesday breakdown goes beyond patching your applications and brings you the intelligence and guidance needed to prioritize where to focus your attention first. Catch early analysis on our Ivanti blog, then join industry expert Chris Goettl for the Patch Tuesday Webinar Event. There we’ll do a deep dive into each of the bulletins and give guidance on the risks associated with the newly-identified vulnerabilities.

Building Production Ready Search Pipelines with Spark and Milvus

Zilliz

Climate Impact of Software Testing at Nordic Testing Days

Kari Kakkonen

My slides at Nordic Testing Days 6.6.2024 Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.

“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...

Edge AI and Vision Alliance

For the full video of this presentation, please visit: https://www.edge-ai-vision.com/2024/06/building-and-scaling-ai-applications-with-the-nx-ai-manager-a-presentation-from-network-optix/ Robin van Emden, Senior Director of Data Science at Network Optix, presents the “Building and Scaling AI Applications with the Nx AI Manager,” tutorial at the May 2024 Embedded Vision Summit. In this presentation, van Emden covers the basics of scaling edge AI solutions using the Nx tool kit. He emphasizes the process of developing AI models and deploying them globally. He also showcases the conversion of AI models and the creation of effective edge AI pipelines, with a focus on pre-processing, model conversion, selecting the appropriate inference engine for the target hardware and post-processing. van Emden shows how Nx can simplify the developer’s life and facilitate a rapid transition from concept to production-ready applications.He provides valuable insights into developing scalable and efficient edge AI solutions, with a strong focus on practical implementation.

20240605 QFM017 Machine Intelligence Reading List May 2024

Matthew Sinclair

HCL Notes and Domino License Cost Reduction in the World of DLAU

panagenda

Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-and-domino-license-cost-reduction-in-the-world-of-dlau/ The introduction of DLAU and the CCB & CCX licensing model caused quite a stir in the HCL community. As a Notes and Domino customer, you may have faced challenges with unexpected user counts and license costs. You probably have questions on how this new licensing approach works and how to benefit from it. Most importantly, you likely have budget constraints and want to save money where possible. Don’t worry, we can help with all of this! We’ll show you how to fix common misconfigurations that cause higher-than-expected user counts, and how to identify accounts which you can deactivate to save money. There are also frequent patterns that can cause unnecessary cost, like using a person document instead of a mail-in for shared mailboxes. We’ll provide examples and solutions for those as well. And naturally we’ll explain the new licensing model. Join HCL Ambassador Marc Thomas in this webinar with a special guest appearance from Franz Walder. It will give you the tools and know-how to stay on top of what is going on with Domino licensing. You will be able lower your cost through an optimized configuration and keep it low going forward. These topics will be covered - Reducing license cost by finding and fixing misconfigurations and superfluous accounts - How do CCB and CCX licenses really work? - Understanding the DLAU tool and how to best utilize it - Tips for common problem areas, like team mailboxes, functional/test users, etc - Practical examples and best practices to implement right away

Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack

shyamraj55

How to use Firebase Data Connect For Flutter

Daiki Mogmet Ito

Serial Arm Control in Real Time Presentation

tolgahangng

Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf

Malak Abu Hammad

Discover how MongoDB Atlas and vector search technology can revolutionize your application's search capabilities. This comprehensive presentation covers: * What is Vector Search? * Importance and benefits of vector search * Practical use cases across various industries * Step-by-step implementation guide * Live demos with code snippets * Enhancing LLM capabilities with vector search * Best practices and optimization strategies Perfect for developers, AI enthusiasts, and tech leaders. Learn how to leverage MongoDB Atlas to deliver highly relevant, context-aware search results, transforming your data retrieval process. Stay ahead in tech innovation and maximize the potential of your applications. #MongoDB #VectorSearch #AI #SemanticSearch #TechInnovation #DataScience #LLM #MachineLearning #SearchTechnology

CAKE: Sharing Slices of Confidential Data on Blockchain

Claudio Di Ciccio

Presented at the CAiSE 2024 Forum, Intelligent Information Systems, June 6th, Limassol, Cyprus. Synopsis: Cooperative information systems typically involve various entities in a collaborative process within a distributed environment. Blockchain technology offers a mechanism for automating such processes, even when only partial trust exists among participants. The data stored on the blockchain is replicated across all nodes in the network, ensuring accessibility to all participants. While this aspect facilitates traceability, integrity, and persistence, it poses challenges for adopting public blockchains in enterprise settings due to confidentiality issues. In this paper, we present a software tool named Control Access via Key Encryption (CAKE), designed to ensure data confidentiality in scenarios involving public blockchains. After outlining its core components and functionalities, we showcase the application of CAKE in the context of a real-world cyber-security project within the logistics domain. Paper: https://doi.org/10.1007/978-3-031-61000-4_16

Recently uploaded (20)

Best 20 SEO Techniques To Improve Website Visibility In SERP

How to Get CNIC Information System with Paksim Ga.pptx

Ocean lotus Threat actors project by John Sitima 2024 (1).pptx

Mind map of terminologies used in context of Generative AI

UI5 Controls simplified - UI5con2024 presentation

Your One-Stop Shop for Python Success: Top 10 US Python Development Providers

Uni Systems Copilot event_05062024_C.Vlachos.pdf

GenAI Pilot Implementation in the organizations

AI 101: An Introduction to the Basics and Impact of Artificial Intelligence

June Patch Tuesday

Building Production Ready Search Pipelines with Spark and Milvus

Climate Impact of Software Testing at Nordic Testing Days

“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...

20240605 QFM017 Machine Intelligence Reading List May 2024

HCL Notes and Domino License Cost Reduction in the World of DLAU

Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack

How to use Firebase Data Connect For Flutter

Serial Arm Control in Real Time Presentation

Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf

CAKE: Sharing Slices of Confidential Data on Blockchain

Comprehensive container based service monitoring with kubernetes and istio

1. Comprehensive Container-Based Service Monitoring with Kubernetes and Istio SREcon Asia Australia 2018-06-06 Fred Moyer

2. Monitoring Nerd @phredmoyer Developer Evangelist @Circonus / @IRONdb @IstioMesh Geek Observability and Statistics Dork

3. Talk Agenda ❏ Istio Overview ❏ Service Level Objectives ❏ RED Dashboard ❏ Histogram Telemetry ❏ Istio Metrics Adapter ❏ Asking the Right Questions

4. Istio.io “An open platform to connect, manage, and secure microservices”

5. Happy Birthday!

6. K8S + Istio ● Orchestration ● Deployment ● Scaling ● Data Plane ● Policy Enforcement ● Traffic Management ● Telemetry ● Control Plane

7. Istio Architecture

8. Istio GCP Deployment

9. Istio Sample App $ istioctl create -f apps/bookinfo.yaml

10. Istio Sample App

11. Istio Sample App

12. Istio Sample App kind: Deployment metadata: name: ratings-v1 spec: replicas: 1 template: metadata: labels: app: ratings version: v1 spec: containers: - name: ratings image: istio/examples-bookinfo-ratings-v1 imagePullPolicy: IfNotPresent ports: - containerPort: 9080

13. Istio Sample App $ istioctl create -f apps/bookinfo/route-rule-reviews-v2-v3.yaml type: route-rule spec: name: reviews-default destination: reviews.default.svc.cluster.local precedence: 1 route: - tags: version: v2 weight: 80 - tags: version: v3 weight: 20

14. Istio K8s Services > kubectl get services NAME CLUSTER-IP EXTERNAL-IP PORT(S) AGE details 10.0.0.31 <none> 9080/TCP 6m kubernetes 10.0.0.1 <none> 443/TCP 7d productpage 10.0.0.120 <none> 9080/TCP 6m ratings 10.0.0.15 <none> 9080/TCP 6m reviews 10.0.0.170 <none> 9080/TCP 6m

15. Istio K8s App Pods > kubectl get pods NAME READY STATUS RESTARTS AGE details-v1-1520924117 2/2 Running 0 6m productpage-v1-560495357 2/2 Running 0 6m ratings-v1-734492171 2/2 Running 0 6m reviews-v1-874083890 2/2 Running 0 6m reviews-v2-1343845940 2/2 Running 0 6m reviews-v3-1813607990 2/2 Running 0 6m

16. Istio K8s System Pods > kubectl get pods -n istio-system NAME READY STATUS RESTARTS AGE istio-ca-797dfb66c5 1/1 Running 0 2m istio-ingress-84f75844c4 1/1 Running 0 2m istio-egress-29a16321d3 1/1 Running 0 2m istio-mixer-9bf85fc68 3/3 Running 0 2m istio-pilot-575679c565 2/2 Running 0 2m grafana-182346ba12 2/2 Running 0 2m prometheus-837521fe34 2/2 Running 0 2m

17. Talk Agenda ❏ Istio Overview ❏ Service Level Objectives ❏ RED Dashboard ❏ Histogram Telemetry ❏ Istio Metrics Adapter ❏ Asking the Right Questions

18. Service Level Objectives ● SLI - Service Level Indicator ● SLO - Service Level Objective ● SLA - Service Level Agreement

19. Service Level Objectives

20. “SLIs drive SLOs which inform SLAs” SLI - Service Level Indicator, a measure of the service that can be quantified “95th percentile latency of homepage requests over past 5 minutes < 300ms” Excerpted from “SLIs, SLOs, SLAs, oh my!” @sethvargo @lizthegrey https://youtu.be/tEylFyxbDLE

21. “SLIs drive SLOs which inform SLAs” SLO - Service Level Objective, a target for Service Level Indicators “95th percentile homepage SLI will succeed 99.9% over trailing year” Excerpted from “SLIs, SLOs, SLAs, oh my!” @sethvargo @lizthegrey https://youtu.be/tEylFyxbDLE

22. “SLIs drive SLOs which inform SLAs” SLA - Service Level Agreement, a legal agreement between a customer and a service provider based on SLOs “Service credits if 95th percentile homepage SLI succeeds less than 99.5% over trailing year” Excerpted from “SLIs, SLOs, SLAs, oh my!” @sethvargo @lizthegrey https://youtu.be/tEylFyxbDLE

23. Talk Agenda ❏ Istio Overview ❏ Service Level Objectives ❏ RED Dashboard ❏ Histogram Telemetry Collection ❏ Istio Metrics Adapter ❏ Asking the Right Questions

24. Emerging Standards ● USE ○ Utilization, Saturation, Errors ○ Introduced by Brendan Gregg @brendangregg ○ KPIs for host based health ● The Four Golden Signals ○ Latency, Traffic, Errors, Saturation ○ Covered in the Google SRE Book ○ Extended version of RED ● RED ○ Rate, Errors, Duration ○ Introduced by Tom Wilkie @tom_wilkie ○ KPIs for API based health, SLI focused

25. Containers? ● Ephemeral ● High Cardinality ● Difficult to Instrument ● Instrument Services, Not Containers

26. Istio Mixer Provided Telemetry ● Request Count by Response Code ● Request Duration ● Request Size ● Response Size ● Connection Received Bytes ● Connection Sent Bytes ● Connection Duration ● Template Based MetaData (Metric Tags)

27. ● Rate ○ Requests per second ○ First derivative of request count provided by Istio ● Errors ○ Unsuccessful requests per second ○ First derivative of failed request count provided by Istio ● Duration ○ Request latency provided by Istio RED

28. Istio Grafana Dashboard

29. Istio Grafana Dashboard

30. Rate

31. Errors

32. Duration

33. Duration Problems: ● Percentiles > averages, but have limitations ○ Aggregated metric, fixed time window ○ Cannot be re-aggregated for cluster health ○ Cannot be averaged (common mistake) ● Stored aggregates are outputs, not inputs ● Difficult to measure cluster SLIs ● Leave a lot to be desired for

34. WE CAN DO BETTER WE HAVE THE TECHNOLOGY^W^W MATH https://youtu.be/yCX1Ze3OcKo

35. Talk Agenda ❏ Istio Overview ❏ Service Level Objectives ❏ RED Dashboard ❏ Histogram Telemetry ❏ Istio Metrics Adapter ❏ Asking the Right Questions

36. Histogram https://github.com/circonus-labs/circonusllhist

37. Log linear histogram https://github.com/circonus-labs/circonusllhist

38. Duration - Histogram https://github.com/circonus-labs/circonusllhist

39. Duration - SLI SLI - “90th percentile latency of requests over past 5 minutes < 1,000ms”

40. Duration - Modes

41. Duration - Modes

42. Duration - Modes

43. Duration - Heatmap

44. Duration - Heatmap

45. RED

46. RED - SLI Alerting

47. Talk Agenda ❏ Istio Overview ❏ Service Level Objectives ❏ RED Dashboard ❏ Histogram Telemetry ❏ Istio Metrics Adapter ❏ Asking the Right Questions

48. Istio Metrics Adapter ● Golang based adapter API ● In process (built into the Mixer executable) ○ Out of process for new adapter dev ● Set of handler hooks and YAML files

49. Istio Metrics Adapter “SHOW ME THE CODE” https://github.com/istio/istio/ https://github.com/istio/istio/blob/master/mixer/adapter/circonus

50. Istio Metrics Adapter

51. Istio Metrics Adapter

52. Istio Metrics Adapter // HandleMetric submits metrics to Circonus via circonus-gometrics func (h *handler) HandleMetric(ctx context.Context, insts []*metric.Instance) error { for _, inst := range insts { metricName := inst.Name metricType := h.metrics[metricName] switch metricType { case config.GAUGE: value, _ := inst.Value.(int64) h.cm.Gauge(metricName, value) case config.COUNTER: h.cm.Increment(metricName)

53. Istio Metrics Adapter case config.DISTRIBUTION: value, _ := inst.Value.(time.Duration) h.cm.Timing(metricName, float64(value)) } } return nil

54. Istio Metrics Adapter handler struct { cm *cgm.CirconusMetrics env adapter.Env metrics map[string]config.Params_MetricInfo_Type cancel context.CancelFunc }

55. And some YAML metrics: - name: requestcount.metric.istio-system type: COUNTER - name: requestduration.metric.istio-system type: DISTRIBUTION - name: requestsize.metric.istio-system type: GAUGE - name: responsesize.metric.istio-system type: GAUGE

56. Buffer metrics, then report env.ScheduleDaemon( func() { ticker := time.NewTicker(b.adpCfg.SubmissionInterval) for { select { case <-ticker.C: cm.Flush() case <-adapterContext.Done(): ticker.Stop() cm.Flush() return } } })

57. Talk Agenda ❏ Istio Overview ❏ Service Level Objectives ❏ RED Dashboard ❏ Histogram Telemetry ❏ Istio Metrics Adapter ❏ Asking the Right Questions

58. Your boss wants to know ● How many users got angry on the Tuesday slowdown after the big marketing promotion? ● Are we over-provisioned or under- provisioned on our purchasing checkout service? ● Other business centric questions

59. The Slowdown ● Marketing launched a new product ● Users complained the site was slow ● Median human reaction time is 215 ms [1] ● If users get angry (rage clicks) when requests take more than 500 ms, how many users got angry? [1] - https://www.humanbenchmark.com/tests/reactiontime/

60. The Slowdown 1. Record all service request latencies as distribution 2. Plot as a heatmap 3. Calculate percentage of requests that exceed 500ms SLI using inverse percentiles 4. Multiply result by total number requests, integrate over time

61. The Slowdown 4 million slow requests

62. Under or Over Provisioned? ● “It depends” ● Time of day, day of week ● Special events ● Behavior under load ● Latency bands shed some light

63. Latency Bands

64. Conclusions ● Monitor services, not containers ● Record distributions, not aggregates ● Istio gives you RED metrics for free ● Use math to ask the right questions

65. Thank you! Questions? Tweet me @phredmoyer

Editor's Notes

Hello, I’m Fred Moyer, and welcome to my talk. I’m here to speak about monitoring kubernetes container based services with Istio
I’m a monitoring nerd and developer evangelist at Circonus. I’m an Istio geek, I created the first Istio community adapter for monitoring Istio based services with Circonus. The Istio developers gave me that ship in a bottle as a thank you for being the first person to implement a third party adapter. I also like to dork out on observability and statistics.
So just a quick overview of the the talk roadmap. I’ll first give an overview of Istio and why it is awesome, then I’ll do a brief refresher on service level objectives to set a baseline. We’ll look at what a Red dashboard is and how we can make it more awesome with histogram based telemetry. Then I’ll do a brief walkthrough of the Istio metrics adapter I wrote, and finish up with a couple examples of how we can ask the right questions of the services we are monitoring.
Let’s start off with Istio. Istio is a service mesh for Kubernetes, which means that it takes care of all of the intercommunication and facilitation between services, kind of like network routing software does for TCP/IP traffic. It can also interact with Docker and Consul based services in addition to Kubernetes based apps. It is similar to LinkerD, which has been around for a while.
Istio is developed as an open source project by teams at Google, IBM, Cisco, and Lyft’s Envoy. The project just turned one year old a few weeks ago, and is rapidly approaching version 1.0. Version 0.8 was released just last week, and Istio has found its way into a couple of production environments at scale.
So how does Istio fit into the Kubernetes ecosystem? Kubernetes handles container orchestration, deployment, and scaling. Kubernetes is the data plane, the part of the infrastructure that carries the application traffic. Istio is the control plane, the part of the infrastructure that tells the application traffic where it can go. It handles policy enforcement such as role based access control, traffic management such as rate limiting, load balancing (think 5% of traffic can be routed to a canary instance), and telemetry syndication such as metrics, logs, and tracing. Istio is the crossing guard and reporting piece of the container based infrastructure.
Let’s take a quick look at the service mesh architecture. Istio uses an envoy sidecar proxy for each service. Envoy proxies inbound requests to the Istio Mixer service via a GRPC call, where Mixer applies rules for traffic management, and syndicates request telemetry. Mixer is the brains of Istio; operators can write YAML files that specify how Envoy should redirect traffic, and what telemetry to push to monitoring and observability systems. Rules can be applied as needed at runtime without needing to restart any Istio components. Istio supports a number of adapters to send telemetry to prometheus, circonus, statsd, and others. You can also enable both zipkin and jaeger tracing, as well as generating service graphs to visualize the services involved.
Istio is easy to deploy. Way back when, around 7 to 8 months ago, you had to install Istio onto a Kubernetes cluster with a series of kubectl commands. Now you can just hop into Google Cloud platform, and deploy an Istio enabled Kubernetes cluster with a few clicks, including monitoring, tracing, and a sample application. You can get up and running very quickly, and then use the istioctl command to start having fun. Let’s take a look at how we can get going with istioctl.
Istio is managed with a command line executable called istio control (or is it istio cuddle?). Istioctl operates on YAML files which specify resource deployments similar to how kubectl operates with Kubernetes. I’ll show a couple of examples of these yaml files in a few slides, but first let’s see what we get when we run this command against our Istio enabled Kubernetes cluster.
And boom, we have the sample application up and running. This sample app is basically a book review application which allows us to demonstrate a lot of the Istio functiolity. At the bottom of the screen, you can see where Istio can provide information about app routing weight, app version, and other metadata.
Here’s the sample bookinfo application that ships with istio. It has several different microservices written in different language. So instead of listening to folks complain about why their language is better than what your organization is using, you can make those tactics part of your strategic plan, let folks use their language of choice, and be rid of the programming language wars in your group. Instead, you can get everyone back to arguing about tabs vs spaces. This layout shows a product page service, a book details page service, three different versions of a reviews service, and a ratings service. Istio allows you to abstract a service such as the reviews service behind different versions of container deployments. You can configure Istio to route different levels of traffic to each set of pods for functionality such as canary deployments. You’ll notice that there’s an envoy sidecar present in each service. Envoy is very high performance, so this doesn’t actually present noticeable performance problems. Another significant benefit of this architecture is that it allows us to gather telemetry from services without requiring that developers instrument their services to provide that telemetry. This has a multitude of benefits, from eliminating points of maintenance and failure in the code, to providing vendor agnostic interfaces which reduce lockin.
The bookinfo.yaml file contains pragmas which specify different services. This particular section contains the information to setup the v1 ratings service. We specify a name for the service, version, container port, and the docker image to pull. This type of approach makes it fairly easy to codify a microservice based application.
We can apply yaml based rules at runtime to tell Istio that we want to change the traffic distribution between different versions of services. In this example, we direct Istio to send 80% of the requests to version 2 of the reviews service, and 20% of the requests to version 3 of the reviews service. This makes canary deployments straightforward to execute. You can send a small portion of your traffic to the new version of your service, apply a routing rule like this, and examine your telemetry and logs to determine if your new version is behaving as expected. This does mean that you are in effect testing in production, but often there is no substitute for this approach to verify code is working as expected.
Here’s a look inside at what Kubernetes services are exposed by this sample application. Our sample application is composed of four total services, excluding the kubernetes system service. We have separate services for the book details page, the product page, the ratings page, and the reviews page. Let’s take a look at the pods that compose these services.
Here are the individual pods deployed which compose those services. Notice the three different versions of the reviews pods which make up the reviews service. As we just saw, we can deploy different versions of individual services and weight the traffic between them.
Istio itself makes use of a number of different pods to operate itself. I didn’t say that running Istio is lightweight; the power and flexibility come with some overhead for operation. If you have more than a few microservices though in your application, your application containers will soon surpass the system provisioned containers though.
Now that we’ve taken a look at what Istio is, let’s do a brief overview of Service Level Objectives to set the stage for how we should measure our service health.
Most people here are probably familiar with the Google SRE book. The concept of SLAs has been around for at least a decade, but service level objectives and service level indicators are also becoming ubiquitous amongst site reliability engineers. The amount of online content around these terms has been increasing rapidly. I have had to restructure this talk over the past few weeks to include some of the most recent content.
In addition to the Google SRE book, two new books that talk about service level objectives are due for publication soon. The site reliability workbook has a dedicated chapter on service level objectives. Seeking SRE has a chapter on defining SLOs by Theo Schlossnagle, the Circonus CEO (shout out there as they are paying for my trip here).
I’m pulling in a few excerpts from a great youtube video by Seth Vargo and Liz Fong Jones I recently watched which I think explains these concepts really well. I recommend watching the video to get an in depth understanding. The gist of that video is that SLIs drive SLOs which inform SLAs. A service level indicator is basically a metric derived measure of health for a service. For example, I could have an SLI that says my 95th percentile latency of homepage requests over the last 5 minutes should be less than 300 milliseconds.
A service level objective is basically how we take a service level indicator, and extend the scope of it to quantify how we expect our service to perform over a strategic time interval. Drawing on the SLI we talked about in the previous, we could say that we want to meet the criteria set by that SLI for three nines over a trailing year window.
Service level agreements have been around for a long time. They are generally crafted by lawyers to limit the possibility of having to give customers money for those times when our service doesn’t perform like we committed to. SLAs are similar to SLOs, but the commitment level is relaxed, as we want our internal facing targets to be more strict than our external facing targets.
We’ve talked about Istio and SLOs, so we’ve set the stage to move onto talking about RED dashboards, which is where the fun begins.
Over the past several years there have been a number of emerging standards regarding combinations of indicators for quantifying both host and service health. Brendan Gregg introduced the USE method, which seeks to quantify health of a system host based on utilization, saturation, and errors metrics. For something like a CPU, we can use common utilization metrics for user, system, and idle percentages, and load average and run queue for saturation. The UNIX perf profiler is a good tool for measuring CPU error events. The Google SRE book talks about using latency, traffic, errors, and saturation metrics. It is targeted at service health, and is similar to the RED method below, but extends it with saturation. In practice, it can be difficult to quantify service saturation. Tom Wilkie introduced the RED method a few years ago. I talked to him about it in March, and he said it was something he just came up with and didn’t expect it to take off. With RED we monitor request rate, request errors, and request duration. Service Level Indicators can be integrated as part of a RED dashboard, as we’ll see in a bit. But wait, the title of this talk mentioned containers. Why aren’t we looking at monitoring the containers? Containers are ephemeral, short lived entities, and monitoring them directly to discern our service health presents a number of complex problems, such as the dreaded high cardinality issue. It is easier and more effective to monitor the service outputs of those containers in aggregate. We don’t care if one container is misbehaving if the service is healthy; chances are that our orchestration framework will reap that container anyway and replace it with a new one.
But wait, the title of this talk mentioned containers. Why aren’t we looking at monitoring the containers? Containers are ephemeral, short lived entities, and monitoring them directly to discern our service health presents a number of complex problems, such as the dreaded high cardinality issue. It is easier and more effective to monitor the service outputs of those containers in aggregate. We don’t care if one container is misbehaving if the service is healthy; chances are that our orchestration framework will reap that container anyway and replace it with a new one.
To compose our RED dashboard, let’s look at what telemetry is provided by Istio. I’ve given part of it away here with the font color. Istio provides several metrics about the requests it receives, the latency to generate a response, and connection level data. It also gives us the ability to add metric tags, called dimensions in Istio, to this telemetry. So we can break down the telemetry by host, cluster, etc.
So Istio provides us with the component metrics we need to construct a RED dashboard. We can get the rate in requests per second by taking the first derivative of the request count. We can get the error rate by taking the derivative of the request count of unsuccessful requests. And Istio provides us with the request latency of each request, so we can record how long each service request took to complete. These three key metrics provide us with the building blocks to construct a fairly complete view of the health of a service.
Istio actually provides us with a Grafana dashboard out of the box that contains the pieces that we are looking to compose a red dashboard from.
Here I’ve circled those components in red. We have the request rate in operations per second in the upper left, the number of failed requests per second in the upper right, and a graph of response time in the bottom. There are several other indicators on this graph, but let’s take a closer look at the ones I’ve circled.
Here’s the rate component of the dashboard, pretty straight forward. We count the number of requests which returned a 200 response code and graph the rate over time.
The Istio dashboard does something similar for for responses that return a 5xx error code. You can see here how it breaks down the errors by either the ingress controller, or errors from the application product page itself.
Here’s the request duration graph, this is probably the most informative as to the health of our service. This data is provided by a prometheus monitoring system, so we see request time percentiles graphed here, the median, 90th, 95th, and 99th percentiles. While these percentiles give us some overall indication of how the service is performing, there are a number of deficiencies with this approach that are worth examining. For example, during times of low activity, these percentiles can skew wildly because of limited numbers of samples, and you can misinterpret the system performance in those situations. Let’s take a look at the other issues that can arise with this approach.
Percentiles are generally metrics than averages as you are expressing the range of values with multiple datapoints instead of just one. However, like an average, percentiles are an aggregated metric, calculated over a fixed time window for a fixed data set. If we calculate a duration percentile for one cluster member, we cannot merge that with another cluster member to get an aggregate performance metric for the whole cluster. It is a common misconception that percentiles can be averaged; they cannot, except in rare cases where the distributions that generated them are nearly identical. If you only have the percentile, and not the source data, you cannot know that might be the case. It is a chicken and egg problem. This also means that you cannot set service level indicators for an entire service due to the lack of mergeability, if you are measuring percentile based performance only for individual cluster members. Our ability to set meaningful Service Level Indicators (and as a result SLOs) here is limited due to only having four latency data points over fixed time windows. So when you are working with percentile based duration metrics, you have to ask yourself, are my SLIs really good SLIs?
I believe that we can do better than this, and the next part of the talk will focus on how we can use math to generate the service level indicators that we need to give us a comprehensive view of our service’s performance and health. Has anyone here seen the donald duck math video? Maybe I’m showing my age, we watched this in grade school. So let’s push on and have some fun with math.
And we’ll start off with histogram based telemetry.
This is a visualization of latency data for a service in microseconds using a histogram. The number of samples is on the Y-Axis, and the sample value, in this case microsecond latency, is on the X-Axis. This is the open source histogram we developed at Circonus. There are a few other histogram implementations out there that are open source, such as Ted Dunning’s t-digest histogram and the HDR histogram. The Envoy project recently implemented the C implementation of this log linear histogram library to collect envoy telemetry as distributions. They actually found a very minor bug in implementation. We were quite happy to fix it. That’s the beauty of open source, the more eyes on the code, the better it gets over time. Histograms have a property of mergeability; I can take this distribution, and combine it with other distribution as long as the bin boundaries are identical. Mergeable metrics are awesome for monitoring and observability, as they allow us to combine outputs from similar sources such as service members, and calculate aggregates to obtain aggregate service metrics.
This log linear histogram contains 90 bins for each power of 10. You can see 90 bins between 100,000 and 1M. At each power of 10, the bin size increases by a factor of 10. This allows us to record a wide range of values with high relative accuracy without needing to know the data distribution ahead of time.
Here you can see where we have overlayed the average, and the 50th percentile (also known as the median), and the 90th percentile. The 90th percentile is the value at which 90% of the samples are below that value.
Let’s bring up the SLI we talked about earlier. With latency data displayed in this format, we can easily calculate our SLI for a service by merging histograms together to get a 5 minute view of data, and then calculating the 90th percentile value for that distribution. If it is less than 1,000 milliseconds, we met our SLI. Previously we showed the RED dashboard duration graph with four percentiles, the 50th, 90th, 95th, and 99th. We can overlay those percentiles on this distribution, and likely get a coarse guess of what request distribution might be without the data, but as you can see we would be making a lot of assumptions. What about if the distribution had additional modes?
This histogram shows a distribution with two distinct modes. The leftmost mode could be fast responses due to serving from a cache, and the right mode from serving from disk. Just measuring latency using four percentiles would make it nearly impossible to discern a distribution like this. This gives us a sense of the complexity that percentiles can mask. What if we had more than two modes?
This distribution has at least four visible modes. If we do the math on the full distribution, we will actually find 20+ modes here. How many percentiles would you need to record to approximate a latency distribution like this?
How about a distribution like this? Complex systems composed of many service will generate latency distributions that simply cannot be accurately represented by using percentiles; you have to record the entire latency distribution to be able to fully represent it. This type of histogram visualization shows a distribution over a fixed time window. We can store multiple distributions to get a sense of how it changes over time.
This is a heatmap, a set of histograms over time. This is a grafana visualization of the response latency from a cluster of 10 load balancers. This gives us a deep insight into the system behavior of the entire cluster over a week, there’s over 1 million data samples here. The median here centers around 500 microseconds, shown in the red colored bands.
This is another type of heatmap. You can see where we’ve overlayed percentile calculations over time on top of the heatmap. Percentiles are robust metrics and very useful, but not alone by themselves. We can see here how the 90+ percentiles increase as the latency distribution shifts upwards. Let’s take these distribution based duration maps and see if we can generate something more informative than the Istio dashboard.
This is a RED dashboard revised to show distribution based latency data. Here in the lower left we show a histogram of latencies over a fixed time window. And to the right of it, we use a heat map to break that distribution down into smaller time windows. With this layout of RED dashboard, we can get a complete view of how our service is behaving with only a few panels of information. This particular dashboard was implemented using Grafana served from an IRONdb time series database which stores the latency data natively as log linear histograms.
We can extend this RED dashboard a bit further and overlay our service level indicators onto the graphs as well. For the rate panel, our SLI might be to maintain a minimum level of requests per second. For the rate panel, our SLI might be to stay under a certain number of errors per second. And as we have previously examined duration SLIs, we might want our 99th percentile for our entire service which is composed of several pods, to stay under a certain latency over a fixed window. Using Istio telemetry stored as histograms enables us to set these meaningful service wide SLIs. Now we have a lot more to work with and as we’ll see in a bit, to ask questions of.
Let’s move on to take a look at how you can get distribution data out of Istio.
Istio provides developers with an adapter API that they can use to create their own custom adapters for telemetry syndication, as well as traffic management and policy enforcement. Mixer is written in Golang, and provides a set of predefined handler endpoints that you can write a golang package to receive request and response metadata, and act on it. The development team is currently exploring an option for out of process mixer adapters using GRPC, but we’ll be looking at an in process implementation. Along with Go code, mixer adapter developers also create a set of YAML files to tell the adapter what data to receive and what to do with it.
So let’s dive right in a take a brief look at the code. I’ll give folks a few seconds to bring up the code repo on Microsoft, uh whoops I meant Github.
Here’s the root level directory for the circonus adapter I created. The adapter code is in circonus.go, and the tests in circonus_test.go. The config directory contains YAML configuration needed to tell Istio what telemetry to send to the adapter, and the operatorconfig directory gives a sample configuration file. You can cd up to the adapter directory also to see all of the other adapters that folks have created.
Here’s a list of all of the adapters that Mixer has. There are also third party adapters for Cloudwatch, Datadog, Solarwinds, and Stackdriver. It was fun to see more telemetry adapters get added after I finished the Circonus adapter and was able to pass along feedback to the Istio development team about what was fun about it and what wasn’t fun. Initially the Istio project was using the Bazel build system to build the Istio executable. I attended a few of the community meetings where folks were grumbling about some of its difficulties, and I was able to chime in with my experiences as an outside developer as well.
Here’s the meat of the adapter. The HandleMetric function takes a context object and a slice of metric instances. The handler object already contains an instantiation of the circonus-gometrics library which is the glue to interface with Circonus. The logic here is pretty simple; we loop over each metric instance, and then tell circonus gometrics to log it as either a gauge, a counter (the request and error rate metrics), or a distribution.
I’ve highlighted both the counter and distribution cases in red, since that’s where we are recording the telemetry needed for our RED dashboard. The log-linear histogram implementation is abstracted by circonus-gometrics, so when we record a distribution metric, it adds that value to an existing histogram data structure. If this all looks pretty simple, that’s because most of the work is abstracted into the circonus client.
The handler object shown in the previous slide is a struct composed of the circonus-gometrics client adapter, an environment struct, a metrics config map, and a context cancel function. The environment element encapsulates some common functionality such as logging, an interface to a job scheduling daemon, and some other mixer utilities. The metrics map contains a map of the metrics that we configured in a yaml file as available, to the list provided by Mixer. The cancel struct member contains a context cancel function. This is a really cool feature of Go; it allows us to setup a context that is available to other callbacks. So when Istio shuts down, we can call handler.cancel, and it triggers a metrics flush and any needed destructors.
Here we write some YAMl to tell Istio which metrics to map to what metric type. We map request count (along with the request metadata) to a counter type. We map request distribution to a distribution type. We map both the request size and the response size to gauge data types.
Because Istio is designed to run at high QPS rates, we don’t syndicate metrics for each request. We tell Istio to run a daemon which flushes the circonus client at regular intervals, and also when Istio shuts down. Writing an Istio adapter is fairly straightforward. I’m giving a tutorial on it at OSCON in Portland in July. I’m happy to walk through the code in depth here also with anyone who is interested.
So now that we’ve put the pieces together of how to use Istio to get meaningful data from our services, let’s see what kinds questions we can answer with it.
We all love being able to solve technical problems, but not everyone has that same focus. The folks on the business side want to answer questions on how the business is doing. Let’s take the toolset we’ve assembled over the past 50 slides and align the capabilities with a couple of questions that folks here will likely be asked as SREs.
First, the big slowdown. Everyone has been through this. Marketing does a big push, and users complained the site got slow. How can we quantify the extent of how slow it was for everyone? How many users got angry? Marketing wants to know this so that they can send out a 10% discount email to the users affected (but hopefully not create a recurrence of the same problem). Let’s craft an SLI and assume that users got angry if requests took more than 500 milliseconds.
How can we calculate how many users got angry with this SLI of 500 ms? First (well, we should have already done this) is to record the request latencies as a distribution. Then we can plot them as a heatmap. We can use the distribution data to calculate the percentage of requests that exceeded our 500ms SLI by using inverse percentiles. We take that answer, multiply it by the total number of requests in that time window, and integrate over time.
Then we can plot the result overlayed on the heatmap. I circled the slowdown part of the heatmap - the increased latency distribution is fairly indicative of a slowdown. The line on the graph indicates the total number of requests affected over time. We managed to miss our SLI for 4 million requests. Whoops. What isn’t obvious are the two additional slowdowns on the right because they are smaller in magnitude. Each of those cost us an additional 2 million SLI violations. Whoops. We can do these kinds of mathematical analyses because we are storing data as distributions, not aggregations like percentiles.
Here’s another common question. Is my service under provisioned, or over provisioned. The answer is often, it depends. Loads vary from time of day and day of week, in addition to special events. Not to mention how the system behaves under load. Let’s put some math to work and use things called latency bands to visualize how our system can perform.
This visualization shows latency distribution broken down by latency bands over time. The bands here show the number of requests that took under 25ms, between 25 and 100 ms, 100-250ms, 250-1000, and over 1000ms. The colors are grouped by fast requests as shown in green, to slow requests shown in red. What does this visualization tell us? It shows that requests to our service started off very quickly, and then the percentage of fast requests dropped off after a few minutes, and the percentage of slow requests increased after about 10 minutes. This pattern repeated itself for two traffic sessions. What does that tell us about provisioning? Well, it suggests that initially the service was over provisioned, but then became under provisioned over the course of 10-20 minutes. Sounds like a good candidate for autoscaling. We can add this type of visualization to our RED dashboard also. This type of visualization is excellent for communicating with business stakeholders, as it doesn’t require a lot of technical knowledge investment by them to understand the impact on the business.
Conclusions We should monitor services, not containers. Services are long lived entities, containers are not. Your bosses doesn’t care how your containers are performing, they care about how your services are performing. Record distributions, not aggregates. But generate aggregates from those distributions. Aggregates are very valuable sources of information, but unmergeable and not well suited to statistical analysis. Istio gives you a lot of stuff for free. You don’t have to instrument your code either. You don’t need to go and build a high quality application framework from scratch. Use math to ask questions about your services that are important to the business. That’s what this is all about, right? When we can make systems reliable by answering questions that the business values, we achieve the goals of the organization.
Thanks folks, I saved my obligatory proud parent photo for the end. Happy to answer any questions now.

Comprehensive container based service monitoring with kubernetes and istio

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Comprehensive container based service monitoring with kubernetes and istio

Similar to Comprehensive container based service monitoring with kubernetes and istio (20)

More from Fred Moyer

More from Fred Moyer (19)

Recently uploaded

Recently uploaded (20)

Comprehensive container based service monitoring with kubernetes and istio

Editor's Notes