SlideShare a Scribd company logo
1 of 37
Observability for Application
Developers
About me
Abhinav Kapoor
• Principal Engineer, Nagarro
• Work as a Technical Architect for .NET Solutions.
• 15+ years of technical experience in architecting & developing solutions for Banking, Supply chain &
Embedded domains.
• I enjoy writing about software architecture, communication patterns, modernization & more on
Medium.com https://medium.com/@kapoorabhinav.
Accreditations
I‘m available on Linked at
https://www.linkedin.com/in/abhinav-kapoor
0394456/
What we’ll cover
1. Why do we need Observability?
2. How does it fit in a cloud native landscape?
3. Application Instrumentation
1. Logs
2. Traces
3. metrics
4. Health Check
4. OpenTelemetry
1. What is it & why is it needed?
2. Its components.
3. Its signals
Why do we need Observability?
● We want to avoid these reactions
Why do we need Observability?
● Issue Timeline in Operation Support
Detect Identify Fix Deploy
Issue
Start
Issue
End
Mean Time To Repair (MTTR)
Mean Time To Identify
(MTTI)
Mean Time To Detect (MTTD)
Why is Observability more critical now?
● Evolution to smaller but more services - high cohesion and low coupling.
“Complexities arises when the dependencies among the elements become important”
- We have reduced code complexity but introduced much more moving parts
Why is Observability more critical now?
Technology transformation trends
Microservices Polyglot Applications Ephemeral & cloud native environments
How does it fit in a cloud native
landscape?
How does it fit in a cloud native landscape?
Instrument Capture Transport Analyze Act
SDK, Agent
Scrapping/
Agent
Dashboard Alerts / Notifications
How does it fit in a cloud native landscape?
Active Directory / IAM
Certificate Manager
Resources – DB, Broker, API Gateway
Service A
Sidecar
Platform / Orchestrator
Service B
Sidecar
Shared
Services,
Infrastructure
Applications
Audit logs,
Activity logs,
Metrics
Central Aggregation
& Collection
Application logs,
metrics, traces Monitoring & Alerts
Access Control
Application Instrumentation –
What should I know & do as a
developer ?
Application Instrumentation
● Logs – What happened & who did it ?
● Traces – How did it happen ?
● Metrics – How much happened ?
● Health – How is the system doing ?
Big picture
● We don’t just want to have connected black boxes.
● While infrastructure & sidecars can give a good overview, there are always
blind spots.
● It speeds up development.
Logs, metrics,
Traces, Health
Logs, metrics,
Traces, Health
Logs, metrics,
Traces, Health
Logs, metrics,
Traces, Health
Logs – What Happened ?
Logs – Components of Logging
Logging Framework
Log Message
- Severity, tags
Application
SDK
Filter based on Severity
- Destination Specific
Log Providers
ILogger
Types of Logs
● Based on Information they contain – Governs where they are stored, how long
they are stored, who can access them
○ Technical – Startup configurations, system event that the application receives
○ Business - Business and User Information
● Based on the schema
○ Unstructured – Simple statements with severity.
○ Structured / Schematic logs - Logs with Metadata to provide context. Pod, Service, IP address
& more.
Duality of Logs - Readable messages &
Quarriable metrics
Traces - How did it happen ?
Traces – Types of traces
● Tracing – This is similar to logging. But noisier and collects more information
from deeper parts of the application.
● Distributed Trancing – It’s the method to observe requests as they propagate
through different services.
○ Its structured log which comes from different processes, nodes and can be stitched together
to give an end to end view.
Distributed Tracing
Trace Collector
Services Web API Broker Background Services
Traces
HTTP Request
with tracing
context
TCP message
with tracing
context
TCP message
with tracing
context
DB
Store tracing
context with
data in db
Services
Read tracing
context
• Correlation IDs are passed in the remote calls/messages.
• The receiver regenerates context proceeds with its unit of work.
• Depending upon business transaction, traces can be of few seconds to hours or days.
Metrics–How much is happening?
Metrics – Components
Vendor specific
Instrumentation &
collection
Metrics-
Name, description, Labels &
Value
Application
Backend – time series database
- Adds timestamp
- Supports aggregation (eg PromQL)
SDK
/Metrics
scrape
Visualization
PromQL expressions
Alerts
Types of Prometheus Metrics - Counters
Track the occurrence of an event in application.
● The absolute value may not be helpful.
● The delta between two timestamps or the rate of change over time is helpful. Functions like Increase(),
Rate().
● Consider the case when applications restart.
● Examples - Orders processed counter, unsuccessful counters, business error, technical errors.
● Sample output of meta-endpoint (/metrics)
# HELP http_requests_total Total number of http api requests
# TYPE http_requests_total
counter http_requests_total{api=“add_Product“, company=“ITC”} 3433
counter http_requests_total{api=“add_Product“, company =“TATA”} 2000
PromQL
increase(http_requests_total{api="add_product“, company=“ITC”}[5m])
Types of Prometheus Metrics - Gauges
Snapshots of a metric at a single point in time. Not an event.
● Example is message queue size, CPU utilization at a given time.
● Useful functions - avg_over_time, max_over_time,
min_over_time, and quantile_over_time
Types of Prometheus Metrics - Histograms
● Histograms divide the entire range of measurements into a set of intervals—
named buckets—and count how many measurements fall into each bucket.
● These buckets are defined at compile time, they are upper inclusive (le).
● Histogram looks like the following
http_request_duration_seconds_sum{api="add_product" instance=" host1 "} 8953.332
http_request_duration_seconds_count{api="add_product" instance=" host1"} 27892
http_request_duration_seconds_bucket{api="add_product", instance=" host1", le="0.05"}
1672
http_request_duration_seconds_bucket{api="add_product", instance=" host1", le="0.1"} 8954
http_request_duration_seconds_bucket{api="add_product", instance=" host1", le="0.25"}
14251
sum by (le) (rate(http_request_duration_seconds_bucket[5m]))
Types of Prometheus Metrics - Summaries
● Similar to Histogram but they add quantile label to bound the summary.
● OpenTelemetry has been marked it legacy.
● Summary looks like the following
http_request_duration_seconds_sum{api="add_product" instance="host1.domain.com"} 8953.332
http_request_duration_seconds_count{api="add_product" instance="host1.domain.com"} 27892
http_request_duration_seconds{api="add_product" instance=" host1 " quantile="0"}
http_request_duration_seconds{api="add_product" instance="host1" quantile="0.5"} 0.232227334
● Considerations due to quantile.
Metrics - What happens after instrumentation?
● Enrichthe application instrumentation with infrastructure information such as
Service Name, Pod, IP address.
● The backend saves it with timestamp
● Add Alerts & dashboards. Help available at
https://grafana.com/grafana/dashboards/
Health Check– How is the system
doing ?
Health check
● Basic use is liveness probe.
● More useful when we add dependencies.
● Prebuilt libraries depending upon the language
● Considerations
○ Orchestrator may restart the service while the issue is in a
dependency.
○ Checking all downstream services could be time consuming.
○ Meta-end points (/health) may not be exposed. Instead of
writing to the response metrics may be used to indicate
issue.
OpenTelemetry
What is OpenTelemetry ?
● It is an open-source observability framework for infrastructure & application
instrumentation.
● It’s the second-highest ranked CNCF project by activity and contributors after
K8s
● It aims to be vendor agnostic & the SDK is produced by OpenTelemetry.
● Its still young so not all features are available for all languages.
● As Open Telemetry doesn’t provide a backend implementation (its concern is
creating, collecting, and sending signals), the data will flow to another system
or systems for storage and querying.
Why OpenTelemetry?
● Before Open Telemetry
○ APM vendors had their own things – SDKs,
Agents, Grammer
○ Vendor lock-in.
○ Application developers, built wrappers,
event hooks to keep application isolated.
○ Competing Open standards – OpenTracing
(Traces), OpenCensus (Metrics, Traces)
OpenTelemetry Library -
Instrumentation (Automatic & Manual)
Exporters
Exporter
- Backends/APM
Application
Trace, metrics, Logs
SDK
Components of OpenTelemetry
Collector
Multiple receivers
Multiple exporters
Signals of OpenTelemetry - Metrics
● Differences from Prometheus metrics
○ Summary is marked deprecated
○ Some more functions & data type flexibility
Signals of OpenTelemetry -Tracing
Data model of an OpenTelemetry Span
Example tracing for a news article.
Source - https://www.timescale.com/blog/what-are-traces-and-how-sql-yes-sql-
and-opentelemetry-can-help-us-get-more-value-out-of-traces-to-build-better-
software/
Questions ?
Thank you!

More Related Content

Similar to Observability for Application Developers (1)-1.pptx

Major Project Report on Designing an Android Application for Electrical Maint...
Major Project Report on Designing an Android Application for Electrical Maint...Major Project Report on Designing an Android Application for Electrical Maint...
Major Project Report on Designing an Android Application for Electrical Maint...
Amit Kumar
 

Similar to Observability for Application Developers (1)-1.pptx (20)

Go Observability (in practice)
Go Observability (in practice)Go Observability (in practice)
Go Observability (in practice)
 
Dot Net performance monitoring
 Dot Net performance monitoring Dot Net performance monitoring
Dot Net performance monitoring
 
Prometheus-Grafana-RahulSoni1584KnolX.pptx.pdf
Prometheus-Grafana-RahulSoni1584KnolX.pptx.pdfPrometheus-Grafana-RahulSoni1584KnolX.pptx.pdf
Prometheus-Grafana-RahulSoni1584KnolX.pptx.pdf
 
Major Project Report on Designing an Android Application for Electrical Maint...
Major Project Report on Designing an Android Application for Electrical Maint...Major Project Report on Designing an Android Application for Electrical Maint...
Major Project Report on Designing an Android Application for Electrical Maint...
 
SplunkLive! Zurich 2018: Monitoring the End User Experience with Splunk
SplunkLive! Zurich 2018: Monitoring the End User Experience with SplunkSplunkLive! Zurich 2018: Monitoring the End User Experience with Splunk
SplunkLive! Zurich 2018: Monitoring the End User Experience with Splunk
 
Why we decided on RSA Security Analytics for network visibility
Why we decided on RSA Security Analytics for network visibilityWhy we decided on RSA Security Analytics for network visibility
Why we decided on RSA Security Analytics for network visibility
 
SplunkLive! Munich 2018: Monitoring the End-User Experience with Splunk
SplunkLive! Munich 2018: Monitoring the End-User Experience with SplunkSplunkLive! Munich 2018: Monitoring the End-User Experience with Splunk
SplunkLive! Munich 2018: Monitoring the End-User Experience with Splunk
 
Modern Monitoring
Modern MonitoringModern Monitoring
Modern Monitoring
 
Importance of ‘Centralized Event collection’ and BigData platform for Analysis !
Importance of ‘Centralized Event collection’ and BigData platform for Analysis !Importance of ‘Centralized Event collection’ and BigData platform for Analysis !
Importance of ‘Centralized Event collection’ and BigData platform for Analysis !
 
Splunk MINT for Mobile Intelligence and Splunk App for Stream for Enhanced Op...
Splunk MINT for Mobile Intelligence and Splunk App for Stream for Enhanced Op...Splunk MINT for Mobile Intelligence and Splunk App for Stream for Enhanced Op...
Splunk MINT for Mobile Intelligence and Splunk App for Stream for Enhanced Op...
 
sagar
sagarsagar
sagar
 
sat_presentation
sat_presentationsat_presentation
sat_presentation
 
3 types of monitoring for 2020
3 types of monitoring for 20203 types of monitoring for 2020
3 types of monitoring for 2020
 
Arunprakash Alagesan
Arunprakash AlagesanArunprakash Alagesan
Arunprakash Alagesan
 
Data Ops at TripActions
Data Ops at TripActionsData Ops at TripActions
Data Ops at TripActions
 
SplunkLive! Frankfurt 2018 - Monitoring the End User Experience with Splunk
SplunkLive! Frankfurt 2018 - Monitoring the End User Experience with SplunkSplunkLive! Frankfurt 2018 - Monitoring the End User Experience with Splunk
SplunkLive! Frankfurt 2018 - Monitoring the End User Experience with Splunk
 
Presentation tritan erp service
Presentation tritan erp servicePresentation tritan erp service
Presentation tritan erp service
 
OSMC 2023 | Journey to observability: tracking every function execution in pr...
OSMC 2023 | Journey to observability: tracking every function execution in pr...OSMC 2023 | Journey to observability: tracking every function execution in pr...
OSMC 2023 | Journey to observability: tracking every function execution in pr...
 
SANJAY_SINGH
SANJAY_SINGHSANJAY_SINGH
SANJAY_SINGH
 
Requirement analysis
Requirement analysisRequirement analysis
Requirement analysis
 

More from OpsTree solutions

More from OpsTree solutions (19)

Deliver Latency Free Customer Experience
Deliver Latency Free Customer ExperienceDeliver Latency Free Customer Experience
Deliver Latency Free Customer Experience
 
How can Enterprises benefit from GitOps.pdf
How can  Enterprises benefit from GitOps.pdfHow can  Enterprises benefit from GitOps.pdf
How can Enterprises benefit from GitOps.pdf
 
OpenTelemetry - A powerful new standard for observability
OpenTelemetry - A powerful new standard for observabilityOpenTelemetry - A powerful new standard for observability
OpenTelemetry - A powerful new standard for observability
 
Getting started with Loki on GKE
Getting started with Loki on GKEGetting started with Loki on GKE
Getting started with Loki on GKE
 
Graviton Migration on AWS
Graviton Migration on AWSGraviton Migration on AWS
Graviton Migration on AWS
 
Graviton Migration on AWS
Graviton Migration on AWSGraviton Migration on AWS
Graviton Migration on AWS
 
Sonar use case (4).pdf
Sonar use case  (4).pdfSonar use case  (4).pdf
Sonar use case (4).pdf
 
Enabling Governance with Jira For Change Management - BuildPiper
Enabling Governance with Jira For Change Management - BuildPiperEnabling Governance with Jira For Change Management - BuildPiper
Enabling Governance with Jira For Change Management - BuildPiper
 
Application Delivery & Change Management - BuildPiper
Application Delivery & Change Management - BuildPiperApplication Delivery & Change Management - BuildPiper
Application Delivery & Change Management - BuildPiper
 
Graviton Migration on AWS - Achieve cost efficiency
Graviton Migration on AWS - Achieve cost efficiency Graviton Migration on AWS - Achieve cost efficiency
Graviton Migration on AWS - Achieve cost efficiency
 
Everything about Blue-Green Deployment Strategy!
Everything about Blue-Green Deployment Strategy!Everything about Blue-Green Deployment Strategy!
Everything about Blue-Green Deployment Strategy!
 
SRE Fundamentals
SRE FundamentalsSRE Fundamentals
SRE Fundamentals
 
CANARY DEPLOYMENT
CANARY DEPLOYMENTCANARY DEPLOYMENT
CANARY DEPLOYMENT
 
Managed Devsecops Approach for Tech Startups
Managed Devsecops Approach for Tech StartupsManaged Devsecops Approach for Tech Startups
Managed Devsecops Approach for Tech Startups
 
Secure continuous integration and deployment features
Secure continuous integration and deployment featuresSecure continuous integration and deployment features
Secure continuous integration and deployment features
 
BuildPiper - Delivering Value at Scale
BuildPiper - Delivering Value at ScaleBuildPiper - Delivering Value at Scale
BuildPiper - Delivering Value at Scale
 
Infographics - Microservices to grow your buisness
Infographics - Microservices to grow your buisnessInfographics - Microservices to grow your buisness
Infographics - Microservices to grow your buisness
 
Prometheus workshop
Prometheus workshopPrometheus workshop
Prometheus workshop
 
Managed Kubernetes - 5 Reasons Why your business needs it!"
Managed Kubernetes - 5 Reasons Why your business needs it!"Managed Kubernetes - 5 Reasons Why your business needs it!"
Managed Kubernetes - 5 Reasons Why your business needs it!"
 

Recently uploaded

Recently uploaded (20)

IESVE for Early Stage Design and Planning
IESVE for Early Stage Design and PlanningIESVE for Early Stage Design and Planning
IESVE for Early Stage Design and Planning
 
Demystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John StaveleyDemystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John Staveley
 
Syngulon - Selection technology May 2024.pdf
Syngulon - Selection technology May 2024.pdfSyngulon - Selection technology May 2024.pdf
Syngulon - Selection technology May 2024.pdf
 
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
 
AI revolution and Salesforce, Jiří Karpíšek
AI revolution and Salesforce, Jiří KarpíšekAI revolution and Salesforce, Jiří Karpíšek
AI revolution and Salesforce, Jiří Karpíšek
 
Strategic AI Integration in Engineering Teams
Strategic AI Integration in Engineering TeamsStrategic AI Integration in Engineering Teams
Strategic AI Integration in Engineering Teams
 
What's New in Teams Calling, Meetings and Devices April 2024
What's New in Teams Calling, Meetings and Devices April 2024What's New in Teams Calling, Meetings and Devices April 2024
What's New in Teams Calling, Meetings and Devices April 2024
 
Powerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara LaskowskaPowerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara Laskowska
 
Connecting the Dots in Product Design at KAYAK
Connecting the Dots in Product Design at KAYAKConnecting the Dots in Product Design at KAYAK
Connecting the Dots in Product Design at KAYAK
 
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya HalderCustom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
 
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdfIntroduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
 
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
 
Oauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoftOauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoft
 
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
 
Speed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in MinutesSpeed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in Minutes
 
IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024
 
Designing for Hardware Accessibility at Comcast
Designing for Hardware Accessibility at ComcastDesigning for Hardware Accessibility at Comcast
Designing for Hardware Accessibility at Comcast
 
WSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptxWSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptx
 
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptxUnpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
 
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
 

Observability for Application Developers (1)-1.pptx

  • 2. About me Abhinav Kapoor • Principal Engineer, Nagarro • Work as a Technical Architect for .NET Solutions. • 15+ years of technical experience in architecting & developing solutions for Banking, Supply chain & Embedded domains. • I enjoy writing about software architecture, communication patterns, modernization & more on Medium.com https://medium.com/@kapoorabhinav. Accreditations I‘m available on Linked at https://www.linkedin.com/in/abhinav-kapoor 0394456/
  • 3. What we’ll cover 1. Why do we need Observability? 2. How does it fit in a cloud native landscape? 3. Application Instrumentation 1. Logs 2. Traces 3. metrics 4. Health Check 4. OpenTelemetry 1. What is it & why is it needed? 2. Its components. 3. Its signals
  • 4. Why do we need Observability? ● We want to avoid these reactions
  • 5. Why do we need Observability? ● Issue Timeline in Operation Support Detect Identify Fix Deploy Issue Start Issue End Mean Time To Repair (MTTR) Mean Time To Identify (MTTI) Mean Time To Detect (MTTD)
  • 6. Why is Observability more critical now? ● Evolution to smaller but more services - high cohesion and low coupling. “Complexities arises when the dependencies among the elements become important” - We have reduced code complexity but introduced much more moving parts
  • 7. Why is Observability more critical now? Technology transformation trends Microservices Polyglot Applications Ephemeral & cloud native environments
  • 8. How does it fit in a cloud native landscape?
  • 9. How does it fit in a cloud native landscape? Instrument Capture Transport Analyze Act SDK, Agent Scrapping/ Agent Dashboard Alerts / Notifications
  • 10. How does it fit in a cloud native landscape? Active Directory / IAM Certificate Manager Resources – DB, Broker, API Gateway Service A Sidecar Platform / Orchestrator Service B Sidecar Shared Services, Infrastructure Applications Audit logs, Activity logs, Metrics Central Aggregation & Collection Application logs, metrics, traces Monitoring & Alerts Access Control
  • 11. Application Instrumentation – What should I know & do as a developer ?
  • 12. Application Instrumentation ● Logs – What happened & who did it ? ● Traces – How did it happen ? ● Metrics – How much happened ? ● Health – How is the system doing ?
  • 13. Big picture ● We don’t just want to have connected black boxes. ● While infrastructure & sidecars can give a good overview, there are always blind spots. ● It speeds up development. Logs, metrics, Traces, Health Logs, metrics, Traces, Health Logs, metrics, Traces, Health Logs, metrics, Traces, Health
  • 14. Logs – What Happened ?
  • 15. Logs – Components of Logging Logging Framework Log Message - Severity, tags Application SDK Filter based on Severity - Destination Specific Log Providers ILogger
  • 16. Types of Logs ● Based on Information they contain – Governs where they are stored, how long they are stored, who can access them ○ Technical – Startup configurations, system event that the application receives ○ Business - Business and User Information ● Based on the schema ○ Unstructured – Simple statements with severity. ○ Structured / Schematic logs - Logs with Metadata to provide context. Pod, Service, IP address & more.
  • 17. Duality of Logs - Readable messages & Quarriable metrics
  • 18. Traces - How did it happen ?
  • 19. Traces – Types of traces ● Tracing – This is similar to logging. But noisier and collects more information from deeper parts of the application. ● Distributed Trancing – It’s the method to observe requests as they propagate through different services. ○ Its structured log which comes from different processes, nodes and can be stitched together to give an end to end view.
  • 20. Distributed Tracing Trace Collector Services Web API Broker Background Services Traces HTTP Request with tracing context TCP message with tracing context TCP message with tracing context DB Store tracing context with data in db Services Read tracing context • Correlation IDs are passed in the remote calls/messages. • The receiver regenerates context proceeds with its unit of work. • Depending upon business transaction, traces can be of few seconds to hours or days.
  • 21. Metrics–How much is happening?
  • 22. Metrics – Components Vendor specific Instrumentation & collection Metrics- Name, description, Labels & Value Application Backend – time series database - Adds timestamp - Supports aggregation (eg PromQL) SDK /Metrics scrape Visualization PromQL expressions Alerts
  • 23. Types of Prometheus Metrics - Counters Track the occurrence of an event in application. ● The absolute value may not be helpful. ● The delta between two timestamps or the rate of change over time is helpful. Functions like Increase(), Rate(). ● Consider the case when applications restart. ● Examples - Orders processed counter, unsuccessful counters, business error, technical errors. ● Sample output of meta-endpoint (/metrics) # HELP http_requests_total Total number of http api requests # TYPE http_requests_total counter http_requests_total{api=“add_Product“, company=“ITC”} 3433 counter http_requests_total{api=“add_Product“, company =“TATA”} 2000 PromQL increase(http_requests_total{api="add_product“, company=“ITC”}[5m])
  • 24. Types of Prometheus Metrics - Gauges Snapshots of a metric at a single point in time. Not an event. ● Example is message queue size, CPU utilization at a given time. ● Useful functions - avg_over_time, max_over_time, min_over_time, and quantile_over_time
  • 25. Types of Prometheus Metrics - Histograms ● Histograms divide the entire range of measurements into a set of intervals— named buckets—and count how many measurements fall into each bucket. ● These buckets are defined at compile time, they are upper inclusive (le). ● Histogram looks like the following http_request_duration_seconds_sum{api="add_product" instance=" host1 "} 8953.332 http_request_duration_seconds_count{api="add_product" instance=" host1"} 27892 http_request_duration_seconds_bucket{api="add_product", instance=" host1", le="0.05"} 1672 http_request_duration_seconds_bucket{api="add_product", instance=" host1", le="0.1"} 8954 http_request_duration_seconds_bucket{api="add_product", instance=" host1", le="0.25"} 14251 sum by (le) (rate(http_request_duration_seconds_bucket[5m]))
  • 26. Types of Prometheus Metrics - Summaries ● Similar to Histogram but they add quantile label to bound the summary. ● OpenTelemetry has been marked it legacy. ● Summary looks like the following http_request_duration_seconds_sum{api="add_product" instance="host1.domain.com"} 8953.332 http_request_duration_seconds_count{api="add_product" instance="host1.domain.com"} 27892 http_request_duration_seconds{api="add_product" instance=" host1 " quantile="0"} http_request_duration_seconds{api="add_product" instance="host1" quantile="0.5"} 0.232227334 ● Considerations due to quantile.
  • 27. Metrics - What happens after instrumentation? ● Enrichthe application instrumentation with infrastructure information such as Service Name, Pod, IP address. ● The backend saves it with timestamp ● Add Alerts & dashboards. Help available at https://grafana.com/grafana/dashboards/
  • 28. Health Check– How is the system doing ?
  • 29. Health check ● Basic use is liveness probe. ● More useful when we add dependencies. ● Prebuilt libraries depending upon the language ● Considerations ○ Orchestrator may restart the service while the issue is in a dependency. ○ Checking all downstream services could be time consuming. ○ Meta-end points (/health) may not be exposed. Instead of writing to the response metrics may be used to indicate issue.
  • 31. What is OpenTelemetry ? ● It is an open-source observability framework for infrastructure & application instrumentation. ● It’s the second-highest ranked CNCF project by activity and contributors after K8s ● It aims to be vendor agnostic & the SDK is produced by OpenTelemetry. ● Its still young so not all features are available for all languages. ● As Open Telemetry doesn’t provide a backend implementation (its concern is creating, collecting, and sending signals), the data will flow to another system or systems for storage and querying.
  • 32. Why OpenTelemetry? ● Before Open Telemetry ○ APM vendors had their own things – SDKs, Agents, Grammer ○ Vendor lock-in. ○ Application developers, built wrappers, event hooks to keep application isolated. ○ Competing Open standards – OpenTracing (Traces), OpenCensus (Metrics, Traces)
  • 33. OpenTelemetry Library - Instrumentation (Automatic & Manual) Exporters Exporter - Backends/APM Application Trace, metrics, Logs SDK Components of OpenTelemetry Collector Multiple receivers Multiple exporters
  • 34. Signals of OpenTelemetry - Metrics ● Differences from Prometheus metrics ○ Summary is marked deprecated ○ Some more functions & data type flexibility
  • 35. Signals of OpenTelemetry -Tracing Data model of an OpenTelemetry Span Example tracing for a news article. Source - https://www.timescale.com/blog/what-are-traces-and-how-sql-yes-sql- and-opentelemetry-can-help-us-get-more-value-out-of-traces-to-build-better- software/

Editor's Notes

  1. We’ll take the questions at the end.
  2. Monitoring is collecting metrics & logs Observability is the ability to observe what is going on inside the system. Consider it an NFR.
  3. MTTD -> metrics or synthetic tests MTTI -> Traces
  4. Emergence for micro-services – Lot more Story of a certificate expiring somewhere instead of a single server.
  5. Everything writing to one file. Or few files is lot more simple. Lot more moving parts. Speed of development
  6. Conceptual View
  7. Its happening at multiple levels
  8. Logs – important events. State changes, who did. Trances – Are flow within the program or outside the program. Matrics – are numbers which can be aggregated to have formulas – SLOs, KPI & alerts Health – current state of health
  9. Automatic & manual – instrumentation
  10. Cost!!
  11. All services should have it.
  12. Do consider the situation when it turns to zero. https://grafana.com/grafana/dashboards/ including ones from OPSTREE
  13. May be used in cases where strict KPIs are enforced on duration for example
  14. https://grafana.com/grafana/dashboards/ Which help me detect issue & fix the issues faster. Which help management find datapoints. Which can report compliance & SLO.
  15. Alternatives 1. Have a monitoring on solution 2. cache the results
  16. Receivers can be used in migration
  17. As a developer, you don’t really need to worry about the models—but having a basic understanding helps.