You have a mature development process, and you also embrace DevOps. Your development team uses agile methodology. You use Git, and you have a continuous dev, test, deployment, everything process. But do you sleep well at night? Do you know that your services are up and running? That there are no availability, performance, and stability problems? Do you know if your customers are happy?
The answer to all of those questions is precisely what APM systems provide.
Application Performance Monitoring systems have become the IDE of the Site Reliability Engineers (SRE) and, as a matter of fact, for the all DevOps team, including the Dev part. In this lecture, you will get to know the essence of the APM systems, the good, the bad, and the vision about their future.
3. About Me
Alon Fliess:
Chief Software Architect & Co-Founder at OzCode & CodeValue
More than 30 years of hands-on experience
Microsoft Regional Director & Microsoft Azure MVP
Spend most of my time in project analysis, architecture, design
Code at night
5. Agenda
DevOps, the true story
Microservice Architecture, the complexity shift
Ops & Monitoring
Site Reliable Managers
Developers & Observability
Business (marketing, sales, management) and
observability
Application Performance Monitoring
How does it work?
Distributed Tracing
Production problem solving
5
6. The Essence of DevOps
Better Software, Faster! When Development and Operations Synergize
Covers the *entire* Application Lifecycle
6
13. Gartner
Critical Capabilities for APM (May 2019)
13
Business
Analysis
Anomaly
Detection
IT Operations
DevOps Release
Application Support
Application Development
Application Owner
Use Cases
16. How Does Monitoring & Tracing Work?
16
Operating Systems
APM system tracking agent installed on the machine
CPU, Memory, I/O, Network
Code Tracing
Instrumentation
Manual
Auto
Runtime data collection
17. Instrumentation – Original Pseudo Code
17
Function AddToBasket(var productId, var quantity)
if (quantity < 0)
return false
var product = Dal.GetProductById(productId)
BasketService.Add(product, quantity)
return true
18. Instrumentation – Add Logging on Errors
18
Function AddToBasket(var productId, var quantity)
if (quantity < 0)
Log(“Error: Negative quantity value”)
return false
var product = Dal.GetProductById(productId)
BasketService.Add(product, quantity)
return true
19. Instrumentation – Add Metrics of Usage and Errors
19
Function AddToBasket(var productId, var quantity)
metrics.Count(“AddToBasket”, 1)
if (quantity < 0)
Log(“Error: Negative quantity value”)
metrics.Count(“AddToBasketFailure”, 1)
return false
var product = Dal.GetProductById(productId)
BasketService.Add(product, quantity)
return true
20. Instrumentation – Measure Latency
20
Function AddToBasket(var productId, var quantity)
metrics.Count(“AddToBasket”, 1)
start = time()
if (quantity < 0)
Log(“Error: Negative quantity value”)
metrics.Count(“AddToBasketFailure”, 1)
return false
var product = Dal.GetProductById(productId);
BasketService.Add(product, quantity);
metrics.Measure(“AddToBasket”, time() – start);
return true;
26. Instrumentation – Call Context
26
Function AddToBasket(var productId, var quantity, var context)
debug.AddParameters(context, “AddToBasket”, [[“ProductId”, productid],[“quantity”, quantity]])
metrics.Count(context, “AddToBasket”, 1)
start = time()
if (quantity < 0)
Log(context, “Error: Negative quantity value”)
metrics.Count(context, “AddToBasketFailure”, 1)
debug.AddError(context, “AddToBasket”, GetErrorData())
return false
var product = Dal.GetProductById(context, productId)
debug.AddValue(context, “AddToBasket”, [[“product”, product]])
metrics.Measure(context, “AddToBasket_GetProductById”, time() – start)
BasketService.Add(context, product, quantity)
metrics.Measure(context, “AddToBasket”, time() – start)
return true
Context:
Call Id
URL
HTTP Method
DB Host
User Info
Timing Info
27. Instrumentation – Using Span
27
Function AddToBasket(var productId, var quantity, var context)
span = trace.BeginSpan(context, {“AddToBasket”, productid, quantity})
if (quantity < 0)
span.Error(“Negative quantity value”)
return false;
var product = Dal.GetProductById(context, productId)
span.AddValue(“product”, product)
BasketService.Add(context, product, quantity)
span.End()
return true;
Span:
Call Id
URL
HTTP Method
DB Host
User Info
Timing Info
31. APM Error Analysis – Not Enough Information
Error Rate
Request information
Stack trace
APM systems can assist in health monitoring and fault first aid
32. Production Problem Solving Challenges
10kg
Can’t mess with
data
10kg
No Debugging
tools
10kg
Code is
optimized
10kg
Older source
code version
10kg
Can’t impact
performance
10kg
Data must stay in
a secure env.
10kg
Data is private and
contains PII
10kg
Very hard to
reproduce the bug
MSA – many small parts deployed and communicate
Simple components, Complex combination
Very hard to follow a request that spans many services
Must have automation process to overcome the complexity
Must have health monitor, performance monitor and cross-services error handling
TOOLS!!!
More than CI/CD
Ops First aid medic, take vital signs
CPU, Network, IO, Memory
Request throughput and latency
Wants easy life.
Eats the meal that the Dev team cooked.
The customer of the Dev team
Bugs, Problem Solving
Need to know the current situation with the current problems
For example, can role back to a previous version, but need to know the status of the bug fix
Information,
Debuggability
Reproduce the problem
Analytics
Business Insights
Usage
As Twitter has moved from a monolithic to a distributed architecture, our scalability has increased dramatically.
Zipking – a distributed tracing system (https://zipkin.io/)
Business Analysis - business related KPI
IT Service Monitoring - health of Key Services
Root Cause Analysis - a failure or degradation
Anomaly Detection
identifying system observations that do not conform to an expected behavior
Distributed Profiling
track transactions across a mesh of interconnected nodes, followed by detection of where along the path the degradation appears to be happening
Application Debugging
production debugging capabilities, based on distributed date collection
Enable saying: 15% of our request fails
Errors & problems root cause may be the result of
Problem happens only with a specific user or URLs
Problem happens only with a specific user or URLs
OpenTelemetry makes robust, portable telemetry a built-in feature of cloud-native software.
OpenTelemetry provides a single set of APIs, libraries, agents, and collector services to capture distributed traces and metrics from your application. You can analyze them using Prometheus, Jaeger, and other observability tools.
DevOps, the true story
Microservice Architecture, the complexity shift
Ops & Monitoring
Site Reliable Managers
Developers & Observability
Business (marketing, sales, management) and observability
Application Performance Monitoring
How does it work?
Distributed Tracing
Production problem solving