Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Observability in a̶c̶t̶i̶o̶n̶! the wild!

276 views

Published on

ContainerDays 2018, Hamburg: Talk by Florian Lautenschlager (@flolaut, Senior Software Engineer at QAware) and Josef Fuchshuber (@fuchshuber, Principal Software Architect at QAware)

Abstract:
This is not just another “Oh look! I show you how to use Opentracing, Prometheus and EFK for distributed Hello World projects” talk. There are tons of great talks on this out there. Instead we present a case study of an observable large real-world cloud native application and share our key findings from a technical, functional and collaborational point of view. For typical monitoring / observability Sleuth, Prometheus and the EFK-Stack are perfect bulletproof tools. They are means to collect, store and analyze traces, metrics and logs. For technical monitoring of resources, e.g. memory and cpu consumption, we use the USE method described by Brendan Gregg [1] and for functional monitoring, e.g. use cases and business services, we use the RED method described by Tom Wilkie [2]. Continuous end-to-end tests deployed along with the software system give us constant feedback about the software system. All relevant metrics are checked by automated alerts, defined in Grafana, which keep us up to date. In addition, we link all information (traces, logs, metrics) in order to gain as much knowledge as possible, e.g. add the trace id to every log event (called contextualize logging [3] or log correlation [4]). On top of our technical and functional monitoring we designed a so called collaborative monitoring. This means, that our observability tools are integrated in the standard tools of our audience, which is highly heterogeneous: Engineers, QA, Managers, Operations, Help Desk. The big benefit of having such a collaborative monitoring, is a better collaboration between the people around the project and also the machines. This, for example, allows us to build chatbots to easily interact with the software-system and everyone can jump directly to the traces, logs and metrics of a request and send them to a person that can provide help, if something bad happens. With this opportunities observability leads to an improvement of documentation, tickets, bug fix processes and communication all across the project. It was never easier to talk about a software system (Ok - This was just fun.). We show you our solution (also at code level) and talk about pros and cons.

[1] http://www.brendangregg.com/usemethod.html
[2] https://www.weave.works/blog/of-metrics-and-middleware/
[3] https://medium.com/opentracing/take-opentracing-for-a-hotrod-ride-f6e3141f7941
[4] http://cloud.spring.io/spring-cloud-static/spring-cloud-sleuth/2.0.0.M8/single/spring-cloud-sleuth.html#_log_correlation

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Observability in a̶c̶t̶i̶o̶n̶! the wild!

  1. 1. Florian Lautenschlager florian.lautenschlager@qaware.de @flolaut Observability in action! the wild! Hamburg, 20. Juni 2018 Josef Fuchshuber josef.fuchshuber@qaware.de @fuchshuber
  2. 2. Florian Lautenschlager & && @&
  3. 3. Josef Fuchshuber & && @&@
  4. 4. Observability in action the wild! 5 In our cloud backend we have a vital microservice ecosystem.
  5. 5. Our team is just as vital and heterogeneous as our software. Observability in action the wild! 6 Platform Developer App Developer Skill Developer Client Developer Tester Ops Help Desk Product Management Data Scientist UX Designer
  6. 6. Observability isn't just for operations.
  7. 7. What is the hardest step in the DevOps process? Observability in action the wild! 8 DEV OPS
  8. 8. Much better: The 6 Cs of the DevOps Cycle. Observability in action the wild! 9Source: https://dzone.com/articles/6-cs-of-devops-adoption
  9. 9. Observability in the wild! A case study… and how we found collaborative monitoring.
  10. 10. Monitoring Toolchain: Simply Cloud Native Standard. Observability in action the wild! 11 Metrics Events Traces Java (Spring Boot) or Python on Azure / Kubernetes / Openshift / Docker
  11. 11. Monitoring Technical and Functional Observability in action the wild! 12 Hardware Hypervisor Operating System Kubernetes Docker Runtime Application Generic monitoring that does not need knowledge about the application. Monitoring that does need knowledge about the application. Health of platform and application Telemetry data Infrastructure-Monitoring Application-Monitoring
  12. 12. Monitoring Technical and Functional Observability in action the wild! 13 Questions: Services are up and running Services can accept traffic Sources: Kubestate-Exporter Prometheus-Node-Exporter JMX, top, iostat etc. Questions: Use-Cases runtimes Service level agreements Sources: Specific instrumentation (around use cases, etc.) Health of platform and application Telemetry data Hardware Hypervisor Operating System Kubernetes Docker Runtime Application Infrastructure-Monitoring Application-Monitoring
  13. 13. USE Dashboard Observability in action the wild!
  14. 14. RED Dashboard Observability in action the wild!
  15. 15. I know. Most of you do this already. But what about .. Observability in action the wild! 16 Collaborative Monitoring!?!?
  16. 16. An example is the best explanation. Observability in action the wild! 17 and a chatbot… and a monitoring toolchain… Once there was a little tiny application…
  17. 17. Observability in action the wild! 18
  18. 18. Observability in action the wild! 19 Snip Snap Links request with trace and logs. verbose
  19. 19. Observability in action the wild! 20 Or in case of an error
  20. 20. Observability in action the wild! 21 Total duration Involved services <click> Standard Zipkin Features
  21. 21. Code-Slide: Standardize tracing and metrics. Observability in action the wild! 22 Traces and metrics for every database call with standardized names and trace tags. database_call_duration{repository=yy, Call=zz}
  22. 22. Code-Slide: Standardize tracing logs and tags. Observability in action the wild! 23 Span logs: We model database calls as well as other expensive calls as logs using a template to reduce the size of traces: db:<Repo>.<Call> took: xx ms. call:<Class>.<Method> took: xx ms. Span tags: Used to model values that are valid for a span. We use a template to standardize tags. span.tag. (to mark our tags) Environment (staging, integration , etc.) db (to mark spans with db calls.) param.<name>=value (call parameters)
  23. 23. Observability in action the wild! 24 Logs for a given trace Involved Services Standard EFK + Contextual Logging
  24. 24. Code-Slide: Contextual logging. Observability in action the wild! 25 Context of a log event. Everyone can easily see the logs for a specific context (trace etc.)
  25. 25. Observability in action the wild! 26 Or for checking the health of the services
  26. 26. Observability in action the wild! 27 Or for checking the status of e2e tests
  27. 27. end-2-end tests are also integrated in our observability stack. Observability in action the wild! 28 See the logs VIDEO =) Run in their own docker containers execute spock tests periodically and export Prometheus metrics
  28. 28. Our current setup: A chatbot as generic interface. Observability in action the wild! 29 Development Setup!
  29. 29. and even our help desk / first level support. Observability in action the wild! 30 Production Setup!
  30. 30. Early prototype of the Customer Care Observability Tool. Observability in action the wild! 31 Activate tracing for a user Health Checks + e2e Logs
  31. 31. Observability in action the wild! 32 Ease of communication within bug tickets.
  32. 32. Observability in action the wild! Happy end. 33
  33. 33. Summary
  34. 34. Collaborative Monitoring. Observability in action the wild! 35 Monitoring that allows everyone to benefit of without the need of expert knowledge.
  35. 35. Three steps to enable collaborative monitoring. Observability in action the wild! 36 Standardize metrics, logs and traces Link and combine them as far as possible Integrate them into everyone's tools Start Here Correlate Events and Trace by Context Metrics with Events and Traces by Time Structured Logging + Context, Metric names, etc. Tools your team
  36. 36. Did we create an uncontrollable observability monster? Observability in action the wild! 37
  37. 37. There’s No Such Thing as a Free Lunch • The more complex a microservice architecture is, the more sophisticated the observability solution must be. • For Collaborative Observability there is no out of the box solution. Observability in action the wild! 38
  38. 38. Collaborative Monitoring by everyone. Observability in action the wild! 39 Ease of use. Simple general interface to access various monitoring tools. Integrated into everyone's daily tools (ChatBots, E-Mail, etc.) Support all kinds of teams: Operations / Dev-Ops / Developers / QA-Team / My mum =) Allow everyone to get superman insights. Decrease Mean Time To Recovery (MTTR) with a fast analysis Integrates different kinds of monitoring data (traces, metrics and logs) of different monitoring layers. The right information. Provide relevant information for different teams, e.g. runtimes for perf. engineer. Level of Detail: Abstract (use case level) for management vs. details (database calls) for developers The behavior of system is not just a single metric.
  39. 39. Lessons Learned Observability in action the wild! 40 Tool stack is awesome: Prometheus, Sleuth / Zipkin, Logging (fluentD, elastic) is stable with a good documentation. Maximum flexibility compared to commercial products. But: Effort for concepts, implementation and quality checks. Conventions and rulesets are important! Mindset: We found that we had to convince people first. But we have seen a high level of acceptance. Example: Chatbot with trace-links is standard tool for discussing possible bugs between all project roles. Development and system understanding: No need of “cloudy” conversations. Just provide the context, e.g. a trace id. Example: Issues typically contain the context (trace id) that points the developer to the logs and the trace.
  40. 40. Observability in action the wild! 41 Any Questions? Come to our booth We’re hiring! #CloudNativeNerd #CloudKoffer chatbot: cvi scale up team

×