Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

From Zero to Useless to Hero – Make Runtime Data Useful in Teams (Observability)

Cloud Native Night, July 2020, online: Talk of Florian Lautenschlager (@flolaut, QAware) & Robert Hoffmann (@robhoffmax, AWS)

== Please download slides if blurred! ==

Abstract:
We introduced distributed tracing, central logging with trace correlation and monitoring with Prometheus and Grafana in a large internationally distributed software development project from the beginning. The result: Nobody used it.
In this talk we show the good and not so good sides we have learned while introducing and operating the observability tools. We show which extensions and conventions were necessary in order to carry out a cultural change and to awaken enthusiasm for these tools. Today the tools are a first-class citizen and people are shouting when they are not available.

________________________________________________

Follow us on:
https://twitter.com/qaware
https://www.linkedin.com/company/qaware-gmbh
https://www.youtube.com/channel/UCmNf72xADnO57idipq9jvMg
https://github.com/qaware

www.qaware.de

  • Be the first to comment

  • Be the first to like this

From Zero to Useless to Hero – Make Runtime Data Useful in Teams (Observability)

  1. 1. From Zero to Useless to Hero Making Runtime Data Useful in Teams Robert Hoffmann @robhoffmax Florian Lautenschlager @flolaut
  2. 2. Dr. Florian Lautenschlager Software Architect {name.surname}@qaware.de Robert Hoffmann Senior Solutions Architect @ AWS rho@amazon.de @robhoffmax Contact us if you want. =) @flolaut
  3. 3. 3 “Hallo Magenta” Building a European Voice Assistant Platform
  4. 4. 4 From Zero to 1 international co-development > 900 collaborators > 500 active git repos > 100 services
  5. 5. T=Zero
  6. 6. Some Cloud SQL Databases Storage NoSQL Databases Proxy http http Kubernetes <Pod> Admin Gateway <Pod> API Gateway Voice Services <Pod> Service <Pod> Service <Pod>Service API Admin https https External Services IDM CDN … Skills Weather … Radio Device Services <Pod> Service <Pod> Service <Pod>Service API Admin 6 Complex Architecture. Complex Software System. Complex Analysis.
  7. 7. Exploration Probes, Collection and Storage Storage, Exploration Transport Storage, Exploration Probes and Collection Metric sampling-based Textual event-based Span event-based 7 Advanced toolchain needed. Standard used. NEW: Humio NEW: Jaeger NEW: Grafana Cloud
  8. 8. Generic-Standard-Runtime-Data-Smarthub-Service-Data-Model 9 Logging Concept Tracing Best- Practices Standard Metrics for incoming and outgoing Requests Standard Database Metrics and specific ones. Standard Readiness and Liveness checks with Metrics TraceId in every Response
  9. 9. – We, the ignorant ones “Done. This solves all our problems. They will it!“
  10. 10. 11 Our team: Colorful. Platform Developers Skill Developers Operation Heros First Level Support Data Scientists Production Management Tester Mobile Developers
  11. 11. 12 Our solution: Monochrome. Platform Developers Skill Developers Operation Heros First Level Support Data Scientists Production Management Tester Mobile Developers
  12. 12. T=Useless Because we are monochrome
  13. 13. Nobody wants to be a Beginner. Optimize for Intermediate. About Face - Alan Cooper Intermediate ExpertBeginner Toolchain 14
  14. 14. Useful = Utility + Usability. 🤔 Utility: whether it provides the features you need. ✅ You can find all the information... Usability: how easy & pleasant these features are to use: Learnability, Efficiency, Memorability, Error Handling, Satisfaction. ❌ ... if you really know how and where to look (as an Expert). Usability 101 - Jakob Nielsen https://www.nngroup.com/articles/usability-101-introduction-to-usability/ What we did to move our solution from expert to intermediate. 15
  15. 15. Close Gaps: Link data and tools as much as possible. 16 Developer-, Tester-
 & Operations-oriented Dashboards with links to logs and e2e test runs
  16. 16. Close Gaps: Link data and tools as much as possible. 17 Developer-, Tester-, 
 & Operations-oriented Gangway landing page to access k8s, logs, traces, metrics
  17. 17. Make functional use: First-level support integration. 18 Customer First Level Support First-Level- & 
 Operations-oriented GDPR-aware debugging in production: Token-based user-specific debug logging and tracing
  18. 18. Referencing Trace IDs as a common base to discuss and find relevant data Make functional use: Resolving Tickets more easily. 19 Developer-, Tester-
 First-Level-, &
 Operations-oriented
  19. 19. Close Gaps: Link data and tools as much as possible. 20 Developer-
 oriented Pipeline UI - promote software and get runtime data
  20. 20. Close Gaps: Link data and tools as much as possible. 21 Developer-
 oriented Pipeline dashboards with logs, traces
  21. 21. 22 Any project member has easy access. Just open your chat. Anyone can learn by example.
 See how others use the service. Support in case of an error.
 By others or technical: • Trace: Request Trace • Logs: Request Application log Lower the access hurdle: CLI & Chatbot integration. Everybody-oriented
  22. 22. Visibility and Increased Trust: Toolchain acts as a safety-net as it shows the runtime behavior. People can be sure to understand their services, e.g. in case of an error. Self-Awareness: Accept and understand that software has a runtime behavior. Not all developers feel comfortable with dynamic analysis, but now they have means to see and understand. Clear Communication: Inner & cross-team communication is easier. Different people can easily share the same context, e.g. trace-Id, log messages, request flow. Error Culture: Failures are more easily accepted. As the software system is visible and the cross-team communication is clear, people tend to accept failures and work together on solutions. Ownership: Increased acceptance is the foundation for end-to-end responsibility. Due the visibility and increased trust, clear communication and error culture, people are more inclined to take ownership for their services. Changes in the culture that we have recognized. 23 Everybody-oriented
  23. 23. T=Hero Because we are a little bit colorful
  24. 24. Recognise that modern observability may challenge your team structure: Developers are instrumenting their applications, not some distant operations team. This works best when self- sufficient feature teams build and run their software on their own. Instrument early, instrument everything: Yields early value and helps growing the understanding in your org. Otherwise, you might have to do a large batch before it is useful. After you have the some basic visualisation, immediately focus on your users (other teams): By understanding their needs, you can build observability that they actually use. Plan ahead for the reliability of your observability tools: Once teams use the tools on a day- to-day basis, any service interruption will slow down or even block their work. At this point, the tools need to be as reliable and available as other developer services. What we learned along the way. 25 Everybody-oriented
  25. 25. Exploration Probes, Collection and Storage Storage, Exploration Transport Storage, Exploration Probes and Collection Metric sampling-based Textual event-based Span event-based 26 Advanced toolchain needed. Standard used. NEW: Humio NEW: Jaeger NEW: Grafana Cloud
  26. 26. x 3 (Metrics, Logging, Tracing) ? Scaling ? Operations ? Updates ? Linking ? Cloud Services Don’t build the basics if you don’t have to. Then you can focus on adding specific integrations for your project.
  27. 27. Look for integrated, managed observability to get going quickly. ✔ Scaling ✔ Operations ✔ Updates ✔ Linking ✔ Cloud Services ✔ Prometheus Support ✔ OpenTelemetry Support
  28. 28. Start here Select Toolchain & Standardize Metrics, Logs, Traces Tools your team Link and combine them as far as possible Integrate them into everyday tools & Processes

×