I will show our journey of the implementation/integration of an Open Source Application Performance Monitoring solution, based on
– inspect-IT (http://inspectit.rocks)
– OpenCensus (http://opencensus.io)
– Jaeger (http://jaegertracing.io)
– InfluxDB (http://influxdata.com)
– Grafana (http://grafana.com)
We are instrumenting more than 1.000 JVMs and more than 100 applications. We are using JVM-Instrumentation and JS/Browser/End-User-Monitoring to measure the performance from our applications. I will pitfalls and success of the Implementation. And how it could help for application-performance-monitoring.
Professional Resume Template for Software Developers
OSMC 2021 || Open Source Application Performance Monitoring in the Enterprise
1. Distributed Tracing in the enterprise,
Nov. 09-11, 2021 / OSMC
Open Source Application Performance Monitoring
OSS APM
2. Manifest
1. Who am I
2. Who is “VHV Versicherung”
3. What is OSS APM / Distributed Tracing
• Generic instructions
• Challenges at VHV
• Concept of the solution
• Solution
3. / /
Hi, I am Sascha Brechmann.
I have been working for the “VHV Versicherung” since more than 7 years.
My main tasks at Team Monitoring is the “Application-Performance-Monitoring”
• My Main Skills are:
• Linux-System-Administration
• Monitoring-Administration
• Availability-Monitoring (CheckMK)
• Performance-Monitoring (inspect-IT)
• Consulting everything around Monitoring
• Event-/Log-Monitoring (ELK-Stack)
OSS APM / OSMC2021 19.11.2021 3
OSS APM
Who am I
4. / /
The VHV group, located in Hannover, is a growing corporation specialist in insurance and
foresight.
The business-brands “VHV Versicherung” and “Hannoversche” and “VHV Versicherung (Austria)”
are the “VHV Gruppe” a professional and forward-looking affiliate for insurance.
The business areas are property and casualty insurance, motor insurance and life insurance.
For its success, the VHV Group relies on the strengths of around 3,000 employees, modern
structures, efficient cost management and customer-oriented products.
OSS APM / OSMC2021 19.11.2021 4
OSS APM
Who is the VHV insurance (VHV Versicherung)
5. Manifest
1. Who am I
2. Who is “VHV Versicherung”
3. What is OSS APM / Distributed Tracing
• Generic instructions
• Challenges at VHV
• Concept of the solution
• Solution
6. / /
Distributed tracing in context of APM means the observation of distributed transactions (“traces”)
across different applications / functions (“span”).
Metrics, TAGs and "logs / messages" can be attached to the individual spans and evaluated for
(problem) analysis.
With distributed tracing, the "spans" from all the applications involved must be forwarded to a
central backend for correlation and analysis.
OSS APM / OSMC2021 19.11.2021 6
General introduction
What is Distributed-Tracing
M
o
ni
t
o
ri
11. /
/
The benefits of (OSS) APM / Distributed Tracing
Part 1
• Cross-application measurement of (business-) processes
• Requests, error rates and duration of transactions (R.E.D. metrics)
• Enriching transactions with additional (meta) data for error analysis (HTTP status code; order code; ...)
• Correlation of transactions with logs
• Comparing transactions:
• Releases / periods
• End-User interactions / actions
19.11.2021
OSS APM / OSMC2021 11
12. /
/
The benefits of (OSS) APM / Distributed Tracing
Part 2
• Early identification of performance problems
• Avoidance of support tickets
• Earlier integration of performance measurement into the development process
• Runtime behaviour / analysis of alternative software solutions
• Easily drill down into the "software code"
• Striking display of complex metrics in dashboards
• Overview of dependencies / connections between applications
• Reveal hidden dependencies (why is application A talking to B?)
19.11.2021
OSS APM / OSMC2021 12
13. …
3. What is OSS APM / Distributed Tracing
• Generic instructions
• Challenges at the VHV
• Concept of the solution
• Solution
14. / /
(License) cost reduction
• Commercial solutions can only be used to a limited extent
(quantities) of instances
• Not all stages / levels
• Not all applications
• No vendor lock-in
• “Hardly / no influence” on the development
• (Micro) service oriented
• The simplest possible exchange of individual components
• Commercial solutions are usually a monolith / black box and
scaling is only possible with the help of the manufacturer
• Cooperation with other (insurance) market participants
Organizational improvements
• Earliest possible inclusion of performance measurements in the
development process
• Better overview through a service dependency graph (flow map)
of the application landscape
• Easily create striking dashboards
• Self-service for the IT operations groups
• Less (know-how) dependence on “power users” thanks to a
comprehensive training concept
• Easy adaptation to/in the VHV processes
• Simple correlation between traces and (application) logs
• Acceptance of previous "blackboxes" (SAP / Mainframe / None
Java)
OSS APM / OSMC2021 19.11.2021 14
OSS APM / Distributed-Tracing
Problem statement at the VHV
Commercial vs. OSS Solution
15. …
3. What is OSS APM / Distributed Tracing
• Generic instructions
• Challenges at VHV
• Concept of the solution
• Solution
16. / /
1. The desired (open source) solution should consist of several (micro) services.
2. The solution must rely on open standards and (mostly) existing open-source software.
3. The individual components should be easy to replace.
4. The solution should be as simple as possible to scale (especially horizontally, by adding nodes)
5. The solution should simply fit into the VHV staging concept (development, test, training, production stage)
6. The solution should, if possible, rely on technologies that are already in use at VHV
a) Linux (SLES)
b) Java
c) Elastic Search
d) PostgreSQL/SQLite
OSS APM / OSMC2021 19.11.2021 16
OSS APM / Distributed Tracing
Concept of solution
17. …
3. What is OSS APM / Distributed Tracing
• Generic instructions
• Challenges at VHV
• Concept of the solution
• Solution
18. / /
1. Micro-Services + Standards:
a) OpenCensus-Collector;Jaeger-Collector;Elasticsearch;InfluxDB;Grafana;inspect-IT configsrv
b) Standards: OpenCensus + OpenMetrics + (coming soon OpenTelemetry)
2. VHV staging concept: A separate “inspect IT stack” for each VHV main stage
3. Simply scale:
a) Collectors, Elasticsearch, Grafana + inspect-IT services (EUM,config,baseline) scale horizontally.
b) InfluxDB is still a "problem child" here
4. Technologies:
a) Linux (SLES) for Client + Server / Windows Client
b) Java (OpenJDK; Adopt-JDK; Oracle-JDK; IBM-JDK) Version: >= 1.8
c) PostgreSQL/SQLite (currently more SQLite, but migration to PGSQL is planned)
d) Elasticsearch: It is used as storage for traces
e) InfluxDB: New Technology. Alternative metric storages in consideration
f) GO-Lang: Several components are implemented in GO-Lang, but are used as BINs and are not
compiled by themselves.
OSS APM / OSMC2021 19.11.2021 18
OSS APM / Distributed Tracing
Solution
19. / /
OSS APM / OSMC2021 19.11.2021 19
„inspect-IT stack“ architecture sketch
See https://openapm.io for alternatives
B
r
o
w
s
e
r
J
a
v
a
-
A
p
p
li
c
a
ti
o
n
J
a
v
a
-
T
r
a
c
i
n
g
-
F
r
a
m
e
w
D
a
t
e
n
-
T
r
a
n
s
p
o
rt
D
a
t
a
-
C
o
rr
e
l
a
ti
o
n
D
a
t
e
n
-
S
t
o
r
a
g
e
D
a
t
a
-
A
n
a
ly
s
e
s
+
A
l
e
rt
i
20. /
/
Core functions of inspect-IT
- Distributed tracing (from application to application / SDG - flow map)
- Recording of transaction metrics and traces
- Detection of business transactions
- Display of HTTP / SQL Query / MQ / SMTP / LDAP / SFTP transactions
- Automatic injection of the TraceID into the (Application-)Log output
- Measuring of the "end-user experience" when using the (observed) application
19.11.2021
OSS APM / OSMC2021 20
21. /
/
„inspect-IT“: Who/How/What
Under the name “inspect-IT Stack” we summarize a collection of open-source tools, that together provide the
functions of “Distributed Tracing” in our solution.
• Inspect-IT Ocelot Agent => this is used to instrument the individual Java applications.
• Open Census-Collector => Collecting and routing the trace information
• Jaeger => Central tracing instance. Takes care of the preparation
(drill down) of tracing information.
• Elasticsearch => Storage backend for Jaeger / Traces -> Correlation Traces
• Telegraf => machine agent. Collects all (server) metrics and stores this in InfluxDB
• InfluxDB => TimeSeriesDB. Stores all metrics
• Grafana => Web UI for InfluxDB / Elasticsearch
19.11.2021
OSS APM / OSMC2021 21
22. /
A few figures for an overview of the use of inspect-IT at VHV
• “Inspect-IT Stack” Stages: 4 with currently 5 VMs per level (without tracing storage)
• VHV-Stages (Application-Stages): >20
• Number of inspect-IT agents: > 50 per VHV stage
• Number of different applications: >100
• Data volume metrics: ~ 60GB / inspect-IT stage (7 days of raw data, then aggregated)
• Amount of data traces:> 250 GB / inspect-IT stage and day (up to 10 days retention time)
• User: >100 (only IT)
19.11.2021
OSS APM / OSMC2021 22
23. /
Some Screenshots from our Environment
Service Dependency Graph
19.11.2021
OSS APM / OSMC2021 23
24. /
Some Screenshots from our Environment
Service Dependency Graph / Single application
19.11.2021
OSS APM / OSMC2021 24
25. /
Some Screenshots from our Environment
JVM Metric of one CRM Application
19.11.2021
OSS APM / OSMC2021 25
26. /
Some Screenshots from our Environment
HTTP Metrics of one CRM Application
19.11.2021
OSS APM / OSMC2021 26
27. /
Some Screenshots from our Environment
BT Metrics of one CRM Application
19.11.2021
OSS APM / OSMC2021 27
28. /
Some Screenshots from our Environment
EUM/Browser Metrics of one CRM Application
19.11.2021
OSS APM / OSMC2021 28
30. / /
OSS APM / OSMC2021 19.11.2021 30
V
H
V
-
D
e
v
el
o
p
er
IT
-
O
E
n
d-
U
s
er
B
u
si
n
e
s
s-
U
s
er
In
te
D
a
s
h
b
o
ar
d-
U
s
er
St
or
a
g
e
St
or
a
g
e
S
t
o
r
a
g
e
D
a
s
h
b
o
ar
d-
U
s
er
31. /
OpenCensus/OpenTracing/OpenTelemetry:
Who/How/What
• OpenCensus is an open-source standard and describes the functions required to implement "distributed tracing“
(APIs/Libs – Protocols – Receiver/Exporter).
• OpenCensus was initially designed by Google and Microsoft and is largely based on Google's own "Distributed Tracing" implementation.
-> The commercial version is / was “Google Stacktrace” + “Azure AppInside”
• OpenCensus describes how "distributed tracing" information can be exchanged between the individual systems involved
• OpenCensus also provides a reference implementation (Libs) ready for a very wide range of programming languages
• OpenTelemetry is the successor project or the continuation of OpenCensus (v2) and OpenTracing (v2).
• OpenCensus + OpenTracing have now been merged into the "OpenTelemetry" project
(no two standards / implementations for the same goal)
• OpenTelemetry has an expanded focus that includes the following information levels:
• Traces, from OpenTelemetry, OpenCensus and OpenTracing. (Interoperability)
• Metrics: Acquisition of application metrics, output of metrics, transport of metrics, storage of metrics
• Logs: Collection and transport of log messages. Transfer to a log collector
19.11.2021
OSS APM / OSMC2021 31
32. / /
International:
• Uber
• Redhat
• Ticketmaster
• Grafana
Deutsch:
• Hermes Logistik (Start 2018; Change from Monolith to Micro Service)
• Zalando
19.11.2021 32
Companies that already work on distributed tracing with
OpenTelemetry / Jaeger
OSS APM / OSMC2021