2. Insights into the inner workings of an application
become crucial latest when performance and
scalability issues are encountered. This becomes
especially challenging in distributed systems, like
when using Akka cluster.
A popular open-source solution for monitoring on the
JVM in general, and Akka in particular, is Kamon. With
its recently reached 1.0 milestone, it features means
for both metrics collection and tracing of Akka
applications, running both standalone or distributed.
This talk gives an introduction to Kamon 1.0 with a
focus on its metrics features. The basic setup using
Prometheus and Grafana will be described, as well as
an overview over the different modules and its APIs
for implementing custom metrics. The resulting setup
allows to record both, automatically exposed metrics
about Akka’s actor systems, as well as metrics tailored
to the monitored application’s domain and service
level indicators.
Finally, learnings from a first-time user experience of
getting started with Kamon will be reported. The
example of adding instrumentation to EMnify’s
mobile core application will illustrate, how easy it is to
get started and how to kill the Prometheus on a daily
basis.
Abstract
3. • Steffen
• has a heart beating for infrastructure
• writes code at EMnify
• PhD in computer science, topic: software-based networks
• EMnify
• MVNO focussed on IoT
• runs virtualized mobile core network
• Würzburg/Berlin, Germany
About Me & Us
@StGebert
Slides available at st-g.de/speaking
10. • Tracing
• Per-request call graph
• Context propagation across nodes
• Exemplary objectives:
• Request profiling
• Understanding call graph
• Metrics
• Time series data
• Counters / gauges / distributions
• Exemplary objectives:
• Function call counts and latency
• Open DB connections
• User logins
• Generated revenue
Kamon: Feature Set
11. • Custom Metrics
• added to your code where it
makes sense
• Automatic Instrumentation
• integrations into Akka,
Akka HTTP, Play, JDBC, Servlet
• system and JVM metrics
Metrics
12. • Counter
• function calls
• customer buying our product
• Gauge
• number of open DB connections
• mailbox size
Custom Metric Types
t
t
13. • Histogram
• latencies
• shopping cart total prices
• Timer
• latencies
• RangeSampler
• number of open DB connections
• mailbox size
Custom Metric Types (2)
histogram
(single sample)
observations
value10 20 30 40 50
15. • Actor system metrics
• processed messages
• active actors
• unhandled messages
• dead letters
• Per actor performance metrics
• processing time (per message)
• time in mailbox
• mailbox sizes
• errors
Kamon Akka
Mailbox
Actor A
Mailbox
Actor B
Mailbox
Actor C
Message
16. • Metrics related to
• routers
• dispatchers
• executors
• actor groups
• remoting (with kamon-akka-remote)
• Requirement (AOP)
• AspectJ Weaver or
• Kanela (Kamon Agent)
Kamon Akka (2)
18. Related Projects
Targets Time Series DB Dashboard
simple_client
DropWizard Metrics
Micrometer
Commercial Tools
Datadog, Dynatrace, Instana, NewRelic, etc.
19. • Time Series Database
• collection, storage & query of metrics data
• based on Google's Borgmon, CNCF project
• Pull-based model
• scrapes configured targets
• HTTP endpoints on monitored targets
• Easy deployment
• statically linked Golang binaries
• single YAML config file
• Alertmanager.. for alerting ;-)
Prometheus
20. • Integrated time series database
• on disk, no external dependency
• fixed retention period, no long-term storage / downsampling
• very efficient storage [1]
• query language PromQL
Prometheus TSDB
[1] Storing 16 bytes at scale, Fabian Reinartz @ PromCon 2017
25. • Tick interval (Kamon) and scrape frequency (Prometheus)
• both should match!
• usually (?) 30s or 60s
• for load tests, we went for 5s
• hope to go for 15s in production
• Deployment [for development / load tests]
• EC2 instances tagged in CloudFormation plus EC2 service discovery
• started simple (stupid): Prometheus in container on AWS ECS with EFS
Our Experiences with Kamon+Prometheus
Docker automated build config github.com/EMnify/prometheus-docker
26. • Little CPU resources + NFS storage + high cardinality =
• High cardinality?
• akka_actor_processing_time_seconds_bucket{⏎
class="com.example.SomethingFrequentlyUsed", ⏎
le="0.33", …⏎
path="mystem/some-supervisor/$aX"}
How to Kill Prometheus (Regularly)
27. • Define actor groups
kamon.akka.actor-groups += "mygroup"
kamon.util.filters {
"akka.tracked-actor" {
excludes = ["mysystem/some-supervisor/*"]
}
mygroup {
includes = ["mysystem/some-supervisor/*"]
}
}
• Delete Prometheus data to recover
• Continue to watch out for metrics with unnamed actors
How to Fix Kamon to Not Kill Prometheus
28. • Limit the number of samples per scrape:
<scrape_config>
# Per-scrape limit on number of scraped samples that will be accepted.
[ sample_limit: <int> | default = 0 ]
• Watch for limit kicking in:
prometheus_target_scrapes_exceeded_sample_limit_total
How to Fix Prometheus to Not Kill Itself
30. • Hosted service
• by Kamon developers
• currently in private beta
• no price tags, yet
• Great user experience for us
• tailored to Akka monitoring
• distributions over time
• still, few rough edges
Kamino Hosted Service
Targets Time Series DB Dashboard
33. • Kamon offers wide range of APM features
• customized and automated metric collection
• works with both on-prem/OSS and SaaS "backends"
• super friendly community, thanks Ivan!
• distributed tracing
• Monitor your application (from the inside!)
• now!
• better start small
Summary & Conclusion
34. Find me at the Speaker‘s Roundtable
Questions, please!
38. Setup with Kamon
JVM
Your ApplicationPort 80
Kamon
Kamon-prometheus Port 9095
Prometheus
Storage
Retrieval PromQL
Port 9090
Node Exporter Port 9100
scrapes
Grafana
*magic*
Prometheus Data Source
41. • Kamon core trackable values
• highest trackable values for range sampler / histogram
• can be adjusted per metric
• Default Prometheus histogram buckets might not fit
• global default can be adjusted
• PR pending for overriding per metric [1]
Adjusting Value Ranges / Aggregation
[1] kamon-io/kamon-prometheus#12
42. Histograms
histogram
over timevalue
t
10
30
50
observations
0 max
histogram
(single sample)
observations
value10 20 30 40 50
• Better describe values than
avg/min/max does
• Can be aggregated across nodes
• Usually percentiles/quantiles computed
• Xth percentile: X% of the values lower than <n>
• Median (=50th percentile)
• SLO/SLA candidates 90/95/99th percentile of
response times