2. Optimisation opportunities
▹ Improve coordination
▸ A lot of contention for same job
among workers resulting in wasted
operations
▹ Performance degradation in large series
▸ Larger file == slower read
▸ Larger file == slower write
6. Testing methodology (small series)
▹ POST 1000 generic resources spread
across 20 workers, 20 metrics each
▸ every 5 minutes
▸ 60 points/metric per POST
▸ 1.2 million points every 5 min
▸ 3 granularities x 8 aggregates
▸ ~500 points in most granular series
▹ 3 metricd services, 24 workers each
10. Testing methodology (small series)
▹ POST 1000 generic resources spread
across 20 workers, 20 metrics each
▸ every 5 minutes
▸ 60 points/metric per POST
▸ 1.2 million points every 5 min
▸ 3 granularities x 8 aggregates
▸ ~500 points in most granular series
▹ 3 metricd services, 24 workers each
11. Amount of time
required to
compute 1.2M
points into 20K
metrics of 24
different
aggregations.
12. Testing methodology (medium series)
▹ POST 500 generic resources spread across
20 workers, 20 metrics each
▸ every 5 minutes
▸ 720 points/metric per POST
▸ 7.2 million points every 5 min
▸ 3 granularities x 8 aggregates
▸ ~7000 points in most granular series
▹ 3 metricd services, 24 workers each
16. Release notes v3
▹ New storage format for new, back window and
aggregated series (msgpacks vs struct serialisation)
▹ Storage compression
▹ No-read append writes (Ceph only)
▹ Dynamic resource configuration
▹ Coordinated task scheduling
▹ Performance related changes to aggregation logic
▹ Grafana 3.0 support
17. Computation time
per metric
Amount of time
required to
compute new
measures into 24
different
aggregates.
~60% less
processing time at
lower unprocessed
sizes
~40% less
processing time at
higher
unprocessed sizes
18. Write throughput
Data generated
using benchmark
tool in client. 32
single-threaded
api server
processes,
4x12threads client.
Gnocchi writes to
disk but will be
enhanced to write
to memory (for
Ceph)
22. Disk size.
Live datapoints
are 18B in v2.1 and
at worst 8B (9B in
Ceph) in v3.
Compression is
applied which
additionally lowers
size.
v2.1 datapoints
were serialized
using msgpacks. In
v3, the storage
format is
optimised for
space efficiency
and compression.
23. Testing methodology (short series)
▹ POST 1000 generic resources spread
across 20 workers, 20 metrics each
▸ every 5 minutes
▸ 60 points/metric per POST
▸ 1.2 million points every 5 min
▸ 3 granularities x 8 aggregates
▸ ~500 points in most granular series
▹ 3 metricd services, 24 workers each
24. Amount of time
required to
process 1.2M
points across 20K
metrics into 24
different
aggregates per
cycle.
25. Number of metrics
processed per 5s.
No degradation
between batches
and more
consistent
processing rates
V2.1.4 averages ~6
metrics (144
aggregates)
calculated every 5
seconds per
worker.
V3.0 averages
~10.88 metrics
(261 aggregates)
calculated every 5
seconds per
worker.
26. Amount of Ceph
IOPs required to
process 1.2M
points across 20K
metrics into 24
different
aggregates per
cycle.
Less operations/s
represents lower
hardware
requirements
27. Number of times a
worker attempts
to handle a metric
that has already
been handled.
Less contentions
represents better
job scheduling.
28. Time required to
POST 1.2M points
across 20K metrics
under load.
20 workers making
50 POSTs of 20
metrics, 60 points
per metric.
29. Testing methodology (medium series)
▹ POST 500 generic resources spread across
20 workers, 20 metrics each
▸ every 5 minutes
▸ 720 points/metric per POST
▸ 7.2 million points every 5 min
▸ 3 granularities x 8 aggregates
▸ ~5760 points in most granular series
▹ 3 metricd services, 24 workers each
30. Amount of time
required to
process 7.2M
points across 10K
metrics into 24
different
aggregates per
cycle.
32. Threading
enabled.
Python threading
only works when
I/O heavy (see:
GIL). CPU usage
has been
minimised and
now threading
works
54 metricd with 8
threads has
roughly same
performance as 72
single-threaded
metricd.
33. Effect of number of
aggregates on
processing time.
Less
aggregates/metric,
less time to
process. More
aggregates/metric,
more time.
Note: spike at
batch 5 is due to
compression logic
that is triggered by
dataset.
34. Effect of series
length on
datapoint size.
Long, medium,
short series
contain 83K, 1.4K,
and 400 points
respectively.
Shorter series are
more likely to have
to-be-deleted stale
data as well as less
compression
opportunities
(latter applicable
only to Ceph).
35. Effect of series
length on
datapoint size.
Shorter series tend
to have higher
granularity and
thus larger back
window
requirements.