Chronix: Long Term Storage and Retrieval Technology
for Anomaly Detection in Operational Data
Florian Lautenschlager,1
Michael Philippsen,2
Andreas Kumlehn,2
and Josef Adersberger1
1QAware GmbH, Munich, Germany 2University Erlangen-Nürnberg (FAU), Programming Systems Group, Erlangen
Chronix: Long Term Storage and Retrieval Technology
for Anomaly Detection in Operational Data
Florian Lautenschlager,1
Michael Philippsen,2
Andreas Kumlehn,2
and Josef Adersberger1
1QAware GmbH, Munich, Germany 2University Erlangen-Nürnberg (FAU), Programming Systems Group, Erlangen
Abstract
Anomalies in the runtime behavior of software systems, especially
in distributed systems, are inevitable, expensive, and hard to lo-
cate. To detect and correct such anomalies one has to automati-
cally collect, store, and analyze the operational data of the run-
time behavior, often represented as time series. There are efficient
means both to collect and analyze the runtime behavior. But
general-purpose time series databases do not focus
on the specific needs of anomaly detection. Chronix
is a domain specific time series database targeted at
anomaly detection in operational data.
Detecting Anomalies in Running Software matters
• Resource consumption: anomalous
memory consumption, high CPU usage, . . .
• Sporadic failure: blocking state,
deadlock, dirty read, . . .
• Security: port scanning activity,
short frequent login attempts, . . .
→ Economic or reputation loss.
Anomaly Detection Tool Chain for Operational Data
Collection
Framework
Analysis
FrameworkTime Series Database
Collects operational data
from a running application
Asks the database for data
and analyzes the data
Stores the time series data
General Purpose TSDB
• Brake shoe
• Resource hog
• Productivity obstacle
Chronix:
Domain specific TSDB
Domain specific sensors
and adaptors
Domain specific analysis
algorithms and tools
Application’s
Operational
Data
Types of operational data:
• Metrics: scalar values, e.g.,
rates, runtimes, total hits,
counters, …
• Events: single occurrences,
e.g., a user’s login, product
order, …
• Traces: sequences within a
software system, e.g., the
called methods, …
General-purpose TSDBs in Anomaly Detection
Requirements
G
raphiteInfluxDB
OpenTSDBK
airosDBProm
etheus
Generic
data model
Analysis
support
Lossless long
term storage
No support for data types
= Productivity obstacle
No support for analyses
= Productivity obstacle + Brake shoe
High memory footprint
= Performance hog
High storage demands
= Performance hog
Loss of historical data
= Brake shoe
What makes Chronix domain specific?
Option to pre-compute an extra representation of the data
Optional timestamp compression for almost-periodic time series
Records that meet the needs of the domain
Compression technique that suits the domain’s data
Underlying multi-dimensional storage
Domain specific query language with server-side evaluation
Domain specific commissioning of configuration parameters
How it works!
Example: Almost-periodic time series
Timestamp Value Metric Process Host
25.10.2016 00:00:01.546 218.34 ingestertime SmartHub QAMUC
25.10.2016 00:00:06.718 218.37 ingestertime SmartHub QAMUC
25.10.2016 00:00:11.891 218.49 ingestertime SmartHub QAMUC
25.10.2016 00:00:16.964 218.35 ingestertime SmartHub QAMUC
… … … … …
Optional Pre-compute Extras
Timestamp Value Metric Process Host SAX
25.10.2016 00:00:01.546 218.34 ingestertime SmartHub QAMUC A
25.10.2016 00:00:06.718 218.37 ingestertime SmartHub QAMUC B
25.10.2016 00:00:11.891 218.49 ingestertime SmartHub QAMUC C
25.10.2016 00:00:16.964 218.35 ingestertime SmartHub QAMUC B
… … … … … …
B
A
C
B
C
D
C
A
B
• Lossless storage that keeps all details as analyses may need them.
• Programming interface to add extra domain specific "columns".
• These "columns" speed up anomaly detection queries.
Optional Timestamp Compaction
Timestamp
25.10 … :01.546
25.10 … :06.718
25.10 … :11.891
25.10 … :16.964
…
Timestamp
25.10 … :01.546
5.172
5.173
5.073
…
Timestamp
25.10 … :01.546
5.172
0.001
0.1
…
Timestamp
25.10 … :01.546
5.172
-
-
…
Timestamp
25.10 … :01.546
5.172
…
space saved
space
saved
Drop diffs
below threshold
Calculate
deltas
Compute
diffs
between
them
If accumulated
drift > threshold store delta
• Date-Delta-Compaction for almost-periodic time series.
• Functionally lossless as all relevant details are kept.
• Degree of inaccuracy is a configuration parameter of
Domain Specifc Records
Process Host SAX
SmartHub QAMUC A
SmartHub QAMUC B
SmartHub QAMUC C
SmartHub QAMUC B
… … …
1
Record
metric: ingestertime
process: SmartHub
host: QAMUC
start: 25.10.2016 00:00:01.546
end: …
type: metric
data: Timestamp Value SAX
25.10.2016 00:00:01.546 218.34 A
5.172 218.37 B
- 218.49 C
- 218.35 B
1
BLOB
chunk
& convert
2
Timestamp Value Metric
25….:01.546 218.34 ingestertime
5.172 218.37 ingestertime
- 218.49 ingestertime
- 218.35 ingestertime
… … …
1
1 1 2
22
• Exploit repetitiveness and bundle "lines" into data chunks.
• Programming interface for a specifc time series record encoding.
• Chunk size is a configuration parameter of
Domain Specific Compression
Record
metric: ingestertime
process: SmartHub
host: QAMUC
start: 25.10.2016 00:00:01.546
end: …
type: metric
data: 00105e0 e6b0 343b 9
07bc 0804 e7d508040
Record
metric: ingestertime
process: SmartHub
host: QAMUC
start: 25.10.2016 00:00:01.546
end: …
type: metric
data: Timestamp Value SAX
25.10.2016 00:00:01.546 218.34 A
5.172 218.37 B
- 218.49 C
- 218.35 B Compressed BLOB
serialize
& compress
• Lossless compression techniques minimizes the record size.
• Domain data often has small increments, recurring patterns, etc.
• Choice of compression technqiue is a configuration parameter of
Multi-Dimensional Storage
Timestamp Value Metric Process Host
25.10.2016 00:00:01.546 218.34 ingestertime SmartHub QAMUC
25.10.2016 00:00:06.718 218.37 ingestertime SmartHub QAMUC
25.10.2016 00:00:11.891 218.49 ingestertime SmartHub QAMUC
25.10.2016 00:00:16.964 218.35 ingestertime SmartHub QAMUC
… … … … …
q=host:QAMUC AND metric:ingester*
AND type:[metric OR trace] AND end:NOW-7MONTH
• Explorative: Users can use the attributes to find a record.
• Correlating: Queries can use and combine all types.
Query Language & Server-Side eval.
Basic Graphite
InfluxDB
OpenTSDB
KairosDB
Prometheus
Chronix
distinct × × × ×
integral × × × ×
min/max/sum
bottom/top × × ×
first/last × ×
... . .. ... ... .. . ... .. .
nnderivative × × ×
movavg × × ×
divide/scale ×
High-level
sax [33] × × × × ×
fastdtw [38] × × × × ×
outlier × × × × ×
trend × × × × ×
frequency × × × × ×
grpsize × × × × ×
split × × × × ×
query
read
result
process
q=metric:ingester* & cf=outlier
• Basic functions & high-level built-in domain specific functions.
• Plug-in interface to add functions for server-side evalution.
Domain Specific Commissioning
0
10
20
30
40
50
60
70
80
90
0 5 10 50 100 200 1000
DDC threshold in ms
Ratesin%−Averageinms
Inaccuracy Rate
Average Deviation
Space Reduction
32
34
36
38
40
42
44
46
48
32 64 128 256 512 1024
Chunk size in kBytes
Totalaccesstimeinsec
gzip
LZ4
Snappy
• DDC-Threshold: 200 ms.
• Compression & Chunk Size: gzip + 128 kByte.
Easily detectpattern!
Fast!
Best Values!
Select the ideal
Compression!
Projects 1–3
Remove
Jitter!
Evaluation
Benchmark Client
Ubuntu 16.04.1 x64, 12 core, 32 GB Ram,
512 GB SSD
Java
Benchmark
Benchmark Server
Ubuntu 16.04.1 x64, 12 core, 32 GB Ram,
512 GB SSD
Docker InfluxDB KairosDB
ChronixOpenTSDB
Queries
Time Series
HTTP
Data of 5 Industry Projects.
Project 1 2 3 4 5 total
time series 1,080 8,567 4,538 500 24,055 38,740
pairs
(mio)
metric 2.4 331.4 162.6 3.9 3,762.3 4,262.6
lsof 0.0 0.0 0.0 0.4 0.0 0.4
strace 0.0 0.0 0.0 12.1 0.0 12.1
(a) Pairs and time series per project.
Project r 0.5 1 7 14 21 28 56 91 180
1 – 3 q 15 30 30 10 5 3 1 2 0 96
q 2 11 15 8 12 5 1 2 2 58
4 & 5 b 1 6 5 7 2 4 4 1 2 32
h 2 6 10 8 6 6 3 2 0 43
(b) Time ranges r (days) and occurrences of queries (q) for raw data retrieval,
and of queries with basic (b) and high-level (h) functions.
Memory footprint (in MBytes)
InfluxDB
OpenTSDB
KairosDB
Chronix
Initially 33 2,726 8,763 446
Import (max) 10,336 10,111 18,905 7,002
Query (max) 8,269 9,712 11,230 4,792
Chronix has a 34%–69% smaller memory footprint.
Storage demands (in GBytes)
Project Raw
Data
InfluxDB
OpenTSDB
KairosDB
Chronix
4 1.2 0.2 0.2 0.3 0.1
5 107.0 10.7 16.9 26.5 8.6
total 108.2 10.9 17.1 26.8 8.7
Chronix saves 20%–68% of the storage space.
Data retrieval times (in s)
r q InfluxDB
OpenTSDB
KairosDB
Chronix
0.5 2 4.3 2.8 4.4 0.9
1 11 5.2 5.6 6.6 5.3
7 15 34.1 17.4 26.8 7.0
14 8 36.2 14.2 25.5 4.0
21 12 76.5 29.8 55.0 6.0
28 5 7.9 3.9 5.6 0.5
56 1 35.4 12.4 24.1 1.2
91 2 47.5 15.5 33.8 1.1
180 2 96.7 36.7 66.6 1.1
total 343.8 138.3 248.4 27.1
Chronix saves 80%–92% on data retrieval time.
Times for b- and h-queries (in s)
Basic (b) InfluxDB
OpenTSDB
KairosDB
Chronix
4 Avg. 0.9 6.1 9.8 4.4
5 Max. 1.3 8.4 9.1 6.0
3 Min. 0.7 2.7 5.3 2.8
3 Dev. 6.7 16.7 21.1 2.3
5 Sum 0.7 6.0 12.0 2.0
4 Count 0.8 5.5 10.5 1.0
8 Perc. 10.2 25.8 34.5 8.6
High-level (h)
12 Outlier 30.7 29.1 117.6 18.9
14 Trend 162.7 50.4 100.6 30.2
11 Freq. 47.3 23.9 45.7 16.3
3 GroupSize 218.9 2927.8 206.3 29.6
3 Split 123.1 2893.9 47.9 37.2
75 total 604.0 5996.3 620.4 159.3
Chronix saves 73%–97% on analysis times.
Typical scenario.
Im
portant!
Neededfor
exploration!
Evaluation:
Projects
4 & 5
Conclusion
Chronix exploits the characteristics of the domain in many ways and
thus achieves better storage and query results. Chronix is open source. www.chronix.io
Acknowledgements
This research was in part supported by the Bavarian Ministry of Economic Affairs and Media, Energy and
Technology as an IuK-grant for the project DfD – Design for Diagnosability.

Chronix Poster for the Poster Session FAST 2017

  • 1.
    Chronix: Long TermStorage and Retrieval Technology for Anomaly Detection in Operational Data Florian Lautenschlager,1 Michael Philippsen,2 Andreas Kumlehn,2 and Josef Adersberger1 1QAware GmbH, Munich, Germany 2University Erlangen-Nürnberg (FAU), Programming Systems Group, Erlangen Chronix: Long Term Storage and Retrieval Technology for Anomaly Detection in Operational Data Florian Lautenschlager,1 Michael Philippsen,2 Andreas Kumlehn,2 and Josef Adersberger1 1QAware GmbH, Munich, Germany 2University Erlangen-Nürnberg (FAU), Programming Systems Group, Erlangen Abstract Anomalies in the runtime behavior of software systems, especially in distributed systems, are inevitable, expensive, and hard to lo- cate. To detect and correct such anomalies one has to automati- cally collect, store, and analyze the operational data of the run- time behavior, often represented as time series. There are efficient means both to collect and analyze the runtime behavior. But general-purpose time series databases do not focus on the specific needs of anomaly detection. Chronix is a domain specific time series database targeted at anomaly detection in operational data. Detecting Anomalies in Running Software matters • Resource consumption: anomalous memory consumption, high CPU usage, . . . • Sporadic failure: blocking state, deadlock, dirty read, . . . • Security: port scanning activity, short frequent login attempts, . . . → Economic or reputation loss. Anomaly Detection Tool Chain for Operational Data Collection Framework Analysis FrameworkTime Series Database Collects operational data from a running application Asks the database for data and analyzes the data Stores the time series data General Purpose TSDB • Brake shoe • Resource hog • Productivity obstacle Chronix: Domain specific TSDB Domain specific sensors and adaptors Domain specific analysis algorithms and tools Application’s Operational Data Types of operational data: • Metrics: scalar values, e.g., rates, runtimes, total hits, counters, … • Events: single occurrences, e.g., a user’s login, product order, … • Traces: sequences within a software system, e.g., the called methods, … General-purpose TSDBs in Anomaly Detection Requirements G raphiteInfluxDB OpenTSDBK airosDBProm etheus Generic data model Analysis support Lossless long term storage No support for data types = Productivity obstacle No support for analyses = Productivity obstacle + Brake shoe High memory footprint = Performance hog High storage demands = Performance hog Loss of historical data = Brake shoe What makes Chronix domain specific? Option to pre-compute an extra representation of the data Optional timestamp compression for almost-periodic time series Records that meet the needs of the domain Compression technique that suits the domain’s data Underlying multi-dimensional storage Domain specific query language with server-side evaluation Domain specific commissioning of configuration parameters How it works! Example: Almost-periodic time series Timestamp Value Metric Process Host 25.10.2016 00:00:01.546 218.34 ingestertime SmartHub QAMUC 25.10.2016 00:00:06.718 218.37 ingestertime SmartHub QAMUC 25.10.2016 00:00:11.891 218.49 ingestertime SmartHub QAMUC 25.10.2016 00:00:16.964 218.35 ingestertime SmartHub QAMUC … … … … … Optional Pre-compute Extras Timestamp Value Metric Process Host SAX 25.10.2016 00:00:01.546 218.34 ingestertime SmartHub QAMUC A 25.10.2016 00:00:06.718 218.37 ingestertime SmartHub QAMUC B 25.10.2016 00:00:11.891 218.49 ingestertime SmartHub QAMUC C 25.10.2016 00:00:16.964 218.35 ingestertime SmartHub QAMUC B … … … … … … B A C B C D C A B • Lossless storage that keeps all details as analyses may need them. • Programming interface to add extra domain specific "columns". • These "columns" speed up anomaly detection queries. Optional Timestamp Compaction Timestamp 25.10 … :01.546 25.10 … :06.718 25.10 … :11.891 25.10 … :16.964 … Timestamp 25.10 … :01.546 5.172 5.173 5.073 … Timestamp 25.10 … :01.546 5.172 0.001 0.1 … Timestamp 25.10 … :01.546 5.172 - - … Timestamp 25.10 … :01.546 5.172 … space saved space saved Drop diffs below threshold Calculate deltas Compute diffs between them If accumulated drift > threshold store delta • Date-Delta-Compaction for almost-periodic time series. • Functionally lossless as all relevant details are kept. • Degree of inaccuracy is a configuration parameter of Domain Specifc Records Process Host SAX SmartHub QAMUC A SmartHub QAMUC B SmartHub QAMUC C SmartHub QAMUC B … … … 1 Record metric: ingestertime process: SmartHub host: QAMUC start: 25.10.2016 00:00:01.546 end: … type: metric data: Timestamp Value SAX 25.10.2016 00:00:01.546 218.34 A 5.172 218.37 B - 218.49 C - 218.35 B 1 BLOB chunk & convert 2 Timestamp Value Metric 25….:01.546 218.34 ingestertime 5.172 218.37 ingestertime - 218.49 ingestertime - 218.35 ingestertime … … … 1 1 1 2 22 • Exploit repetitiveness and bundle "lines" into data chunks. • Programming interface for a specifc time series record encoding. • Chunk size is a configuration parameter of Domain Specific Compression Record metric: ingestertime process: SmartHub host: QAMUC start: 25.10.2016 00:00:01.546 end: … type: metric data: 00105e0 e6b0 343b 9 07bc 0804 e7d508040 Record metric: ingestertime process: SmartHub host: QAMUC start: 25.10.2016 00:00:01.546 end: … type: metric data: Timestamp Value SAX 25.10.2016 00:00:01.546 218.34 A 5.172 218.37 B - 218.49 C - 218.35 B Compressed BLOB serialize & compress • Lossless compression techniques minimizes the record size. • Domain data often has small increments, recurring patterns, etc. • Choice of compression technqiue is a configuration parameter of Multi-Dimensional Storage Timestamp Value Metric Process Host 25.10.2016 00:00:01.546 218.34 ingestertime SmartHub QAMUC 25.10.2016 00:00:06.718 218.37 ingestertime SmartHub QAMUC 25.10.2016 00:00:11.891 218.49 ingestertime SmartHub QAMUC 25.10.2016 00:00:16.964 218.35 ingestertime SmartHub QAMUC … … … … … q=host:QAMUC AND metric:ingester* AND type:[metric OR trace] AND end:NOW-7MONTH • Explorative: Users can use the attributes to find a record. • Correlating: Queries can use and combine all types. Query Language & Server-Side eval. Basic Graphite InfluxDB OpenTSDB KairosDB Prometheus Chronix distinct × × × × integral × × × × min/max/sum bottom/top × × × first/last × × ... . .. ... ... .. . ... .. . nnderivative × × × movavg × × × divide/scale × High-level sax [33] × × × × × fastdtw [38] × × × × × outlier × × × × × trend × × × × × frequency × × × × × grpsize × × × × × split × × × × × query read result process q=metric:ingester* & cf=outlier • Basic functions & high-level built-in domain specific functions. • Plug-in interface to add functions for server-side evalution. Domain Specific Commissioning 0 10 20 30 40 50 60 70 80 90 0 5 10 50 100 200 1000 DDC threshold in ms Ratesin%−Averageinms Inaccuracy Rate Average Deviation Space Reduction 32 34 36 38 40 42 44 46 48 32 64 128 256 512 1024 Chunk size in kBytes Totalaccesstimeinsec gzip LZ4 Snappy • DDC-Threshold: 200 ms. • Compression & Chunk Size: gzip + 128 kByte. Easily detectpattern! Fast! Best Values! Select the ideal Compression! Projects 1–3 Remove Jitter! Evaluation Benchmark Client Ubuntu 16.04.1 x64, 12 core, 32 GB Ram, 512 GB SSD Java Benchmark Benchmark Server Ubuntu 16.04.1 x64, 12 core, 32 GB Ram, 512 GB SSD Docker InfluxDB KairosDB ChronixOpenTSDB Queries Time Series HTTP Data of 5 Industry Projects. Project 1 2 3 4 5 total time series 1,080 8,567 4,538 500 24,055 38,740 pairs (mio) metric 2.4 331.4 162.6 3.9 3,762.3 4,262.6 lsof 0.0 0.0 0.0 0.4 0.0 0.4 strace 0.0 0.0 0.0 12.1 0.0 12.1 (a) Pairs and time series per project. Project r 0.5 1 7 14 21 28 56 91 180 1 – 3 q 15 30 30 10 5 3 1 2 0 96 q 2 11 15 8 12 5 1 2 2 58 4 & 5 b 1 6 5 7 2 4 4 1 2 32 h 2 6 10 8 6 6 3 2 0 43 (b) Time ranges r (days) and occurrences of queries (q) for raw data retrieval, and of queries with basic (b) and high-level (h) functions. Memory footprint (in MBytes) InfluxDB OpenTSDB KairosDB Chronix Initially 33 2,726 8,763 446 Import (max) 10,336 10,111 18,905 7,002 Query (max) 8,269 9,712 11,230 4,792 Chronix has a 34%–69% smaller memory footprint. Storage demands (in GBytes) Project Raw Data InfluxDB OpenTSDB KairosDB Chronix 4 1.2 0.2 0.2 0.3 0.1 5 107.0 10.7 16.9 26.5 8.6 total 108.2 10.9 17.1 26.8 8.7 Chronix saves 20%–68% of the storage space. Data retrieval times (in s) r q InfluxDB OpenTSDB KairosDB Chronix 0.5 2 4.3 2.8 4.4 0.9 1 11 5.2 5.6 6.6 5.3 7 15 34.1 17.4 26.8 7.0 14 8 36.2 14.2 25.5 4.0 21 12 76.5 29.8 55.0 6.0 28 5 7.9 3.9 5.6 0.5 56 1 35.4 12.4 24.1 1.2 91 2 47.5 15.5 33.8 1.1 180 2 96.7 36.7 66.6 1.1 total 343.8 138.3 248.4 27.1 Chronix saves 80%–92% on data retrieval time. Times for b- and h-queries (in s) Basic (b) InfluxDB OpenTSDB KairosDB Chronix 4 Avg. 0.9 6.1 9.8 4.4 5 Max. 1.3 8.4 9.1 6.0 3 Min. 0.7 2.7 5.3 2.8 3 Dev. 6.7 16.7 21.1 2.3 5 Sum 0.7 6.0 12.0 2.0 4 Count 0.8 5.5 10.5 1.0 8 Perc. 10.2 25.8 34.5 8.6 High-level (h) 12 Outlier 30.7 29.1 117.6 18.9 14 Trend 162.7 50.4 100.6 30.2 11 Freq. 47.3 23.9 45.7 16.3 3 GroupSize 218.9 2927.8 206.3 29.6 3 Split 123.1 2893.9 47.9 37.2 75 total 604.0 5996.3 620.4 159.3 Chronix saves 73%–97% on analysis times. Typical scenario. Im portant! Neededfor exploration! Evaluation: Projects 4 & 5 Conclusion Chronix exploits the characteristics of the domain in many ways and thus achieves better storage and query results. Chronix is open source. www.chronix.io Acknowledgements This research was in part supported by the Bavarian Ministry of Economic Affairs and Media, Energy and Technology as an IuK-grant for the project DfD – Design for Diagnosability.