Aurora is an Open Source CLI application to collect OpenStack boot logs for 100 boot cycles and performed textual structure extraction semantic log classification on them.
2. What is it ?
● Analysis of Openstack boot logs using data
analysis techniques to identify, understand
and debug the problems in booting
● Extracting textual structure of Openstack logs
and performing semantic log classification
3. How we Approached it...
● Log Collection
● Log Preprocessing and substitutions
● Extracting structure through textual clustering
● Online Semantic Log Classification
4. Design of log collection automation...
Log Collection
5. Automation Architecture
VMRUN VIX APIs
to start the server
and OpenStack
services and
spawn VMs
Collecting logs
from
OpenStack
GUEST OS
HOST OS
VMVM
7. Data set...
● Logs of 100 boot up cycles
● Total Messages : 337958 lines
● Alert Categories : 7
● Size : 63.1 MB
8. Log structure
Nov 13 07:57:23 pesos 2013-11-13 07:57:23.330 17739 DEBUG nova.
openstack.common.rpc.amqp [-] UNIQUE_ID is
7458faafe75b4be3b51fe21c39820cd7. _add_unique_id
/opt/stack/nova/nova/openstack/common/rpc/amqp.py:341
Nov 13 07:57:23 pesos 2013-11-13 07:57:23.227 17739 AUDIT nova.compute.
resource_tracker [-] Free disk (GB): 195
DEBUG -> INFO -> AUDIT -> WARNING -> ERROR -> CRITICAL ->
TRACE
9. Key Characteristics...
● Logs contain redundant and duplicate information
● Logs have unknown message structure
● Log messages contain a small set of unique words but a large
number of numbers and other symbols
● Distribution of words is different from that of found in natural
languages
● Messages in log tend to be short but are of variable length
10. Substitutions...
● Non word tokens are substituted
○ numbers - ‘<num>’
○ path/url - ‘<path>
○ url - ‘<url>’
○ ip - ‘<ip>’
○ Annotations - ‘’
○ keys - <y>
○ Unicode keys - <x>
● Total number of unique tokens
○ before - 1.7 lakh
○ after - 1216
● Total number of unique logs
○ before - 3.5 lakh
○ after - 1047
11. Extracting structure through textual clustering
● Modified DBScan algorithm
○ for making it to work on per-message basis
● Key components for algorithm
○ Measuring similarity
○ Determining the similarity threshold for clustering
12. Measuring similarity
● Modified Levenshtein distance algorithm
○ to use entire tokens as operation set
○ to avoid bias of giving more importance to longer
strings - Normalised LD