On March 14, we attended the Icinga Camp event in Berlin. As in previous years, it was a blast. On March 14, we attended the Icinga Camp event in Berlin. As in previous years, it was a blast.
It's a long way to the System Diagnostics: identifying a problem, being able to analyze the system status and discovering the root causes. Technically, it means that we have to list the factors that lead to the problem - in order of importance. Obviously, the root cause is at the top of the list. This is what we do and how we behave.
To make it clearer, we can take a Dynamics AX Network as an example. In this network, we would have some hosts, which would have different functions. On the left, we have 3 end users; the broker has the duty to deliver the AX remote session, which can handle and balance the load. The AX AOS processes the app logic. In the end, the info is storage in the SQL Server.
What do we offer as Würth Phoenix? Well, we offer a cloud solution to monitor the different features with different measures. Therefore, we properly measure those hosts. The infrastructure of this cloud service is structured as follows. We install Telegraf on the RDC Broker, AX RDS, AX AOS and on the SQL Server. Telegraf allows us measuring the desired metrics, which are sent to a server. The server on the customer network is NetEye, powered by Icinga. At this stage, I synchronize the “customer NetEye” with the “Würth Phoenix NetEye” through Nats, which is a data channel. In order to manage and represent the data at best, we have InfluxDB and Grafana.
We have our monitoring system. As you can see, for each element we have different measures and different metrics.
IfluxDB takes and stores data and Grafana allows us delivering customers graphs. The representation of the data is the real value of the solution. It comes in useful to finalize the Anomaly Detection and it helps understanding the direction to be taken and what the focus is.
How do we manage data? We manage data with Python. It’s easy and it has a lot of packages, which lets us managing the situations described and tensors. Therefore, Í have various uncorrelated metrics and a multi dimensional space. Thanks to NumPy we can work and create with tensors. With Pandas, I can organize these tensors in temporal series. With Scikit Learn I can work on data and do machine learning.
This slide has the aim of summarize what machine learning is. Machine learning algorithms are often categorized as supervised or unsupervised. Supervised machine learning algorithms can apply what has been learned in the past to new data using labeled examples to predict future events. Starting from the analysis of a known training dataset, the learning algorithm produces an inferred function to make predictions about the output values. The system is able to provide targets for any new input after sufficient training. The learning algorithm can also compare its output with the correct, intended output and find errors in order to modify the model accordingly.
In contrast, unsupervised machine learning algorithms are used when the information used to train is neither classified nor labeled. Unsupervised learning studies how systems can infer a function to describe a hidden structure from unlabeled data. The system doesn’t figure out the right output, but it explores the data and can draw inferences from datasets to describe hidden structures from unlabeled data.
Our colleague Susanne has devised an anomaly detection model. It processes data host-by-host and it takes data samples from last two weeks. The establish sampling period is of 5’’. As for the machine learning algorithm, Susanne chose the unsupervised learning (Isolation Forest): the choice is due to the no-labelling (normal or abnormal). Once we have our algorithm, in order to turn it into a classificator and to have a separating border, there a every-day training phase. The output, as you can see on the slide, is a rate: -0.5 e +0.5.
This is the model output. The orange bars meaning is that in that period an anomaly can occur more probably. Those graphs can be expanded or collapsed, in order to deepen a certain anomaly in a certain period.
As you can see, it is possible to analyze the root cause.
Francesco thought about a potential next anomaly prediction model. It would consider the whole network and it would process the data of the whole network. The PCA would extract the most relevant features and it would analyze them. The Principal Component Analysis reduces the dimensionality and consequently the managing data. The computing power would decrease such as the problem analyzed. In this case the algorithm would be supervised – that is normal or abnormal according to a cause-and-effect logic. Therefore, if we have a certain pattern, we have a certain status in the future. Subsequently, we would train a classifier to predict anomalies.
But… How do we label the data samples? We have several networks and training a classifier needs a number of labeled samples. So, how can I get labeled samples in an automatic way?
Alyvix can be an effective solution. It gives transactions’ and flow’s performance . Therefore, we can precisely know if a flow is fast or slow, if there are slowdowns, breakdowns. In the end, Alyvix allows us giving a digital label.
It is ‘Visual’ because Alyvix looks at graphic interfaces. If you can see something on your screen Alyvix can do that too. It is ‘Synthetic’ because Alyvix behaves like human users. If you can synthetize something (e.g. a music instrument, a vitamin), that is because you can reproduce it artificially. And that’s exactly what you can do with Alyvix synthetizing graphical application states and the way to interact with. And finally, it is a ‘Monitoring’ system because Alyvix (with a proper integration in Icinga) keeps track of the performance measures about each application transactions in a given user interaction flow.
Alyvix provides GUI tools to design any application transactions, from the point of view of their graphical aspects and interaction modes. By the way, at its core, Alyvix relies on the following open-stack of libraries: Python as programming language, RobotFramework for desktop automation, OpenCV and Pillow for image processing, TesseractOCR for text recognition, PyQt for GUI programming
At the end of the day what we would like to do is the complete translation of an entire user interaction flows. Synthetizing user transaction flows we are going to obtain Alyvix keyword flows. Practically, what we get is a list of keywords in the Alyvix editor: we get an executable test case. By the way, keywords are Python methods within a Python module, which is the so-called AlyvixProxies of the test case.
This is an example on a web service: getting results for Google search. Alyvix runs a browser addressing Google, then Alyvix tries continuously to detect that object. When it’ll appear on the screen Alyvix takes that time passed and interacts with that object. That’s what happens on and on until the end of the test case. The important thing to highlight is that Alyvix engine is design to really output net performances, without image processing time, no detection and no interaction times. So, it has a precise and accurate measurement engine.
The synthetic monitoring goals are: to check the availability of all defined transactions of a given use case (pursuing a task with any kind of application through its GUI); to measure the responsiveness times of all defined transactions, until one of them breaks (so it is unavailable: it was not ‘painted’ on screen)
This is an example on the final achievement. We can detect service downtimes and latency spikes, and we can assess the quality of the service from the end user point of view. Therefore, we can use our synthetic data to associate a digital label and consequently to do machine learning with our algorithm.
We would integrate synthetic users in the Dynamics AX Network.
According to the previous slides, this would be the anomaly prediction model architecture.