1. ML Robustness in VEDLIoT
António Casimiro
University of Lisbon
HiPEAC 2022
Budapest, 20 June 2022
2. 2
Local monitoring of input data correctness
Check characteristics of input features (general data)
Build ECDF of training data
Compare with ECDF of input data
Find outliers/drifts in data values (time series data)
Store input data (time series values)
Forecast next expected input value
Compare forecast value with received input value
Robustness and safety service
Remote monitoring of model correctness
Replicated execution using same input on trusted model
(not for time series data)
Periodically send input/output data to remote node
Run trusted model with received input data
Compare outputs of local and remote models
3. 3
Aware of a context change
Measure statistical distance between
Training data distribution
Real-world data distribution
If huge difference in distributions -> Do not trust ML
output
Domain monitoring using statistical
methods
Calculate empirical cumulative distribution
function (ECDF)
1. Ordering all unique observations in data sample
2. Calculating cumulative probability for each as number of
observations less than or equal to a given observation
divided by the total number of observations:
4. 4
In practice
Take 30 images from training set and 30 images from
real-world
Do that for a specific class images based of classifier
prediction
For each image take the first pixel in the left corner
Considered one RGB color channel at time
Calculate first ECDF on the values of the 30 pixels in the
left corner from training set
The same for real-world image -> second ECDF
Statistical distance between data
distributions
Apply statistical distance and save the value
To that for all the pixels and for the three color channel
Average distances per color channel and compare the
distances with a given threshold
Limitations
Too constrained by the color
Differences in brightness can fool the method
Images should be well aligned for proper comparison
5. 5
Specific models for different environmental conditions
Can they perform better than a single generic one?
Experiment with CNN object detector (Yolo4)
Two different conditions: daytime and night
Driving dataset with 10 classes (pedestrian, rider, car,
…)
Domain adaptation through split approach
Training three models:
Daytime model, only objects during daytime
Night model, only objects during night
Daytime and night model, both previous ones
Testing models:
On daytime images only
On night images only
Daytime + Night
Daytime Night
Daytime images: 27,967 Night images: 27,967
Daytime + night images: 55,934
Night images: 3,929
Daytime images: 3,929
Training
Testing
6. 6
Object detector predicts:
Location of object (coordinates)
Class (e.g., a dog)
Confidence score (a value from 0 to 1)
Confidence score measures the confidence on:
Localization
How likely the box contains an object
How accurate is the box -> IoU
Classification
Precision
«When it guesses how often does it guess correctly?»
Recall
«Has it guessed every time that it should have guessed?»
Performance metric
Confidence threshold
Positive detection if confidence score > threshold
Strict threshold -> less recall
Precision-recall (PR) curve
Shows trade-off precision/recall for varying threshold
mean Average precision (mAP)
Summarize such plot for all classes
mAP@0.5 at IoU threshold of 50%
7. 7
Day and night model performs slightly better in both
conditions
Trained twice images than the other two
It “saw” more objects during training
Better a single model and train as much data as
possible
Then monitor the output
Results
Other ways to improve robustness and ensure
correctness of output
Adversarial training
Uncertainty quantification
Explainability methods
Statistical distance between data distributions
Performance on mAP@0.5
Day test set Night test set
Day model 48.59%
Night model 47.82%
Day and night model 50.58% 49.90%
8. 8
Input monitoring (time series data)
Framework with offline part (model training) and online
part (error detection in input data)
Detected errors (due to sensor or communication
faults):
Omissions
Outliers
Drifts
Self
Neighbors
Self + Neighbors
Generated Models
Offline
Online
9. 9
Self
Neighbors
Self + Neighbors
Generated Models
Offline
Training phase
Multiple MLP models are trained to forecast the next
data value to be received by a target sensor:
A model using only past data from the target sensor
A model using data from the target sensor and from
neighbor sensors (if they provide correlated data)
A model using only data from neighbor sensors (if
they provide correlated data)
This approach allows to distinguish real events (with
impact on multiple sensors) from outliers (affecting
only the target sensor)
10. 10
Online
Online error detection
Sensor values go through the monitoring service
Omissions are detected using timers configured for periodic data
Sensor data (from multiple sensors) is stored and properly aligned to feed each model input
Using multiple forecasts (from running the multiple models), outliers can be detected
The service can also replace outliers for quality assurance
12. 12
Monitoring output through explainability
Provide more evidence to assess output correctness
Explainability methods are used to evaluate contribution of each input pixel to the output
Monitoring model correctness
Periodically send images to redundant remote trustworthy model for comparison of results
Safety requirements on the architectural framework
Safety is one of the clusters of concern addressed at several levels of abstration
Other ongoing work
13. 13
Conclusion
Robustness and safety are important concerns, being addressed from several perspectives
• Monitoring methods for input and output data quality
• Monitoring methods for checking model correctness
• Architecting for safety
This showed useful for the quality block as it made possible to correlate events between sensors and distinguish environmental events from sensor errors. However, they are only used when we have more than one sensor.
Intro
The integration of safety-critical systems brings a deep reflection on how to guarantee safety and preserve society from an accident.
In contrast to typical software, ML control flows are specified by inscrutable weights and trained and tested pointwise using specific cases, which has limited effectiveness at improving and assessing an ML system’s completeness and coverage.
They rarely manage all test cases (uncertainty) and are susceptible to small changes in input (adversarial examples), even if they give us an higher confidence score for a prediction (score produced by the model that expresses how confident it is of the prediction correctness).
Flowchart
--- off-line phase ---
Select the most suitable dataset for our goal. It is assumed that it is used a trusted dataset to train, validate and test the model. So it is correctly labelled and the images for classes are balanced. Data augmentation techniques can be used to increase the amount of data.
2) Apply Ranger technique for hardware transient faults propagation prevention (e.g., bit-flip), which consists in applying limits to the ranges of the output values from the activation functions of the neural network.
3) Train the model.
4) Generate adversarial examples to increase the adversarial robustness by retraining the model with these examples.
5) Evaluate the model's performance. If we consider satisfactory, we can use it into the autonomous system. Otherwise it will be necessary to change model or refine the data, and then repeat the training process.
6) Use explainability methods on the model and the test set to evaluate the contribution of each input feature to the output assigning an importance score to each individual feature (each pixel).
7) Use this data to train a second model, with the same architecture as the first, which should give us more information on the output of the main model. Also in this case we evaluate the performance of the model and if it is not satisfactory we can consider adding further data.
--- on-line phase ---
8) Here the main model will be monitored during its real-time operation. The data from real world is going to be feed to the system and the model will return predictions with relative confidence score.
9) Each input will be passed to both models and the outputs for this input will be compared.
10.1) If both gave the same result and the estimated uncertainty is below a certain threshold, then we can trust the result of the main model and use this prediction for the subsequent operations of the system.
10.2) If the uncertainty is not higher than the threshold, but the two models give a different result then the result of the second model is kept if the confidence score is high enough, otherwise we must pass the control to the human.
10.3) If the uncertainty is high, the comparison between the two outputs is not taken into account and also in this case the control is given to the human.