Bayesian Autoencoders for anomaly detection in industrial environments

Bayesian Autoencoders for anomaly detection in industrial environments :
Formulation & design, uncertainty quantification, and explainability
Bang Xiang Yong

outline
1. Introduction & Background
• What is anomaly detection
• What is Autoencoder
• Problems with Autoencoder
2. Contributions
• Probabilistic formulation and design of Bayesian Autoencoder
• Uncertainty quantification in anomaly detection
• Sensors explainability
3. Conclusion

introduction
1. Anomaly / Outlier / Novelty / Out of distribution (OOD) detection, One-class classification
• Abundance of “healthy” or inlier data
• Want to detect data arising from “non-healthy” / anomalous data
2. Easy to determine outliers in low dimension (e.g. multivariate Gaussian, ± 2 std. dev)
• But not so in high-dimensional data (e.g. hundreds / thousands of dimensions)
Each data point / snapshot in time contains :
• K sensors
• D measurements
Total = K x D features (e.g. 11 x 2000)
Blue = Healthy samples
Orange = Anomalies
D measurements
K sensors
Green = Healthy/Inliers/In-distribution
Red = Anomalies/OOD/

Inputs (measured during radial forging processes)
98 sensors x 5620 features
Outputs
Measurement dimensions of a forged part (Within/out of tolerance)
Inputs
11 Sensors x 2000 features
Outputs
Degradation (0-100%)
<25% as healthy
Advanced Forming Research Centre
(AFRC)- Radial Forge (£2.3m)
Inputs
17 sensors x (60-6000) features
Outputs
Fault diagnosis of subsystems
(accumulator, cooler, valve,
pump)
Quality prediction (STRATH) Condition Monitoring (ZEMA)

background
1. Neural network is an attractive option:
• Flexible and scalable
• Able to handle any data type (image, text, audio, tabular, graphs, etc)
• Rapid advancement in specialized hard- and software
2. Unsupervised learning is preferred over supervised learning
• Lack of labelled data
• Data imbalance
3. Autoencoder is an example of unsupervised neural network
• Require only data labelled as inliers for training
NVDA

What is an Autoencoder?
Encoder Decoder
Bottleneck layer
(Compression)
Process
Measurements Reconstructed Measurements
Training distribution Good reconstructed signal of
Training Distribution
Anomalous distribution
Poorly reconstructed signal
of Anomalous Distribution
Reconstruction error =

Several trust issues with AE in high-stake applications:
1. Lack of a sound theoretical ground for analysis
• Overreliance on analogies
• Why should anomalies have larger reconstruction error?
• Is there a framework which explains what is AE doing?
• Do we need a bottleneck?
2. Lack of uncertainty quantification
• Given a prediction, how uncertain is it ?
3. Lack of explainability
• Given a prediction, which sensors are relevant ?

Contribution 1: Formulation and design of BAE for anomaly detection
3. Sample from posterior over parameters due to intractability:
• Bayesian ensembling
• MC Dropout
• Bayes by backprop (variational inference)
• Markov Chain Monte Carlo (MCMC)
4. Output from training procedure:
• M posterior samples of AE parameters
1. Likelihood (reconstruction loss) :
• Isotropic Gaussian of variance 1 => Mean squared error
• Bernoulli likelihood => Binary cross entropy
(*recommended in Pytorch documentation for AE loss)
2. Prior (regulariser):
• Isotropic Gaussian distribution => L2 regularisation
Training Prediction
Bayes rule
Gaussian
Log-Likelihood
Predictive density of new data =
Posterior mean log-likelihood
Posterior variance of
log-likelihood
BAE is a parametric density estimation model
Evaluation
(1) AUROC > 0.5 better than random guessing
(2) GSS: Geometric mean of sensitivity-specificity
Blue = inliers
Orange = anomalies

Recent papers (e.g Nalisnick et al., “Do Deep Generative Models Know What They Don't Know?” ICLR 2019)
showed that likelihood of autoencoders may not work for OOD detection (anomaly detection).
Example : FashionMNIST (in-distribution) vs MNIST (anomaly).
B. X. Yong, T. Pearce and A. Brintrup, "Bayesian Autoencoders: Analysing and Fixing the Bernoulli likelihood for
Out-of-Distribution Detection," ICML 2020 Workshop on Uncertainty and Robustness in Deep Learning.
2. Find this is due to confoundedness. Bernoulli
likelihood assign higher maximum values to
inputs close to 0 (MNIST).
3. Propose two fixes:
i) Uncertainty of log likelihood
ii) Use other likelihood (e.g Gaussian)
1. Surprising – since reconstructed outputs
look very different from inputs for
anomaly. Culprit in the likelihood choice?

Most work described the need for bottleneck in AEs analogically :
• Did not compare against AEs without bottleneck
does the AE need a bottleneck?
1. Autoencoders have the objective of dimensionality reduction and do not target anomaly detection directly.
2. The main difficulty of applying autoencoders for anomaly detection is given in choosing the right degree of
compression, i.e. dimensionality reduction.
3. If there was no compression, an autoencoder would just learn the identity function.
Lukas et al. “Deep One-Class Classification”. ICML. (2018)
“This (bottleneck) ensures that only useful features are learned by the autoencoder, instead of merely copying the input data for
reconstructing the output”.
Chow et al. “Anomaly detection of defects on concrete structures with the convolutional autoencoder”. Advanced Engineering
Informatics 45. (2020)
“ The identity function seems a particularly trivial function to be trying to learn; but by placing constraints on the
network..” – http://ufldl.stanford.edu/tutorial/unsupervised/Autoencoders/

Encoder Decoder
Bottleneck layer
(undercomplete)
Two ways to eliminate bottleneck:
• Increase latent dimensions > input dimensions
• Introduce skip connections (ala U-Net)
Viewing AE as a density estimation model:
• Will it benefit from higher capacity & better architecture ?
stop strangling the AE !
Encoder Decoder
Wide
Latent layer
(overcomplete)
Skip connections

Visualisation on synthetic toy data
Brighter region =
higher likelihood
Takeaway:
• Identity function
was not learnt,
despite not
having
bottleneck
Encoder : 2-50-50-50-z , z=1 or z=100

Another visualization example :
• 1D density estimation
• Infinitely wide AE with Gaussian Process
(either NNGP or NTK method)
-

Takeaways:
1. AUROC >> 0.5 with/without bottleneck
• Identity function is not learnt !
2. Removing the bottleneck may improve performance
• Allows wider architecture search
application to condition monitoring & quality inspection
ZEMA tasks
STRATH tasks

Contribution 2: Uncertainty in anomaly detection
Although we have used BAE for predicting anomaly:
• Still lack way to express uncertainty as a form of predictive confidence
BAE
anomaly score (is it an inlier/anomaly?)
Input
vector
anomaly uncertainty (can we trust the prediction? / does the BAE know?)
What can we do with uncertainty?
1. Uncertainty as an indicator of predictive error -> when ground truth is not available
2. Filter away predictions of high uncertainty -> remain high quality and accurate predictions
3. Handle uncertain predictions as exceptions -> refer to further inspection (e.g. manually by human / equipment )
Evaluate as anomaly detection + reject option:
(1) classify as inlier / (2) classify as anomaly / (3) reject i.e. don’t know

Blue = inliers
Orange = anomalies
AUROC = 0.93 AUROC = 1.0
Density
Reject 40% of
predictions with
highest
uncertainties
Anomaly scores Anomaly scores

Proposed probabilistic workflow to quantify total anomaly uncertainty with BAE

Anomaly
uncertainty
Anomaly
probability
Visualisation on toy data

Result on ZEMA dataset (condition monitoring)
1. Uncertainty indicates
predictive error
2. Rejecting uncertain
predictions leads to higher
accuracy.

Takeaways:
• Not all uncertainties are the same !
• U-exceed prone to overconfidence
• Combining epistemic and aleatoric
uncertainties are better than having
either one of them.

Given that a BAE prediction of OOD , can we tell which sensors are important?
1. Formulation of sensor attribution methods for BAE.
• Don’t need post-hoc explanation models.
• Naturally available to BAE.
• Under independent likelihood assumption, we can decompose the BAE predictions
(both mean and variance) into importance scores for sensor inputs.
Contribution 3 : Explainable predictions for BAE
Bang Xiang Yong, Alexandra Brintrup. Coalitional Bayesian Autoencoders: Towards explainable
unsupervised deep learning. Submitted to Journal of Soft Applied Computing, 2021.
Example for condition monitoring

2. Development of "Coalitional BAE“ to improve explanation quality.
• Centralised BAE : One BAE for all sensors.
• Coalitional BAE : One BAE for each sensor.
(Enforces sensor independence in outputs)
Centralised BAE
Coalitional BAE
M samples of BAE
K sensors
D features for each sensor
Assume :
NLL scores
NLL scores

3. Finding of misleading explanations due to correlation in
Centralised BAE outputs.
• e.g falsely explain non-drifting sensors as covariate shifting / departing from training
distribution.
• Coalitional BAE’s explanation quality outperforms Centralised BAE.
Top Row:
Blue : explanation for drifting sensors
Orange : explanation for non-drifting
sensors
x-axis : machine degradation
Bottom Row:
Correlation between explanation
scores for drifting and non-drifting
sensors

baetorch: Python package for BAE
https://github.com/bangxiangyong/baetorch

Conclusion
Contributed improvements to trustworthiness in AE for anomaly detection:
1. Formulation & Design of BAE
• Choice of likelihood: avoid Bernoulli likelihood !
• Does AE need a bottleneck ? Stop strangling the AE !
2. Quantify uncertainty in prediction of anomaly
• Probabilistic workflow captures epistemic and aleatoric uncertainties
• Rejecting uncertain predictions leads to higher overall accuracy
• Need for high quality uncertainty estimation
3. Sensor explainability
• Due to correlation in output, AE are prone to misleading explanations
• Proposed “Coalitional BAE” as a fix
If you are working on anomaly detection/one class classification | interested to apply the latest developments of BAE | use baetorch
- Please reach out to me !

Bayesian Autoencoders for anomaly detection in industrial environments

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Similar to Bayesian Autoencoders for anomaly detection in industrial environments

Similar to Bayesian Autoencoders for anomaly detection in industrial environments (20)

More from Bang Xiang Yong

More from Bang Xiang Yong (6)

Recently uploaded

Recently uploaded (20)

Bayesian Autoencoders for anomaly detection in industrial environments

Editor's Notes