DutchMLSchool 2022 - History and Developments in ML

2nd edition | July 4-6, 2022
1

BigML, Inc #DutchMLSchool
Shallow and Deep Methods for
Anomaly Detection
Thomas G. Dietterich
Chief Scientist, BigML
2

• Anomaly Detection Use Cases
• Four Basic Methods for Anomaly Detection with Engineered Features
• Benchmarking Study
• Incorporating Feedback
• Deep Versions of the Four Basic Methods
• Classifier-Based Anomaly Detection using the Max Logit Score
• Familiarity Hypothesis
• Challenges for the Future
Outline
3

Anomaly Detection Use Cases
4

BigML, Inc #DutchMLSchool 5
•Data Cleaning
•Remove corrupted data from the training data
•Example: Typos in feature values, feature values interchanged, test results from two patients
combined
•Fault Detection, Fraud Detection, Cyber Attack
•At training or test time, faulty or illegal behavior creates anomalous data
•Open Category Detection
•At test time, the classifier is given an instance of a novel category
•Example: Self-driving car (trained in Europe) encounters a kangaroo (in Australia)
•Out-of-Distribution Detection
•At test time, the classifier is given an instance collected in a different way
•Example: Chest X-Ray classifier trained only on front views is shown a side view
•Example: Self-driving car trained in clear conditions must operate during rainy conditions
Use Cases

•Claim: Every deployed ML
classifier should include an
anomaly detector to detect
queries that lie outside the
region of competence of the
classifier
•Also useful as a performance
indicator to detect that you
need to retrain the classifier
Protecting a Classifier
𝑥𝑥𝑞𝑞
Anomaly
Detector
𝐴𝐴 𝑥𝑥𝑞𝑞 > 𝜏𝜏?
Classifier 𝑓𝑓
Training
Examples
(𝑥𝑥𝑖𝑖, 𝑦𝑦𝑖𝑖) no
�
𝑦𝑦 = 𝑓𝑓(𝑥𝑥𝑞𝑞)
yes reject

•Definition: An “anomaly” is a data point generated by a process that is
different than the process generating the “nominal” data
•Let 𝐷𝐷0 be the probability distribution of the nominal process
•Let 𝐷𝐷𝑎𝑎 be the probability distribution of the anomaly process
•Two formal settings
• Clean training data
• Contaminated training data
Anomaly Detection Definitions

• Given:
• Training data: 𝑥𝑥1, 𝑥𝑥2, … , 𝑥𝑥𝑁𝑁
• All data come from 𝐷𝐷0 the “nominal” distribution
• Test data: 𝑥𝑥𝑁𝑁+1, … , 𝑥𝑥𝑁𝑁+𝑀𝑀 from a mixture of 𝐷𝐷0 and 𝐷𝐷𝑎𝑎 (the anomaly
distribution)
• Find:
• The data points in the test data that belong to 𝐷𝐷𝑎𝑎
• Examples:
• Protecting a classifier
• Detecting manufacturing defects / equipment failure
Clean Training Data

• Given:
• Training data: 𝑥𝑥1, 𝑥𝑥2, … , 𝑥𝑥𝑁𝑁 from a mixture of 𝐷𝐷0 and 𝐷𝐷𝑎𝑎 (the anomaly
distribution)
• Find:
• The data points in the training data that belong to 𝐷𝐷𝑎𝑎
• Use Cases:
• Data cleaning
• Fraud detection, Insider Threat detection
• These two cases can be combined
• Contaminated training data + Separate contaminated test data
Contaminated Training Data

Four Basic Methods for Anomaly
Detection with Engineered Features
10

•Distance-Based Methods
•Anomaly score
𝐴𝐴 𝑥𝑥𝑞𝑞 = min
𝑥𝑥∈𝐷𝐷
𝑥𝑥𝑞𝑞 − 𝑥𝑥
•Density Estimation Methods
•Surprise: 𝐴𝐴 𝑥𝑥𝑞𝑞 = − log 𝑃𝑃𝐷𝐷(𝑥𝑥𝑞𝑞)
•Model the joint distribution
𝑃𝑃𝐷𝐷(𝑥𝑥) of the input data points
𝑥𝑥1, … ∈ 𝐷𝐷
Theoretical Approaches to Anomaly Detection
•Quantile Methods
•Find a smooth function 𝑓𝑓 such that
𝑥𝑥: 𝑓𝑓 𝑥𝑥 ≥ 0 contains 1 − 𝛼𝛼 of the
training data
•Anomaly score 𝐴𝐴 𝑥𝑥 = −𝑓𝑓(𝑥𝑥)
•Reconstruction Methods
•Train an auto-encoder: 𝑥𝑥 ≈
𝐷𝐷 𝐸𝐸 𝑥𝑥 , where 𝐸𝐸 is the encoder and
𝐷𝐷 is the decoder
•Anomaly score
𝐴𝐴 𝑥𝑥𝑞𝑞 = 𝑥𝑥𝑞𝑞 − 𝐷𝐷 𝐸𝐸 𝑥𝑥𝑞𝑞

•Define a distance 𝑑𝑑(𝑥𝑥𝑖𝑖, 𝑥𝑥𝑗𝑗)
• 𝐴𝐴 𝑥𝑥𝑞𝑞 = min
𝑥𝑥∈𝐷𝐷
𝑑𝑑(𝑥𝑥𝑞𝑞, 𝑥𝑥)
•Requires a good distance metric
Approach 1: Distance-Based Methods
𝑥𝑥𝑞𝑞
𝑥𝑥𝑞𝑞

• Approximates L1 (Manhattan) Distance
• (Guha, et al., ICML 2016)
• Construct a fully random binary tree
• choose attribute 𝑗𝑗 at random
• choose splitting threshold 𝜃𝜃 uniformly from
min 𝑥𝑥⋅𝑗𝑗 , max 𝑥𝑥⋅𝑗𝑗
• until every data point is in its own leaf
• let 𝑑𝑑(𝑥𝑥𝑖𝑖) be the depth of point 𝑥𝑥𝑖𝑖
• repeat 𝐿𝐿 times
• let ̅
𝑑𝑑(𝑥𝑥𝑖𝑖) be the average depth of 𝑥𝑥𝑖𝑖
• 𝐴𝐴 𝑥𝑥𝑖𝑖 = 2
−
�
𝑑𝑑 𝑥𝑥𝑖𝑖
𝑟𝑟 𝑥𝑥𝑖𝑖
• 𝑟𝑟(𝑥𝑥𝑖𝑖) is the expected depth
Isolation Forest [Liu, Ting, Zhou, 2011]
𝑥𝑥⋅𝑗𝑗
𝑥𝑥⋅𝑗𝑗 > 𝜃𝜃
𝑥𝑥⋅2 > 𝜃𝜃2 𝑥𝑥⋅8 > 𝜃𝜃3
𝑥𝑥⋅3 > 𝜃𝜃4 𝑥𝑥⋅1 > 𝜃𝜃5
𝑥𝑥𝑖𝑖

• Given a data set 𝑥𝑥1, … , 𝑥𝑥𝑁𝑁 where
𝑥𝑥𝑖𝑖 ∈ ℝ𝑑𝑑
• We assume the data have been drawn
iid from an unknown probability
density: 𝑥𝑥𝑖𝑖 ∼ 𝑃𝑃 𝑥𝑥𝑖𝑖
• Goal: Estimate 𝑃𝑃
• Anomaly Score: 𝐴𝐴 𝑥𝑥𝑞𝑞 = − log 𝑃𝑃 𝑥𝑥𝑞𝑞
• “surprisal” from information theory
• Why density estimation?
• Gives a more global view by combining
distances to all data points
Approach 2: Density Estimation

•Introduce sparse random
projections Π𝑙𝑙 into 1-
dimensional space
•Fit a density estimator
𝑃𝑃𝑙𝑙 Π𝑙𝑙 𝑥𝑥 in each 1-d space
• 𝐴𝐴 𝑥𝑥 =
1
𝐿𝐿
∑𝑙𝑙=1
𝐿𝐿
− log 𝑃𝑃𝑙𝑙 Π𝑙𝑙 𝑥𝑥𝑞𝑞
Example: LODA
(Pevny, 2015)

• Vapnik’s principle: We only need to
estimate the “decision boundary” between
nominal and anomalous
• Surround the data by a function 𝑓𝑓 that
captures 1 − 𝜖𝜖 of the training data
• One-Class Support Vector Machine
(OCSVM)
• 𝑓𝑓 is a hyperplane in “kernel space”
• Support Vector Data Description (SVDD)
• 𝑓𝑓 is a sphere is “kernel space”
• Issue
• Need to choose 𝜖𝜖 at learning time rather
than run time
Approach 3: Quantile Methods

• NavLab self-driving van (Pomerleau, 1992)
• Primary head: Predict steering angle from
input image
• Secondary head: Predict the input image
(“auto-encoder”)
• 𝐴𝐴 𝑥𝑥𝑞𝑞 = 𝑥𝑥𝑞𝑞 − �
𝑥𝑥𝑞𝑞
• If reconstruction is poor, this suggests that
the steering angle should not be trusted
• Principle: Anomaly Detection through
Failure
• Define a task on which the learned system
should fail for anomalies
Approach 4: Reconstruction Methods
Pomerleau, NIPS 1992

• NASA Mars Science Laboratory ChemCam
instrument
• Collects 6144 spectral bands on rock samples
from 7m distance using laser stimulation
• Goal: active learning to find interesting spectra
• DEMUD
• Incremental PCA applied to samples one at a time
• Fit only to the samples labeled as “uninteresting” by
the user
• Show the user the most un-uninteresting sample
(sample with highest PCA reconstruction error)
• Rapidly discovers interesting samples
• Wagstaff, et al. (2013)
Application: Finding Unusual Chemical Spectra

• Distance-Based Methods
• k-NN: Mean distance to 𝑘𝑘-nearest neighbors
• LOF: Local Outlier Factor (Breunig, et al., 2000)
• ABOD: kNN Angle-Based Outlier Detector (Kriegel, et al., 2008)
• IFOR: Isolation Forest (Liu, et al., 2008)
• Density-Based Approaches
• RKDE: Robust Kernel Density Estimation (Kim & Scott, 2008)
• EGMM: Ensemble Gaussian Mixture Model (our group)
• LODA: Lightweight Online Detector of Anomalies (Pevny, 2016)
• Quantile-Based Methods
• OCSVM: One-class SVM (Schoelkopf, et al., 1999)
• SVDD: Support Vector Data Description (Tax & Duin, 2004)
Benchmarking Study [Andrew Emmott, 2015, 2020]

• Select 19 data sets from UC Irvine repository
• Choose one or more classes to be “anomalies”; the rest are “nominals”
• Manipulate
• Relative frequency
• Point difficulty
• Irrelevant features
• Clusteredness
• 20 replicates of each configuration
• Result: 11,888 Non-trivial Benchmark Datasets
Benchmarking Methodology

• Linear ANOVA
• log
𝐴𝐴𝐴𝐴𝐴𝐴
1 −𝐴𝐴𝐴𝐴𝐴𝐴
~ 𝑟𝑟𝑟𝑟 + 𝑝𝑝𝑝𝑝 + 𝑐𝑐𝑐𝑐 + 𝑖𝑖𝑖𝑖 + 𝑝𝑝𝑠𝑠𝑠𝑠𝑠𝑠 + 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎
• rf: relative frequency
• pd: point difficulty
• cl: normalized clusteredness
• ir: irrelevant features
• pset: “Parent” set
• algo: anomaly detection algorithm
• Assess the algo effect while controlling for all other factors
• 𝐴𝐴𝐴𝐴𝐴𝐴: area under the ROC curve for the nominal vs. anomaly binary decision
Analysis of Variance

• 19 UCI Datasets
• 9 Leading “feature-based” algorithms
• 11,888 non-trivial benchmark datasets
• Mean AUC effect for “nominal” vs. “anomaly” decisions
• Controlling for
• Parent data set
• Difficulty of individual queries
• Fraction of anomalies
• Irrelevant features
• Clusteredness of anomalies
• Baseline method: Distance to nominal mean (“tmd”)
• Best methods: K-nearest neighbors and Isolation Forest
• Worst methods: Kernel-based OCSVM and SVDD
Benchmarking Study Results
0.62
0.64
0.66
0.68
0.70
0.72
0.74
0.76
0.78
knn iforest egmm rkde lof abod loda svdd tmd ocsvm
Mean AUC Effect

• Show top-ranked candidate to the
user
• User labels candidate
• Label is used to update the anomaly
detector
• Two methods
• AAD [Das, et al, ICDM 2016]
• GLAD-OMD (modified version of
iForest) [Siddiqui, et al., KDD 2018]
Incorporating User Feedback: Initial Work
Data
Anomaly
Detection
Best
Candidate
User
Anomaly Analysis
yes
no

User Feedback Yields Big Improvements in
Anomaly Discovery
APT Engagement 3 Results

Deep Versions of the Four Basic Methods
25

• Input image 𝑥𝑥
• Network backbone, also called
the “encoder”: 𝑧𝑧 = 𝐸𝐸 𝑥𝑥
• Latent representation 𝑧𝑧
• “Logits” ℓ𝑘𝑘 = 𝑤𝑤𝑘𝑘 ⋅ 𝑧𝑧
• Predicted probabilities
̂
𝑝𝑝 𝑦𝑦 = 𝑘𝑘 𝑥𝑥 =
exp ℓ𝑘𝑘(𝑧𝑧)
∑𝑘𝑘′ exp ℓ𝑘𝑘′(𝑧𝑧)
Deep Anomaly Detection in Image Classification
Convolutional Neural Network Classifier
Image
𝑥𝑥
Penultimate Layer
𝑧𝑧
Logits ℓ𝑘𝑘 = 𝑤𝑤𝑘𝑘
⊤
𝑧𝑧
Probabilities
�
𝑝𝑝(𝑦𝑦 = 𝑘𝑘|𝑥𝑥)
̂
𝑝𝑝(𝑦𝑦 = 𝑘𝑘|𝑥𝑥)
“Backbone” encoder 𝐸𝐸

•K-nearest neighbor in the
latent space
•Issue: What distance metric to
use?
•Cosine distance is the most
popular:
𝑑𝑑 𝑧𝑧1, 𝑧𝑧2 =
𝑧𝑧1 ⋅ 𝑧𝑧2
𝑧𝑧1 ‖𝑧𝑧2‖
Distance-Based Methods

•Mahalanobis Method
• Fit a joint multivariate Gaussian
• Each class 𝑘𝑘 has its own mean 𝜇𝜇𝑘𝑘
• Shared covariance matrix Σ
•Given a new 𝑥𝑥,
log 𝑃𝑃(𝑥𝑥) ∝ min
𝑘𝑘
𝑥𝑥 − 𝜇𝜇𝑘𝑘
⊤
Σ−1
𝑥𝑥 − 𝜇𝜇𝑘𝑘
This is known as the squared
Mahalanobis distance
Density-Based Methods

• Residual Flow Deep Density Estimator
• (Chen, Behrmann, Duvenaud, et al. NeurIPS 2019)
• Standard Cross-Entropy Supervised Loss
• Claim: This helps focus 𝑃𝑃 𝑥𝑥 on relevant aspects of the images
• Anomaly Score: 𝐴𝐴 𝑥𝑥𝑞𝑞 = − log 𝑃𝑃(𝑥𝑥𝑞𝑞)
Open Hybrid: Classification + Density Estimation
(Tack, Li, Guo, Guo, 2020)

• The method is somewhat tricky to work with
• Set 𝑐𝑐 as the mean of a small set of points passed through the untrained network
• No bias weights
• These help prevent “hypersphere collapse”
Quantile Method: Deep SVDD (Ruff, et al. ICML 2018)

• Encoder: 𝑧𝑧 = 𝐸𝐸 𝑥𝑥
• Decoder: �
𝑥𝑥 = 𝐷𝐷(𝑧𝑧)
• Challenge: How to constrain 𝐸𝐸 and
𝐷𝐷 so that the autoencoder fails on
anomalies but succeeds on nominal
images?
• Autoencoders often learn general-
purpose image compression
methods
Reconstruction Methods: Deep Autoencoders
𝑥𝑥
𝑧𝑧
�
𝑥𝑥
𝐸𝐸 𝐷𝐷

Classifier-Based Anomaly Detection
using the Max Logit Score
32

•Garrepalli (2020)
• Train classifier to optimize
softmax likelihood (minimize
“cross-entropy loss”)
• Maximum logit score is better
than two distance methods:
• Isolation Forest
• LOF (a nearest-neighbor method)
Surprise: The Max Logit Score
0.68 0.67
0.63
0.72
0.51
0.44
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
H (y|x) Max SoftMax-
prob.
Max BCE-prob Max-logit Iforest LOF
AUROC
Anomaly Measures on Latent Representations for CIFAR-100

• Vaze, Han, Vedaldi, Zisserman (2021): “Open
Set Recognition: A Good Classifier is All You
Need” (ICLR 2022; arXiv 2110.06207)
• Carefully train a classifier using the latest tricks
• Standard cross-entropy combined with the
following:
• Cosine learning rate schedule
• Learning rate warmup
• RandAugment augmentations
• Label Smoothing
• Anomaly score: max logit
• − max
𝑘𝑘
ℓ𝑘𝑘
More Evidence for Max Logit
Protocol from Lawrence Neal et al. (2018)

•Novel class difficulty based on
semantic distance
• CUB: Bird species
• Air: Aircraft
• ImageNet
Still More Evidence for Max Logit

Why?
Let’s Examine the Learned Representations

• DenseNet with 384-dimensional
latent space.
• CIFAR-10: 6 known classes, 4 novel
classes
• UMAP visualization
• Light green: novel classes
• Darker greens: known classes
• Note that many novel classes stay
toward the center of the space;
others overlap with known classes
• Training was not required to “pull
them out” so that they could be
discriminated
How are open set images represented by deep
learning?
Alex Guyer
6 Known
Classes
4 Novel
Classes

Similar Results from Other Groups
[Tack, et al. NeurIPS 2020] [Vaze, et al. arXiv 2110.06207]

• Convolutional neural network learns “features” that
detect image patches relevant to the classification
task
• The logit layer weights these features to make the
classification decision
• Novel classes activate fewer of these features, so
their activation vectors are smaller
• Hypothesis: The networks don’t detect that an
elephant is novel because of trunk and tusks but
because its head doesn’t activate known features
The Familiarity Hypothesis
The network doesn’t
detect novelty, it detects
the absence of familiarity

Novel images strongly activate fewer
features
• CIFAR 10: 6 known classes; 4 novel
classes
• DenseNet (𝑧𝑧 has 324 dimensions)
• Activation threshold 𝜃𝜃
• Count number of features whose
activation exceeds 𝜃𝜃
• OOD images activate fewer
features
Evidence: Number of Activated Features
Alex Guyer (unpublished)

Are they features “on” the object vs. the
background?
• Strategy: blur the object and see how the
feature activations change
• activations that change must be on the object
• Details:
• PASCAL VOC Segmented Images
• Blur the original image (31x31 kernel; sd=31)
• Form composite image where blurred region
replaces the segmented region
Which features are responsible for the drop in
activation?
https://www.peko-step.com/en/tool/blur.html

Blurring Examples
Note: This does not remove all object-related information (e.g.,
object boundary), so we don’t detect all on-object features

• “presence feature”
• 𝐵𝐵𝐵𝐵 𝑖𝑖, 𝑗𝑗 > 0. Blurring decreases the
activity of the feature. Its net effect is to
measure the presence of one or more
image patterns
• Its activity is high when those patterns
are present
• “absence feature”
• 𝐵𝐵𝐵𝐵 𝑖𝑖, 𝑗𝑗 < 0. Blurring increases the
activity of the feature. Its net effect is to
measure the absence of one or more
image patterns
• Its activity is high when those patterns
are absent
• Define the “blurring effect” of feature 𝑗𝑗 on
image 𝑖𝑖
𝐵𝐵𝐵𝐵 𝑖𝑖, 𝑗𝑗 = 𝑧𝑧𝑖𝑖𝑖𝑖 − ̃
𝑧𝑧𝑖𝑖𝑖𝑖
where
• 𝑧𝑧𝑖𝑖𝑖𝑖 is the activation of latent feature 𝑗𝑗 on
image 𝑖𝑖
• ̃
𝑧𝑧𝑖𝑖𝑖𝑖 is the activation of latent feature 𝑗𝑗 on
blurred image 𝑖𝑖
Blurring Effect

•On average, the activation of
a feature changes when the
object (of class 𝑘𝑘) is blurred
𝑂𝑂𝑂𝑂 𝑗𝑗, 𝑘𝑘
=
1
𝑁𝑁𝑘𝑘
�
𝑖𝑖:𝑦𝑦𝑖𝑖=𝑘𝑘
𝑧𝑧𝑖𝑖𝑖𝑖𝑖𝑖 − ̃
𝑧𝑧𝑖𝑖𝑖𝑖𝑖𝑖
•Feature 𝑗𝑗 is a net presence
feature for class 𝑘𝑘 if
𝑂𝑂𝑂𝑂 𝑗𝑗, 𝑘𝑘 > 0.02
•Feature 𝑗𝑗 is a net absence
feature for class 𝑘𝑘 if
𝑂𝑂𝑂𝑂 𝑗𝑗, 𝑘𝑘 < −0.02
•Otherwise 𝑗𝑗 is net neutral for
class 𝑘𝑘
“On Object” score of feature 𝑗𝑗 for class 𝑘𝑘

• Logit score is ℓ𝑗𝑗𝑗𝑗 = ∑𝑗𝑗 𝑤𝑤𝑗𝑗𝑗𝑗𝑧𝑧𝑖𝑖𝑖𝑖
• Contribution of 𝑗𝑗 in image 𝑖𝑖 to class 𝑘𝑘:
• 𝑐𝑐𝑖𝑖𝑖𝑖𝑖𝑖 = 𝑤𝑤𝑗𝑗𝑗𝑗𝑧𝑧𝑖𝑖𝑖𝑖 (in normal images)
• ̃
𝑐𝑐𝑖𝑖𝑖𝑖𝑖𝑖 = 𝑤𝑤𝑗𝑗𝑗𝑗 ̃
𝑧𝑧𝑖𝑖𝑖𝑖 (in blurred images)
• Mean contribution
• ̅
𝑐𝑐𝑗𝑗𝑗𝑗 =
1
𝑁𝑁𝑘𝑘
∑ 𝑖𝑖 𝑦𝑦𝑖𝑖 = 𝑘𝑘 𝑐𝑐𝑖𝑖𝑖𝑖𝑖𝑖
• ̅̃
𝑐𝑐𝑗𝑗𝑗𝑗 =
1
𝑁𝑁𝑘𝑘
∑ 𝑖𝑖 𝑦𝑦𝑖𝑖 = 𝑘𝑘 ̃
𝑐𝑐𝑖𝑖𝑖𝑖𝑖𝑖
Feature Taxonomy
𝒘𝒘𝒋𝒋𝒋𝒋 > 𝟎𝟎 𝒘𝒘𝒋𝒋𝒋𝒋 < 𝟎𝟎
> 0.02
positive
presence
negative
presence
< 0.02
positive
absence
negative
absence
Sun & Li: On the Effectiveness of Sparsification for Detecting the
Deep Unknowns. arXiv 2111.09805

Mean feature types for class 3
1.00
0.00
On-Object
Index
(presence)
On-Object
Index
(absence)
positive features
negative features
red = presence
blue = absence

Zoomed View: Blurring reduces ̅
𝑐𝑐𝑗𝑗𝑗𝑗
Mean unblurred
contribution
Mean blurred contribution
• Blurring…
• reduces the contribution of
positive presence features (red
dots)
• reduces the contribution of
negative absence features (blue
dots)
1.00
0.00
On-Object
Index
(presence)
On-Object
Index
(absence)

Decomposing the Logit Score: Four Cases
Positive presence:
𝑤𝑤𝑗𝑗𝑗𝑗 > 0 and
𝑂𝑂𝑂𝑂 𝑗𝑗, 𝑘𝑘 > 0
Positive absence:
𝑂𝑂𝑂𝑂 𝑗𝑗, 𝑘𝑘 < 0
Negative presence:
𝑂𝑂𝑂𝑂(𝑗𝑗, 𝑘𝑘) > 0
Negative absence:
𝑤𝑤𝑗𝑗𝑗𝑗 < 0 and
𝑂𝑂𝑂𝑂 𝑗𝑗, 𝑘𝑘 < 0

Visualizing Individual Images: OOD Instance 838

OOD Instance 770

OOD Instance 432

• Note that the Positive Presence
features dominate the max logit
score
• The Negative Absence and
Positive Absence features
(purple and blue lines) make a
small contribution
• Negative Presence features
make no contribution
• Conclusion: Decreases in
activations of positive presence
account for most of the max
logit score
Decomposing the Novelty Scores

•Red line: trend for Positive
Presence contribution to max
logit score
•Black line: smooth estimate of
classification accuracy
(“known” vs “novel”)
Decreases in Positive Presence Features
Account for Novelty Detection Accuracy

•Blakemore, Colin, and Grahame F.
Cooper. “Development of the brain
depends on the visual environment.”
(1970): 477-478.
• Kittens raised in environments with
only horizontal or only vertical lines
• “They were virtually blind for contours
perpendicular to the orientation they
had experienced.”
•Chomsky: “Poverty of the stimulus”
Can we expect computer vision systems to perceive
things they have not been trained on?
Source: Li Yang Ku
https://computervisionblog.wordpress.com/2013/06/01/ca
ts-and-vision-is-vision-acquired-or-innate/

• Familiarity-based anomaly detection advantages:
• Easy to implement – Anomaly signal (max logit) can be extracted from the
classifier. No separate anomaly detection model is needed
• Training on additional, auxiliary classes improves both classification and
anomaly detection performance
• Familiarity-based anomaly detection weaknesses
• Partially-occluded nominal objects will be flagged as anomalies
• If an image contains both a novel object and a known object, the novel object
will not be detected
• Adversarial attacks can easily cause false anomalies and missed anomalies
Implications

Open Challenges
56

• Can we learn deep representations that can represent outliers?
• Nonstationarity
• As the world changes, the anomaly detection model must also change
• Explanation
• Users often want explanations of why something is labeled as anomalous in order to provide feedback or
take other actions
• Setting alarm thresholds
• How can we set a threshold to control the false alarm and missed alarm rates?
• Incremental (continual) learning in deep networks
• How can we efficiently update a trained neural network to incorporate user feedback?
• Anomaly detection in temporal, spatial, and spatio-temporal data, in video data, etc.
• Anomaly detection at multiple scales
Challenges for Anomaly Detection

Summary
58

• Four Basic Methods
• Distances, densities, density quantiles, and reconstruction
• Distances work best; Isolation Forest is very robust
• Anomaly Detection in Deep Learning
• The four basic methods have been extended to deep learning
• They often do not work well when applied to learned representations
• Classifier Max Logit Score Gives Very Competitive Performance
• Computed as a side effect of standard deep classifiers
• Measures familiarity rather than novelty, which makes it risky in many settings
• Advances in Deep Anomaly Detection Require Learning Better Representations
Shallow and Deep Methods for Anomaly Detection
59

Co-organized by:
Companies Presenting:
60

DutchMLSchool 2022 - History and Developments in ML

Recommended

Recommended

More Related Content

Similar to DutchMLSchool 2022 - History and Developments in ML

Similar to DutchMLSchool 2022 - History and Developments in ML (20)

More from BigML, Inc

More from BigML, Inc (20)

Recently uploaded

Recently uploaded (20)

DutchMLSchool 2022 - History and Developments in ML