Detecting anomalies in data is a vital task and , with numerous high-impact applications in areas such as security, finance, health care, and law enforcement and many others.With graph data becoming ubiquitous, techniques for structured graph data have been of focus recently.In this presentation , we're going to review the techniques for anomaly detection in plain static graphs.
1. Anomaly Detection in Plain Static
Graphs
Prepared by : Javad Forough (javad.forough@aut.ac.ir)
Professor : Dr.Mousavi
Amirkabir University of Technology
1
2. Outline
Introduction
Outliers & graph anomalies
Challenges
Anomaly detection in static graphs
Anomalies in static plain graphs
Structure based methods
Community based methods
Anomalies in static attributed graphs
Structure based methods
Community based methods
Relational learning based methods
2
3. Introduction
Detecting anomalies in data is a vital task and , with numerous high-impact
applications in areas such as security, finance, health care, and law enforcement
and many others.
with graph data becoming ubiquitous, techniques for structured graph data have
been of focus recently.
The branch of data mining concerned with discovering rare occurrences in datasets
is called anomaly detection.
Application examples:
Detecting network intrusion or network failure
Detecting credit card fraud
3
4. Introduction
Application examples:
Detecting email and Web spam
Detecting auction fraud
Detecting securities fraud
Detecting malware/spyware
Data cleaning
And so many others…….
4
5. Introduction
Outliers & graph anomalies:
Many techniques have been developed in the past decades, especially for
spotting outliers and anomalies in unstructured collections of multi-
dimensional data points.
data objects cannot always be treated as points lying in a multi-
dimensional space independently.
They may exhibit inter-dependencies which should be accounted for during
the anomaly detection process(Fig. 1).
Data instances in a wide range of disciplines, such as physics, biology,
social sciences, and information systems, are inherently related to one
another.
5
6. Introduction
Outliers & graph anomalies:
Types of outliers in graphs
1. Node outliers : vertices with unusual characteristics
2. Linkage outliers : edges with unusual characteristics
3. Subgraph outliers : parts of the graph which exhibit unusual
characteristics
6
7. Introduction
Fig. 1 Point-based outlier detection versus graph-based anomaly detection. a
Clouds of points (multidimensional),b inter-linked objects (network)
7
8. Introduction
Researchers have recently intensified their study of methods for anomaly
detection in structured graph data.
Why Graphs?? We highlight four main reasons that make graph-based
approaches to anomaly detection vital and necessary:
1. Inter-dependent nature of the data
2. Powerful representation
3. Relational nature of problem domains
4. Robust machinery
8
9. Introduction
Challenges:
No unique definition for the problem of anomaly detection exists
the general definition of an anomaly or an outlier is a vague one: the definition
becomes meaningful only under a given context or application
The very first definition of an outlier dates back to 1980, and is given by Hawkins
(1980)[1]:
Definition 1 (Hawkins’ Definition of Outlier, 1980) “An outlier is an observation that
differs so much from other observations as to arouse suspicion that it was
generated by a different mechanism.”
the above definition is quite general and thus make the detection problem an open-
ended one
the problem of anomaly detection has been defined in various ways in different
contexts
9
10. Introduction
Challenges:
the problem has many definitions often tailored for the specific application
domain, and also exhibits various names such as outlier, anomaly, outbreak,
event, change, fraud, detection, etc.
In some applications, such as data cleaning, outliers are even called the noise.
we provide a general definition for the graph anomaly detection problem as
follows.
Definition 2 (General Graph Anomaly Detection Problem):
Given a (plain/attributed, static/dynamic) graph database, Find the graph
objects (nodes/edges/substructures) that are rare and that differ significantly
from the majority of the reference objects in the graph.
10
11. Introduction
Challenges:
For practical purposes, a record/point/graph-object is flagged as anomalous if its
rarity/likelihood/outlierness score exceeds a user-defined or an estimated threshold.
In other words, an anomaly is treated as a data object or a group of objects that is
rare (e.g., rare combination of categorical attribute values), isolated (e.g., far-away
points in n-dimensional spaces), and/or surprising (e.g., data instances that do not
fit well in our mental/statistical model, or need too many bits to describe under the
Minimum Description Length principle (Rissanen 1999)[2]
There are two challenges associated with anomaly detection:
1. data-specific challenges
2. problem-specific challenges
11
12. Introduction
Challenges:
Data-specific challenges Simply put, the challenges with respect to data are those of
working with big data; namely volume, velocity, and variety of massive, streaming,
and complex datasets. The same challenges generalize to graph data as well.
Scale and dynamics
Facebook ~ 2 billion users
The Web ~ 1 trillion pages
Cell phone ~ over 6 billion users
Complexity
the datasets are rich and complex in content
12
13. Introduction
Challenges:
Problem-specific challenges Additional challenges arise with respect to
the anomaly detection task itself.
Lack and noise of labels
Class imbalance and asymmetric error
Novel anomalies
Graph-specific challenges : All of the above +
Inter-dependent objects
Variety of definitions
Size of search space
13
14. Anomaly detection in static plain graphs
Outliers in clouds of data points:
Multi-dimensional outlier detection
Techniques:
density-based
distance-based
depth-based
distribution-based
clustering-based
classification-based
information theory-based
spectrum-based
subspace-based
14
16. Anomaly detection in static plain graphs
Anomaly detection in static graphs:
1. Plain graphs
only nodes and edges among those nodes, i.e. the graph structure.
2. Attributed graphs
Social network : Users various interests , work/live at
different locations ,various education levels and etc.
relational links various strengths, types, frequency, etc.
16
17. Anomaly detection in static plain graphs
a general definition for the anomaly detection problem for static graphs can be
stated as follows:
Definition 3 (Static-graph anomaly detection problem) :
Given the snapshot of a (plain or attributed) graph database, Find the
nodes and/or edges and/or substructures that are “few and different” or
deviate significantly from the patterns observed in the graph.
17
18. Anomaly detection in static plain graphs
Anomalies in static plain graphs
The only information is the graph structure
1. Structure based methods
Feature-based approaches
Proximity-based approaches
2. Community based methods
18
19. Anomaly detection in static plain graphs
Structure based methods:
Feature-based approaches:
Main idea : This group of approaches uses the graph representation to
extract structural graph-centric features
use the given graph structure to compute various measures associated
with the nodes, dyads, triads, egonets, communities, as well as the
global graph structure[3]
These features have been used in several anomaly detection
applications including Web spam[4] and network intrusion[5]
19
20. Anomaly detection in static plain graphs
Structure based methods:
Node-level features
1. (in/out) degrees
2. centrality measures
1. Eigenvector
2. Closeness
3. Betweenness
3. local clustering coefficient
4. degree assortativity
5. roles
20
21. Anomaly detection in plain graphs
Structure based methods:
dyadic features:
1. Reciprocity
2. edge betweenness
3. number of common neighbors
Egonet features:
1. number of triangles
2. total weight
3. principal eigenvalue
21
22. Anomaly detection in static plain graphs
Structure based methods:
node-group-level:
1. Density
2. Modularity
3. Conductance
Global measures:
1. number of connected components
2. distribution of component sizes
3. principal eigenvalue
4. minimum spanning tree weight
5. average node degree
6. global clustering coefficient
22
23. Anomaly detection in static plain graphs
Structure based methods:
Oddball[6] :
The aim of this technique is to find anomalous nodes
It builds its solution on the analysis of ego networks
Input = a graph , output = list of node outlier candidates
Ego network :
one-step neighborhood around a central node “ego”
includes the central node, its direct neighbors and all the edges
among these nodes
In other words, the ego network is the subgraph of one-step
neighborhood of the central node
23
25. Anomaly detection in static plain graphs
Oddball :
1. Ego network extraction: get all ego networks from the input graph.
2. Feature selection: choose features of ego networks that could indicate
anomalies; compute these features for all ego networks.
3. Analysis: pinpoint anomalies using any outlier detection method in point
clouds
Two of the features that are successful in detecting outliers are number of
nodes and number of edges in the ego network
25
26. Anomaly detection in static plain graphs
Oddball:
Plotting the number of nodes against the number of edges reveals near
cliques and stars
The green line represents the maximum number of edges in an 𝑛 node ego
network (𝑛∗(𝑛−1)/2)
the blue line the minimum number of edges (𝑛−1)
The closer the ego network lies to the lines, the more remarkable it is likely
to be.
26
27. Anomaly detection in static plain graphs
Oddball:
Clique in graph 𝐴 Star in graph 𝐵
fig.4 Revealing cliques in graph 𝐴 fig.5 Revealing stars in graph 𝐵
27
28. Anomaly detection in static plain graphs
Structure based methods:
Proximity-based approaches:
Main idea : This group of techniques exploits the graph structure to measure closeness
(or proximity) of objects in the graph
These methods capture the simple autocorrelation between these objects, where close-by
objects are considered to be likely to belong to the same class (e.g. malicious/benign or
infected/healthy)
Measuring the importance of the nodes in a graph
PageRank[7]
Personalized PageRank (PPR)[8]
SimRank[9]
28
29. Anomaly detection in static plain graphs
Community based methods
Main idea : The cluster or community-based methods for graph anomaly detection rely on finding densely
connected groups of “close-by” nodes in the graph and spot nodes and/or edges that have connections
across communities.
Two main problems[10]
P1 : how to find the community of a given node : ‘neighborhood of a node’
Use random-walk-with-restart-based PPR scores[8] of all the nodes with respect to the given node
nodes with high PPR scores constitute the neighborhood of a node
P2 : how to quantify the level of the given node to be a bridge node
The pairwise PPR scores among all the neighbors of the given node are aggregated by averaging to compute a
so-called “normality” score of a node
nodes with low normality ~ have neighbors with low pairwise proximity to one another ~ neighbors lie in
different, separate communities ~ given node resemble a bridging node across communities
techniques:
1. SCAN
2. AUTOPART
29
30. Anomaly detection in static plain graphs
SCAN[11] – Structural Clustering Algorithm for Networks
purpose : to identify node outliers
two types of nodes that play special roles:
1. Outliers : nodes that are marginally connected to clusters
2. Hubs : nodes that bridge clusters
clusters : groups of nodes that have a dense set of edges running within
the clusters, and have a relatively low number of edges that run between
the clusters
hubs play a significant role
outliers have no importance and maybe discarded or isolated as noise
30
31. Anomaly detection in static plain graphs
SCAN :
Input : a graph and two parameters (ε,μ)
ε captures the rigorousness of the condition of a node to be considered
part of a cluster
μ determines the minimum number of vertices a cluster must have
Output : a list of clusters, hubs and outliers as output
31
32. Anomaly detection in static plain graphs
SCAN :
A low ε ~ draws a low line of requirement for being a member of a cluster
In-creasing ε tightens the coherence inside a cluster, and the initial all-
encompassing cluster would be broken up to smaller groups
fig.6. ε=0.7,μ=2 fig.7. ε=0.8,μ=2 fig.8. ε=0.9,μ=2
32
33. Anomaly detection in static plain graphs
SCAN :
In the Fig.6, the original interpretation is retrieved: clusters {1, 2, 3, 4, 5, 6}
and {8, 9, 10, 11, 12, 13}, 7 as a hub and 14 as an outlier.
Fig.7 further decomposes the two clusters, thus identifying 10 also as a
hub, because it neighbors two clusters.
At the extreme case in Fig.8, the conditions to form a cluster are so high,
that none was identified, thus all nodes are taken to be outliers.
It is worth to note that ε=0.7 and μ=7 would also lead to the extreme case,
because there is no combination of seven nodes that are closely connected.
33
34. Anomaly detection in static plain graphs
SCAN :
How it work ??
At the beginning, all nodes are labeled as unclassified
SCAN performs one pass of the nodes, and classifies them either as a
cluster member or a non-member based on structure connectivity
At the end, when all clusters are found, the non-members are classified
further as hubs or outliers, based on the cluster membership of their
neighbors
34
35. Anomaly detection in static plain graphs
AUTOPART[12]-Parameter Free Graph Partitioning and Outlier Detection
capable of identifying anomalous edges
primary purpose is to (automatically) partition the graph into clusters
without user intervention ~ it is parameter free
After finding a partitioning – a set of clusters – it proposes a method to
measure the outlierness of edges that bridge separate clusters
This technique specifically uses the adjacency matrix as graph
representation
A partitioning is a reordering of rows and columns in a way that nodes
belonging to the same cluster are placed next to each other
the adjacency matrix is broken down to blocks
35
36. Anomaly detection in static plain graphs
AUTOPART :
the squares located on the diagonal of the matrix capture the edges running inside
the clusters
the rectangles represent the edges bridging the corresponding clusters.
fig.9. nodes fig.10. groups
36
37. Anomaly detection in static plain graphs
AUTOPART:
A good partitioning yields homogeneous blocks, which in turn, can be compressed
efficiently
The total cost is comprised of a description cost and a code cost
Description cost : holds the information about the rectangular/square blocks. It is
the transmission cost of the following terms :
• Number of nodes
• Node permutation (which row represents which node)
• number of clusters
• number of nodes in each cluster
• number of ones in each block (the number of edges bridging the given clusters)
Code cost : holds the information about the content of the blocks. It is the
transmission cost of the blocks calculated using the Shannon entropy function
37
38. Anomaly detection in static plain graphs
AUTOPART:
Description cost penalizes a high number of blocks
code cost penalizes heterogeneous blocks
a good partitioning maintains a balance between a low number of clusters
and a high homogeneity of blocks
The algorithm finds the tradeoff point between the two aspects and yields
a construction with the minimal total cost
38
39. Anomaly detection in static plain graphs
AUTOPART: how it works ??
fig.11.
Start with
initial
matrix,
k=1
Final
partitionin
g, k*
STEP 2. Increase k, k=k+1
Lower the
encoding
cost
STEP 1. Find good clusters
for fixed k
39
40. Anomaly detection in static plain graphs
AUTOPART:
It starts with an initial adjacency matrix, where all nodes belong to one cluster (k =
1)
Inside the main loop, the total cost is iteratively reduced until no improvements can
be made, and the final partitioning together with the final cluster count 𝑘∗ is
outputted
The iterative reduction is made up of two steps :
1. first, a good partitioning given the number of clusters is found
2. Second, the number of clusters is increased to allow for better partitioning
Once the final partitioning is found, AUTOPART marks the anomalous edges
Outliers show deviation from the normal patterns, so they hurt attempts to
compress data
Therefore those edges, whose removal reduces the total cost the most are marked
as outliers
40
41. References
1. D. M. Hawkins, Identification of outliers, Springer, 1980.
2. Rissanen J (1999) Hypothesis selection and testing by the MDL principle. Comput J 42:260–269
3. Henderson K, Eliassi-Rad T, Faloutsos C, Akoglu L, Li L Maruhashi K, Prakash BA, Tong H (2010)
Metricforensics: a multi-level approach for mining volatile graphs. In: Proceedings of the 16th ACM
international conference on knowledge discovery and data mining (SIGKDD), Washington, DC, pp
163–172
4. Becchetti L, Castillo C, Donato D, Leonardi S, Baeza-Yates R (2006) Link-based characterization and
detection of Web Spam. In: Second international workshop on adversarial information retrieval on
the web (AIRWeb)
5. Ding Q, Katenka N, Barford P, Kolaczyk ED, Crovella M (2012) Intrusion as (anti)social
communication:characterization and detection. In: Proceedings of the 18th ACM international
conference on knowledge discovery and data mining (SIGKDD), Beijing, China. ACM, pp 886–894
41
42. References
6. Akoglu L,McGlohon M, Faloutsos C (2010) OddBall: spotting anomalies in weighted graphs. In:
Proceedings of the 14th Pacific-Asia conference on knowledge discovery and data mining
(PAKDD),Hyderabad, India, pp 410–421
7. Brin S, Page L (1998) The anatomy of a large-scale hypertextual web search engine. Comput Netw
30(1–7):107–117
8. Haveliwala TH (2003) Topic-sensitive pagerank: a context-sensitive ranking algorithm for web
search.IEEE Trans Knowl Data Eng 15(4):784–796
9. JehG,Widom J (2002) SimRank: ameasure of structural-context similarity. In: Proceedings of the
8thACM international conference on knowledge discovery and data mining (SIGKDD), Edmonton,
Alberta, pp 538–543
10. Sun J, Qu H, Chakrabarti D, Faloutsos C (2005) Neighborhood formation and anomaly detection in
bipartite graphs. In: Proceedings of the 5th IEEE international conference on data mining (ICDM),
Houston, TX. IEEE Computer Society, pp 418–425
11. X. Xu, N. Yuruk, Z. Feng and T. A. Schweiger, "SCAN: A Structural Clustering Algorithm for
Networks," in Proceedings of the 13th ACM SIGKDD international conference on Knowledge
discovery and data mining, 2007.
42
43. References
12. D. Chakrabarti, "Autopart: Parameter-free graph partitioning and outlier detection," in Knowledge
Discovery in Databases: PKDD 2004, Springer, 2004, pp. 112--124.
43