Anomaly detection in plain static graphs

Anomaly Detection in Plain Static
Graphs
Prepared by : Javad Forough (javad.forough@aut.ac.ir)
Professor : Dr.Mousavi
Amirkabir University of Technology
1

Outline
 Introduction
 Outliers & graph anomalies
 Challenges
 Anomaly detection in static graphs
 Anomalies in static plain graphs
 Structure based methods
 Community based methods
 Anomalies in static attributed graphs
 Structure based methods
 Relational learning based methods
2

Introduction
 Detecting anomalies in data is a vital task and , with numerous high-impact
applications in areas such as security, finance, health care, and law enforcement
and many others.
 with graph data becoming ubiquitous, techniques for structured graph data have
been of focus recently.
 The branch of data mining concerned with discovering rare occurrences in datasets
is called anomaly detection.
 Application examples:
 Detecting network intrusion or network failure
 Detecting credit card fraud
3

Introduction
 Application examples:
 Detecting email and Web spam
 Detecting auction fraud
 Detecting securities fraud
 Detecting malware/spyware
 Data cleaning
 And so many others…….
4

Introduction
 Outliers & graph anomalies:
 Many techniques have been developed in the past decades, especially for
spotting outliers and anomalies in unstructured collections of multi-
dimensional data points.
 data objects cannot always be treated as points lying in a multi-
dimensional space independently.
 They may exhibit inter-dependencies which should be accounted for during
the anomaly detection process(Fig. 1).
 Data instances in a wide range of disciplines, such as physics, biology,
social sciences, and information systems, are inherently related to one
another.
5

Introduction
 Outliers & graph anomalies:
 Types of outliers in graphs
1. Node outliers : vertices with unusual characteristics
2. Linkage outliers : edges with unusual characteristics
3. Subgraph outliers : parts of the graph which exhibit unusual
characteristics
6

Introduction
 Fig. 1 Point-based outlier detection versus graph-based anomaly detection. a
Clouds of points (multidimensional),b inter-linked objects (network)
7

Introduction
 Researchers have recently intensified their study of methods for anomaly
detection in structured graph data.
 Why Graphs?? We highlight four main reasons that make graph-based
approaches to anomaly detection vital and necessary:
1. Inter-dependent nature of the data
2. Powerful representation
3. Relational nature of problem domains
4. Robust machinery
8

Introduction
 Challenges:
 No unique definition for the problem of anomaly detection exists
 the general definition of an anomaly or an outlier is a vague one: the definition
becomes meaningful only under a given context or application
 The very first definition of an outlier dates back to 1980, and is given by Hawkins
(1980)[1]:
 Definition 1 (Hawkins’ Definition of Outlier, 1980) “An outlier is an observation that
differs so much from other observations as to arouse suspicion that it was
generated by a different mechanism.”
 the above definition is quite general and thus make the detection problem an open-
ended one
 the problem of anomaly detection has been defined in various ways in different
contexts
9

Introduction
 Challenges:
 the problem has many definitions often tailored for the specific application
domain, and also exhibits various names such as outlier, anomaly, outbreak,
event, change, fraud, detection, etc.
 In some applications, such as data cleaning, outliers are even called the noise.
 we provide a general definition for the graph anomaly detection problem as
follows.
 Definition 2 (General Graph Anomaly Detection Problem):
Given a (plain/attributed, static/dynamic) graph database, Find the graph
objects (nodes/edges/substructures) that are rare and that differ significantly
from the majority of the reference objects in the graph.
10

Introduction
 Challenges:
 For practical purposes, a record/point/graph-object is flagged as anomalous if its
rarity/likelihood/outlierness score exceeds a user-defined or an estimated threshold.
 In other words, an anomaly is treated as a data object or a group of objects that is
rare (e.g., rare combination of categorical attribute values), isolated (e.g., far-away
points in n-dimensional spaces), and/or surprising (e.g., data instances that do not
fit well in our mental/statistical model, or need too many bits to describe under the
Minimum Description Length principle (Rissanen 1999)[2]
 There are two challenges associated with anomaly detection:
1. data-specific challenges
2. problem-specific challenges
11

Introduction
 Challenges:
 Data-specific challenges Simply put, the challenges with respect to data are those of
working with big data; namely volume, velocity, and variety of massive, streaming,
and complex datasets. The same challenges generalize to graph data as well.
 Scale and dynamics
 Facebook ~ 2 billion users
 The Web ~ 1 trillion pages
 Cell phone ~ over 6 billion users
 Complexity
 the datasets are rich and complex in content
12

Introduction
 Challenges:
 Problem-specific challenges Additional challenges arise with respect to
the anomaly detection task itself.
 Lack and noise of labels
 Class imbalance and asymmetric error
 Novel anomalies
 Graph-specific challenges : All of the above +
 Inter-dependent objects
 Variety of definitions
 Size of search space
13

Anomaly detection in static plain graphs
 Outliers in clouds of data points:
 Multi-dimensional outlier detection
 Techniques:
 density-based
 distance-based
 depth-based
 distribution-based
 clustering-based
 classification-based
 information theory-based
 spectrum-based
 subspace-based
14

Anomaly detection in static plain graphs 15

 Anomaly detection in static graphs:
1. Plain graphs
 only nodes and edges among those nodes, i.e. the graph structure.
2. Attributed graphs
 Social network : Users various interests , work/live at
different locations ,various education levels and etc.
 relational links various strengths, types, frequency, etc.
16

 a general definition for the anomaly detection problem for static graphs can be
stated as follows:
 Definition 3 (Static-graph anomaly detection problem) :
 Given the snapshot of a (plain or attributed) graph database, Find the
nodes and/or edges and/or substructures that are “few and different” or
deviate significantly from the patterns observed in the graph.
17

 Anomalies in static plain graphs
 The only information is the graph structure
1. Structure based methods
 Feature-based approaches
 Proximity-based approaches
2. Community based methods
18

 Structure based methods:
 Feature-based approaches:
 Main idea : This group of approaches uses the graph representation to
extract structural graph-centric features
 use the given graph structure to compute various measures associated
with the nodes, dyads, triads, egonets, communities, as well as the
global graph structure[3]
 These features have been used in several anomaly detection
applications including Web spam[4] and network intrusion[5]
19

 Node-level features
1. (in/out) degrees
2. centrality measures
1. Eigenvector
2. Closeness
3. Betweenness
3. local clustering coefficient
4. degree assortativity
5. roles
20

Anomaly detection in plain graphs
 dyadic features:
1. Reciprocity
2. edge betweenness
3. number of common neighbors
 Egonet features:
1. number of triangles
2. total weight
3. principal eigenvalue
21

 node-group-level:
1. Density
2. Modularity
3. Conductance
 Global measures:
1. number of connected components
2. distribution of component sizes
3. principal eigenvalue
4. minimum spanning tree weight
5. average node degree
6. global clustering coefficient
22

 Oddball[6] :
 The aim of this technique is to find anomalous nodes
 It builds its solution on the analysis of ego networks
 Input = a graph , output = list of node outlier candidates
 Ego network :
 one-step neighborhood around a central node “ego”
 includes the central node, its direct neighbors and all the edges
among these nodes
 In other words, the ego network is the subgraph of one-step
neighborhood of the central node
23

 Ego network
 Fig.2 Fig.3
24

 Oddball :
1. Ego network extraction: get all ego networks from the input graph.
2. Feature selection: choose features of ego networks that could indicate
anomalies; compute these features for all ego networks.
3. Analysis: pinpoint anomalies using any outlier detection method in point
clouds
 Two of the features that are successful in detecting outliers are number of
nodes and number of edges in the ego network
25

 Oddball:
 Plotting the number of nodes against the number of edges reveals near
cliques and stars
 The green line represents the maximum number of edges in an 𝑛 node ego
network (𝑛∗(𝑛−1)/2)
 the blue line the minimum number of edges (𝑛−1)
 The closer the ego network lies to the lines, the more remarkable it is likely
to be.
26

 Oddball:
Clique in graph 𝐴 Star in graph 𝐵
 fig.4 Revealing cliques in graph 𝐴 fig.5 Revealing stars in graph 𝐵
27

 Proximity-based approaches:
 Main idea : This group of techniques exploits the graph structure to measure closeness
(or proximity) of objects in the graph
 These methods capture the simple autocorrelation between these objects, where close-by
objects are considered to be likely to belong to the same class (e.g. malicious/benign or
infected/healthy)
 Measuring the importance of the nodes in a graph
 PageRank[7]
 Personalized PageRank (PPR)[8]
 SimRank[9]
28

 Main idea : The cluster or community-based methods for graph anomaly detection rely on finding densely
connected groups of “close-by” nodes in the graph and spot nodes and/or edges that have connections
across communities.
 Two main problems[10]
 P1 : how to find the community of a given node : ‘neighborhood of a node’
 Use random-walk-with-restart-based PPR scores[8] of all the nodes with respect to the given node
 nodes with high PPR scores constitute the neighborhood of a node
 P2 : how to quantify the level of the given node to be a bridge node
 The pairwise PPR scores among all the neighbors of the given node are aggregated by averaging to compute a
so-called “normality” score of a node
 nodes with low normality ~ have neighbors with low pairwise proximity to one another ~ neighbors lie in
different, separate communities ~ given node resemble a bridging node across communities
 techniques:
1. SCAN
2. AUTOPART
29

 SCAN[11] – Structural Clustering Algorithm for Networks
 purpose : to identify node outliers
 two types of nodes that play special roles:
1. Outliers : nodes that are marginally connected to clusters
2. Hubs : nodes that bridge clusters
 clusters : groups of nodes that have a dense set of edges running within
the clusters, and have a relatively low number of edges that run between
the clusters
 hubs play a significant role
 outliers have no importance and maybe discarded or isolated as noise
30

 SCAN :
 Input : a graph and two parameters (ε,μ)
 ε captures the rigorousness of the condition of a node to be considered
part of a cluster
 μ determines the minimum number of vertices a cluster must have
 Output : a list of clusters, hubs and outliers as output
31

 SCAN :
 A low ε ~ draws a low line of requirement for being a member of a cluster
 In-creasing ε tightens the coherence inside a cluster, and the initial all-
encompassing cluster would be broken up to smaller groups
 fig.6. ε=0.7,μ=2 fig.7. ε=0.8,μ=2 fig.8. ε=0.9,μ=2
32

 SCAN :
 In the Fig.6, the original interpretation is retrieved: clusters {1, 2, 3, 4, 5, 6}
and {8, 9, 10, 11, 12, 13}, 7 as a hub and 14 as an outlier.
 Fig.7 further decomposes the two clusters, thus identifying 10 also as a
hub, because it neighbors two clusters.
 At the extreme case in Fig.8, the conditions to form a cluster are so high,
that none was identified, thus all nodes are taken to be outliers.
 It is worth to note that ε=0.7 and μ=7 would also lead to the extreme case,
because there is no combination of seven nodes that are closely connected.
33

 SCAN :
 How it work ??
 At the beginning, all nodes are labeled as unclassified
 SCAN performs one pass of the nodes, and classifies them either as a
cluster member or a non-member based on structure connectivity
 At the end, when all clusters are found, the non-members are classified
further as hubs or outliers, based on the cluster membership of their
neighbors
34

 AUTOPART[12]-Parameter Free Graph Partitioning and Outlier Detection
 capable of identifying anomalous edges
 primary purpose is to (automatically) partition the graph into clusters
without user intervention ~ it is parameter free
 After finding a partitioning – a set of clusters – it proposes a method to
measure the outlierness of edges that bridge separate clusters
 This technique specifically uses the adjacency matrix as graph
representation
 A partitioning is a reordering of rows and columns in a way that nodes
belonging to the same cluster are placed next to each other
 the adjacency matrix is broken down to blocks
35

 AUTOPART :
 the squares located on the diagonal of the matrix capture the edges running inside
the clusters
 the rectangles represent the edges bridging the corresponding clusters.
fig.9. nodes fig.10. groups
36

 AUTOPART:
 A good partitioning yields homogeneous blocks, which in turn, can be compressed
efficiently
 The total cost is comprised of a description cost and a code cost
 Description cost : holds the information about the rectangular/square blocks. It is
the transmission cost of the following terms :
• Number of nodes
• Node permutation (which row represents which node)
• number of clusters
• number of nodes in each cluster
• number of ones in each block (the number of edges bridging the given clusters)
 Code cost : holds the information about the content of the blocks. It is the
transmission cost of the blocks calculated using the Shannon entropy function
37

 AUTOPART:
 Description cost penalizes a high number of blocks
 code cost penalizes heterogeneous blocks
 a good partitioning maintains a balance between a low number of clusters
and a high homogeneity of blocks
 The algorithm finds the tradeoff point between the two aspects and yields
a construction with the minimal total cost
38

 AUTOPART: how it works ??
fig.11.
Start with
initial
matrix,
k=1
Final
partitionin
g, k*
STEP 2. Increase k, k=k+1
Lower the
encoding
cost
STEP 1. Find good clusters
for fixed k
39

 AUTOPART:
 It starts with an initial adjacency matrix, where all nodes belong to one cluster (k =
1)
 Inside the main loop, the total cost is iteratively reduced until no improvements can
be made, and the final partitioning together with the final cluster count 𝑘∗ is
outputted
 The iterative reduction is made up of two steps :
1. first, a good partitioning given the number of clusters is found
2. Second, the number of clusters is increased to allow for better partitioning
 Once the final partitioning is found, AUTOPART marks the anomalous edges
 Outliers show deviation from the normal patterns, so they hurt attempts to
compress data
 Therefore those edges, whose removal reduces the total cost the most are marked
as outliers
40

References
1. D. M. Hawkins, Identification of outliers, Springer, 1980.
2. Rissanen J (1999) Hypothesis selection and testing by the MDL principle. Comput J 42:260–269
3. Henderson K, Eliassi-Rad T, Faloutsos C, Akoglu L, Li L Maruhashi K, Prakash BA, Tong H (2010)
Metricforensics: a multi-level approach for mining volatile graphs. In: Proceedings of the 16th ACM
international conference on knowledge discovery and data mining (SIGKDD), Washington, DC, pp
163–172
4. Becchetti L, Castillo C, Donato D, Leonardi S, Baeza-Yates R (2006) Link-based characterization and
detection of Web Spam. In: Second international workshop on adversarial information retrieval on
the web (AIRWeb)
5. Ding Q, Katenka N, Barford P, Kolaczyk ED, Crovella M (2012) Intrusion as (anti)social
communication:characterization and detection. In: Proceedings of the 18th ACM international
conference on knowledge discovery and data mining (SIGKDD), Beijing, China. ACM, pp 886–894
41

References
6. Akoglu L,McGlohon M, Faloutsos C (2010) OddBall: spotting anomalies in weighted graphs. In:
Proceedings of the 14th Pacific-Asia conference on knowledge discovery and data mining
(PAKDD),Hyderabad, India, pp 410–421
7. Brin S, Page L (1998) The anatomy of a large-scale hypertextual web search engine. Comput Netw
30(1–7):107–117
8. Haveliwala TH (2003) Topic-sensitive pagerank: a context-sensitive ranking algorithm for web
search.IEEE Trans Knowl Data Eng 15(4):784–796
9. JehG,Widom J (2002) SimRank: ameasure of structural-context similarity. In: Proceedings of the
8thACM international conference on knowledge discovery and data mining (SIGKDD), Edmonton,
Alberta, pp 538–543
10. Sun J, Qu H, Chakrabarti D, Faloutsos C (2005) Neighborhood formation and anomaly detection in
bipartite graphs. In: Proceedings of the 5th IEEE international conference on data mining (ICDM),
Houston, TX. IEEE Computer Society, pp 418–425
11. X. Xu, N. Yuruk, Z. Feng and T. A. Schweiger, "SCAN: A Structural Clustering Algorithm for
Networks," in Proceedings of the 13th ACM SIGKDD international conference on Knowledge
discovery and data mining, 2007.
42

References
12. D. Chakrabarti, "Autopart: Parameter-free graph partitioning and outlier detection," in Knowledge
Discovery in Databases: PKDD 2004, Springer, 2004, pp. 112--124.
43

Anomaly detection in plain static graphs

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Anomaly detection in plain static graphs

Similar to Anomaly detection in plain static graphs (20)

Recently uploaded

Recently uploaded (20)

Anomaly detection in plain static graphs