Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Ā
Labreport
1. Local Outlier Factor
Lab Report: Lab Development and Application of Data Mining and
Learning Systems 2015
Amr Koura
Abstract: Outlier detection has become an important problem in many
real world applications. In some applications like intrusion detection , ļ¬nding
outlier become more important than ļ¬nding common pattern. In this paper ,
we are going to discuss one of outlier detection algorithms called āLOF:Local
outlier factorā algorithm.we will show the algorithm in two modes, ļ¬rst is
ābatch modeā , where input data set is known in advance, while second mode
is āincremental modeā , where outlier should be detected on the ļ¬y during re-
ceiving streaming data. The paper will also show the implementation details
for the two modes and the integration with open source data mining project
called āRealKDā.In the ļ¬rst part,LOF batch mode, the paper will provide
theoretical explanation for algorithm and will discuss the implementation
details of the algorithm, while in the second part,The incremental mode ,the
paper will show how the algorithm computes the outlier eļ¬ciently such as
insertion and deletion of points will eļ¬ect only limited number of nearest
neighbors and donāt depend on the total number of points N in the data
set.the implementation details and integration with realKD library will also
be discussed.
1 Introduction
knowledge discovery in database (KDD) are focusing on identifying understandable
knowledge from the existing data.Most of KDD algorithms concentrates on comput-
ing patterns that matches large portion of objects in dataset. However , in application
like intrusion detection , detecting rare events that deviates from the majority, is more
important than identifying common patterns.
Most of outlier detection algorithms rely on clustering algorithm. For clustering al-
gorithms , outliers are points reside outside the cluster and considered to be noise. this
approach depends more in particular clustering algorithm and parameters.In fact,There
are very few algorithms that are directly concerned with outlier detection.
1
2. Most of outlier detection algorithms are giving the outlier as binary property so, the
points are either classiļ¬ed as an outlier or not. The Local outlier factor is an algorithm
that quantify the tendency of being outlier. the algorithm computes local outlier factory
for each example that show the tendency of point to be an outlier.
The algorithm depends on density based clustering , so it computes the density of
each point and compare it to the density of its neighbors. and so, the outlier factor is
local in since that only restricted neighbors for each point is taken into account.
Online detection of outlier plays an important role in many streaming applications.
Automated identiļ¬cation for outlier from data stream is hot research topic and has
many usage in modern applications like security application, image and multimedia
analysis.The incremental mode of LOF algorithm can detect eļ¬ciently the outlier. The
incremental LOF provide equivalent performance for batch LOF mode , and has O(N
log N) time complexity, where N is the total number of points in the data set. The paper
will do experiments on insertion and deletion of points using incremental LOF and will
compare the result with the results obtained from the batch mode test.
The paper will also discuss the implementation details for batch and incremental mode
and show how to integrate the code with an open-source project āRealKDā.
2 Related Work
Most of previous KDD outlier detection research papers was building outlier detection
approaches based on clustering algorithms. The algorithms was optimized for clustering
purpose and considering the outliers as noise data.There was a need to build algorithms
that are designed solely to identify the outlier.
The āLOF: Identifying Density-based Local Outliersā paper [1] was one of the very
ļ¬rst papers who studied LOF algorithm. in That paper , authors explains the theory
behind the LOF algorithm and explains the equation behind the algorithm.
The paper shows the problem with previous distance based algorithms using ļ¬gure
1,In this ļ¬gure , you can see data set with dense cluster C2 and other less dense cluster
C1 and two objects O1 and O2. according to the paper , if the distance between each
object in C1 and its nearest neighbors are larger than the distance between O2 and C2,
so we canāt ļ¬nd a minimum distance dmin that we use it, to classify O2 as outlier without
classify objects in C1 as outlier too. LOF was a solution for such problem because it has
local view of the points, not global view as distance-based did.our implementation for
batch mode totally is based on the equations that was provided in [1].
The demand for detecting in the data stream applications has become increasingly
important and active research area. The incremental mode of LOF algorithm in this
paper is based on [2]. In that paper , the Author shows eļ¬cient algorithm in detecting
2
3. Figure 1: distance based algorithm
http://www.dbs.iļ¬.lmu.de/Publikationen/Papers/LOF.pdf
the outlier in data stream application and shows that insertion or deletion of point
depends on limited numbers of neighbors and donāt depends in total number of points N
in the data set. in this paper we use the same algorithms and try it with real example.
Our implementation of incremental mode of LOF is based on the algorithms that are
provided in [2].
our implementation should integrate with open-source project called ārealKDā. re-
alKD is an open-source java library that has be designed to help users to apply KDD
algorithms to discover real knowledge from real data.
The repository for realKD library is āhttps://bitbucket.org/realKD/realkd/wiki/Homeā.
realKD library is published under MIT license and there are many algorithms which are
already implemented.There was an outlier detection algorithm that implemented pre-
viously using Support vector machine SVM and our algorithm supposed to be second
outlier detection algorithm for that library. we will discuss in the next section the inter-
faces that we use to implement in order to integrate successfully. and will also mention
how the user can call our algorithm with parameters from the command line interface.
3 Local Outlier Factor Algorithm
The local outlier factor algorithm is based on a concept of a local density, where the
algorithm compare the density of each point with the density of nearest neighbors. In
the following subsection , the papers will discuss the theory behind algorithm and the
implementation details for the algorithm and the integration part. In ļ¬rst subsection
, we will deal with the āBatchā mode , while in second section, we will deal with the
āIncrementalā mode.
3.1 LOF Batch Mode:
In this section , we will discuss the details of theory behind the LOF batch mode and the
implementation details for the algorithm and integration. the details for the algorithm
3
4. is based on [1].
3.1.1 Fomal Deļ¬nition:
let K-distance(p) be the distance of the object p to the k-th nearest neighbor. set of the
k nearest neighbors includes all objects at this distance,and can be more than k objects.
we will denote the set of k nearest neighbors as Nk(p).
Nk(p) = {q ā D {p} | d(q, p) ā¤ k ā distance(p)}
Deļ¬nition: reachability distance of an object p w.r.t object o:
Let K be integer number. the reachability distance of an object p with respect to
object o is deļ¬ned as:
reach ā distk(p, o) = max{k ā distance(o), d(p, o)}
Figure 2: reach-dist(p1,o) and reach-dist(p2,o), for k=4
http://www.dbs.iļ¬.lmu.de/Publikationen/Papers/LOF.pdf
Figure 2 illustrate the idea of reachability distance between objects p and o. if object
p is far away from object o, like p2, then the reachability distance in that case equals to
the distance between p and o , d(p,o). but if object p is near from o, like p1, then the
reachability distance will be equals to k-distance of object o. This shows the importance
of parameter K, the higher the value of K , the more similar the reachability distance
for objects within the same neighborhood.
Deļ¬nition: local reachability density of object p:
lrdK(p) = 1/
oāNK (p)
reachādistK (p,o)
|NK (p)|
the Local reachability density of object p is the inverse of average reachability distance
based on K nearest neighbor of p.
Deļ¬nition: local outlier factor of an object p:
4
5. The local outlier factor of an object p is deļ¬ned as:
LOFK(p) =
oāNK (p)
lrdK (o)
lrdK (p)
|NK (p)|
The LOF value for an object gives the tendency of that object to be an outlier.The
LOF value has a very special properties that can help to detect the outlier. The LOF
value for objects that exists deep inside the cluster is approximately equals 1, but LOF
values objects outside the cluster has values larger than 1. proof [1]
3.1.2 implementation details
In this subsection, we will show out implementation details for the batch mode LOF
algorithm.
The class āLOFOutlierā in package āde.unibonn.realkd.algorithms.outlier.LOFā con-
tains all the details code to implement the algorithm equation. The class extends from
abstract class āAbstractMiningAlgorithmā which is the class that contains the logic to
be called from the realKD framework. we maintain N*N matrix called ātrainingMatrixā
that contains the distance between all data in our data set. Another N*N matrix called
āsortedTrainingMatrixā contains sorted order of indices of data according to their dis-
tance. for example , if the nearest neighbor for data with index 0 is the data with index
2, then index 6, then the ļ¬rst row of this matrix, will be like this:
0 2 6 ...
so for the ļ¬rst data , the nearest neighbor is it self (index:0), then data in index 2,
then data in index 6 , and so on. The reason behind selecting this data structure is
to optimize the speed , so in order to ļ¬nd K-nearest neighbor , or K- inverse nearest
neighbor āwhich we will need in the incremental mode laterā, it is easy to work with
these matrices to do the task fast.
the main parameter for this algorithm to work is the input data set and value of
K parameter. The class āFractionOFLOFParameterā is the class that represents the
place holder for K value. The main logic exists in function āconcreteCallā which is the
function that will be called from the realKD once the user specify the algorithm name
āLOFā. and pass the K value. the function is calling ācomputeLofValuesā function
which takes the input data set and K value and call corresponding functions to compute
the LRD,LOF for each example in the data set. to test this class. the user should call the
program with the following parameters: RealKD load āPath to input data setā āPath
to input attribute Fileā āinput to group input ļ¬leā run LOF āspecify Numeric target
attributesā KValue=[Value of K] for Example:
5
6. RealKD load ā/Users/XYZ/simpleTestFile/data.txtā ā/Users/XYZ/simpleTestFile/attributes.txtā
ā/Users/XYZ/simpleTestFile/groups.txtā run LOF āNumeric Target attributes=Latitude,Longitudeā
KValue=3
3.2 Incremental LOF Mode
designing an incremental LOF algorithm is motivated by two goals. First,the perfor-
mance of the algorithm should be equivalent to the performance of iterated āstaticā
LOF algorithm,Second because of stream data are considered to be inļ¬nite , so we need
an eļ¬cient algorithm, that can do insertion/deletion eļ¬ciently and donāt depend on the
total number of data N, otherwise the performance will be O(N2logN). The paper in
[2], shows an eļ¬cient algorithm for incremental LOF Mode for insertion and deletion.
in each insertion/deletion operation , the algorithm update limited number of neighbors,
so it doesnāt depends on total number of records N. This improves the complexity of
algorithm comparing to iterated static LOF and makes the complexityO(NlogN) rather
than O(N2logN).
In this paper we will show the details for the insertion and deletion part and our
implementation is based on algorithm in [2].
3.2.1 Insertion
In insertion part , the algorithm should keep track of the points that should be updated
āK-distance,LRD,LOFā after inserting the new data.
Figure 3 shows general framework for inserting new point in the incremental LOF.
3.2.2 Deletion
In data stream application, there is a need to delete one or more examples to resolve
because of the memory limitation and sometime because of these examples become
outdated.
Same like insertion part, the deletion part should keep track of eļ¬ected examples to
update their K-distance, LRD, and LOF after deleting the required data example.
Figure 4 shows framework for deletion.
3.2.3 Implementation Details:
In this section we will show the implementation details , integration and command line
interface to call incremental LOF for both insertion and deletion.
Analogy to batch LOF, we create two classes for the incremental LOF, ļ¬rst āILO-
FOutlierAddā class that contains the logic of insertion algorithm and, second is āILO-
FOutlierDeleteā which contains the of deletion algorithm. For reusability reasons, both
classes are extends āLOFOutlierā class, because we will need to reuse all functions that
6
7. Figure 3: incremental LOF insertion
http://www-ai.cs.uni-
dortmund.de/LEHRE/FACHPROJEKT/SS12/paper/outlier/pokrajac2007.pdf
compute LRD,LOF and we will need to use trainingMatrix and sortedTrainingMatrix
as well.
3.2.3.1 insertion: in case of insertion , the algorithm will need the K value āwhich
was already implemented in batch mode , so we will reuse it againā and will need
the new data point to be inserted. the class that correspond to the new parameter is
āILOFNewDataParameterā.
Again the function āconcreteCallā contains the main logic behind the insertion algo-
rithm. In this function , the code is inserting new example into data set and update
only the value of LOF values for limited number of eļ¬ected neighbors as shown in the
algorithm.
To test incremental LOF addition, the user should call the program with the following
parameters:
RealKD load āPath to input data setā āPath to input attribute Fileā āinput to group
input ļ¬leā run LOF āspecify Numeric target attributesā KValue=[Value of K] new-
7
8. Figure 4: incremental LOF Deletion
http://www-ai.cs.uni-
dortmund.de/LEHRE/FACHPROJEKT/SS12/paper/outlier/pokrajac2007.pdf
Point=āstring delimited of new exampleā.
for Example, to insert data sample with attributes āAlexandriaā,ā31.205753ā,ā29.924526ā,
the user should executes:
RealKD load ā/Users/XYZ/simpleTestFile/data.txtā ā/Users/XYZ/simpleTestFile/attributes.txtā
ā/Users/XYZ/simpleTestFile/groups.txtā run LOF āNumeric Target attributes=Latitude,Longitudeā
KValue=3 newPoint=āAlexandria;31.205753;29.924526ā
3.2.3.2 Deletion: In case of Deletion , the algorithm will need the K value āwhich
was already implemented in batch mode , so we will reuse it againā and will need
the index of data to be deleted. the class that correspond to the new parameter is
āILOFDeleteExampleā.
Again the function āconcreteCallā contains the main logic behind the deletion algo-
rithm. In this function , the code deleted the example with at the index position that
the user pass and updates only the value of LOF values for limited number of eļ¬ected
neighbors as shown in the algorithm.
8
9. To test incremental LOF deletion, the user should call the program with the following
parameters: RealKD loadāPath to input data setā āPath to input attribute Fileā āin-
put to group input ļ¬leā run LOF āspecify Numeric target attributesā KValue=[Value
of K] deleteIndex=index to be deleted. for Example, to delete the fourth data in
data set āindex=3ā, the user executes:
RealKD load ā/Users/XYZ/simpleTestFile/data.txtā ā/Users/XYZ/simpleTestFile/attributes.txtā
ā/Users/XYZ/simpleTestFile/groups.txtā run LOF āNumeric Target attributes=Latitude,Longitudeā
KValue=3 deleteIndex=3
4 Experiments
In this section , we will show an experiment of running the algorithm against simple
geographical data set that contains German and Egyptian cities with their coordinates
(Longitude and Latitude).For simplicity , we will select small value for K value =3.
4.1 Running Batch Mode
In the dataset , we put nine German Cities and one Egyptian city with their coordinates
āLongitude and Latitudeā and then run the algorithm with K value equals 3.
Here is the input dataset, each record contains city name,Latitude,Longitude:
Berlin 52.520 13.380
Hamburg 53.550 10.000
Munchen 48.140 11.580
Koln 50.950 6.970
Frankfurt 50.120 8.680
Dortmund 51.510 7.480
Stuttgart 48.790 9.190
Essen 51.470 7.000
Bonn 50.730 7.100
Cairo 30.3 31.14
After running the program, The algorithms compute the LOF value for all cities we
can see that the Egyptian city has big LOF value.The following is the program output
which shows the index of city along with its LOF value.
9
10. 1 1.191001325549464
2 1.1997290645736223
3 0.9628552264586343
4 0.7586643428289646
5 0.7359971660104047
6 0.7495005015494334
7 1.005038367007253
8 0.6938237614108347
9 0.8042180889618675
10 5.521620958353801
from the program output, we see that the object which outside the cluster (Cairo) has
LOF value larger than 1, while all other objects has LOF value approximately equals 1.
, now letās add other three Egyptian cities.
Aswan 25.6833 32.6500
Alexandria 31.13 29.58
Hurghada 27.15 33.50
So, now number of Egyptian Cities =4 , and now they form their own cluster , as their
number is larger than the value of K.So when the algorithm is run ask for the nearest
3 neighbors of the Egyptian cities, the list will be 3 cities from the same cluster , and
then the LOF values will be approximately equals 1. when run the program , we get the
following output:
1 1.187791873035322
2 1.1898577015355898
3 0.968974318939152
4 0.7563190608212818
5 0.7411657187228445
6 0.7532750824303704
7 0.9872825814658073
8 0.6951376411639997
9 0.8009756608212486
10 0.77008269448957
11 0.7192686329698315
12 0.7903058153557572
13 0.7239698839427493
now , we can see that all values has approximately LOF value equals 1. and this
matches our expectation , as all examples now within clusters.
4.2 Running incremental Mode
In this section we will run incremental LOF insertion and will compute the LOF value
after that and compare the result with the result that we get from running the batch
10
11. mode in the previous subsection. we start with the same nine german cities, plus three
Egyptian Cities:
Cairo 30.3 31.14
Aswan 25.6833 32.6500
Alexandria 31.13 29.58
Then by running the incremental LOF addition , we will add fourth Egyptian city:
Hurghada;27.15;33.50 and then compute the LOF values for all cities , the user calls the
program from the command line with the following parameters:
RealKD load ā/Users/XYZ/simpleTestFile/data.txtā ā/Users/XYZ/simpleTestFile/attributes.txtā
ā/Users/XYZ/simpleTestFile/groups.txtā run LOF āNumeric Target attributes=Latitude,Longitudeā
KValue=3 newPoint=āHurghada;27.15;33.50ā
Then, this is the output:
1 1.187791873035322
2 1.1898577015355898
3 0.968974318939152
4 0.7563190608212818
5 0.7411657187228445
6 0.7532750824303704
7 0.9872825814658073
8 0.6951376411639997
9 0.8009756608212486
10 0.77008269448957
11 0.7192686329698315
12 0.7903058153557572
13 0.7239698839427493
So, we get same result by running the LOF in Batch mode and in incremental mode
as well.
5 Conclusion
In this paper , we have discussed two modes of running LOF algorithm , batch mode
and incremental mode. by running experiment on simple 2-d geographical dataset , we
get the same expected result from our theoretical understanding.
The algorithm computes LOF value for each existing data example and this value
gives the tendency of this data point to be an outlier. the data examples that are exists
in the cluster has LOF value approximately equals 1, while examples outside the cluster
has LOF values larger than 1.
The incremental LOF algorithm has same performance as the iterated static algo-
rithm but has better complexity as it updates only limited number of neighbors and not
depends on total number of examples in dataset.
11
12. out implementation of Batch Mode has integrated correctly with realkd development
branch, while the the incremental mode is tested correctly in local machine and will be
merged soon with the realkd development branch.
More research are required to ļ¬nd optimized way to compute K-nearest neighbor and
reverse K-nearest neighbors to improve the performance. Also , more research is also
required to select suitable K value and to determine the LOF threshold for identifying
the outliers when applying the algorithm to real world data set.
References
[1] Markus M. Breunig, Hans-Peter Kriegel, Raymond T. Ng, and JĀØorg Sander. Lof:
Identifying density-based local outliers. SIGMOD Rec., 29(2):93ā104, May 2000.
[2] Dragoljub Pokrajac. Incremental local outlier detection for data streams. In In
Proceedings of IEEE Symposium on Computational Intelligence and Data Mining,
pages 504ā515, 2007.
12